Hi all, I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one of which holds the text for a sentence from a document.
# Load sentence data table sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing') sentenceRDD.take(3) Out[20]: [Row(annotID=118, annotSet=u'ge', annotType=u'sentence', endOffset=20194, pii=u'0094576587900440', startOffset=20062, text=u'Paper IAF-86-85 presented at the 37th Congress of the International Astronautical Federation, Innsbruck, Austria, 4-11 October 1986.'), Row(annotID=163, annotSet=u'ge', annotType=u'sentence', endOffset=20249, pii=u'0094576587900440', startOffset=20194, text=u"The landsat sensors: Eosat's plans for landsats 6 and 7"), Row(annotID=190, annotSet=u'ge', annotType=u'sentence', endOffset=20342, pii=u'0094576587900440', startOffset=20334, text=u'Abstract')] I have this registered as a table and can query it with SQL select statments. I would also like to filter the RDD using text operations like regexps that have greated capabilities than SQL's LIKE operator. However, the code below does not work. Instead I get a runtime error. openProbsRDD = sentenceRDD.filter(lambda row: "remains unknown" in row["text"] ) openProbsRDD.take(5) ... TypeError: tuple indices must be integers, not str ... If I use row[6] instead of row["text"] I get what I am looking for. However, finding the right numeric index could be a pain. Can I access the fields in a Row of a SchemaRDD by name, so that I can map, filter, etc. without a trial and error process of finding the right int for the fieldname? Thanks, Ron Daniel