RE: filtering a SchemaRDD

2014-11-16 Thread Daniel, Ronald (ELS-SDG)
Indeed it did. Thanks!

Ron


From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, November 14, 2014 9:53 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: filtering a SchemaRDD


If I use row[6] instead of row[text] I get what I am looking for. However, 
finding the right numeric index could be a pain.

Can I access the fields in a Row of a SchemaRDD by name, so that I can map, 
filter, etc. without a trial and error process of finding the right int for the 
fieldname?

row.text should work.

More examples here: 
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#tab_python_2

Michael


Re: filtering a SchemaRDD

2014-11-14 Thread Vikas Agarwal
Hi, did you try using single quote instead of double around column name? I
faced similar situation with apache phoenix.

On Saturday, November 15, 2014, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

  Hi all,



 I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields,
 one of which holds the text for a sentence from a document.



   # Load sentence data table

   sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing')

   sentenceRDD.take(3)

 Out[20]: [Row(annotID=118, annotSet=u'ge', annotType=u'sentence',
 endOffset=20194, pii=u'0094576587900440', startOffset=20062, text=u'Paper
 IAF-86-85 presented at the 37th Congress of the International Astronautical
 Federation, Innsbruck, Austria, 4-11 October 1986.'), Row(annotID=163,
 annotSet=u'ge', annotType=u'sentence', endOffset=20249,
 pii=u'0094576587900440', startOffset=20194, text=uThe landsat sensors:
 Eosat's plans for landsats 6 and 7), Row(annotID=190, annotSet=u'ge',
 annotType=u'sentence', endOffset=20342, pii=u'0094576587900440',
 startOffset=20334, text=u'Abstract')]



 I have this registered as a table and can query it with SQL select
 statments. I would also like to filter the RDD using text operations like
 regexps that have greated capabilities than SQL's LIKE operator. However,
 the code below does not work. Instead I get a runtime error.



 openProbsRDD = sentenceRDD.filter(lambda row: remains unknown in
 row[text] )

 openProbsRDD.take(5)

 …

 TypeError: tuple indices must be integers, not str

 …



 If I use row[6] instead of row[text] I get what I am looking for.
 However, finding the right numeric index could be a pain.



 Can I access the fields in a Row of a SchemaRDD by name, so that I can
 map, filter, etc. without a trial and error process of finding the right
 int for the fieldname?



 Thanks,

 Ron Daniel



-- 
Regards,
Vikas Agarwal
91 – 9928301411

InfoObjects, Inc.
Execution Matters
http://www.infoobjects.com
2041 Mission College Boulevard, #280
Santa Clara, CA 95054
+1 (408) 988-2000 Work
+1 (408) 716-2726 Fax


Re: filtering a SchemaRDD

2014-11-14 Thread Michael Armbrust


 If I use row[6] instead of row[text] I get what I am looking for.
 However, finding the right numeric index could be a pain.



 Can I access the fields in a Row of a SchemaRDD by name, so that I can
 map, filter, etc. without a trial and error process of finding the right
 int for the fieldname?


row.text should work.

More examples here:
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#tab_python_2

Michael