Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-06 Thread Davies Liu
There is a PR to fix this: https://github.com/apache/spark/pull/1802 On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: I concur that printSchema works; it just seems to be operations that use the data where trouble happens. Thanks for posting the bug. -Brad

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-06 Thread Nicholas Chammas
Nice catch Brad and thanks to Yin and Davies for getting on it so quickly. On Wed, Aug 6, 2014 at 2:45 AM, Davies Liu dav...@databricks.com wrote: There is a PR to fix this: https://github.com/apache/spark/pull/1802 On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller bmill...@eecs.berkeley.edu

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Nicholas Chammas
I believe this is a known issue in 1.0.1 that's fixed in 1.0.2. See: SPARK-2376: Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException https://issues.apache.org/jira/browse/SPARK-2376 On Tue, Aug 5, 2014 at 2:55 PM, Brad Miller bmill...@eecs.berkeley.edu

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Michael Armbrust
Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC which should be coming out this week. Pyspark did not have good support for nested data previously. If you still encounter issues using a more recent version, please file a JIRA. Thanks! On Tue, Aug 5, 2014 at 11:55 AM, Brad

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Nick: Thanks for both the original JIRA bug report and the link. Michael: This is on the 1.0.1 release. I'll update to master and follow-up if I have any problems. best, -Brad On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust mich...@databricks.com wrote: Is this on 1.0.1? I'd suggest

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I've built and deployed the current head of branch-1.0, but it seems to have only partly fixed the bug. This code now runs as expected with the indicated output: srdd = sqlCtx.jsonRDD(sc.parallelize(['{foo:[1,2,3]}', '{foo:[4,5,6]}'])) srdd.printSchema() root |-- foo:

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Nicholas Chammas
This looks to be fixed in master: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) sc.parallelize(['{foo:[[1,2,3], [4,5,6]]}', '{foo:[[1,2,3], [4,5,6]]}']) ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:315 sqlContext.jsonRDD(sc.parallelize(['{foo:[[1,2,3],

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
Hi All, I checked out and built master. Note that Maven had a problem building Kafka (in my case, at least); I was unable to fix this easily so I moved on since it seemed unlikely to have any influence on the problem at hand. Master improves functionality (including the example Nicholas just

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Yin Huai
I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we take the data back to the Python side, SchemaRDD#javaToPython failed on your cases. I have created https://issues.apache.org/jira/browse/SPARK-2875 to track it. Thanks, Yin On Tue, Aug 5, 2014 at 9:20 PM, Brad

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Brad Miller
I concur that printSchema works; it just seems to be operations that use the data where trouble happens. Thanks for posting the bug. -Brad On Tue, Aug 5, 2014 at 10:05 PM, Yin Huai yh...@databricks.com wrote: I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we