Hi All, I've built and deployed the current head of branch-1.0, but it seems to have only partly fixed the bug.
This code now runs as expected with the indicated output: > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":[1,2,3]}', '{"foo":[4,5,6]}'])) > srdd.printSchema() root |-- foo: ArrayType[IntegerType] > srdd.collect() [{u'foo': [1, 2, 3]}, {u'foo': [4, 5, 6]}] This code still crashes: > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":[[1,2,3], [4,5,6]]}', '{"foo":[[1,2,3], [4,5,6]]}'])) > srdd.printSchema() root |-- foo: ArrayType[ArrayType(IntegerType)] > srdd.collect() Py4JJavaError: An error occurred while calling o63.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3.0:29 failed 4 times, most recent failure: Exception failure in TID 67 on host kunitz.research.intel-research.net: net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments I may be able to see if this is fixed in master, but since it's not fixed in 1.0.3 it seems unlikely to be fixed in master either. I previously tried master as well, but ran into a build problem that did not occur with the 1.0 branch. Can anybody else verify that the second example still crashes (and is meant to work)? If so, would it be best to modify JIRA-2376 or start a new bug? https://issues.apache.org/jira/browse/SPARK-2376 best, -Brad On Tue, Aug 5, 2014 at 12:10 PM, Brad Miller <bmill...@eecs.berkeley.edu> wrote: > Nick: Thanks for both the original JIRA bug report and the link. > > Michael: This is on the 1.0.1 release. I'll update to master and > follow-up if I have any problems. > > best, > -Brad > > > On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC which >> should be coming out this week. Pyspark did not have good support for >> nested data previously. If you still encounter issues using a more recent >> version, please file a JIRA. Thanks! >> >> >> On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller <bmill...@eecs.berkeley.edu> >> wrote: >> >>> Hi All, >>> >>> I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of >>> some JSON data I have, but I've run into some instability involving the >>> following java exception: >>> >>> An error occurred while calling o1326.collect. >>> : org.apache.spark.SparkException: Job aborted due to stage failure: >>> Task 181.0:29 failed 4 times, most recent failure: Exception failure in TID >>> 1664 on host neal.research.intel-research.net: >>> net.razorvine.pickle.PickleException: couldn't introspect javabean: >>> java.lang.IllegalArgumentException: wrong number of arguments >>> >>> I've pasted code which produces the error as well as the full traceback >>> below. Note that I don't have any problem when I parse the JSON myself and >>> use inferSchema. >>> >>> Is anybody able to reproduce this bug? >>> >>> -Brad >>> >>> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[1,2,3]}', >>> '{"foo":"boom", "baz":[1,2,3]}'])) >>> > srdd.printSchema() >>> >>> root >>> |-- baz: ArrayType[IntegerType] >>> |-- foo: StringType >>> >>> > srdd.collect() >>> >>> >>> --------------------------------------------------------------------------- >>> Py4JJavaError Traceback (most recent call >>> last) >>> <ipython-input-89-ec7e8e8c68c4> in <module>() >>> ----> 1 srdd.collect() >>> >>> /home/spark/spark-1.0.1-bin-hadoop1/python/pyspark/rdd.py in >>> collect(self) >>> 581 """ >>> 582 with _JavaStackTrace(self.context) as st: >>> --> 583 bytesInJava = self._jrdd.collect().iterator() >>> 584 return >>> list(self._collect_iterator_through_file(bytesInJava)) >>> 585 >>> >>> /usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in >>> __call__(self, *args) >>> 535 answer = self.gateway_client.send_command(command) >>> 536 return_value = get_return_value(answer, >>> self.gateway_client, >>> --> 537 self.target_id, self.name) >>> 538 >>> 539 for temp_arg in temp_args: >>> >>> /usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in >>> get_return_value(answer, gateway_client, target_id, name) >>> 298 raise Py4JJavaError( >>> 299 'An error occurred while calling >>> {0}{1}{2}.\n'. >>> --> 300 format(target_id, '.', name), value) >>> 301 else: >>> 302 raise Py4JError( >>> >>> Py4JJavaError: An error occurred while calling o1326.collect. >>> : org.apache.spark.SparkException: Job aborted due to stage failure: >>> Task 181.0:29 failed 4 times, most recent failure: Exception failure in TID >>> 1664 on host neal.research.intel-research.net: >>> net.razorvine.pickle.PickleException: couldn't introspect javabean: >>> java.lang.IllegalArgumentException: wrong number of arguments >>> net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) >>> net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) >>> net.razorvine.pickle.Pickler.save(Pickler.java:125) >>> net.razorvine.pickle.Pickler.put_map(Pickler.java:322) >>> net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) >>> net.razorvine.pickle.Pickler.save(Pickler.java:125) >>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) >>> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) >>> net.razorvine.pickle.Pickler.save(Pickler.java:125) >>> net.razorvine.pickle.Pickler.dump(Pickler.java:95) >>> net.razorvine.pickle.Pickler.dumps(Pickler.java:80) >>> >>> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385) >>> >>> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385) >>> scala.collection.Iterator$anon$11.next(Iterator.scala:328) >>> >>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:294) >>> >>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) >>> >>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175) >>> >>> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175) >>> >>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) >>> >>> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) >>> Driver stacktrace: >>> at org.apache.spark.scheduler.DAGScheduler.org >>> $apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1044) >>> at >>> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) >>> at >>> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) >>> at >>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>> at >>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) >>> at >>> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>> at >>> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>> at scala.Option.foreach(Option.scala:236) >>> at >>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) >>> at >>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >>> at akka.actor.ActorCell.invoke(ActorCell.scala:456) >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >>> at akka.dispatch.Mailbox.run(Mailbox.scala:219) >>> at >>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at >>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> at >>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> at >>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >> >> >