Nick: Thanks for both the original JIRA bug report and the link. Michael: This is on the 1.0.1 release. I'll update to master and follow-up if I have any problems.
best, -Brad On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust <mich...@databricks.com> wrote: > Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC which > should be coming out this week. Pyspark did not have good support for > nested data previously. If you still encounter issues using a more recent > version, please file a JIRA. Thanks! > > > On Tue, Aug 5, 2014 at 11:55 AM, Brad Miller <bmill...@eecs.berkeley.edu> > wrote: > >> Hi All, >> >> I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of >> some JSON data I have, but I've run into some instability involving the >> following java exception: >> >> An error occurred while calling o1326.collect. >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task >> 181.0:29 failed 4 times, most recent failure: Exception failure in TID 1664 >> on host neal.research.intel-research.net: >> net.razorvine.pickle.PickleException: couldn't introspect javabean: >> java.lang.IllegalArgumentException: wrong number of arguments >> >> I've pasted code which produces the error as well as the full traceback >> below. Note that I don't have any problem when I parse the JSON myself and >> use inferSchema. >> >> Is anybody able to reproduce this bug? >> >> -Brad >> >> > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[1,2,3]}', >> '{"foo":"boom", "baz":[1,2,3]}'])) >> > srdd.printSchema() >> >> root >> |-- baz: ArrayType[IntegerType] >> |-- foo: StringType >> >> > srdd.collect() >> >> >> --------------------------------------------------------------------------- >> Py4JJavaError Traceback (most recent call >> last) >> <ipython-input-89-ec7e8e8c68c4> in <module>() >> ----> 1 srdd.collect() >> >> /home/spark/spark-1.0.1-bin-hadoop1/python/pyspark/rdd.py in collect(self) >> 581 """ >> 582 with _JavaStackTrace(self.context) as st: >> --> 583 bytesInJava = self._jrdd.collect().iterator() >> 584 return >> list(self._collect_iterator_through_file(bytesInJava)) >> 585 >> >> /usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in >> __call__(self, *args) >> 535 answer = self.gateway_client.send_command(command) >> 536 return_value = get_return_value(answer, >> self.gateway_client, >> --> 537 self.target_id, self.name) >> 538 >> 539 for temp_arg in temp_args: >> >> /usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in >> get_return_value(answer, gateway_client, target_id, name) >> 298 raise Py4JJavaError( >> 299 'An error occurred while calling >> {0}{1}{2}.\n'. >> --> 300 format(target_id, '.', name), value) >> 301 else: >> 302 raise Py4JError( >> >> Py4JJavaError: An error occurred while calling o1326.collect. >> : org.apache.spark.SparkException: Job aborted due to stage failure: Task >> 181.0:29 failed 4 times, most recent failure: Exception failure in TID 1664 >> on host neal.research.intel-research.net: >> net.razorvine.pickle.PickleException: couldn't introspect javabean: >> java.lang.IllegalArgumentException: wrong number of arguments >> net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) >> net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) >> net.razorvine.pickle.Pickler.save(Pickler.java:125) >> net.razorvine.pickle.Pickler.put_map(Pickler.java:322) >> net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) >> net.razorvine.pickle.Pickler.save(Pickler.java:125) >> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) >> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) >> net.razorvine.pickle.Pickler.save(Pickler.java:125) >> net.razorvine.pickle.Pickler.dump(Pickler.java:95) >> net.razorvine.pickle.Pickler.dumps(Pickler.java:80) >> >> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385) >> >> org.apache.spark.sql.SchemaRDD$anonfun$javaToPython$1$anonfun$apply$3.apply(SchemaRDD.scala:385) >> scala.collection.Iterator$anon$11.next(Iterator.scala:328) >> >> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:294) >> >> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) >> >> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175) >> >> org.apache.spark.api.python.PythonRDD$WriterThread$anonfun$run$1.apply(PythonRDD.scala:175) >> >> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) >> >> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) >> Driver stacktrace: >> at org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1044) >> at >> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) >> at >> org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> at >> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) >> at >> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >> at >> org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >> at scala.Option.foreach(Option.scala:236) >> at >> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >> at akka.actor.ActorCell.invoke(ActorCell.scala:456) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >> at akka.dispatch.Mailbox.run(Mailbox.scala:219) >> at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> > >