I think this is due to the json file format.  DataFrame can only accept
json file with one valid record per line.  Multiple line per record is
invalid for DataFrame.


On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu <dav...@databricks.com> wrote:

> Could you create a JIRA to track this bug?
>
> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
> <balaji.k.vija...@gmail.com> wrote:
> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
> >
> > I'm trying to read in a large quantity of json data in a couple of files
> and
> > I receive a scala.MatchError when I do so. Json, Python and stack trace
> all
> > shown below.
> >
> > Json:
> >
> > {
> >     "dataunit": {
> >         "page_view": {
> >             "nonce": 438058072,
> >             "person": {
> >                 "user_id": 5846
> >             },
> >             "page": {
> >                 "url": "http://mysite.com/blog";
> >             }
> >         }
> >     },
> >     "pedigree": {
> >         "true_as_of_secs": 1438627992
> >     }
> > }
> >
> > Python:
> >
> > import pyspark
> > sc = pyspark.SparkContext()
> > sqlContext = pyspark.SQLContext(sc)
> > pageviews = sqlContext.read.json("[Path to folder containing file with
> above
> > json]")
> > pageviews.collect()
> >
> > Stack Trace:
> > Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> > : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1
> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> > 32.0 (TID 133, localhost): scala.MatchError:
> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
> >         at
> >
> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
> >         at
> >
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
> >         at
> >
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
> >         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> >         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >         at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
> >         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> >         at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
> >         at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> >         at
> > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> >         at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> >         at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> >         at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
> >         at
> >
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> >         at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
> >         at
> > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> >         at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
> >         at
> >
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> >         at
> >
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> >         at
> >
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> >         at
> >
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> >         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> >         at org.apache.spark.scheduler.Task.run(Task.scala:70)
> >         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > Driver stacktrace:
> >         at
> > org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> >         at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> >         at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> >         at
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >         at
> >
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> >         at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> >         at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> >         at scala.Option.foreach(Option.scala:236)
> >         at
> >
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> >         at
> >
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
> >         at
> >
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> >         at
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> >
> > It seems like this issue has been resolved in scala per  SPARK-3390
> > <https://issues.apache.org/jira/browse/SPARK-3390>  ; any thoughts on
> the
> > root cause of this in pyspark?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Reading-JSON-in-Pyspark-throws-scala-MatchError-tp24911.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang

Reply via email to