Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Jeff Zhang
BTW, I think Json Parser should verify the json format at least when
inferring the schema of json.

On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang  wrote:

> I think this is due to the json file format.  DataFrame can only accept
> json file with one valid record per line.  Multiple line per record is
> invalid for DataFrame.
>
>
> On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu  wrote:
>
>> Could you create a JIRA to track this bug?
>>
>> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
>>  wrote:
>> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>> >
>> > I'm trying to read in a large quantity of json data in a couple of
>> files and
>> > I receive a scala.MatchError when I do so. Json, Python and stack trace
>> all
>> > shown below.
>> >
>> > Json:
>> >
>> > {
>> > "dataunit": {
>> > "page_view": {
>> > "nonce": 438058072,
>> > "person": {
>> > "user_id": 5846
>> > },
>> > "page": {
>> > "url": "http://mysite.com/blog;
>> > }
>> > }
>> > },
>> > "pedigree": {
>> > "true_as_of_secs": 1438627992
>> > }
>> > }
>> >
>> > Python:
>> >
>> > import pyspark
>> > sc = pyspark.SparkContext()
>> > sqlContext = pyspark.SQLContext(sc)
>> > pageviews = sqlContext.read.json("[Path to folder containing file with
>> above
>> > json]")
>> > pageviews.collect()
>> >
>> > Stack Trace:
>> > Py4JJavaError: An error occurred while calling
>> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>> > : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 1
>> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in
>> stage
>> > 32.0 (TID 133, localhost): scala.MatchError:
>> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
>> > at
>> >
>> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
>> > at
>> >
>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
>> > at
>> >
>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
>> > at
>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> > at
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> > at
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
>> > at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>> > at
>> >
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> > at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> > at scala.collection.TraversableOnce$class.to
>> (TraversableOnce.scala:273)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
>> > at
>> >
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
>> > at
>> >
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
>> > at
>> >
>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
>> > at
>> >
>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
>> > at
>> >
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>> > at
>> >
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>> > at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>> > at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> > at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> > at
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> > at
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> > at java.lang.Thread.run(Thread.java:745)
>> >
>> > Driver stacktrace:
>> > at
>> > org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>> > at
>> >
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>> > at
>> >
>> 

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Jeff Zhang
I think this is due to the json file format.  DataFrame can only accept
json file with one valid record per line.  Multiple line per record is
invalid for DataFrame.


On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu  wrote:

> Could you create a JIRA to track this bug?
>
> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
>  wrote:
> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
> >
> > I'm trying to read in a large quantity of json data in a couple of files
> and
> > I receive a scala.MatchError when I do so. Json, Python and stack trace
> all
> > shown below.
> >
> > Json:
> >
> > {
> > "dataunit": {
> > "page_view": {
> > "nonce": 438058072,
> > "person": {
> > "user_id": 5846
> > },
> > "page": {
> > "url": "http://mysite.com/blog;
> > }
> > }
> > },
> > "pedigree": {
> > "true_as_of_secs": 1438627992
> > }
> > }
> >
> > Python:
> >
> > import pyspark
> > sc = pyspark.SparkContext()
> > sqlContext = pyspark.SQLContext(sc)
> > pageviews = sqlContext.read.json("[Path to folder containing file with
> above
> > json]")
> > pageviews.collect()
> >
> > Stack Trace:
> > Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> > : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1
> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> > 32.0 (TID 133, localhost): scala.MatchError:
> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
> > at
> >
> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
> > at
> >
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
> > at
> >
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
> > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
> > at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> > at
> > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> > at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> > at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
> > at
> >
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
> > at
> > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
> > at
> >
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> > at
> >
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> > at
> >
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> > at
> >
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> > at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> > at org.apache.spark.scheduler.Task.run(Task.scala:70)
> > at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> >
> > Driver stacktrace:
> > at
> > org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> > at
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > at
> >
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> >  

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Balaji Vijayan
You are correct, that was the issue.

On Tue, Oct 20, 2015 at 10:18 PM, Jeff Zhang  wrote:

> BTW, I think Json Parser should verify the json format at least when
> inferring the schema of json.
>
> On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang  wrote:
>
>> I think this is due to the json file format.  DataFrame can only accept
>> json file with one valid record per line.  Multiple line per record is
>> invalid for DataFrame.
>>
>>
>> On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu  wrote:
>>
>>> Could you create a JIRA to track this bug?
>>>
>>> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
>>>  wrote:
>>> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>>> >
>>> > I'm trying to read in a large quantity of json data in a couple of
>>> files and
>>> > I receive a scala.MatchError when I do so. Json, Python and stack
>>> trace all
>>> > shown below.
>>> >
>>> > Json:
>>> >
>>> > {
>>> > "dataunit": {
>>> > "page_view": {
>>> > "nonce": 438058072,
>>> > "person": {
>>> > "user_id": 5846
>>> > },
>>> > "page": {
>>> > "url": "http://mysite.com/blog;
>>> > }
>>> > }
>>> > },
>>> > "pedigree": {
>>> > "true_as_of_secs": 1438627992
>>> > }
>>> > }
>>> >
>>> > Python:
>>> >
>>> > import pyspark
>>> > sc = pyspark.SparkContext()
>>> > sqlContext = pyspark.SQLContext(sc)
>>> > pageviews = sqlContext.read.json("[Path to folder containing file with
>>> above
>>> > json]")
>>> > pageviews.collect()
>>> >
>>> > Stack Trace:
>>> > Py4JJavaError: An error occurred while calling
>>> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>>> > : org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 1
>>> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in
>>> stage
>>> > 32.0 (TID 133, localhost): scala.MatchError:
>>> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
>>> > at
>>> >
>>> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
>>> > at
>>> >
>>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
>>> > at
>>> >
>>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
>>> > at
>>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>> > at
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> > at
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
>>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
>>> > at
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>> > at
>>> >
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>> > at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>> > at scala.collection.TraversableOnce$class.to
>>> (TraversableOnce.scala:273)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
>>> > at
>>> >
>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
>>> > at
>>> >
>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
>>> > at
>>> >
>>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
>>> > at
>>> >
>>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
>>> > at
>>> >
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>>> > at
>>> >
>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>>> > at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>>> > at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> > at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> > at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> > at
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> > at java.lang.Thread.run(Thread.java:745)
>>> >
>>> > Driver stacktrace:
>>> > at
>>> > org.apache.spark.scheduler.DAGScheduler.org
>>> 

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-05 Thread Davies Liu
Could you create a JIRA to track this bug?

On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
 wrote:
> Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>
> I'm trying to read in a large quantity of json data in a couple of files and
> I receive a scala.MatchError when I do so. Json, Python and stack trace all
> shown below.
>
> Json:
>
> {
> "dataunit": {
> "page_view": {
> "nonce": 438058072,
> "person": {
> "user_id": 5846
> },
> "page": {
> "url": "http://mysite.com/blog;
> }
> }
> },
> "pedigree": {
> "true_as_of_secs": 1438627992
> }
> }
>
> Python:
>
> import pyspark
> sc = pyspark.SparkContext()
> sqlContext = pyspark.SQLContext(sc)
> pageviews = sqlContext.read.json("[Path to folder containing file with above
> json]")
> pageviews.collect()
>
> Stack Trace:
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> 32.0 (TID 133, localhost): scala.MatchError:
> (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
> at
> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
> at
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
> at
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
> at

Reading JSON in Pyspark throws scala.MatchError

2015-10-02 Thread balajikvijayan
Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.

I'm trying to read in a large quantity of json data in a couple of files and
I receive a scala.MatchError when I do so. Json, Python and stack trace all
shown below.

Json:

{
"dataunit": {
"page_view": {
"nonce": 438058072,
"person": {
"user_id": 5846
},
"page": {
"url": "http://mysite.com/blog;
}
}
},
"pedigree": {
"true_as_of_secs": 1438627992
}
}

Python:

import pyspark
sc = pyspark.SparkContext()
sqlContext = pyspark.SQLContext(sc)
pageviews = sqlContext.read.json("[Path to folder containing file with above
json]")
pageviews.collect()

Stack Trace:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
32.0 (TID 133, localhost): scala.MatchError:
(VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
at
org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
at
org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
at
org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

It seems like this issue has been resolved in scala per  SPARK-3390
  ; any thoughts on the
root cause of this in pyspark?



--

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-02 Thread Ted Yu
I got the following when parsing your input with master branch (Python
version 2.6.6):

http://pastebin.com/1w8WM3tz

FYI

On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan 
wrote:

> Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>
> I'm trying to read in a large quantity of json data in a couple of files
> and
> I receive a scala.MatchError when I do so. Json, Python and stack trace all
> shown below.
>
> Json:
>
> {
> "dataunit": {
> "page_view": {
> "nonce": 438058072,
> "person": {
> "user_id": 5846
> },
> "page": {
> "url": "http://mysite.com/blog;
> }
> }
> },
> "pedigree": {
> "true_as_of_secs": 1438627992
> }
> }
>
> Python:
>
> import pyspark
> sc = pyspark.SparkContext()
> sqlContext = pyspark.SQLContext(sc)
> pageviews = sqlContext.read.json("[Path to folder containing file with
> above
> json]")
> pageviews.collect()
>
> Stack Trace:
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> 32.0 (TID 133, localhost): scala.MatchError:
> (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
> at
>
> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
> at
>
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
> at
>
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
>
>