[jira] [Commented] (SPARK-13764) Parse modes in JSON data source

Hyukjin Kwon (JIRA) Tue, 15 Mar 2016 19:53:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196665#comment-15196665
 ]


Hyukjin Kwon commented on SPARK-13764:
--------------------------------------

The issue SPARK-3308 is related with supporting each row wrapped with an array.

For other types, this works like the PERMISSIVE mode in CSV data source but 
only when actual data is an array and the given data type is {{StructType}}, it 
emits an exception above. 

This is because JSON data source converts each data to a desirable type by the 
combination of Jackson parser's token and given data type but the combination 
of {{START_OBJECT}} and {{ArrayType}} exists for the issue above.

So, if the given schema is {{StructType}} and given actual data is an array, 
the combination of {{START_OBJECT}} and {{ArrayType}} is applied to this case, 
which causes the exception above.


> Parse modes in JSON data source
> -------------------------------
>
>                 Key: SPARK-13764
>                 URL: https://issues.apache.org/jira/browse/SPARK-13764
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
>     "user": {
>       "id": 123
>     }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
>     "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>       at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>       at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>       at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>       at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>       at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>       at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>       at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>       at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>       at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>       at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>       at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>       at org.apache.spark.scheduler.Task.run(Task.scala:88)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13764) Parse modes in JSON data source

Reply via email to