I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line:
{"values":[1,2,3]} and with code like the following: scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> val test = sqlContext.jsonFile("test.json") scala> test.saveAsTable("test") it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala> test.printSchema root |-- values: array (nullable = true) | |-- element: integer (containsNull = false) scala> sqlContext.table("test").printSchema root |-- values: array (nullable = true) | |-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table then try inserting the data into the table, it works: scala> sqlContext.jsonFile("file:///home/hadoop/test.json", sqlContext.table("test").schema).insertInto("test") scala> sqlContext.sql("select * from test").collect().foreach(println) [ArrayBuffer(1, 2, 3)] Does this mean that there is a bug with how the schema is being automatically determined when you use HiveContext.jsonFile() for JSON files that contain nested arrays? (i.e., should containsNull be true for the array elements?) Or is there a bug with how the Hive table is created from the SchemaRDD? (i.e., should containsNull in fact be false?) I can probably get around this by defining the schema myself rather than using auto-detection, but for now I¹d like to use auto-detection. By the way, I'm using Spark 1.1.0. Thanks, Jonathan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org