Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of "containsNull" for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been fixed by https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 1.2.
Thanks, Yin On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan <jonat...@amazon.com> wrote: > After playing around with this a little more, I discovered that: > > 1. If test.json contains something like {"values":[null,1,2,3]}, the > schema auto-determined by SchemaRDD.jsonFile() will have "element: integer > (containsNull = true)", and then > SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course > makes sense but doesn't really help). > 2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json", > StructType(Seq(StructField("values", ArrayType(IntegerType, true), > true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() > work, though as I mentioned before, this is less than ideal. > > Why don't saveAsTable/insertInto work when the containsNull properties > don't match? I can understand how inserting data with containsNull=true > into a column where containsNull=false might fail, but I think the other > way around (which is the case here) should work. > > ~ Jonathan > > > On 11/26/14, 5:23 PM, "Kelly, Jonathan" <jonat...@amazon.com> wrote: > > >I've noticed some strange behavior when I try to use > >SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file > >that contains elements with nested arrays. For example, with a file > >test.json that contains the single line: > > > > {"values":[1,2,3]} > > > >and with code like the following: > > > >scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > >scala> val test = sqlContext.jsonFile("test.json") > >scala> test.saveAsTable("test") > > > >it creates the table but fails when inserting the data into it. Here¹s > >the exception: > > > >scala.MatchError: ArrayType(IntegerType,true) (of class > >org.apache.spark.sql.catalyst.types.ArrayType) > > at > >org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: > >2 > >47) > > at > org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) > > at > org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) > > at > >org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal > >a > >:84) > > at > >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app > >l > >y(Projection.scala:66) > > at > >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app > >l > >y(Projection.scala:50) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at > >org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org > $apache$spark$s > >q > >l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc > >a > >la:149) > > at > >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv > >e > >File$1.apply(InsertIntoHiveTable.scala:158) > > at > >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv > >e > >File$1.apply(InsertIntoHiveTable.scala:158) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > > at org.apache.spark.scheduler.Task.run(Task.scala:54) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > at > >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: > >1 > >145) > > at > >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java > >: > >615) > > at java.lang.Thread.run(Thread.java:745) > > > >I'm guessing that this is due to the slight difference in the schemas of > >these tables: > > > >scala> test.printSchema > >root > > |-- values: array (nullable = true) > > | |-- element: integer (containsNull = false) > > > > > >scala> sqlContext.table("test").printSchema > >root > > |-- values: array (nullable = true) > > | |-- element: integer (containsNull = true) > > > >If I reload the file using the schema that was created for the Hive table > >then try inserting the data into the table, it works: > > > >scala> sqlContext.jsonFile("file:///home/hadoop/test.json", > >sqlContext.table("test").schema).insertInto("test") > >scala> sqlContext.sql("select * from test").collect().foreach(println) > >[ArrayBuffer(1, 2, 3)] > > > >Does this mean that there is a bug with how the schema is being > >automatically determined when you use HiveContext.jsonFile() for JSON > >files that contain nested arrays? (i.e., should containsNull be true for > >the array elements?) Or is there a bug with how the Hive table is created > >from the SchemaRDD? (i.e., should containsNull in fact be false?) I can > >probably get around this by defining the schema myself rather than using > >auto-detection, but for now I¹d like to use auto-detection. > > > >By the way, I'm using Spark 1.1.0. > > > >Thanks, > >Jonathan > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >