Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
Yeah, only a few hours after I sent my message I saw some correspondence on this other thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html, which is the exact same issue. Glad to find that this should be fixed in 1.2.0! I'll give that a try later. Thanks a lot, Jonathan From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Thursday, November 27, 2014 at 4:37 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of containsNull for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been fixed by https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 1.2. Thanks, Yin On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: After playing around with this a little more, I discovered that: 1. If test.json contains something like {values:[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have element: integer (containsNull = true), and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course makes sense but doesn't really help). 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json, StructType(Seq(StructField(values, ArrayType(IntegerType, true), true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() work, though as I mentioned before, this is less than ideal. Why don't saveAsTable/insertInto work when the containsNull properties don't match? I can understand how inserting data with containsNull=true into a column where containsNull=false might fail, but I think the other way around (which is the case here) should work. ~ Jonathan On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: 2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal a :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.orghttp://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s q l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc a la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java : 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table
SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table then try inserting the data into the table, it works: scala sqlContext.jsonFile(file:///home/hadoop/test.json, sqlContext.table(test).schema).insertInto(test) scala sqlContext.sql(select * from test).collect().foreach(println) [ArrayBuffer(1, 2, 3)] Does this mean that there is a bug with how the schema is being automatically determined when you use HiveContext.jsonFile() for JSON files that contain nested arrays? (i.e., should containsNull be true for the array elements?) Or is there a bug with how the Hive table is created from the SchemaRDD? (i.e., should containsNull in fact be false?) I can probably get around this by defining the schema myself rather than using auto-detection, but for now I¹d like to use auto-detection. By the way, I'm using Spark 1.1.0. Thanks, Jonathan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
After playing around with this a little more, I discovered that: 1. If test.json contains something like {values:[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have element: integer (containsNull = true), and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course makes sense but doesn't really help). 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json, StructType(Seq(StructField(values, ArrayType(IntegerType, true), true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() work, though as I mentioned before, this is less than ideal. Why don't saveAsTable/insertInto work when the containsNull properties don't match? I can understand how inserting data with containsNull=true into a column where containsNull=false might fail, but I think the other way around (which is the case here) should work. ~ Jonathan On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.com wrote: I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: 2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal a :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s q l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc a la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java : 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table then try inserting the data into the table, it works: scala sqlContext.jsonFile(file:///home/hadoop/test.json, sqlContext.table(test).schema).insertInto(test) scala sqlContext.sql(select * from test).collect().foreach(println) [ArrayBuffer(1, 2, 3)] Does this mean that there is a bug with how the schema is being automatically determined when you use HiveContext.jsonFile() for JSON files that contain nested arrays? (i.e., should containsNull be true for the array elements?) Or is there a bug with how the Hive table is created from the SchemaRDD? (i.e., should containsNull in fact be false?) I can probably get around this by defining the schema myself rather than using auto-detection, but for now I¹d like to use auto-detection. By the way, I'm using Spark 1.1.0. Thanks, Jonathan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of containsNull for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been fixed by https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 1.2. Thanks, Yin On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan jonat...@amazon.com wrote: After playing around with this a little more, I discovered that: 1. If test.json contains something like {values:[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have element: integer (containsNull = true), and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course makes sense but doesn't really help). 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json, StructType(Seq(StructField(values, ArrayType(IntegerType, true), true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() work, though as I mentioned before, this is less than ideal. Why don't saveAsTable/insertInto work when the containsNull properties don't match? I can understand how inserting data with containsNull=true into a column where containsNull=false might fail, but I think the other way around (which is the case here) should work. ~ Jonathan On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.com wrote: I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: 2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal a :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org $apache$spark$s q l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc a la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java : 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table then try inserting the data into the table, it works: scala sqlContext.jsonFile(file:///home/hadoop/test.json, sqlContext.table(test).schema).insertInto(test) scala sqlContext.sql(select * from test).collect().foreach(println) [ArrayBuffer(1, 2, 3)] Does this mean that there is a bug with how the schema is being automatically determined when you use HiveContext.jsonFile() for JSON files that contain nested arrays? (i.e., should containsNull be true for the array elements?) Or is there a bug with how the Hive table is created from the SchemaRDD? (i.e., should containsNull in fact be false?) I can probably get around this by defining the schema myself rather than using auto-detection, but for now I¹d like to use auto-detection. By the way, I'm using Spark 1.1.0. Thanks,