Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-27 Thread Kelly, Jonathan
Yeah, only a few hours after I sent my message I saw some correspondence on 
this other thread: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html,
 which is the exact same issue.  Glad to find that this should be fixed in 
1.2.0!  I'll give that a try later.

Thanks a lot,
Jonathan

From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Thursday, November 27, 2014 at 4:37 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded 
from a JSON file using schema auto-detection

Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive 
table. Hive does not have the notion of containsNull for array values. So, 
for a Hive table, the containsNull will be always true for an array and we 
should ignore this field for Hive. This issue has been fixed by 
https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 
1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
After playing around with this a little more, I discovered that:

1. If test.json contains something like {values:[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have element: integer
(containsNull = true), and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json,
StructType(Seq(StructField(values, ArrayType(IntegerType, true),
true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:

I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

   {values:[1,2,3]}

and with code like the following:

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val test = sqlContext.jsonFile(test.json)
scala test.saveAsTable(test)

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
   at
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
2
47)
   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
   at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
a
:84)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:66)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:50)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.orghttp://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s
q
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
a
la:149)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1
145)
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
:
615)
   at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala test.printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = false)


scala sqlContext.table(test).printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table

SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Kelly, Jonathan
I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

{values:[1,2,3]}

and with code like the following:

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val test = sqlContext.jsonFile(test.json)
scala test.saveAsTable(test)

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:2
47)
at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala
:84)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:66)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.appl
y(Projection.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sq
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sca
la:149)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHive
File$1.apply(InsertIntoHiveTable.scala:158)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
615)
at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala test.printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = false)


scala sqlContext.table(test).printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table
then try inserting the data into the table, it works:

scala sqlContext.jsonFile(file:///home/hadoop/test.json,
sqlContext.table(test).schema).insertInto(test)
scala sqlContext.sql(select * from test).collect().foreach(println)
[ArrayBuffer(1, 2, 3)]

Does this mean that there is a bug with how the schema is being
automatically determined when you use HiveContext.jsonFile() for JSON
files that contain nested arrays?  (i.e., should containsNull be true for
the array elements?)  Or is there a bug with how the Hive table is created
from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
probably get around this by defining the schema myself rather than using
auto-detection, but for now I¹d like to use auto-detection.

By the way, I'm using Spark 1.1.0.

Thanks,
Jonathan


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Kelly, Jonathan
After playing around with this a little more, I discovered that:

1. If test.json contains something like {values:[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have element: integer
(containsNull = true), and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json,
StructType(Seq(StructField(values, ArrayType(IntegerType, true),
true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.com wrote:

I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

   {values:[1,2,3]}

and with code like the following:

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val test = sqlContext.jsonFile(test.json)
scala test.saveAsTable(test)

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
   at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
2
47)
   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
   at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
a
:84)
   at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:66)
   at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:50)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s
q
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
a
la:149)
   at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1
145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
:
615)
   at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala test.printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = false)


scala sqlContext.table(test).printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table
then try inserting the data into the table, it works:

scala sqlContext.jsonFile(file:///home/hadoop/test.json,
sqlContext.table(test).schema).insertInto(test)
scala sqlContext.sql(select * from test).collect().foreach(println)
[ArrayBuffer(1, 2, 3)]

Does this mean that there is a bug with how the schema is being
automatically determined when you use HiveContext.jsonFile() for JSON
files that contain nested arrays?  (i.e., should containsNull be true for
the array elements?)  Or is there a bug with how the Hive table is created
from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
probably get around this by defining the schema myself rather than using
auto-detection, but for now I¹d like to use auto-detection.

By the way, I'm using Spark 1.1.0.

Thanks,
Jonathan



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Yin Huai
Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive
table. Hive does not have the notion of containsNull for array values.
So, for a Hive table, the containsNull will be always true for an array and
we should ignore this field for Hive. This issue has been fixed by
https://issues.apache.org/jira/browse/SPARK-4245, which will be released
with 1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan jonat...@amazon.com
wrote:

 After playing around with this a little more, I discovered that:

 1. If test.json contains something like {values:[null,1,2,3]}, the
 schema auto-determined by SchemaRDD.jsonFile() will have element: integer
 (containsNull = true), and then
 SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
 makes sense but doesn't really help).
 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json,
 StructType(Seq(StructField(values, ArrayType(IntegerType, true),
 true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
 work, though as I mentioned before, this is less than ideal.

 Why don't saveAsTable/insertInto work when the containsNull properties
 don't match?  I can understand how inserting data with containsNull=true
 into a column where containsNull=false might fail, but I think the other
 way around (which is the case here) should work.

 ~ Jonathan


 On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.com wrote:

 I've noticed some strange behavior when I try to use
 SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
 that contains elements with nested arrays.  For example, with a file
 test.json that contains the single line:
 
{values:[1,2,3]}
 
 and with code like the following:
 
 scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
 scala val test = sqlContext.jsonFile(test.json)
 scala test.saveAsTable(test)
 
 it creates the table but fails when inserting the data into it.  Here¹s
 the exception:
 
 scala.MatchError: ArrayType(IntegerType,true) (of class
 org.apache.spark.sql.catalyst.types.ArrayType)
at
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
 2
 47)
at
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
at
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
at
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
 a
 :84)
at
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
 l
 y(Projection.scala:66)
at
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
 l
 y(Projection.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$s
 q
 l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
 a
 la:149)
at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
 e
 File$1.apply(InsertIntoHiveTable.scala:158)
at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
 e
 File$1.apply(InsertIntoHiveTable.scala:158)
at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
 1
 145)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
 :
 615)
at java.lang.Thread.run(Thread.java:745)
 
 I'm guessing that this is due to the slight difference in the schemas of
 these tables:
 
 scala test.printSchema
 root
  |-- values: array (nullable = true)
  ||-- element: integer (containsNull = false)
 
 
 scala sqlContext.table(test).printSchema
 root
  |-- values: array (nullable = true)
  ||-- element: integer (containsNull = true)
 
 If I reload the file using the schema that was created for the Hive table
 then try inserting the data into the table, it works:
 
 scala sqlContext.jsonFile(file:///home/hadoop/test.json,
 sqlContext.table(test).schema).insertInto(test)
 scala sqlContext.sql(select * from test).collect().foreach(println)
 [ArrayBuffer(1, 2, 3)]
 
 Does this mean that there is a bug with how the schema is being
 automatically determined when you use HiveContext.jsonFile() for JSON
 files that contain nested arrays?  (i.e., should containsNull be true for
 the array elements?)  Or is there a bug with how the Hive table is created
 from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
 probably get around this by defining the schema myself rather than using
 auto-detection, but for now I¹d like to use auto-detection.
 
 By the way, I'm using Spark 1.1.0.
 
 Thanks,