[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

Xin Wu (JIRA) Wed, 11 Nov 2015 20:19:56 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001667#comment-15001667
 ]


Xin Wu commented on SPARK-10848:
--------------------------------

Debugging through the code and found that in ResolvedDataSource.apply function 
purposely invokes asNullable when constructing the schema out of the user 
specified schema. The resulted schema (StructType) is used to create the 
relation, in this case JsonRelation.

Removing this invocation of course results in consistent schema definition with 
user-specified in terms of nullability. I guess the design was to consider the 
case where the loaded data file may not have value for certain fields and if 
the user specified schema defines the particular fields to be NOT NULL, it will 
leads to some issue later?

But I tried to load a json file with missing value for a field that is defined 
as NOT NULL and still be able to do show or orderby, etc.
{code}
Testing started at 8:14 PM ...
+-------+----------+
|OrderID|CustomerID|
+-------+----------+
|      1|       452|
|      2|      null|
+-------+----------+

+-------+----------+
|OrderID|CustomerID|
+-------+----------+
|      2|      null|
|      1|       452|
+-------+----------+

root
 |-- OrderID: long (nullable = false)
 |-- CustomerID: integer (nullable = false)
{code}

The json file has the following:
{code}
{"OrderID": 1, "CustomerID":452}
{"OrderID": 2}  
{code}


> Applied JSON Schema Works for json RDD but not when loading json file
> ---------------------------------------------------------------------
>
>                 Key: SPARK-10848
>                 URL: https://issues.apache.org/jira/browse/SPARK-10848
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Miklos Christine
>            Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

Reply via email to