[ https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001667#comment-15001667 ]
Xin Wu commented on SPARK-10848: -------------------------------- Debugging through the code and found that in ResolvedDataSource.apply function purposely invokes asNullable when constructing the schema out of the user specified schema. The resulted schema (StructType) is used to create the relation, in this case JsonRelation. Removing this invocation of course results in consistent schema definition with user-specified in terms of nullability. I guess the design was to consider the case where the loaded data file may not have value for certain fields and if the user specified schema defines the particular fields to be NOT NULL, it will leads to some issue later? But I tried to load a json file with missing value for a field that is defined as NOT NULL and still be able to do show or orderby, etc. {code} Testing started at 8:14 PM ... +-------+----------+ |OrderID|CustomerID| +-------+----------+ | 1| 452| | 2| null| +-------+----------+ +-------+----------+ |OrderID|CustomerID| +-------+----------+ | 2| null| | 1| 452| +-------+----------+ root |-- OrderID: long (nullable = false) |-- CustomerID: integer (nullable = false) {code} The json file has the following: {code} {"OrderID": 1, "CustomerID":452} {"OrderID": 2} {code} > Applied JSON Schema Works for json RDD but not when loading json file > --------------------------------------------------------------------- > > Key: SPARK-10848 > URL: https://issues.apache.org/jira/browse/SPARK-10848 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Miklos Christine > Priority: Minor > > Using a defined schema to load a json rdd works as expected. Loading the json > records from a file does not apply the supplied schema. Mainly the nullable > field isn't applied correctly. Loading from a file uses nullable=true on all > fields regardless of applied schema. > Code to reproduce: > {code} > import org.apache.spark.sql.types._ > val jsonRdd = sc.parallelize(List( > """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", > "ProductCode": "WQT648", "Qty": 5}""", > """{"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", > "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, > "expressDelivery":true}""")) > val mySchema = StructType(Array( > StructField(name="OrderID" , dataType=LongType, nullable=false), > StructField("CustomerID", IntegerType, false), > StructField("OrderDate", DateType, false), > StructField("ProductCode", StringType, false), > StructField("Qty", IntegerType, false), > StructField("Discount", FloatType, true), > StructField("expressDelivery", BooleanType, true) > )) > val myDF = sqlContext.read.schema(mySchema).json(jsonRdd) > val schema1 = myDF.printSchema > val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json") > val schema2 = dfDFfromFile.printSchema > {code} > Orders.json > {code} > {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": > "WQT648", "Qty": 5} > {"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", "ProductCode": > "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true} > {code} > The behavior should be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org