Maxim Gekk created SPARK-24269:
----------------------------------

             Summary: Infer nullability rather than declaring all columns as 
nullable
                 Key: SPARK-24269
                 URL: https://issues.apache.org/jira/browse/SPARK-24269
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Maxim Gekk


Currently, CSV and JSON datasource set the *nullable* flag to true 
independently from data itself during schema inferring.

JSON: 
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
CSV: 
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51

For example, source dataset has schema:
{code}
root
 |-- item_id: integer (nullable = false)
 |-- country: string (nullable = false)
 |-- state: string (nullable = false)
{code}

If we save it and read again the schema of the inferred dataset is
{code}
root
 |-- item_id: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- state: string (nullable = true)
{code}
The ticket aims to set the nullable flag more precisely during schema inferring 
based on read data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to