Maxim Gekk created SPARK-24269: ---------------------------------- Summary: Infer nullability rather than declaring all columns as nullable Key: SPARK-24269 URL: https://issues.apache.org/jira/browse/SPARK-24269 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk
Currently, CSV and JSON datasource set the *nullable* flag to true independently from data itself during schema inferring. JSON: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126 CSV: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51 For example, source dataset has schema: {code} root |-- item_id: integer (nullable = false) |-- country: string (nullable = false) |-- state: string (nullable = false) {code} If we save it and read again the schema of the inferred dataset is {code} root |-- item_id: integer (nullable = true) |-- country: string (nullable = true) |-- state: string (nullable = true) {code} The ticket aims to set the nullable flag more precisely during schema inferring based on read data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org