[ https://issues.apache.org/jira/browse/SPARK-24269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482148#comment-16482148 ]
Hyukjin Kwon commented on SPARK-24269: -------------------------------------- I can at least see the reason. If data doesn't contain null, the more correct schema shouldn't set the nullable true. Although I won't do it too. > Infer nullability rather than declaring all columns as nullable > --------------------------------------------------------------- > > Key: SPARK-24269 > URL: https://issues.apache.org/jira/browse/SPARK-24269 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Maxim Gekk > Priority: Minor > > Currently, CSV and JSON datasource set the *nullable* flag to true > independently from data itself during schema inferring. > JSON: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126 > CSV: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51 > For example, source dataset has schema: > {code} > root > |-- item_id: integer (nullable = false) > |-- country: string (nullable = false) > |-- state: string (nullable = false) > {code} > If we save it and read again the schema of the inferred dataset is > {code} > root > |-- item_id: integer (nullable = true) > |-- country: string (nullable = true) > |-- state: string (nullable = true) > {code} > The ticket aims to set the nullable flag more precisely during schema > inferring based on read data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org