[ https://issues.apache.org/jira/browse/SPARK-24269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484071#comment-16484071 ]
Simeon Simeonov commented on SPARK-24269: ----------------------------------------- There are many reasons why correct nullability inference is important for any data source, not just CSV & JSON. # It can be used to verify the foundation of data contracts, especially in data exchange with third parties via something as simple as schema (StructType) equality. The common practice is to persist a JSON representation of the expected schema. # It can substantially improve performance and reduce memory use when dealing with Dataset[A <: Product] by using B <: AnyVal directly in case classes as opposed to via Option[B]. # It can simplify the use of code-generation tools. As an example of (2), consider the following: {code:java} import org.apache.spark.util.SizeEstimator import scala.util.Random.nextInt case class WithNulls(a: Option[Int], b: Option[Int]) case class WithoutNulls(a: Int, b: Int) val sizeWith = SizeEstimator.estimate(WithNulls(Some(nextInt), Some(nextInt))) // 88 val sizeWithout = SizeEstimator.estimate(WithoutNulls(nextInt, nextInt)) // 24 val percentMemoryReduction = 100.0 * (sizeWith - sizeWithout) / sizeWith // 72.7{code} I would argue that 70+% savings in memory use are a pretty big deal. The savings can be even bigger in the cases of many columns with small primitive types (Byte, Short, ...). As an example of (3), consider tools that code-generate case classes from schema. We use tools like that at Swoop for efficient & performant transformations that cannot easily happen via the provided operations that work on internal rows. Without proper nullability inference, manual configuration has to be provided to these tools. We do this routinely, even for ad hoc data transformations in notebooks. [~Teng Peng] I agree that this behavior should not be the default given Spark's current behavior. It should be activated via an option. > Infer nullability rather than declaring all columns as nullable > --------------------------------------------------------------- > > Key: SPARK-24269 > URL: https://issues.apache.org/jira/browse/SPARK-24269 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Maxim Gekk > Priority: Minor > > Currently, CSV and JSON datasource set the *nullable* flag to true > independently from data itself during schema inferring. > JSON: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126 > CSV: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51 > For example, source dataset has schema: > {code} > root > |-- item_id: integer (nullable = false) > |-- country: string (nullable = false) > |-- state: string (nullable = false) > {code} > If we save it and read again the schema of the inferred dataset is > {code} > root > |-- item_id: integer (nullable = true) > |-- country: string (nullable = true) > |-- state: string (nullable = true) > {code} > The ticket aims to set the nullable flag more precisely during schema > inferring based on read data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org