[jira] [Commented] (SPARK-24269) Infer nullability rather than declaring all columns as nullable

Simeon Simeonov (JIRA) Tue, 22 May 2018 07:44:20 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-24269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484071#comment-16484071
 ]


Simeon Simeonov commented on SPARK-24269:
-----------------------------------------

There are many reasons why correct nullability inference is important for any 
data source, not just CSV & JSON. 
 # It can be used to verify the foundation of data contracts, especially in 
data exchange with third parties via something as simple as schema (StructType) 
equality. The common practice is to persist a JSON representation of the 
expected schema.
 # It can substantially improve performance and reduce memory use when dealing 
with Dataset[A <: Product] by using B <: AnyVal directly in case classes as 
opposed to via Option[B].
 # It can simplify the use of code-generation tools.

As an example of (2), consider the following:
{code:java}
import org.apache.spark.util.SizeEstimator
import scala.util.Random.nextInt

case class WithNulls(a: Option[Int], b: Option[Int])
case class WithoutNulls(a: Int, b: Int)

val sizeWith = SizeEstimator.estimate(WithNulls(Some(nextInt), Some(nextInt)))
// 88

val sizeWithout = SizeEstimator.estimate(WithoutNulls(nextInt, nextInt))
// 24

val percentMemoryReduction = 100.0 * (sizeWith - sizeWithout) / sizeWith
// 72.7{code}
I would argue that 70+% savings in memory use are a pretty big deal. The 
savings can be even bigger in the cases of many columns with small primitive 
types (Byte, Short, ...).

As an example of (3), consider tools that code-generate case classes from 
schema. We use tools like that at Swoop for efficient & performant 
transformations that cannot easily happen via the provided operations that work 
on internal rows. Without proper nullability inference, manual configuration 
has to be provided to these tools. We do this routinely, even for ad hoc data 
transformations in notebooks.

[~Teng Peng] I agree that this behavior should not be the default given Spark's 
current behavior. It should be activated via an option.

> Infer nullability rather than declaring all columns as nullable
> ---------------------------------------------------------------
>
>                 Key: SPARK-24269
>                 URL: https://issues.apache.org/jira/browse/SPARK-24269
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Maxim Gekk
>            Priority: Minor
>
> Currently, CSV and JSON datasource set the *nullable* flag to true 
> independently from data itself during schema inferring.
> JSON: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
> CSV: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51
> For example, source dataset has schema:
> {code}
> root
>  |-- item_id: integer (nullable = false)
>  |-- country: string (nullable = false)
>  |-- state: string (nullable = false)
> {code}
> If we save it and read again the schema of the inferred dataset is
> {code}
> root
>  |-- item_id: integer (nullable = true)
>  |-- country: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> The ticket aims to set the nullable flag more precisely during schema 
> inferring based on read data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24269) Infer nullability rather than declaring all columns as nullable

Reply via email to