GitHub user mn-mikke opened a pull request:

    https://github.com/apache/spark/pull/21747

    [SPARK-24165][SQL][branch-2.3] Fixing conditional expressions to handle 
nullability of nested types

    ## What changes were proposed in this pull request?
    This PR is proposing a fix for the output data type of ```If``` and 
```CaseWhen``` expression. Upon till now, the implementation of exprassions has 
ignored nullability of nested types from different execution branches and 
returned the type of the first branch.
    
    This could lead to an unwanted ```NullPointerException``` from other 
expressions depending on a ```If```/```CaseWhen``` expression.
    
    Example:
    ```
    val rows = new util.ArrayList[Row]()
    rows.add(Row(true, ("a", 1)))
    rows.add(Row(false, (null, 2)))
    val schema = StructType(Seq(
      StructField("cond", BooleanType, false),
      StructField("s", StructType(Seq(
        StructField("val1", StringType, true),
        StructField("val2", IntegerType, false)
      )), false)
    ))
    
    val df = spark.createDataFrame(rows, schema)
    
    df
      .select(when('cond, struct(lit("x").as("val1"), 
lit(10).as("val2"))).otherwise('s) as "res")
      .select('res.getField("val1"))
      .show()
    ```
    Exception:
    ```
    Exception in thread "main" java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
        at 
org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
    ...
    ```
    Output schema:
    ```
    root
     |-- res.val1: string (nullable = false)
    ```
    
    ## How was this patch tested?
    New test cases added into
    - DataFrameSuite.scala
    - conditionalExpressions.scala

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mn-mikke/spark SPARK-24165-branch-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21747.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21747
    
----
commit a2fe63e1d48b0291feaa9fcd008654da051d1f1b
Author: Marek Novotny <mn.mikke@...>
Date:   2018-07-11T04:21:03Z

    [SPARK-24165][SQL][branch-2.3] Fixing conditional expressions to handle 
nullability of nested types

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to