Spark SQL DataFrame: Nullable column and filtering

martinibus77 Thu, 30 Jul 2015 11:20:33 -0700

Hi all,

1. *Columns in dataframes can be nullable and not nullable. Having a
nullable column of Doubles, I can use the following Scala code to filter all
"non-null" rows:*


  val df = ..... // some code that creates a DataFrame
  df.filter( df("columnname").isNotNull() )

+-+-----+----+                                                                  
|x|    a|   y|
+-+-----+----+
|1|hello|null|
|2|  bob|   5|
+-+-----+----+

root
 |-- x: integer (nullable = false)
 |-- a: string (nullable = true)
 |-- y: integer (nullable = true)

And with the filter expression
+-+---+-+                                                                       
|x|  a|y|
+-+---+-+
|2|bob|5|
+-+---+-+


Unfortunetaly and while this is a true for a nullable column (according to
df.printSchema), it is not true for a column that is not nullable:


+-+-----+----+                                                                  
|x|    a|   y|
+-+-----+----+
|1|hello|null|
|2|  bob|   5|
+-+-----+----+

root
 |-- x: integer (nullable = false)
 |-- a: string (nullable = true)
 |-- y: integer (nullable = false)

+-+-----+----+                                                                  
|x|    a|   y|
+-+-----+----+
|1|hello|null|
|2|  bob|   5|
+-+-----+----+

such that the output is not affected by the filter. Is this intended?


2. *What is the cheapest (in sense of performance) to turn a non-nullable
column into a nullable column?
A came uo with this:*

  /**
   * Set, if a column is nullable.
   * @param df source DataFrame 
   * @param cn is the column name to change
   * @param nullable is the flag to set, such that the column is either
nullable or not
   */
  def setNullableStateOfColumn( df: DataFrame, cn: String, nullable:
Boolean) : DataFrame = {

    val schema = df.schema
    val newSchema = StructType(schema.map {
      case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t,
nullable = nullable, m)
      case y: StructField => y
    })
    df.sqlContext.createDataFrame( df.rdd, newSchema)
  }

Is there a cheaper solution?

3. *Any comments?*

Cheers and thx in advance,

Martin






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DataFrame-Nullable-column-and-filtering-tp24087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL DataFrame: Nullable column and filtering

Reply via email to