Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

Reynold Xin Mon, 11 May 2015 13:37:03 -0700

Thanks for catching this. I didn't read carefully enough.

It'd make sense to have the udaf result be non-nullable, if the exprs are
indeed non-nullable.


On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot <ssab...@gmail.com> wrote:

> Hi Haopu,
> actually here `key` is nullable because this is your input's schema :
>
> scala> result.printSchema
> root
> |-- key: string (nullable = true)
> |-- SUM(value): long (nullable = true)
>
> scala> df.printSchema
> root
> |-- key: string (nullable = true)
> |-- value: long (nullable = false)
>
> I tried it with a schema where the key is not flagged as nullable, and the
> schema is actually respected. What you can argue however is that SUM(value)
> should also be not nullable since value is not nullable.
>
> @rxin do you think it would be reasonable to flag the Sum aggregation
> function as nullable (or not) depending on the input expression's schema ?
>
> Regards,
>
> Olivier.
> Le lun. 11 mai 2015 à 22:07, Reynold Xin <r...@databricks.com> a écrit :
>
>> Not by design. Would you be interested in submitting a pull request?
>>
>> On Mon, May 11, 2015 at 1:48 AM, Haopu Wang <hw...@qilinsoft.com> wrote:
>>
>>> I try to get the result schema of aggregate functions using DataFrame
>>> API.
>>>
>>> However, I find the result field of groupBy columns are always nullable
>>> even the source field is not nullable.
>>>
>>> I want to know if this is by design, thank you! Below is the simple code
>>> to show the issue.
>>>
>>> ======
>>>
>>>   import sqlContext.implicits._
>>>   import org.apache.spark.sql.functions._
>>>   case class Test(key: String, value: Long)
>>>   val df = sc.makeRDD(Seq(Test("k1",2),Test("k1",1))).toDF
>>>
>>>   val result = df.groupBy("key").agg($"key", sum("value"))
>>>
>>>   // From the output, you can see the "key" column is nullable, why??
>>>   result.printSchema
>>> //    root
>>> //     |-- key: string (nullable = true)
>>> //     |-- SUM(value): long (nullable = true)
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

Reply via email to