Hello everyone,

Consider this toy example:

case class Foo(x: String, y: String)
val df = sparkSession.createDataFrame(Array(Foo(null), Foo("a"), Foo("b"))
df.select(collect_set($"x")).show

In Spark 2.0.0 I get the following results:

+--------------+
|collect_set(x)|
+--------------+
|  [null, b, a]|
+--------------+

In 1.6.* the same collect_set produces:

+--------------+
|collect_set(x)|
+--------------+
|  [b, a]|
+--------------+

Is there any way to get this aggregation to ignore nulls?  I understand the
trivial way would be to filter on x beforehand, but in my actual use case
I'm calling the collect_set in withColumn over a window specification, so I
want empty arrays on rows with nulls.

For now I'm using this hack of a workaround:

val removenulls = udf((l: scala.collection.mutable.WrappedArray[String]) =>
l.filter(x=>x != null))
f.select(removenulls(collect_set($"x"))).show


Any suggestions are appreciated.

Thanks,
Lee

Reply via email to