Hello everyone, Consider this toy example:
case class Foo(x: String, y: String) val df = sparkSession.createDataFrame(Array(Foo(null), Foo("a"), Foo("b")) df.select(collect_set($"x")).show In Spark 2.0.0 I get the following results: +--------------+ |collect_set(x)| +--------------+ | [null, b, a]| +--------------+ In 1.6.* the same collect_set produces: +--------------+ |collect_set(x)| +--------------+ | [b, a]| +--------------+ Is there any way to get this aggregation to ignore nulls? I understand the trivial way would be to filter on x beforehand, but in my actual use case I'm calling the collect_set in withColumn over a window specification, so I want empty arrays on rows with nulls. For now I'm using this hack of a workaround: val removenulls = udf((l: scala.collection.mutable.WrappedArray[String]) => l.filter(x=>x != null)) f.select(removenulls(collect_set($"x"))).show Any suggestions are appreciated. Thanks, Lee