Hi everyone,
I've started experimenting with my codebase to see how much work I will
need to port it from 1.6.1 to 2.0.0. In regressing some of my dataframe
transforms, I've discovered I can no longer pair a countDistinct with a
collect_set in the same aggregation.
Consider:
val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
"a"))).toDF("x", "y")
val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))
When it comes time to execute (via collect or show). I get the following
error:
*java.lang.RuntimeException: Distinct columns cannot exist in Aggregate
> operator containing aggregate functions which don't support partial
> aggregation.*
I never encountered this behavior in previous Spark versions. Are there
workarounds that don't require computing each aggregation separately and
joining later? Is there a partial aggregation version of collect_set?
Thanks,
Lee