countDistinct, partial aggregates and Spark 2.0

Lee Becker Fri, 12 Aug 2016 10:56:22 -0700

Hi everyone,

I've started experimenting with my codebase to see how much work I will
need to port it from 1.6.1 to 2.0.0.  In regressing some of my dataframe
transforms, I've discovered I can no longer pair a countDistinct with a
collect_set in the same aggregation.


Consider:

val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
"a"))).toDF("x", "y")
val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))

When it comes time to execute (via collect or show).  I get the following
error:

*java.lang.RuntimeException: Distinct columns cannot exist in Aggregate
> operator containing aggregate functions which don't support partial
> aggregation.*


I never encountered this behavior in previous Spark versions.  Are there
workarounds that don't require computing each aggregation separately and
joining later?  Is there a partial aggregation version of collect_set?

Thanks,
Lee

countDistinct, partial aggregates and Spark 2.0

Reply via email to