Hi everyone, I've started experimenting with my codebase to see how much work I will need to port it from 1.6.1 to 2.0.0. In regressing some of my dataframe transforms, I've discovered I can no longer pair a countDistinct with a collect_set in the same aggregation.
Consider: val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c", "a"))).toDF("x", "y") val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y")) When it comes time to execute (via collect or show). I get the following error: *java.lang.RuntimeException: Distinct columns cannot exist in Aggregate > operator containing aggregate functions which don't support partial > aggregation.* I never encountered this behavior in previous Spark versions. Are there workarounds that don't require computing each aggregation separately and joining later? Is there a partial aggregation version of collect_set? Thanks, Lee