subject:"countDistinct, partial aggregates and Spark 2.0"

Re: countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker

On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker wrote: > val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c", > "a"))).toDF("x", "y") > val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y")) > This workaround executes with no exceptions: val

countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker

Hi everyone, I've started experimenting with my codebase to see how much work I will need to port it from 1.6.1 to 2.0.0. In regressing some of my dataframe transforms, I've discovered I can no longer pair a countDistinct with a collect_set in the same aggregation. Consider: val df =