also interested in this.
Is the partition count of df depending on fields of groupby?
Also is the performance of groupby-agg comparable to reducebykey/aggbykey?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-join-groupBy-agg-question-tp28849p2887
java pair rdd has aggregateByKey, which can avoid full shuffle, so have
impressive performance. which has parameters,
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be
collected
# A combining function accepting two paremeters.