On Wednesday 19 July 2017 06:20 PM,
qihuagao wrote:
java pair rdd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will not effect the total values to be collected # A combining function accepting two paremeters. The second paramter is merged into the first parameter. This function combines/merges values within a partition. # A merging function function accepting two parameters. In this case the parameters are merged into one. This step merges values across partitions.While Dataframe, I noticed groupByKey, which could do save function as aggregateByKey, but without merge functions, so I assumed it should trigger shuffle operation. Is this true? No for inbuilt aggregates (like avg, sum, ...) it will already do the partition-wise partial aggregates, then shuffle partial results to merge. Usually it should give better performance than corresponding RDD APIs due to code generation and all. Only Hive user-defined aggregate functions do not support partial aggregation (SPARK-10992). For reference see the comments: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L151 if true should we have a funtion like the performance like aggregateByKey for dataframe? Thanks. regards -- Sumedh Wale SnappyData (http://www.snappydata.io)--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org |
- about aggregateByKey of pairrdd. qihuagao
- Re: about aggregateByKey of pairrdd. Sumedh Wale