On Wednesday 19 July 2017 06:20 PM, qihuagao wrote:
java pair rdd has aggregateByKey, which can avoid full shuffle, so have
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be
collected
# A combining function accepting two paremeters. The second paramter is
merged into the first parameter. This function combines/merges values within
a partition.
# A merging function function accepting two parameters. In this case the
parameters are merged into one. This step merges values across partitions.

While Dataframe, I noticed groupByKey, which could do save function as
aggregateByKey, but without merge functions, so I assumed it should trigger
shuffle operation. Is this true?

No for inbuilt aggregates (like avg, sum, ...) it will already do the partition-wise partial aggregates, then shuffle partial results to merge. Usually it should give better performance than corresponding RDD APIs due to code generation and all. Only Hive user-defined aggregate functions do not support partial aggregation (SPARK-10992). For reference see the comments: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L151

 if true should we have a funtion like the
performance like  aggregateByKey for dataframe?

Thanks.

regards
--
Sumedh Wale
SnappyData (http://www.snappydata.io)
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to