about aggregateByKey of pairrdd.

qihuagao Wed, 19 Jul 2017 05:51:03 -0700

java pair rdd has aggregateByKey, which can avoid full shuffle, so have
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be
collected
# A combining function accepting two paremeters. The second paramter is
merged into the first parameter. This function combines/merges values within
a partition.
# A merging function function accepting two parameters. In this case the
parameters are merged into one. This step merges values across partitions.


While Dataframe, I noticed groupByKey, which could do save function as
aggregateByKey, but without merge functions, so I assumed it should trigger
shuffle operation. Is this true? if true should we have a funtion like the
performance like  aggregateByKey for dataframe?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/about-aggregateByKey-of-pairrdd-tp28878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

about aggregateByKey of pairrdd.

Reply via email to