java pair rdd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will not effect the total values to be collected # A combining function accepting two paremeters. The second paramter is merged into the first parameter. This function combines/merges values within a partition. # A merging function function accepting two parameters. In this case the parameters are merged into one. This step merges values across partitions.
While Dataframe, I noticed groupByKey, which could do save function as aggregateByKey, but without merge functions, so I assumed it should trigger shuffle operation. Is this true? if true should we have a funtion like the performance like aggregateByKey for dataframe? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-aggregateByKey-of-pairrdd-tp28878.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org