Hi Luis, You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you can do groupBy followed by a reduce on the GroupedDataset ( http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.GroupedDataset ) - this works on a per-key basis despite the different name. In Spark 2.0 you would use groupByKey on the Dataset followed by reduceGroups ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.KeyValueGroupedDataset ).
Cheers, Holden :) On Wed, Aug 10, 2016 at 5:15 PM, luismattor <luismat...@gmail.com> wrote: > Hi everyone, > > Consider the following code: > > val result = df.groupBy("col1").agg(min("col2")) > > I know that rdd.reduceByKey(func) produces the same RDD as > rdd.groupByKey().mapValues(value => value.reduce(func)) However > reducerByKey > is more efficient as it avoids shipping each value to the reducer doing the > aggregation (it ships partial aggregations instead). > > I wonder whether the DataFrame API optimizes the code doing something > similar to what RDD.reduceByKey does. > > I am using Spark 1.6.2. > > Regards, > Luis > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame- > API-tp27508.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau