If you are confused because of  the name of two APIs. I think DF API name
groupBy came from SQL, but it works similarly as reducebykey.
On 29 Aug 2016 20:57, "Marius Soutier" <mps....@gmail.com> wrote:

> In DataFrames (and thus in 1.5 in general) this is not possible, correct?
>
> On 11.08.2016, at 05:42, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> Hi Luis,
>
> You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you
> can do groupBy followed by a reduce on the GroupedDataset (
> http://spark.apache.org/docs/1.6.2/api/scala/index.
> html#org.apache.spark.sql.GroupedDataset ) - this works on a per-key
> basis despite the different name. In Spark 2.0 you would use groupByKey on
> the Dataset followed by reduceGroups ( http://spark.apache.org/
> docs/latest/api/scala/index.html#org.apache.spark.sql.
> KeyValueGroupedDataset ).
>
> Cheers,
>
> Holden :)
>
> On Wed, Aug 10, 2016 at 5:15 PM, luismattor <luismat...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Consider the following code:
>>
>> val result = df.groupBy("col1").agg(min("col2"))
>>
>> I know that rdd.reduceByKey(func) produces the same RDD as
>> rdd.groupByKey().mapValues(value => value.reduce(func)) However
>> reducerByKey
>> is more efficient as it avoids shipping each value to the reducer doing
>> the
>> aggregation (it ships partial aggregations instead).
>>
>> I wonder whether the DataFrame API optimizes the code doing something
>> similar to what RDD.reduceByKey does.
>>
>> I am using Spark 1.6.2.
>>
>> Regards,
>> Luis
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-
>> in-DataFrame-API-tp27508.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>
>

Reply via email to