Re: GroupedDataset needs a mapValues

Andy Davidson Sun, 14 Feb 2016 09:35:26 -0800

Hi Michael

From:  Michael Armbrust <mich...@databricks.com>
Date:  Saturday, February 13, 2016 at 9:31 PM
To:  Koert Kuipers <ko...@tresata.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: GroupedDataset needs a mapValues


> Instead of grouping with a lambda function, you can do it with a column
> expression to avoid materializing an unnecessary tuple:
> 
> df.groupBy($"_1")


I am unfamiliar with this notation. Is there something similar for Java and
python?

Kind regards

Andy


> 
> Regarding the mapValues, you can do something similar using an Aggregator
> <https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%
> 20Aggregator.html> , but I agree that we should consider something lighter
> weight like the mapValues you propose.
> 
> On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> i have a Dataset[(K, V)]
>> i would like to group by k and then reduce V using a function (V, V) => V
>> how do i do this?
>> 
>> i would expect something like:
>> val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f)
>> or better:
>> ds.grouped.reduce(f)  # grouped only works on Dataset[(_, _)] and i dont care
>> about java api
>> 
>> but there is no mapValues or grouped. ds.groupBy(_._1) gives me a
>> GroupedDataset[(K, (K, V))] which is inconvenient. i could carry the key
>> through the reduce operation but that seems ugly and inefficient.
>> 
>> any thoughts?
>> 
>> 
>

Re: GroupedDataset needs a mapValues

Reply via email to