Instead of grouping with a lambda function, you can do it with a column
expression to avoid materializing an unnecessary tuple:

df.groupBy($"_1")

Regarding the mapValues, you can do something similar using an Aggregator
<https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html>,
but I agree that we should consider something lighter weight like the
mapValues you propose.

On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i have a Dataset[(K, V)]
> i would like to group by k and then reduce V using a function (V, V) => V
> how do i do this?
>
> i would expect something like:
> val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f)
> or better:
> ds.grouped.reduce(f)  # grouped only works on Dataset[(_, _)] and i dont
> care about java api
>
> but there is no mapValues or grouped. ds.groupBy(_._1) gives me a
> GroupedDataset[(K, (K, V))] which is inconvenient. i could carry the key
> through the reduce operation but that seems ugly and inefficient.
>
> any thoughts?
>
>
>

Reply via email to