Instead of grouping with a lambda function, you can do it with a column expression to avoid materializing an unnecessary tuple:
df.groupBy($"_1") Regarding the mapValues, you can do something similar using an Aggregator <https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html>, but I agree that we should consider something lighter weight like the mapValues you propose. On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers <ko...@tresata.com> wrote: > i have a Dataset[(K, V)] > i would like to group by k and then reduce V using a function (V, V) => V > how do i do this? > > i would expect something like: > val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f) > or better: > ds.grouped.reduce(f) # grouped only works on Dataset[(_, _)] and i dont > care about java api > > but there is no mapValues or grouped. ds.groupBy(_._1) gives me a > GroupedDataset[(K, (K, V))] which is inconvenient. i could carry the key > through the reduce operation but that seems ugly and inefficient. > > any thoughts? > > >