Two more ways:
*Using the Typed Dataset API with Rows*
Caveat: The docs about flatMapGroups do warn "This function does not
support partial aggregation, and as a result requires shuffling all the
data in the Dataset. If an application intends to perform an aggregation
over each key, it is best to
For the curious, I played around with a UDAF for this (shown below). On the
downside, it assembles a Map of all possible values of the column that'll
need to be stored in memory somewhere.
I suspect some kind of sorted groupByKey + cogroup could stream values
through, though might not support part
Hi,
One common situation I run across is that I want to compact my data and
select the mode (most frequent value) in several columns for each group.
Even calculating mode for one column in SQL is a bit tricky. The ways I've
seen usually involve a nested sub-select with a group by + count and then