Re: Calculate mode separately for multiple columns in row

2017-05-01 Thread Everett Anderson
Two more ways: *Using the Typed Dataset API with Rows* Caveat: The docs about flatMapGroups do warn "This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to

Re: Calculate mode separately for multiple columns in row

2017-04-27 Thread Everett Anderson
For the curious, I played around with a UDAF for this (shown below). On the downside, it assembles a Map of all possible values of the column that'll need to be stored in memory somewhere. I suspect some kind of sorted groupByKey + cogroup could stream values through, though might not support part

Calculate mode separately for multiple columns in row

2017-04-26 Thread Everett Anderson
Hi, One common situation I run across is that I want to compact my data and select the mode (most frequent value) in several columns for each group. Even calculating mode for one column in SQL is a bit tricky. The ways I've seen usually involve a nested sub-select with a group by + count and then