Hi,

One common situation I run across is that I want to compact my data and
select the mode (most frequent value) in several columns for each group.

Even calculating mode for one column in SQL is a bit tricky. The ways I've
seen usually involve a nested sub-select with a group by + count and then a
window function using rank().

However, what if you want to calculate the mode for several columns,
producing a new row with the results? And let's say the set of columns is
only known at runtime.

In Spark SQL, I start going down a road of many self-joins. The more
efficient way leads me to either RDD[Row] or Dataset[Row] where I could do
a groupByKey + flatMapGroups, keeping state as I iterate over the Rows in
each group.

What's the best way?

Here's a contrived example:

val input = spark.sparkContext.parallelize(Seq(
    ("catosaur", "black", "claws"),
    ("catosaur", "orange", "scales"),
    ("catosaur", "black", "scales"),
    ("catosaur", "orange", "scales"),
    ("catosaur", "black", "spikes"),
    ("bearcopter", "gray", "claws"),
    ("bearcopter", "black", "fur"),
    ("bearcopter", "gray", "flight"),
    ("bearcopter", "gray", "flight")))
    .toDF("creature", "color", "feature")

+----------+------+-------+
|creature  |color |feature|
+----------+------+-------+
|catosaur  |black |claws  |
|catosaur  |orange|scales |
|catosaur  |black |scales |
|catosaur  |orange|scales |
|catosaur  |black |spikes |
|bearcopter|gray  |claws  |
|bearcopter|black |fur    |
|bearcopter|gray  |flight |
|bearcopter|gray  |flight |
+----------+------+-------+

val expectedOutput = spark.sparkContext.parallelize(Seq(
    ("catosaur", "black", "scales"),
    ("bearcopter", "gray", "flight")))
    .toDF("creature", "color", "feature")

+----------+-----+-------+
|creature  |color|feature|
+----------+-----+-------+
|catosaur  |black|scales |
|bearcopter|gray |flight |
+----------+-----+-------+

Reply via email to