Hi, One common situation I run across is that I want to compact my data and select the mode (most frequent value) in several columns for each group.
Even calculating mode for one column in SQL is a bit tricky. The ways I've seen usually involve a nested sub-select with a group by + count and then a window function using rank(). However, what if you want to calculate the mode for several columns, producing a new row with the results? And let's say the set of columns is only known at runtime. In Spark SQL, I start going down a road of many self-joins. The more efficient way leads me to either RDD[Row] or Dataset[Row] where I could do a groupByKey + flatMapGroups, keeping state as I iterate over the Rows in each group. What's the best way? Here's a contrived example: val input = spark.sparkContext.parallelize(Seq( ("catosaur", "black", "claws"), ("catosaur", "orange", "scales"), ("catosaur", "black", "scales"), ("catosaur", "orange", "scales"), ("catosaur", "black", "spikes"), ("bearcopter", "gray", "claws"), ("bearcopter", "black", "fur"), ("bearcopter", "gray", "flight"), ("bearcopter", "gray", "flight"))) .toDF("creature", "color", "feature") +----------+------+-------+ |creature |color |feature| +----------+------+-------+ |catosaur |black |claws | |catosaur |orange|scales | |catosaur |black |scales | |catosaur |orange|scales | |catosaur |black |spikes | |bearcopter|gray |claws | |bearcopter|black |fur | |bearcopter|gray |flight | |bearcopter|gray |flight | +----------+------+-------+ val expectedOutput = spark.sparkContext.parallelize(Seq( ("catosaur", "black", "scales"), ("bearcopter", "gray", "flight"))) .toDF("creature", "color", "feature") +----------+-----+-------+ |creature |color|feature| +----------+-----+-------+ |catosaur |black|scales | |bearcopter|gray |flight | +----------+-----+-------+