Hi all,

What's the best way to do top-k with Windowing in Dataset world?

I have a snippet of code that filters the data to the top-k, but with
skewed keys:

val windowSpec = Window.parititionBy(skewedKeys).orderBy(dateTime)
val rank = row_number().over(windowSpec)

input.withColumn("rank", rank).filter("rank <= 10").drop("rank")

The problem with this code is that Spark doesn't know that it can sort the
data locally, get the local rank first. What it ends up doing is performing
a sort by key using the skewed keys, and this blew up the cluster since the
keys are heavily skewed.

In the RDD world we can do something like:
rdd.mapPartitioins(iterator -> topK(iterator))
but I can't really think of an obvious to do this in the Dataset API,
especially with Window function. I guess some UserAggregateFunction would
do, but I wonder if there's obvious way that I missed.

-------
Regards,
Andy

Reply via email to