What about
https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF
Koert Kuipers schrieb am Mi. 4. Jan. 2017 um 16:11:
> i assumed topk of frequencies in one pass. if its topk by known
> sorting/ordering then use priority queue
i assumed topk of frequencies in one pass. if its topk by known
sorting/ordering then use priority queue aggregator instead of spacesaver.
On Tue, Jan 3, 2017 at 3:11 PM, Koert Kuipers wrote:
> i dont know anything about windowing or about not using developer apis...
>
> but
03, 2017 8:03 PM
To: Mendelson, Assaf
Cc: user
Subject: Re: top-k function for Window
> Furthermore, in your example you don’t even need a window function, you can
> simply use groupby and explode
Can you clarify? You need to sort somehow (be it map-side sorting or
reduce-side s
i dont know anything about windowing or about not using developer apis...
but
but a trivial implementation of top-k requires a total sort per group. this
can be done with dataset. we do this using spark-sorted (
https://github.com/tresata/spark-sorted) but its not hard to do it yourself
for
Hi Austin,
It's trivial to implement top-k in the RDD world - however I would like to
stay in the Dataset API world instead of flip-flopping between the two APIs
(consistency, wholestage codegen etc).
The twitter library appears to support only RDD, and the solution you gave
me is very similar
Andy,
You might want to also checkout the Algebird libraries from Twitter. They have
topK and a lot of other helpful functions. I’ve used the Algebird topk
successfully on very large data sets.
You can also use Spark SQL to do a “poor man’s” topK. This depends on how
scrupulous you are about
You can write a UDAF in which the buffer contains the top K and manage it. This
means you don’t need to sort at all. Furthermore, in your example you don’t
even need a window function, you can simply use groupby and explode.
Of course, this is only relevant if k is small…
From: Andy Dang