Re: top-k function for Window

2017-01-04 Thread Georg Heiler
What about https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF Koert Kuipers schrieb am Mi. 4. Jan. 2017 um 16:11: > i assumed topk of frequencies in one pass. if its topk by known > sorting/ordering then use priority queue

Re: top-k function for Window

2017-01-04 Thread Koert Kuipers
i assumed topk of frequencies in one pass. if its topk by known sorting/ordering then use priority queue aggregator instead of spacesaver. On Tue, Jan 3, 2017 at 3:11 PM, Koert Kuipers wrote: > i dont know anything about windowing or about not using developer apis... > > but

RE: top-k function for Window

2017-01-03 Thread Mendelson, Assaf
03, 2017 8:03 PM To: Mendelson, Assaf Cc: user Subject: Re: top-k function for Window > Furthermore, in your example you don’t even need a window function, you can > simply use groupby and explode Can you clarify? You need to sort somehow (be it map-side sorting or reduce-side s

Re: top-k function for Window

2017-01-03 Thread Koert Kuipers
i dont know anything about windowing or about not using developer apis... but but a trivial implementation of top-k requires a total sort per group. this can be done with dataset. we do this using spark-sorted ( https://github.com/tresata/spark-sorted) but its not hard to do it yourself for

Re: top-k function for Window

2017-01-03 Thread Andy Dang
Hi Austin, It's trivial to implement top-k in the RDD world - however I would like to stay in the Dataset API world instead of flip-flopping between the two APIs (consistency, wholestage codegen etc). The twitter library appears to support only RDD, and the solution you gave me is very similar

Re: top-k function for Window

2017-01-03 Thread HENSLEE, AUSTIN L
Andy, You might want to also checkout the Algebird libraries from Twitter. They have topK and a lot of other helpful functions. I’ve used the Algebird topk successfully on very large data sets. You can also use Spark SQL to do a “poor man’s” topK. This depends on how scrupulous you are about

RE: top-k function for Window

2017-01-03 Thread Mendelson, Assaf
You can write a UDAF in which the buffer contains the top K and manage it. This means you don’t need to sort at all. Furthermore, in your example you don’t even need a window function, you can simply use groupby and explode. Of course, this is only relevant if k is small… From: Andy Dang