Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-17 Thread Xiangrui Meng
This is implemented in MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41. -Xiangrui On Wed, Jun 10, 2015 at 1:53 PM, erisa erisa...@gmail.com wrote: Hi, I am a Spark newbie, and trying to solve the same problem, and

Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-10 Thread erisa
Hi, I am a Spark newbie, and trying to solve the same problem, and have implemented the same exact solution that sowen is suggesting. I am using priorityqueues to keep trak of the top 25 sub_categories, per each category, and using the combineByKey function to do that. However I run into the

Re: Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Sean Owen
You probably want to use combineByKey, and create an empty min queue for each key. Merge values into the queue if its size is K. If = K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element. This gives you an RDD of keys mapped to collections of