You probably want to use combineByKey, and create an empty min queue for each key. Merge values into the queue if its size is < K. If >= K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element.
This gives you an RDD of keys mapped to collections of up to K values each, and should be about as efficient as it gets in general. On Thu, Dec 4, 2014 at 8:53 AM, Theodore Vasiloudis <theodoros.vasilou...@gmail.com> wrote: > Hello everyone, > > I was wondering what is the most efficient way for retrieving the top K > values per key in a (key, value) RDD. > > The simplest way I can think of is to do a groupByKey, sort the iterables > and then take the top K > elements for every key. > > But reduceByKey is an operation that can be very costly. > > This > <http://apache-spark-user-list.1001560.n3.nabble.com/Folding-an-RDD-in-order-td16577.html> > thread seems related, where it is recommended to change the key include the > value we want to sort on, and then perform an aggregate operation. > > My use case would be to filter an RDD representing the edges of a graph ( > (srcID, dstID), edgeWeight), > so that we only retain at most top K edges according to weight for each > (srcID, dstID) key. > The graph can have multiple edges between the same two vertices. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org