Hello everyone, I was wondering what is the most efficient way for retrieving the top K values per key in a (key, value) RDD.
The simplest way I can think of is to do a groupByKey, sort the iterables and then take the top K elements for every key. But reduceByKey is an operation that can be very costly. This <http://apache-spark-user-list.1001560.n3.nabble.com/Folding-an-RDD-in-order-td16577.html> thread seems related, where it is recommended to change the key include the value we want to sort on, and then perform an aggregate operation. My use case would be to filter an RDD representing the edges of a graph ( (srcID, dstID), edgeWeight), so that we only retain at most top K edges according to weight for each (srcID, dstID) key. The graph can have multiple edges between the same two vertices. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org