This is implemented in MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41. -Xiangrui
On Wed, Jun 10, 2015 at 1:53 PM, erisa <erisa...@gmail.com> wrote: > Hi, > > I am a Spark newbie, and trying to solve the same problem, and have > implemented the same exact solution that sowen is suggesting. I am using > priorityqueues to keep trak of the top 25 sub_categories, per each category, > and using the combineByKey function to do that. > However I run into the following exception when I submit the spark job: > > ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17) > java.lang.UnsupportedOperationException: unsuitable as hash key > at > scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226) > > > From the error it looks like spark is trying to use the mutable priority > queue as a hashkey so the error makes sense, but I don't get why it is doing > that since the value of the RDD record is a priority queue not the key. > > Maybe there is a more straightforward solution to what I want to achieve, so > any suggestion is appreciated :) > > Thanks, > Erisa > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org