Hi, I am a Spark newbie, and trying to solve the same problem, and have implemented the same exact solution that sowen is suggesting. I am using priorityqueues to keep trak of the top 25 sub_categories, per each category, and using the combineByKey function to do that. However I run into the following exception when I submit the spark job:
ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17) java.lang.UnsupportedOperationException: unsuitable as hash key at scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226) >From the error it looks like spark is trying to use the mutable priority queue as a hashkey so the error makes sense, but I don't get why it is doing that since the value of the RDD record is a priority queue not the key. Maybe there is a more straightforward solution to what I want to achieve, so any suggestion is appreciated :) Thanks, Erisa -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org