Hi,

I am a Spark newbie, and trying to solve the same problem, and have
implemented the same exact solution that sowen  is suggesting. I am using
priorityqueues to keep trak of the top 25 sub_categories, per each category,
and using the combineByKey function to do that. 
However I run into the following exception when I submit the spark job:

ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17)
java.lang.UnsupportedOperationException: unsuitable as hash key
    at
scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226)


>From the error it looks like spark is trying to use the mutable priority
queue as a hashkey so the error makes sense, but I don't get why it is doing
that since the value of the RDD record is a priority queue not the key.

Maybe there is a more straightforward solution to what I want to achieve, so
any suggestion is appreciated :)

Thanks,
Erisa



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to