Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-17 Thread Xiangrui Meng
This is implemented in MLlib:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41.
-Xiangrui

On Wed, Jun 10, 2015 at 1:53 PM, erisa erisa...@gmail.com wrote:
 Hi,

 I am a Spark newbie, and trying to solve the same problem, and have
 implemented the same exact solution that sowen  is suggesting. I am using
 priorityqueues to keep trak of the top 25 sub_categories, per each category,
 and using the combineByKey function to do that.
 However I run into the following exception when I submit the spark job:

 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17)
 java.lang.UnsupportedOperationException: unsuitable as hash key
 at
 scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226)


 From the error it looks like spark is trying to use the mutable priority
 queue as a hashkey so the error makes sense, but I don't get why it is doing
 that since the value of the RDD record is a priority queue not the key.

 Maybe there is a more straightforward solution to what I want to achieve, so
 any suggestion is appreciated :)

 Thanks,
 Erisa



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-10 Thread erisa
Hi,

I am a Spark newbie, and trying to solve the same problem, and have
implemented the same exact solution that sowen  is suggesting. I am using
priorityqueues to keep trak of the top 25 sub_categories, per each category,
and using the combineByKey function to do that. 
However I run into the following exception when I submit the spark job:

ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17)
java.lang.UnsupportedOperationException: unsuitable as hash key
at
scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226)


From the error it looks like spark is trying to use the mutable priority
queue as a hashkey so the error makes sense, but I don't get why it is doing
that since the value of the RDD record is a priority queue not the key.

Maybe there is a more straightforward solution to what I want to achieve, so
any suggestion is appreciated :)

Thanks,
Erisa



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Theodore Vasiloudis
Hello everyone,

I was wondering what is the most efficient way for retrieving the top K
values per key in a (key, value) RDD.

The simplest way I can think of is to do a groupByKey, sort the iterables
and then take the top K 
elements for every key.

But reduceByKey is an operation that can be very costly.

This
http://apache-spark-user-list.1001560.n3.nabble.com/Folding-an-RDD-in-order-td16577.html
  
thread seems related, where it is recommended to change the key include the 
value we want to sort on, and then perform an aggregate operation.

My use case would be to filter an RDD representing the edges of a graph (
(srcID, dstID), edgeWeight), 
so that we only retain at most top K edges according to weight for each
(srcID, dstID) key.
The graph can have multiple  edges between the same two vertices.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Sean Owen
You probably want to use combineByKey, and create an empty min queue
for each key. Merge values into the queue if its size is  K. If = K,
only merge the value if it exceeds the smallest element; if so add it
and remove the smallest element.

This gives you an RDD of keys mapped to collections of up to K values
each, and should be about as efficient as it gets in general.

On Thu, Dec 4, 2014 at 8:53 AM, Theodore Vasiloudis
theodoros.vasilou...@gmail.com wrote:
 Hello everyone,

 I was wondering what is the most efficient way for retrieving the top K
 values per key in a (key, value) RDD.

 The simplest way I can think of is to do a groupByKey, sort the iterables
 and then take the top K
 elements for every key.

 But reduceByKey is an operation that can be very costly.

 This
 http://apache-spark-user-list.1001560.n3.nabble.com/Folding-an-RDD-in-order-td16577.html
 thread seems related, where it is recommended to change the key include the
 value we want to sort on, and then perform an aggregate operation.

 My use case would be to filter an RDD representing the edges of a graph (
 (srcID, dstID), edgeWeight),
 so that we only retain at most top K edges according to weight for each
 (srcID, dstID) key.
 The graph can have multiple  edges between the same two vertices.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org