Selecting first ten values in a RDD/partition

2014-05-29 Thread nilmish
I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted
each RDD and want to retain only top 10 values and discard further value.
How can I retain only top 10 values ?

I am trying to get top 10 hashtags.  Instead of sorting the entire of
5-minute-counts (thereby, incurring the cost of a data shuffle), I am trying
to get the top 10 hashtags in each partition. I am struck at how to retain
top 10 hashtags in each partition.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal
Can you clarify what you're trying to achieve here ?

If you want to take only top 10 of each RDD, why don't sort followed by
take(10) of every RDD ?

Or, you want to take top 10 of five minutes ?

Cheers,



On Thu, May 29, 2014 at 2:04 PM, nilmish  wrote:

> I have a DSTREAM which consists of RDD partitioned every 2 sec. I have
> sorted
> each RDD and want to retain only top 10 values and discard further value.
> How can I retain only top 10 values ?
>
> I am trying to get top 10 hashtags.  Instead of sorting the entire of
> 5-minute-counts (thereby, incurring the cost of a data shuffle), I am
> trying
> to get the top 10 hashtags in each partition. I am struck at how to retain
> top 10 hashtags in each partition.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
It will give you direct access to an iterator containing the member objects
of each partition for doing the kind of within-partition hashtag counts
you're describing.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Gerard Maas
DStream has a help method to print the first 10 elements of each RDD. You
could take some inspiration from it, as the usecase is practically the same
and the code will be probably very similar:  rdd.take(10)...

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala#L591

-kr, Gerard.




On Thu, May 29, 2014 at 10:08 PM, Brian Gawalt  wrote:

> Try looking at the .mapPartitions( ) method implemented for RDD[T] objects.
> It will give you direct access to an iterator containing the member objects
> of each partition for doing the kind of within-partition hashtag counts
> you're describing.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6534.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish
My primary goal : To get top 10 hashtag for every 5 mins interval.

I want to do this efficiently. I have already done this by using
reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
taking only top 10 elements. But this is very slow. 

So I now I am thinking of retaining only top 10 hashtags in each RDD because
these only could come in the final answer. 

I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
? Basically I need to transform my DTREAM in which each RDD contains only
top 10 hashtags so that number of hashtags in 5 mins interval is low.

If there is some more efficient way of doing this then please let me know
that also.

Thanx,
Nilesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly
as brian g alluded to earlier, you can use DStream.mapPartitions() to
return  the partition-local top 10 for each partition.  once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.

this leads to a much much-smaller dataset to be shuffled back to the driver
to calculate the global top 10.


On Fri, May 30, 2014 at 5:05 AM, nilmish  wrote:

> My primary goal : To get top 10 hashtag for every 5 mins interval.
>
> I want to do this efficiently. I have already done this by using
> reducebykeyandwindow() and then sorting all hashtag in 5 mins interval
> taking only top 10 elements. But this is very slow.
>
> So I now I am thinking of retaining only top 10 hashtags in each RDD
> because
> these only could come in the final answer.
>
> I am stuck at : how to retain only top 10 hashtag in each RDD of my DSTREAM
> ? Basically I need to transform my DTREAM in which each RDD contains only
> top 10 hashtags so that number of hashtags in 5 mins interval is low.
>
> If there is some more efficient way of doing this then please let me know
> that also.
>
> Thanx,
> Nilesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Selecting-first-ten-values-in-a-RDD-partition-tp6517p6577.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>