Re: Spark-Streaming collect/take functionality.

2014-08-26 Thread Chris Fregly
good suggestion, td.

and i believe the optimization that jon.burns is referring to - from the
big data mini course - is a step earlier:  the sorting mechanism that
produces sortedCounts.

you can use mapPartitions() to get a top k locally on each partition, then
shuffle only (k * # of partitions) elements to the driver for sorting -
versus shuffling the whole dataset from all partitions.  network IO saving
technique.


On Tue, Jul 15, 2014 at 9:41 AM, jon.burns  wrote:

> It works perfect, thanks!. I feel like I should have figured that out, I'll
> chalk it up to inexperience with Scala. Thanks again.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark-Streaming collect/take functionality.

2014-07-15 Thread jon.burns
It works perfect, thanks!. I feel like I should have figured that out, I'll
chalk it up to inexperience with Scala. Thanks again. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark-Streaming collect/take functionality.

2014-07-14 Thread Tathagata Das
Why doesnt something like this work? If you want a continuously updated
reference to the top counts, you can use a global variable.

var topCounts: Array[(String, Int)] = null

sortedCounts.foreachRDD (rdd =>
val currentTopCounts = rdd.take(10)
// print currentTopCounts it or watever
   topCounts = currentTopCounts
)

TD


On Mon, Jul 14, 2014 at 4:11 PM, jon.burns  wrote:

> Hello everyone,
>
> I'm an undergrad working on a summarization project. I've created a
> summarizer in normal Spark and it works great, however I want to write it
> for Spark_Streaming to increase it's functionality. Basically I take in a
> bunch of text and get the most popular words as well as most popular
> bi-grams (Two words together), and I've managed to do this with streaming
> (And made it stateful, which is great). However the next part of my
> algorithm requires me to get the top 10 words and top 10 bigrams and store
> them in a vector like structure. With just spark I would use code like;
>
> array_of_words = words.sortByKey().top(50)
>
> Is there a way to mimick this with streaming? I was following along with
> the
> ampcamp  tutorial
> <
> http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html
> >
> so I know that you can print the top 10 by using;
>
> sortedCounts.foreach(rdd =>
>   println("\nTop 10 hashtags:\n" + rdd.take(10).mkString("\n")))
>
> However I can't seem to alter this to make it store the top 10, just print
> them. The instructor mentions at the end that
>
> "one can get the top 10 hashtags in each partition, collect them together
> at
> the driver and then find the top 10 hashtags among them" but they leave it
> as an exercise. I would appreciate any help :)
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>