Why doesnt something like this work? If you want a continuously updated
reference to the top counts, you can use a global variable.
var topCounts: Array[(String, Int)] = null
sortedCounts.foreachRDD (rdd =>
val currentTopCounts = rdd.take(10)
// print currentTopCounts it or watever
topCounts = currentTopCounts
)
TD
On Mon, Jul 14, 2014 at 4:11 PM, jon.burns wrote:
> Hello everyone,
>
> I'm an undergrad working on a summarization project. I've created a
> summarizer in normal Spark and it works great, however I want to write it
> for Spark_Streaming to increase it's functionality. Basically I take in a
> bunch of text and get the most popular words as well as most popular
> bi-grams (Two words together), and I've managed to do this with streaming
> (And made it stateful, which is great). However the next part of my
> algorithm requires me to get the top 10 words and top 10 bigrams and store
> them in a vector like structure. With just spark I would use code like;
>
> array_of_words = words.sortByKey().top(50)
>
> Is there a way to mimick this with streaming? I was following along with
> the
> ampcamp tutorial
> <
> http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html
> >
> so I know that you can print the top 10 by using;
>
> sortedCounts.foreach(rdd =>
> println("\nTop 10 hashtags:\n" + rdd.take(10).mkString("\n")))
>
> However I can't seem to alter this to make it store the top 10, just print
> them. The instructor mentions at the end that
>
> "one can get the top 10 hashtags in each partition, collect them together
> at
> the driver and then find the top 10 hashtags among them" but they leave it
> as an exercise. I would appreciate any help :)
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>