Nicola Ferraro created SPARK-6974:
-------------------------------------

             Summary: Possible error in TwitterPopularTags
                 Key: SPARK-6974
                 URL: https://issues.apache.org/jira/browse/SPARK-6974
             Project: Spark
          Issue Type: Bug
            Reporter: Nicola Ferraro
            Priority: Minor


Looking at the example for Twitter popular tags in spark streaming 
(https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala),
 it seems that the algorithm can have issues in some cases.

Top k tags are computed using the following function on a DStream:
topCounts60.foreachRDD(rdd => { ... print ... })

But the function passed to "foreachRDD" is called multiple times when your 
DStream is composed of multiple RDDs, once per RDD in the DStream, resulting in 
multiple Top-k charts.

Probably this scenario is unlikely to happen, because a previous transformation 
on the DStream (reduceByKeyAndWindow) collapses all RDDs of the stream into a 
single one.

The problem is that this behavior is not stated in the documentation and can be 
changed in future versions.
Moreover, computing correctly the topK chart in streaming seems impossible if 
you rely on the documentation only. But it is the base algorithm for many RT 
dashboard use cases.

I have also tried to get some reply on stackoverflow 
(http://stackoverflow.com/questions/29539655/how-to-compute-the-top-k-words).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to