Nicola Ferraro created SPARK-6974: ------------------------------------- Summary: Possible error in TwitterPopularTags Key: SPARK-6974 URL: https://issues.apache.org/jira/browse/SPARK-6974 Project: Spark Issue Type: Bug Reporter: Nicola Ferraro Priority: Minor
Looking at the example for Twitter popular tags in spark streaming (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala), it seems that the algorithm can have issues in some cases. Top k tags are computed using the following function on a DStream: topCounts60.foreachRDD(rdd => { ... print ... }) But the function passed to "foreachRDD" is called multiple times when your DStream is composed of multiple RDDs, once per RDD in the DStream, resulting in multiple Top-k charts. Probably this scenario is unlikely to happen, because a previous transformation on the DStream (reduceByKeyAndWindow) collapses all RDDs of the stream into a single one. The problem is that this behavior is not stated in the documentation and can be changed in future versions. Moreover, computing correctly the topK chart in streaming seems impossible if you rely on the documentation only. But it is the base algorithm for many RT dashboard use cases. I have also tried to get some reply on stackoverflow (http://stackoverflow.com/questions/29539655/how-to-compute-the-top-k-words). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org