[ https://issues.apache.org/jira/browse/SPARK-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499496#comment-14499496 ]
Sean Owen commented on SPARK-6974: ---------------------------------- No, foreachRDD always means "for each RDD in the DStream, one at a time". In the windowed DStream, it's one RDD representing the preceding 60 seconds of data. It may contain data from 3 or more batch intervals' worth of blocks, but this does not mean you get 3 RDDs for each window shift. > Possible error in TwitterPopularTags > ------------------------------------ > > Key: SPARK-6974 > URL: https://issues.apache.org/jira/browse/SPARK-6974 > Project: Spark > Issue Type: Bug > Reporter: Nicola Ferraro > Priority: Minor > > Looking at the example for Twitter popular tags in spark streaming > (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala), > it seems that the algorithm can have issues in some cases. > Top k tags are computed using the following function on a DStream: > topCounts60.foreachRDD(rdd => { ... print ... }) > But the function passed to "foreachRDD" is called multiple times when your > DStream is composed of multiple RDDs, once per RDD in the DStream, resulting > in multiple Top-k charts. > Probably this scenario is unlikely to happen, because a previous > transformation on the DStream (reduceByKeyAndWindow) collapses all RDDs of > the stream into a single one. > The problem is that this behavior is not stated in the documentation and can > be changed in future versions. > Moreover, computing correctly the topK chart in streaming seems impossible if > you rely on the documentation only. But it is the base algorithm for many RT > dashboard use cases. > I have also tried to get some reply on stackoverflow > (http://stackoverflow.com/questions/29539655/how-to-compute-the-top-k-words). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org