[ 
https://issues.apache.org/jira/browse/SPARK-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499496#comment-14499496
 ] 

Sean Owen commented on SPARK-6974:
----------------------------------

No, foreachRDD always means "for each RDD in the DStream, one at a time". In 
the windowed DStream, it's one RDD representing the preceding 60 seconds of 
data. It may contain data from 3 or more batch intervals' worth of blocks, but 
this does not mean you get 3 RDDs for each window shift.

> Possible error in TwitterPopularTags
> ------------------------------------
>
>                 Key: SPARK-6974
>                 URL: https://issues.apache.org/jira/browse/SPARK-6974
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Nicola Ferraro
>            Priority: Minor
>
> Looking at the example for Twitter popular tags in spark streaming 
> (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala),
>  it seems that the algorithm can have issues in some cases.
> Top k tags are computed using the following function on a DStream:
> topCounts60.foreachRDD(rdd => { ... print ... })
> But the function passed to "foreachRDD" is called multiple times when your 
> DStream is composed of multiple RDDs, once per RDD in the DStream, resulting 
> in multiple Top-k charts.
> Probably this scenario is unlikely to happen, because a previous 
> transformation on the DStream (reduceByKeyAndWindow) collapses all RDDs of 
> the stream into a single one.
> The problem is that this behavior is not stated in the documentation and can 
> be changed in future versions.
> Moreover, computing correctly the topK chart in streaming seems impossible if 
> you rely on the documentation only. But it is the base algorithm for many RT 
> dashboard use cases.
> I have also tried to get some reply on stackoverflow 
> (http://stackoverflow.com/questions/29539655/how-to-compute-the-top-k-words).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to