[ https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386568#comment-14386568 ]
Sean Owen commented on SPARK-6605: ---------------------------------- Thanks, that's very useful. I think the behavior is expected, but it's not obvious. I assume you are printing the RDD from a window with no data. Both are giving the same answer in that both show no count, or 0, for every key. The second example just has an explicit 0 in two cases instead of all implicit 0. The more expected answer is the first one -- no results. The first version gets that exactly since it re-counts the whole window which has no data. The second one is the result of the optimization offered by invFunc. It correctly finds the count is 0 in the current window for these two keys, but it has no notion that a count of 0 is the same as no value at all. You and I know that, and you could simply apply a filter() to remove these redundant entries if desired. I'm not sure it's "fixable" in general without the user being able to supply a {{(V,V) => Option[V]}} instead or something as the {{invFunc}}. But it's not really getting the wrong answer either. > Same transformation in DStream leads to different result > -------------------------------------------------------- > > Key: SPARK-6605 > URL: https://issues.apache.org/jira/browse/SPARK-6605 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.3.0 > Reporter: SaintBacchus > Fix For: 1.4.0 > > > The transformation *reduceByKeyAndWindow* has two implementations: one use > the *WindowDstream* and the other use *ReducedWindowedDStream*. > But the result always is the same, except when an empty windows occurs. > As a wordcount example, if a period of time (larger than window time) has no > data coming, the first *reduceByKeyAndWindow* has no elem inside but the > second has many elem with the zero value inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org