François Garillot created SPARK-9819:
----------------------------------------

             Summary: reduceBy(KeyAnd)Window should specify which is the 
accumulator argument in invReduceFunc
                 Key: SPARK-9819
                 URL: https://issues.apache.org/jira/browse/SPARK-9819
             Project: Spark
          Issue Type: Bug
          Components: Streaming
    Affects Versions: 1.5.0
         Environment: All
            Reporter: François Garillot


{{reduceByWindow}} has an optional {{invReduceFunc}} argument which allows the 
reduction to be performed incrementally.

The incremental reduction [performed in 
{{ReducedWindowedDStream}}|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/ReducedWindowedDStream.scala#L157]
 only depends on the reduction and its inverse function being associative (as 
shown by the reduce applied to {{oldValues}}), but does not require those 
functions to be commutative.

In particular, if the inverse reduction is the non-commutative, but associative 
substraction (e.g. what you're computing is a running sum), it's necessary to 
know that the intermediate result (to be substracted from) is the first 
argument of {{invReduceFunc}} and that the second argument is the old value to 
substract.

It's only in the commutative case that we don't care which is which.

The Scaladoc for the various overloads of {{reduceByWindow}} should let the 
user know which is the accumulator, and which is the old value. A concise, 
unambiguous way to state this is to write an inversion law in the Scaladoc:

{{invReduceFunc(reduceFunc(x, y), y) = x}}

We should also remind the user that he should use associative reduction (& 
inverse reduction) functions, since the computation makes that assumption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to