Jorge created ARROW-9937:
----------------------------

             Summary: [Rust] [DataFusion] Average is not correct
                 Key: ARROW-9937
                 URL: https://issues.apache.org/jira/browse/ARROW-9937
             Project: Apache Arrow
          Issue Type: Bug
          Components: Rust, Rust - DataFusion
            Reporter: Jorge


The current design of aggregates makes the calculation of the average incorrect.
It also makes it impossible to compute the [geometric 
mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
operations. 

The central issue is that Accumulator returns a `ScalarValue` during partial 
aggregations via {{get_value}}, but very often a `ScalarValue` is not 
sufficient information to perform the full aggregation.

A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
reduces them using another average, i.e.

{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}

which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.

I believe that our Accumulators need to pass more information from the partial 
aggregations to the final aggregation.

We could consider taking an API equivalent to 
[spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
i.e. have an `update`, a `merge` and an `evaluate`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to