suibianwanwank commented on issue #15524:
URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2786137208
I'm very interested in this issue, so I try a simple POC. Below are my
observations and recommendations.
If my understanding is correct, a truly linear function should satisfy the
identity:
`f(3x + 2y + 1) = 3f(x) + 2f(y) + f(1)`,
However, in practice, none of the basic SQL aggregate functions—`SUM`,
`COUNT`, `MAX`, or `MIN`—exhibit this linearity.
## SUM
Take the expression:
`sum(3x + 2y + 1)`,
we want to split it into
`3sum(x) + 2sum(y) + sum(1)`
This conversion is based on the assumption that both `x` and `y` are
non-null:
For any row, assuming x is null, the corresponding sum(3x + 2y + 1) should
accumulate 0, but after splitting, sum(1) will accumulate 1
To make the transformation semantically equivalent, we need to add **null
filtering** from the original expression. A more accurate rewrite would be:
```
3 * sum(x) filter (y is not null) + 2 * sum(y) filter (x is not null) +
sum(1) filter (x, y is not null)
```
Similarly, for Q29, `SUM("ResolutionWidth" + 1)` should be converted to
`SUM(ResolutionWidth) + 1 * COUNT(ResolutionWidth)` rather than `COUNT(*)`.
## MAX/MIN
We seem to be able to support only the most basic scenarios e.g..
`max(a + 1)=max(a) + 1`
Consider the following scenario, the conversions should not be equivalent.
`max(1 - a) != 1 - max(a)`
`max(a + b) != max(a) + max(b)`
## COUNT
`COUNT` has almost none of the characteristics of linear, e.g..
`count(x + 1) != count(x) + 1`
The optimization I can think of is: if all fields in expr are null and expr
always outputs null, then we can eliminate the computation of expr except for
the fields. i.e.
`count(x + 1) => count(x)`
## Conclusion
In conclusion, unfortunately, I haven't found a universal approach to handle
the envisioned "linear functions" as none fully meet the criteria of linear
functions.
Moreover, in my view, the optimization scenarios described above have very
limited applicability. I'm uncertain whether such queries would occur in
real-world scenarios. However, for Q29, the SUM optimization can provide
significant performance improvements.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]