berkaysynnada commented on issue #15524:
URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2771901348
I haven't reviewed the PR yet, but I agree with @Dandandan, and I think we
can improve this further. We've actually thought about this issue before and
sketched out an initial design. Let me share some notes from that:
This simplification should be based on the **linearity property**, not just
`SUM()` and `COUNT()` rewrites. Formally:
**_f(x + y) = f(x) + f(y)_**, if f is a linear function.
So, I believe we should define a `"linear function"` tag for all such
functions.
Consider the same query:
```sql
SELECT SUM(id), SUM(id + 1), SUM(id + 2), ..., SUM(id + 89) FROM employees;
```
LP:
```
--Aggregate: groupBy=[[]], aggr=[
sum(__common_expr_1 AS employees.id),
sum(__common_expr_1 AS employees.id + Int64(1)),
...,
sum(__common_expr_1 AS employees.id + Int64(89))
]
----Projection: CAST(employees.id AS Int64) AS __common_expr_1
------TableScan: employees projection=[id]
```
PP:
```
--AggregateExec: mode=Single, gby=[], aggr=[
sum(employees.id),
sum(employees.id + Int64(1)),
...,
sum(employees.id + Int64(89))
]
----ProjectionExec: expr=[CAST(id@0 AS Int64) as __common_expr_1]
------MemoryExec: partitions=1, partition_sizes=[1]
```
We should apply the linearity property here to simplify expressions like
SUM(id + n) into SUM(id) + n * COUNT(1), when n is a constant. It doesn't
effect the performance of this clickbench query, but we should also properly
handle the cases when n is not constant as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]