andygrove commented on issue #11850:
URL: https://github.com/apache/datafusion/issues/11850#issuecomment-2273402515
@alamb Sure, here is one of the query stages after we have translated it to
a DataFusion plan. Note that we are performing a join on the output of two
partial aggregates and then applying the final aggregate after the join. Having
duplicates on either input to the join causes extra rows to be generated in the
join output.
Perhaps we'll need to start thinking about having a physical optimizer phase
in Comet so that we can leverage the "skip partial aggregates" feature in some
cases.
```
ProjectionExec: expr=[sum@0 as col_0, sum@1 as col_1, sum@2 as col_2]
AggregateExec: mode=Final, gby=[], aggr=[sum, sum, sum]
AggregateExec: mode=Partial, gby=[], aggr=[sum, sum, sum]
ProjectionExec: expr=[col_0@0 as col_0, col_0@2 as col_1]
SortMergeJoin: join_type=Full, on=[(col_0@0, col_0@0), (col_1@1,
col_1@1)]
SortExec: expr=[col_0@0 ASC,col_1@1 ASC],
preserve_partitioning=[false]
CopyExec
ProjectionExec: expr=[col_0@0 as col_0, col_1@1 as col_1]
AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_1@1
as col_1], aggr=[]
ScanExec: schema=[col_0: Int32, col_1: Int32]
SortExec: expr=[col_0@0 ASC,col_1@1 ASC],
preserve_partitioning=[false]
CopyExec
ProjectionExec: expr=[col_0@0 as col_0, col_1@1 as col_1]
AggregateExec: mode=Partial, gby=[col_0@0 as col_0, col_1@1
as col_1], aggr=[]
ScanExec: schema=[col_0: Int32, col_1: Int32]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]