haohuaijin opened a new pull request, #8266: URL: https://github.com/apache/arrow-datafusion/pull/8266
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
Closes #8123
## Rationale for this change
in this pr
```
❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") FROM
'../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" limit
10;
+----------+--------------------------------------------------+--------------------------------------------------------+
| RegionID | SUM(../benchmarks/data/hits.parquet.AdvEngineID) |
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID) |
+----------+--------------------------------------------------+--------------------------------------------------------+
| 0 | 0 | 8
|
| 1 | 147946 | 239380
|
| 2 | 441662 | 1081016
|
| 3 | 39724 | 131195
|
| 4 | 34557 | 79500
|
| 5 | 13502 | 40914
|
| 6 | 24338 | 55768
|
| 7 | 28417 | 64989
|
| 8 | 34483 | 65472
|
| 9 | 38047 | 91576
|
+----------+--------------------------------------------------+--------------------------------------------------------+
10 rows in set. Query took 0.935 seconds.
❯ explain SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID")
FROM '../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID"
limit 10;
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: skip=0, fetch=10
|
| | Sort: ../benchmarks/data/hits.parquet.RegionID ASC NULLS
LAST, fetch=10
|
| | Projection: ../benchmarks/data/hits.parquet.RegionID,
SUM(alias2) AS SUM(../benchmarks/data/hits.parquet.AdvEngineID), COUNT(alias1)
AS COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)
|
| | Aggregate:
groupBy=[[../benchmarks/data/hits.parquet.RegionID]], aggr=[[SUM(alias2),
COUNT(alias1)]]
|
| | Aggregate:
groupBy=[[../benchmarks/data/hits.parquet.RegionID,
../benchmarks/data/hits.parquet.UserID AS alias1]],
aggr=[[SUM(CAST(../benchmarks/data/hits.parquet.AdvEngineID AS Int64)) AS
alias2]]
|
| | TableScan: ../benchmarks/data/hits.parquet
projection=[RegionID, UserID, AdvEngineID]
|
| physical_plan | GlobalLimitExec: skip=0, fetch=10
|
| | SortPreservingMergeExec: [RegionID@0 ASC NULLS LAST],
fetch=10
|
| | SortExec: TopK(fetch=10), expr=[RegionID@0 ASC NULLS
LAST]
|
| | ProjectionExec: expr=[RegionID@0 as RegionID,
SUM(alias2)@1 as SUM(../benchmarks/data/hits.parquet.AdvEngineID),
COUNT(alias1)@2 as COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[RegionID@0 as RegionID], aggr=[SUM(alias2), COUNT(alias1)]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec:
partitioning=Hash([RegionID@0], 24), input_partitions=24
|
| | AggregateExec: mode=Partial, gby=[RegionID@0
as RegionID], aggr=[SUM(alias2), COUNT(alias1)]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[RegionID@0 as RegionID, alias1@1 as alias1], aggr=[alias2]
|
| | CoalesceBatchesExec:
target_batch_size=8192
|
| | RepartitionExec:
partitioning=Hash([RegionID@0, alias1@1], 24), input_partitions=24
|
| | AggregateExec: mode=Partial,
gby=[RegionID@0 as RegionID, UserID@1 as alias1], aggr=[alias2]
|
| | ParquetExec: file_groups={24
groups: [[home/hhj/datafusion/benchmarks/data/hits.parquet:0..615832352],
[home/hhj/datafusion/benchmarks/data/hits.parquet:615832352..1231664704],
[home/hhj/datafusion/benchmarks/data/hits.parquet:1231664704..1847497056],
[home/hhj/datafusion/benchmarks/data/hits.parquet:1847497056..2463329408],
[home/hhj/datafusion/benchmarks/data/hits.parquet:2463329408..3079161760],
...]}, projection=[RegionID, UserID, AdvEngineID] |
| |
|
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.043 seconds.
```
in main 393e48f98872c696a90fce033fa584533d2326fa
```
❯ SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID") FROM
'../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID" limit
10;
+----------+--------------------------------------------------+--------------------------------------------------------+
| RegionID | SUM(../benchmarks/data/hits.parquet.AdvEngineID) |
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID) |
+----------+--------------------------------------------------+--------------------------------------------------------+
| 0 | 0 | 8
|
| 1 | 147946 | 239380
|
| 2 | 441662 | 1081016
|
| 3 | 39724 | 131195
|
| 4 | 34557 | 79500
|
| 5 | 13502 | 40914
|
| 6 | 24338 | 55768
|
| 7 | 28417 | 64989
|
| 8 | 34483 | 65472
|
| 9 | 38047 | 91576
|
+----------+--------------------------------------------------+--------------------------------------------------------+
10 rows in set. Query took 1.349 seconds.
❯ explain SELECT "RegionID", SUM("AdvEngineID"), COUNT(DISTINCT "UserID")
FROM '../benchmarks/data/hits.parquet' GROUP BY "RegionID" order by "RegionID"
limit 10;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: skip=0, fetch=10
|
| | Sort: ../benchmarks/data/hits.parquet.RegionID ASC NULLS
LAST, fetch=10
|
| | Aggregate:
groupBy=[[../benchmarks/data/hits.parquet.RegionID]],
aggr=[[SUM(CAST(../benchmarks/data/hits.parquet.AdvEngineID AS Int64)),
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)]]
|
| | TableScan: ../benchmarks/data/hits.parquet
projection=[RegionID, UserID, AdvEngineID]
|
| physical_plan | GlobalLimitExec: skip=0, fetch=10
|
| | SortPreservingMergeExec: [RegionID@0 ASC NULLS LAST],
fetch=10
|
| | SortExec: TopK(fetch=10), expr=[RegionID@0 ASC NULLS
LAST]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[RegionID@0 as RegionID],
aggr=[SUM(../benchmarks/data/hits.parquet.AdvEngineID), COUNT(DISTINCT
../benchmarks/data/hits.parquet.UserID)]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([RegionID@0],
24), input_partitions=24
|
| | AggregateExec: mode=Partial, gby=[RegionID@0
as RegionID], aggr=[SUM(../benchmarks/data/hits.parquet.AdvEngineID),
COUNT(DISTINCT ../benchmarks/data/hits.parquet.UserID)]
|
| | ParquetExec: file_groups={24 groups:
[[home/hhj/datafusion/benchmarks/data/hits.parquet:0..615832352],
[home/hhj/datafusion/benchmarks/data/hits.parquet:615832352..1231664704],
[home/hhj/datafusion/benchmarks/data/hits.parquet:1231664704..1847497056],
[home/hhj/datafusion/benchmarks/data/hits.parquet:1847497056..2463329408],
[home/hhj/datafusion/benchmarks/data/hits.parquet:2463329408..3079161760],
...]}, projection=[RegionID, UserID, AdvEngineID] |
| |
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.040 seconds.
```
## What changes are included in this PR?
add no-distinct sum/min/max aggregate support in single_distinct_to_group_by
rule
## Are these changes tested?
yes, add some tests
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
