Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-20 Thread via GitHub


Rachelint commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817566333

   > However, I found the performance won't be better for Clickbench queries 4 
and 7.
   
   I think it may be possible that the test queries can't reflect the 
improvement well.
   I may try to make some cases this evening.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-20 Thread via GitHub


goldmedal commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817168799

   I have another implementation for this issue 
https://github.com/goldmedal/datafusion/pull/4 
   The concept is that getting the row according to indices in the selection 
vector instead of going through all the rows in the batch.
   
   Because it may involve many changes, I want to check if the implementations 
make sense.
   Currently, I only implement `GroupValuesPrimitive::intern` for the group-by 
values. For the aggregation, I only implement `count` and some aggregations 
that use `GroupsAccumulator`.
   
   I also did some optimization for the sv-mode repartition 
https://github.com/apache/datafusion/pull/15423/files#r2051721176.
   
   However, I found the performance won't be better for Clickbench queries 4 
and 7. 
   ```
   Query 4: SELECT COUNT(DISTINCT "UserID") FROM hits;
   Query 7: SELECT "AdvEngineID", COUNT(*) FROM hits WHERE "AdvEngineID" <> 0 
GROUP BY "AdvEngineID" ORDER BY COUNT(*) DESC;
   
   
   Benchmark clickbench_1.json
   
   ┏━━┳━━┳━━┳━━━┓
   ┃ Query┃ feat_hash-agg-sv-disable ┃ feat_hash-agg-sv ┃Change ┃
   ┡━━╇━━╇━━╇━━━┩
   │ QQuery 4 │ 311.74ms │ 320.72ms │ no change │
   │ QQuery 7 │  30.28ms │  29.01ms │ no change │
   └──┴──┴──┴───┘
   ```
   I'm not sure if I'm on the right way 🤔 
   
   @Dandandan @Rachelint Do you have any suggestions for it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-05 Thread via GitHub


Dandandan commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2767253437

   > I'm considering another approach. Maybe I shouldn't use 
filter_record_batch 🤔. It filters the all column iteratly. I should filter the 
row when the accumulator merge_batch 🤔
   
   Yes, doing so will copy the entire batch (which is what we try to avoid) and 
will be slower than `take` (in the end it will do the same).
   I think what we probably want is to get the indices via 
https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#method.set_indices
 so it only will aggregate the values for those indices.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-01 Thread via GitHub


Rachelint commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2770996279

   > I'm considering another approach. Maybe I shouldn't use 
filter_record_batch 🤔. It filters the all column iteratly. I should filter the 
row when the accumulator merge_batch 🤔
   
   I think also need to filter rows in `GroupValues::intern`, too.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-31 Thread via GitHub


goldmedal commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766751553

   I'm considering another approach. Maybe I shouldn't use 
`filter_record_batch` 🤔. It filters the all column iteratly. I should filter 
the row when the accumulator `update_batch` 🤔 
   I'll draft another PR for the approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-31 Thread via GitHub


zebsme commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766784298

   > I'm considering another approach. Maybe I shouldn't use 
`filter_record_batch` 🤔. It filters the all column iteratly. I should filter 
the row when the accumulator `merge_batch` 🤔 I'll draft another PR for the 
approach.
   
   Agree with you, we should try to avoid directly operate on the record batch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-31 Thread via GitHub


goldmedal commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766551662

   Based on https://github.com/goldmedal/datafusion/pull/3, I did the some 
benchmarks(`clieckbench_1`, `h2o_medium`) for it.
   `feat_zero-copy-hash-agg-false` is the branch that disables the 
configuration.
   `feat_zero-copy-hash-agg` is the branch enabling the configuration.
   
   In conclusion, HashAggregate is slower in the selection vector mode.
   ```
   Comparing feat_zero-copy-hash-agg-false and feat_zero-copy-hash-agg
   
   Benchmark clickbench_1.json
   
   
┏━━┳━━━┳━┳━━━┓
   ┃ Query┃ feat_zero-copy-hash-agg-false ┃ feat_zero-copy-hash-agg ┃   
 Change ┃
   
┡━━╇━━━╇━╇━━━┩
   │ QQuery 0 │0.24ms │  0.32ms │  
1.33x slower │
   │ QQuery 1 │   26.98ms │ 24.56ms │ 
+1.10x faster │
   │ QQuery 2 │   55.89ms │ 52.55ms │ 
+1.06x faster │
   │ QQuery 3 │   48.20ms │ 45.62ms │ 
+1.06x faster │
   │ QQuery 4 │  313.79ms │347.08ms │  
1.11x slower │
   │ QQuery 5 │  490.80ms │471.41ms │   
  no change │
   │ QQuery 6 │   25.06ms │ 25.46ms │   
  no change │
   │ QQuery 7 │   28.17ms │ 27.29ms │   
  no change │
   │ QQuery 8 │  353.53ms │406.58ms │  
1.15x slower │
   │ QQuery 9 │  514.71ms │478.99ms │ 
+1.07x faster │
   │ QQuery 10│  132.73ms │130.81ms │   
  no change │
   │ QQuery 11│  142.59ms │143.29ms │   
  no change │
   │ QQuery 12│  475.75ms │493.83ms │   
  no change │
   │ QQuery 13│  569.90ms │630.60ms │  
1.11x slower │
   │ QQuery 14│  435.30ms │444.02ms │   
  no change │
   │ QQuery 15│  361.60ms │406.62ms │  
1.12x slower │
   │ QQuery 16│  825.41ms │856.13ms │   
  no change │
   │ QQuery 17│  752.13ms │766.95ms │   
  no change │
   │ QQuery 18│ 1813.04ms │   1934.07ms │  
1.07x slower │
   │ QQuery 19│   40.67ms │ 41.49ms │   
  no change │
   │ QQuery 20│  621.14ms │625.89ms │   
  no change │
   │ QQuery 21│  769.98ms │749.81ms │   
  no change │
   │ QQuery 22│ 1544.70ms │   1560.61ms │   
  no change │
   │ QQuery 23│ 4471.51ms │   4356.12ms │   
  no change │
   │ QQuery 24│  257.77ms │265.81ms │   
  no change │
   │ QQuery 25│  268.53ms │273.24ms │   
  no change │
   │ QQuery 26│  294.19ms │307.36ms │   
  no change │
   │ QQuery 27│  983.41ms │987.90ms │   
  no change │
   │ QQuery 28│ 7514.46ms │   7533.94ms │   
  no change │
   │ QQuery 29│  346.70ms │344.54ms │   
  no change │
   │ QQuery 30│  387.65ms │405.92ms │   
  no change │
   │ QQuery 31│  390.81ms │427.40ms │  
1.09x slower │
   │ QQuery 32│ 1597.45ms │   1987.50ms │  
1.24x slower │
   │ QQuery 33│ 1753.56ms │   1863.63ms │  
1.06x slower │
   │ QQuery 34│ 1950.84ms │   1945.21ms │   
  no change │
   │ QQuery 35│  510.78ms │560.47ms │  
1.10x slower │
   │ QQuery 36│  105.22ms │110.02ms │   
  no change │
   │ QQuery 37│   56.69ms │ 53.63ms │ 
+1.06x faster │
   │ QQuery 38│   74.69ms │ 77.84ms │   
  no change │
   │ QQuery 39│  189.59ms │193.83ms │   
  no change │
   │ QQuery 40│   24.37ms │ 24.50ms │   
  no change │
   │ QQuery 41│   23.01ms │ 23.45ms │   
  no change │
   │ QQuery 42│   27.48ms │ 27.82ms │   
  no change

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-29 Thread via GitHub


goldmedal commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2763293892

   @Dandandan 
   I have a draft https://github.com/goldmedal/datafusion/pull/3 based on 
#15423 for `HashAggregate`. Could you check if it's heading in the right 
direction?  
   
   When the selection vector mode is enabled:  
   - `CoalesceBatchesExec` is not added for `FinalPartitioned`.  
   - The selection vector is used to filter the required rows before merging 
batches.  
   
   The plan looks like this:
   ```
   > create table t(c int) as values (1), (1), (1), (1), (2), (2), (3), (3)
   > explain select count(distinct c) from t;
   
+---+--+
   | plan_type | plan   
  |
   
+---+--+
   | logical_plan  | Projection: count(alias1) AS count(DISTINCT t.c)   
  |
   |   |   Aggregate: groupBy=[[]], aggr=[[count(alias1)]]  
  |
   |   | Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]]
  |
   |   |   TableScan: t projection=[c]  
  |
   | physical_plan | ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT 
t.c)]|
   |   |   AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]  
  |
   |   | CoalescePartitionsExec 
  |
   |   |   AggregateExec: mode=Partial, gby=[], 
aggr=[count(alias1)]  |
   |   | AggregateExec: mode=FinalPartitioned, 
gby=[alias1@0 as alias1], aggr=[]  |
   |   |   RepartitionExec: 
partitioning=HashSelectionVector([alias1@0], 12), input_partitions=12 |
   |   | RepartitionExec: 
partitioning=RoundRobinBatch(12), input_partitions=1|
   |   |   AggregateExec: mode=Partial, gby=[c@0 as 
alias1], aggr=[]  |
   |   | DataSourceExec: partitions=1, 
partition_sizes=[1]|
   |   |
  |
   
+---+--+
   ```
   
   I'll review more aggregation patterns and add additional tests.
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]