Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
Rachelint commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817566333 > However, I found the performance won't be better for Clickbench queries 4 and 7. I think it may be possible that the test queries can't reflect the improvement well. I may try to make some cases this evening. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817168799 I have another implementation for this issue https://github.com/goldmedal/datafusion/pull/4 The concept is that getting the row according to indices in the selection vector instead of going through all the rows in the batch. Because it may involve many changes, I want to check if the implementations make sense. Currently, I only implement `GroupValuesPrimitive::intern` for the group-by values. For the aggregation, I only implement `count` and some aggregations that use `GroupsAccumulator`. I also did some optimization for the sv-mode repartition https://github.com/apache/datafusion/pull/15423/files#r2051721176. However, I found the performance won't be better for Clickbench queries 4 and 7. ``` Query 4: SELECT COUNT(DISTINCT "UserID") FROM hits; Query 7: SELECT "AdvEngineID", COUNT(*) FROM hits WHERE "AdvEngineID" <> 0 GROUP BY "AdvEngineID" ORDER BY COUNT(*) DESC; Benchmark clickbench_1.json ┏━━┳━━┳━━┳━━━┓ ┃ Query┃ feat_hash-agg-sv-disable ┃ feat_hash-agg-sv ┃Change ┃ ┡━━╇━━╇━━╇━━━┩ │ QQuery 4 │ 311.74ms │ 320.72ms │ no change │ │ QQuery 7 │ 30.28ms │ 29.01ms │ no change │ └──┴──┴──┴───┘ ``` I'm not sure if I'm on the right way 🤔 @Dandandan @Rachelint Do you have any suggestions for it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
Dandandan commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2767253437 > I'm considering another approach. Maybe I shouldn't use filter_record_batch 🤔. It filters the all column iteratly. I should filter the row when the accumulator merge_batch 🤔 Yes, doing so will copy the entire batch (which is what we try to avoid) and will be slower than `take` (in the end it will do the same). I think what we probably want is to get the indices via https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#method.set_indices so it only will aggregate the values for those indices. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
Rachelint commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2770996279 > I'm considering another approach. Maybe I shouldn't use filter_record_batch 🤔. It filters the all column iteratly. I should filter the row when the accumulator merge_batch 🤔 I think also need to filter rows in `GroupValues::intern`, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766751553 I'm considering another approach. Maybe I shouldn't use `filter_record_batch` 🤔. It filters the all column iteratly. I should filter the row when the accumulator `update_batch` 🤔 I'll draft another PR for the approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
zebsme commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766784298 > I'm considering another approach. Maybe I shouldn't use `filter_record_batch` 🤔. It filters the all column iteratly. I should filter the row when the accumulator `merge_batch` 🤔 I'll draft another PR for the approach. Agree with you, we should try to avoid directly operate on the record batch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766551662 Based on https://github.com/goldmedal/datafusion/pull/3, I did the some benchmarks(`clieckbench_1`, `h2o_medium`) for it. `feat_zero-copy-hash-agg-false` is the branch that disables the configuration. `feat_zero-copy-hash-agg` is the branch enabling the configuration. In conclusion, HashAggregate is slower in the selection vector mode. ``` Comparing feat_zero-copy-hash-agg-false and feat_zero-copy-hash-agg Benchmark clickbench_1.json ┏━━┳━━━┳━┳━━━┓ ┃ Query┃ feat_zero-copy-hash-agg-false ┃ feat_zero-copy-hash-agg ┃ Change ┃ ┡━━╇━━━╇━╇━━━┩ │ QQuery 0 │0.24ms │ 0.32ms │ 1.33x slower │ │ QQuery 1 │ 26.98ms │ 24.56ms │ +1.10x faster │ │ QQuery 2 │ 55.89ms │ 52.55ms │ +1.06x faster │ │ QQuery 3 │ 48.20ms │ 45.62ms │ +1.06x faster │ │ QQuery 4 │ 313.79ms │347.08ms │ 1.11x slower │ │ QQuery 5 │ 490.80ms │471.41ms │ no change │ │ QQuery 6 │ 25.06ms │ 25.46ms │ no change │ │ QQuery 7 │ 28.17ms │ 27.29ms │ no change │ │ QQuery 8 │ 353.53ms │406.58ms │ 1.15x slower │ │ QQuery 9 │ 514.71ms │478.99ms │ +1.07x faster │ │ QQuery 10│ 132.73ms │130.81ms │ no change │ │ QQuery 11│ 142.59ms │143.29ms │ no change │ │ QQuery 12│ 475.75ms │493.83ms │ no change │ │ QQuery 13│ 569.90ms │630.60ms │ 1.11x slower │ │ QQuery 14│ 435.30ms │444.02ms │ no change │ │ QQuery 15│ 361.60ms │406.62ms │ 1.12x slower │ │ QQuery 16│ 825.41ms │856.13ms │ no change │ │ QQuery 17│ 752.13ms │766.95ms │ no change │ │ QQuery 18│ 1813.04ms │ 1934.07ms │ 1.07x slower │ │ QQuery 19│ 40.67ms │ 41.49ms │ no change │ │ QQuery 20│ 621.14ms │625.89ms │ no change │ │ QQuery 21│ 769.98ms │749.81ms │ no change │ │ QQuery 22│ 1544.70ms │ 1560.61ms │ no change │ │ QQuery 23│ 4471.51ms │ 4356.12ms │ no change │ │ QQuery 24│ 257.77ms │265.81ms │ no change │ │ QQuery 25│ 268.53ms │273.24ms │ no change │ │ QQuery 26│ 294.19ms │307.36ms │ no change │ │ QQuery 27│ 983.41ms │987.90ms │ no change │ │ QQuery 28│ 7514.46ms │ 7533.94ms │ no change │ │ QQuery 29│ 346.70ms │344.54ms │ no change │ │ QQuery 30│ 387.65ms │405.92ms │ no change │ │ QQuery 31│ 390.81ms │427.40ms │ 1.09x slower │ │ QQuery 32│ 1597.45ms │ 1987.50ms │ 1.24x slower │ │ QQuery 33│ 1753.56ms │ 1863.63ms │ 1.06x slower │ │ QQuery 34│ 1950.84ms │ 1945.21ms │ no change │ │ QQuery 35│ 510.78ms │560.47ms │ 1.10x slower │ │ QQuery 36│ 105.22ms │110.02ms │ no change │ │ QQuery 37│ 56.69ms │ 53.63ms │ +1.06x faster │ │ QQuery 38│ 74.69ms │ 77.84ms │ no change │ │ QQuery 39│ 189.59ms │193.83ms │ no change │ │ QQuery 40│ 24.37ms │ 24.50ms │ no change │ │ QQuery 41│ 23.01ms │ 23.45ms │ no change │ │ QQuery 42│ 27.48ms │ 27.82ms │ no change
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2763293892 @Dandandan I have a draft https://github.com/goldmedal/datafusion/pull/3 based on #15423 for `HashAggregate`. Could you check if it's heading in the right direction? When the selection vector mode is enabled: - `CoalesceBatchesExec` is not added for `FinalPartitioned`. - The selection vector is used to filter the required rows before merging batches. The plan looks like this: ``` > create table t(c int) as values (1), (1), (1), (1), (2), (2), (3), (3) > explain select count(distinct c) from t; +---+--+ | plan_type | plan | +---+--+ | logical_plan | Projection: count(alias1) AS count(DISTINCT t.c) | | | Aggregate: groupBy=[[]], aggr=[[count(alias1)]] | | | Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]] | | | TableScan: t projection=[c] | | physical_plan | ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT t.c)]| | | AggregateExec: mode=Final, gby=[], aggr=[count(alias1)] | | | CoalescePartitionsExec | | | AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)] | | | AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[] | | | RepartitionExec: partitioning=HashSelectionVector([alias1@0], 12), input_partitions=12 | | | RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1| | | AggregateExec: mode=Partial, gby=[c@0 as alias1], aggr=[] | | | DataSourceExec: partitions=1, partition_sizes=[1]| | | | +---+--+ ``` I'll review more aggregation patterns and add additional tests. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
