alamb commented on issue #1708:
URL:
https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1029014217
> However, as the group-by key cardinality grows, the bottleneck of hash
aggregation or hash join row concatenation becomes more memory access pattern
related.
One thing we might consider is not storing the group key values directly in
the hash table, but separately. Something like:
```text
┌─────────────┐ ┌─────────┬─────────┐
│ │ │Group Key│Group Key│
│ HashTable │ │ Col 1 │ Col 2 │
│ │ ├─────────┼─────────┤
├──────┬──────┤ │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ ├ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ┤
│ │ │ ┌───────▶ idx │
│ key │value │ │ ├ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ┤
│ ├──────┤ │ │ │ │
│ │ idx │───────┘ │ │ │
│ ├──────┤ │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└──────┴──────┘ │ │ │
│ │ │
HashTable holds indexes │ │ │
to mutable arary │ │ │
│ ... │ ... │
└─────────┴─────────┘
Mutable (Appendable) Array
New group keys are
appended at the end
```
The current hash aggregate code takes this approach -- but instead of using
Mutable/Appendable arrays it uses a Vec of `GroupState):
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L615
> I'm not proposing to change all these data structures at once to row-wised
but to point out existing theory and practice on using the row-wise
organization for these pipeline breakers' states. We should always implement
and benchmark before deciding on these performance-critical codes, as we did in
the past.
I agree with this 💯 and among other reasons is why I enjoy working with you
(and the rest of the people on this chain!)
BTW the DuckDB sorting blog
https://duckdb.org/2021/08/27/external-sorting.html (and the paper it
references by Goetz Grafe) have a good treatment of sorting (specifically the
calculation of sort keys and then a secondary memory shuffle to sort move the
original data around correctly)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]