yjshen commented on issue #1708: URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028963254
Thank you all for your comments and for fixing my mistakes ❤️ I agree for a small hashtable, i.e., the whole data structure can be CPU cache resident or a few times larger. The columnar organization gives us all benefits at no cost. However, as the group-by key cardinality grows, the bottleneck of hash aggregation or hash join row concatenation becomes more memory access pattern related. The columnar structured hashtable would cause N times extra cache miss/loads since we are constantly accessing a tuple at a time. The problem is identified as [the memory wall](https://dl.acm.org/doi/10.1145/1409360.1409380). It results in quite a lot of systems using row-wise hashtable even they utilize vectorized execution model: in [vertica](https://15721.courses.cs.cmu.edu/spring2020/papers/13-execution/shrinivas-icde2013.pdf), [Typer and Tectorwise](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf), [Operator Fusion](http://www.vldb.org/pvldb/vol11/p1-menon.pdf), and also the latest [DuckDB](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32). For sort, the problem is similar; re-ordering will cause a random access pattern in memory for each column. If there are many payload columns, this will be slow. The row <-> columnar conversion comes with a price; we need extra computation. But considering we always need to buffer intermediate data in memory and compose/decompose while data is cache resident, the conversion could be pretty efficient. I'm not proposing to change all these data structures at once to row-wised but to point out existing theory and practice on using the row-wise organization for these pipeline breakers' states. We should always implement and benchmark before deciding on these performance-critical codes, as we did in the past. For the row structure, I'd like a combination of all these three systems: - byte-aligned null bit set - store all attributes sequentially - for data columns, we do no padding - for aggregate state columns, we do padding to make them aligned, to make the CPU happier when we repeatedly check and update - for the var-len attribute, we store (offset+length) in its position and append the actual value at last for each row - we could come up with a short string representation later for space efficiency but also pay extra effort The reason for the above-proposed structure is: we could determine each attribute's offset with schema independently, and we pay the bill with extra padding space for better computation performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
