yjshen commented on issue #1708:
URL: 
https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028963254


   Thank you all for your comments and for fixing my mistakes ❤️
   
   I agree for a small hashtable, i.e., the whole data structure can be CPU 
cache resident or a few times larger. The columnar organization gives us all 
benefits at no cost. 
   
   However, as the group-by key cardinality grows, the bottleneck of hash 
aggregation or hash join row concatenation becomes more memory access pattern 
related. The columnar structured hashtable would cause N times extra cache 
miss/loads since we are constantly accessing a tuple at a time. 
   
   The problem is identified as [the memory 
wall](https://dl.acm.org/doi/10.1145/1409360.1409380). It results in quite a 
lot of systems using row-wise hashtable even they utilize vectorized execution 
model: in 
[vertica](https://15721.courses.cs.cmu.edu/spring2020/papers/13-execution/shrinivas-icde2013.pdf),
 [Typer and Tectorwise](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf), 
[Operator Fusion](http://www.vldb.org/pvldb/vol11/p1-menon.pdf), and also the 
latest 
[DuckDB](https://github.com/duckdb/duckdb/blob/master/src/common/types/row_layout.cpp#L32).
 
   
   For sort, the problem is similar; re-ordering will cause a random access 
pattern in memory for each column. If there are many payload columns, this will 
be slow.
   
   The row <-> columnar conversion comes with a price; we need extra 
computation. But considering we always need to buffer intermediate data in 
memory and compose/decompose while data is cache resident, the conversion could 
be pretty efficient.
   
   I'm not proposing to change all these data structures at once to row-wised 
but to point out existing theory and practice on using the row-wise 
organization for these pipeline breakers' states. We should always implement 
and benchmark before deciding on these performance-critical codes, as we did in 
the past.
   
   For the row structure, I'd like a combination of all these three systems: 
   
   - byte-aligned null bit set
   - store all attributes sequentially
     - for data columns, we do no padding
     - for aggregate state columns, we do padding to make them aligned, to make 
the CPU happier when we repeatedly check and update
     - for the var-len attribute, we store (offset+length) in its position and 
append the actual value at last for each row
       - we could come up with a short string representation later for space 
efficiency but also pay extra effort
   
   The reason for the above-proposed structure is: we could determine each 
attribute's offset with schema independently, and we pay the bill with extra 
padding space for better computation performance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to