[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement

GitBox Sun, 23 Jan 2022 13:58:19 -0800


paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1019575466



   One more factor to consider: memory management. Row-based solutions are far, 
far easier to memory manage than vectors. Row solutions reduce internal 
fragmentation from 1/4 of batch size to 1/2 of the (much smaller) row size.
   
   We talked early on about the difficulty of sizing vectors and batches. If a 
record has 20 columns, then each is a vector. If the columns are fixed-width 
(`INT`, `BIGINT`, etc.) then we can predict the size of a batch with, say, 1K 
rows. What we can't predict is the size of variable-width columns (`VARCHAR`, 
say) so sometimes we have a vector of size 1K (because all the `VARCHAR` values 
are one byte) and sometimes we have 1GB (because all the values are 1MB.) Drill 
now has nots of ad-hoc code to try to guess the size, which is much better than 
the early days, but still not great.
   
   We often want to go the other way: we want to control memory use, so we 
allocate, say, 1MB per batch. When we do this, if we know the sort is given 1GB 
of buffer space, then we know we can store 1024 batches. (Picking simple 
numbers here just for discussion: actual numbers will vary.) So, what do we do 
to limit overall batch size to 1MB?
   
   Before EVF, we couldn't and batch sizes could become large and cause 
intermittent OOM errors. With EVF, there is complex and elaborate code to 
detect when the next doubling of any vector could push us past the target size. 
EVF reports "your batch is full", and the batch is sent downstream. Today, this 
works only in the scan; all other operators retain their ad-hoc "batch sizer" 
based solutions. (Full disclosure: I originally wrote the batch sizer as a 
quick & dirty solution for the sort, hence its silly name. Little did I realize 
it would spread to other operators and still be with us years later: I thought 
EVF would have won the battle by now. Sigh...)
   
   When sizing vectors this way, we end up with a large amount of *internal 
fragmentation* unused space within vectors. On average, the last half of a 
vector will be half used, meaning that the vector is, on average 3/4 full. (The 
reason is subtle: we doubled the vector when it was full. After the doubling, 
the vector is half full. Variable-width vector sizes are uncorrelated, so, on 
average we'll have half-filled the second half when the batch as a whole 
fills.) So, with our 1MB batch, about 1/4 or 256K goes unused. That's pretty 
inefficient!
   
   Now, consider a row-based solution. Again we allocate 1MB for our buffer. We 
fill it with rows, end-to-end. When the buffer fills, we will, on average, have 
written half a row. We shift that row to a new buffer and send the current one 
on its way. Now, our fragmentation is 1/2 the row width. If rows width is << 
the buffer size, then we're making much better use of our scarce memory 
resource (especially in buffering operators such as sort, join, and 
aggregation.)
   
   Lest you think this is somehow a Drill bug, recall how Parquet is built. The 
entire data set is buffered in memory before it is written to the file. Why? 
Parquet has no idea how large a column or row group will be until the columnar 
data blocks are actually created. To pack data optimally in the file, it is 
sub-optimally all buffered in memory prior to writes. In a way, this Parquet 
buffering (and the resulting memory pressure) is very similar to the exchange 
buffering we discussed in prior notes.
   
   There is another huge win for row-based buffers: they are all of the same 
size (or, if we get very fancy, of a few sizes.) Rather than needing a 
malloc-style variable-block size allocator, we can use a much simpler DB-style 
buffer pool. New buffers come out of the pool, retired buffers go back in. We 
know exactly how many buffers we have, and can easily allocate quantities to 
queries and fragments. All the guesswork evaporates from sort, join and 
aggregate planning: they instead know exactly how many buffers they have to 
work with, and can plan spilling accordingly. None of this is new; you'll find 
it written up in any book on DB internals.
   
   Thus, in just about every dimension, columnar storage is suboptimal as an 
internal query engine data format. The Impala guys knew this and benefited from 
their row-based structure. Drill is still suffering from its early 
misunderstanding that the benefits of columnar storage somehow translate into 
benefits as an internal data format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] paul-rogers commented on issue #2421: [DISCUSSION] ValueVectors Replacement

Reply via email to