Hi All,

Glad to see the Arrow discussion heating up and that it is causing us to ask 
deeper questions.

Here I want to get a bit techie on everyone and highlight two potential memory 
management problems with Arrow.

First: memory fragmentation. Recall that this is how we started on the EVF 
path. Allow allocates large, variable-size blocks of memory. To quote a 35-year 
old DB paper [1]: "[V]ariable-sized pages would cause heavy fragmentation 
problems."

Second: the idea of Arrow is that tool A creates a set of vectors that tool B 
will consume. This means that tool A and B have to agree on vector (buffer) 
size. Suppose tool A wants really big batches, but B can handle only small 
batches. In a columnar system, there is no good way to split a bit batch into 
smaller ones. One can copy values. but this is exactly what Arrow is supposed 
to avoid.

Hence, when using Arrow, a data producer dictates to Drill a crucial factor in 
memory management: batch size. And, Drill dictates batch size to its clients. 
It will require complex negotiation logic. All to avoid a copy when the tools 
will communicate via RPC anyway. This is, in the larger picture, not a very 
good design at all. Needless to say, I am personally very skeptical of the 
benefits.

A possible better alternative, one that we prototyped some time back, is to 
base Drill memory on fixed-size "blocks", say 1 MB in size. Any given vector 
can use part of, all of, or multiple of the blocks to store data. The blocks 
are at least as large as the CPU cache lines, so we retain that benefit. Memory 
management is now far easier, and we can exploit 40 years of experience in 
effective buffer management. (Plus, the blocks are easy to spill to disk using 
classic RDBMS algorithms.)

Point is: let's not blindly accept the work that Arrow has done. Let's do our 
homework to figure out the best system for Drill: whether that be Arrow, 
fixed-size buffers, the current vectors, or something else entirely.

Thanks,
- Paul

 

[1] 
http://users.informatik.uni-halle.de/~hinnebur/Lehre/2008_db_iib_web/uebung3_p560-effelsberg.pdf

Reply via email to