Hi All, Glad to see the Arrow discussion heating up and that it is causing us to ask deeper questions.
Here I want to get a bit techie on everyone and highlight two potential memory management problems with Arrow. First: memory fragmentation. Recall that this is how we started on the EVF path. Allow allocates large, variable-size blocks of memory. To quote a 35-year old DB paper [1]: "[V]ariable-sized pages would cause heavy fragmentation problems." Second: the idea of Arrow is that tool A creates a set of vectors that tool B will consume. This means that tool A and B have to agree on vector (buffer) size. Suppose tool A wants really big batches, but B can handle only small batches. In a columnar system, there is no good way to split a bit batch into smaller ones. One can copy values. but this is exactly what Arrow is supposed to avoid. Hence, when using Arrow, a data producer dictates to Drill a crucial factor in memory management: batch size. And, Drill dictates batch size to its clients. It will require complex negotiation logic. All to avoid a copy when the tools will communicate via RPC anyway. This is, in the larger picture, not a very good design at all. Needless to say, I am personally very skeptical of the benefits. A possible better alternative, one that we prototyped some time back, is to base Drill memory on fixed-size "blocks", say 1 MB in size. Any given vector can use part of, all of, or multiple of the blocks to store data. The blocks are at least as large as the CPU cache lines, so we retain that benefit. Memory management is now far easier, and we can exploit 40 years of experience in effective buffer management. (Plus, the blocks are easy to spill to disk using classic RDBMS algorithms.) Point is: let's not blindly accept the work that Arrow has done. Let's do our homework to figure out the best system for Drill: whether that be Arrow, fixed-size buffers, the current vectors, or something else entirely. Thanks, - Paul [1] http://users.informatik.uni-halle.de/~hinnebur/Lehre/2008_db_iib_web/uebung3_p560-effelsberg.pdf
