Hi Paul,

I would like to add that from your wiki seems that Drill Vectors also has
the same fragmentation issues described as the first problem. So I don't
think that it can be a reason to abandon Arrow completely now.

About the second problem, I agree that this might be a big issue. But it
seems that other open-source tools are not widely adopted Arrow and seems
like the high-performant
native readers are implemented by Arrow backers as proprietary software. So
most probably we won't have such problems because we will use own readers
for most storage plugins.

Although it will not matter at all if, as a result, we choose our own path
of development.

Thanks,
Igor



On Fri, Jan 10, 2020 at 9:17 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Glad to see the Arrow discussion heating up and that it is causing us to
> ask deeper questions.
>
> Here I want to get a bit techie on everyone and highlight two potential
> memory management problems with Arrow.
>
> First: memory fragmentation. Recall that this is how we started on the EVF
> path. Allow allocates large, variable-size blocks of memory. To quote a
> 35-year old DB paper [1]: "[V]ariable-sized pages would cause heavy
> fragmentation problems."
>
> Second: the idea of Arrow is that tool A creates a set of vectors that
> tool B will consume. This means that tool A and B have to agree on vector
> (buffer) size. Suppose tool A wants really big batches, but B can handle
> only small batches. In a columnar system, there is no good way to split a
> bit batch into smaller ones. One can copy values. but this is exactly what
> Arrow is supposed to avoid.
>
> Hence, when using Arrow, a data producer dictates to Drill a crucial
> factor in memory management: batch size. And, Drill dictates batch size to
> its clients. It will require complex negotiation logic. All to avoid a copy
> when the tools will communicate via RPC anyway. This is, in the larger
> picture, not a very good design at all. Needless to say, I am personally
> very skeptical of the benefits.
>
> A possible better alternative, one that we prototyped some time back, is
> to base Drill memory on fixed-size "blocks", say 1 MB in size. Any given
> vector can use part of, all of, or multiple of the blocks to store data.
> The blocks are at least as large as the CPU cache lines, so we retain that
> benefit. Memory management is now far easier, and we can exploit 40 years
> of experience in effective buffer management. (Plus, the blocks are easy to
> spill to disk using classic RDBMS algorithms.)
>
> Point is: let's not blindly accept the work that Arrow has done. Let's do
> our homework to figure out the best system for Drill: whether that be
> Arrow, fixed-size buffers, the current vectors, or something else entirely.
>
> Thanks,
> - Paul
>
>
>
> [1]
> http://users.informatik.uni-halle.de/~hinnebur/Lehre/2008_db_iib_web/uebung3_p560-effelsberg.pdf

Reply via email to