Re: "Crude-but-effective" Arrow integration

Paul Rogers Mon, 20 Aug 2018 10:51:41 -0700

Hi Ted,

We may be confusing two very different ideas. The one is a Drill-to-Arrow 
adapter on Drill's periphery, this is the "crude-but-effective" integration 
suggestion. On the periphery we are not changing existing code, we're just 
building an adapter to read Arrow data into Drill, or convert Drill output to 
Arrow.

The other idea, being discussed in a parallel thread, is to convert Drill's 
runtime engine to use Arrow. That is a whole other beast.

When changing Drill internals, code must change. There is a cost associated 
with that. Whether the Arrow code is better or not is not the key question. 
Rather, the key question is simply the volume of changes.

Drill divides into roughly two main layers: plan-time and run-time. Plan-time 
is not much affected by Arrow. But, run-time code is all about manipulating 
vectors and their metadata, often in quite detailed ways with APIs unique to 
Drill. While swapping Arrow vectors for Drill vectors is conceptually simple, 
those of us who've looked at the details have noted that the sheer volume of 
the lines of code that must change is daunting.

Would be good to get second options. That PR I mentioned will show the volume 
of code that changed at that time (but Drill has grown since then.) Parth is 
another good resource as he reviewed the original PR and has kept a close eye 
on Arrow.

When considering Arrow in the Drill execution engine, we must realistically 
understand the cost then ask, do the benefits we gain justify those costs? 
Would Arrow be the highest-priority investment? Frankly, would Arrow 
integration increase Drill adoption more than the many other topics discussed 
recently on these mail lists?

Charles and others make a strong case for Arrow for integration. What is the 
strong case for Drill's internals? That's really the question the group will 
want to answer.

More details below.

Thanks,
- Paul

    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning 
<[email protected]> wrote:  

 Inline.

On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected]>
wrote:

> ...
> By contrast, migrating Drill internals to Arrow has always been seen as
> the bulk of the cost; costs which the "crude-but-effective" suggestion
> seeks to avoid. Some of the full-integration costs include:
>
> * Reworking Drill's direct memory model to work with Arrow's.
>

Ted: This should be relatively isolated to the allocation/deallocation code. The
deallocation should become a no-op. The allocation becomes simpler and
safer.

Paul: If only that were true. Drill has an ingenious integration of vector 
allocation and Netty. Arrow may have done the same. (Probably did, since such 
integration is key to avoiding copies on send/receive.). That code is highly 
complex. Clearly, the swap can be done; it will simply take some work to get 
right.

> * Changing all low-level runtime code that works with vectors to instead
> work with Arrow vectors.
>

Ted: Why? You already said that most code doesn't have to change since the
format is the same.

Paul: My comment about the format being the same was that the direct memory 
layout is the same, allowing conversion of a Drill vector to an Arrow vector by 
relabeling the direct memory that holds the data.

Paul: But, in the Drill runtime engine, we don't work with the memory directly, 
we use the vector APIs, mutator APIs and so on. These all changed in Arrow. 
Granted, the Arrow versions are cleaner. But, that does mean that every vector 
reference (of which there are thousands) must be revised to use the Arrow APIs. 
That is the cost that has put us off a bit.

> * Change all Drill's vector metadata, and code that uses that metadata, to
> use Arrow's metadata instead.
>

Ted: Why? You said that converting Arrow metadata to Drill's metadata would be
simple. Why not just continue with that?

Paul: In an API, we can convert one data structure to the other by writing code 
to copy data. But, if we change Drill's internals, we must rewrite code in 
every operator that uses Drill's metadata to instead use Arrows. That is a much 
more extensive undertaking than simply converting metadata on input or output.

> * Since generated code works directly with vectors, change all the code
> generation.
>

Ted: Why? You said the UDFs would just work.

Paul: Again, I fear we are confusing two issues. If we don't change Drill's 
internals, then UDFs will work as today. If we do change Drill to Arrow, then, 
since UDFs are part of the code gen system, they must change to adapt to the 
Arrow APIs. Specially, Drill "holders" must be converted to Arrow holders. 
Drill complex writers must convert to Arrow complex writers.

Paul: Here I'll point out that the Arrow vector code and writers have the same 
uncontrolled memory flaw that they inherited from Drill. So, if we replace the 
mutators and writers, we might as well use the "result set loader" model which 
a) hides the details, and b) manages memory to a given budget.  Either way, 
UDFs must change if we move to Arrow for Drill internals.

> * Since Drill vectors and metadata are exposed via the Drill client to
> JDBC and ODBC, those must be revised as well.
>

Ted: How much given the high level of compatibility?

Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector and 
metadata classes must be revised to use Arrow vectors and metadata, adapting 
the code to the changed APIs. This is not a huge technical challenge, it is 
just a pile of work. Perhaps this was done in that Arrow conversion PR.

> * Since the wire format will change, clients of Drill must upgrade their
> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>

Ted: Doesn't this have to happen fairly often anyway?

Ted: Perhaps this would be a good excuse for a 2.0 step.

Paul: As Drill matures, users would appreciate the ability to use JDBC and ODBC 
drivers with multiple Drill versions. If a shop has 1000 desktops using the 
drivers against five Drill clusters, it is impractical to upgrade everything in 
one go.

Paul: You hit the nail on the head: conversion to Arrow would justify a jump to 
"Drill 2.0" to explain the required big-bang upgrade (and, to highlight the 
cool new capabilities that come with Arrow.)

Re: "Crude-but-effective" Arrow integration

Reply via email to