Re: "Crude-but-effective" Arrow integration

Paul Rogers Mon, 20 Aug 2018 09:20:39 -0700

Hi Ted,

The "crude but effective" integration suggestion allows Drill to participate in 
an Arrow pipeline with minimal work.

By contrast, migrating Drill internals to Arrow has always been seen as the 
bulk of the cost; costs which the "crude-but-effective" suggestion seeks to 
avoid. Some of the full-integration costs include:

* Reworking Drill's direct memory model to work with Arrow's.
* Changing all low-level runtime code that works with vectors to instead work 
with Arrow vectors.
* Change all Drill's vector metadata, and code that uses that metadata, to use 
Arrow's metadata instead.
* Since generated code works directly with vectors, change all the code 
generation.
* Since Drill vectors and metadata are exposed via the Drill client to JDBC and 
ODBC, those must be revised as well.
* Since the wire format will change, clients of Drill must upgrade their 
JDBC/ODBC drivers when migrating to an Arrow-based Drill.

At one time I had some hope that the "result set loader" work could ease the 
pain. The result set loader is a uniform way to read and write vectors that 
encapsulates much of the low-level work (while also providing tight controls on 
memory usage.) Once Drill operators used the result set loader, then to move to 
Arrow we'd change just one place: the result set loader implementation.

As it has turned out, however, the "sizer"-based solution gave 80% of the 
memory management benefit for 20% of the cost. So, to move to Arrow, we either 
need to take a new look at using the result set loader, or make the major 
changes noted above.

That said, there was a PR that contained all those changes, though it is quite 
old now. The Dremio fork of Drill contains those changes since Dremio is based 
on Arrow. Perhaps someone could work though, and update, those many changes for 
the current state of Drill.

Thanks,
- Paul

    On Monday, August 20, 2018, 8:17:32 AM PDT, Ted Dunning 
<[email protected]> wrote:  

 This makes it sound like allocation is the important difference. As such
that might mean that converting drill would be easier than was thought.

On Sat, Aug 18, 2018, 16:44 Paul Rogers <[email protected]> wrote:

> Hi All,
>
> Charles recently suggested why Arrow integration could be helpful. (See
> quote below.)  When we've looked at reworking Drill's internals to use
> Arrow, we found the project to be costly with little direct benefit in
> terms of performance or stability. But, Charles points out that the real
> value is in data exchange, not in changing Drill's internals.
>
>  It might be fairly simple to integrate with Arrow for input or output.
> Why? As it turns out (last time I checked) the memory layout of Arrow
> vectors is identical to Drill's, so it is simply a matter of reinterpreting
> Drill's vectors as Arrow vectors (or visa-versa); possibly passing memory
> ownership somehow. (I suspect the memory ownership issue will be the
> fussiest part of the whole exercise.)
>
>
> Drill and Arrow use different metadata formats. But, since they both
> describe the same in-memory layout, we can probably translate from one to
> the other with some straightforward code. Since metadata is a small part of
> a typical result set, the overhead of the metadata translation is likely
> negligible.
>
>
> If an Arrow client wants to consume Drill output, someone could wrap the
> Drill native Drill Client API that speaks Drill value vectors. The wrapper
> could reinterpret Drill vectors as Arrow vectors, and convert metadata.
>
>
> If we want Drill to consume Arrow data, then we'd have to play the same
> trick in reverse: reinterpret Arrow vectors as Drill vectors, then convert
> Arrow metadata to Drill format.
>
> Building such integration can be done by the community to enable
> integration. Granted, this approach is a bit on the "crude-but-effective"
> side. But, if the integration proves valuable, then there is justification
> for a next round of deeper integration.
>
>
>  Charles' original comment from the discussion about project state:
>
> (quote)
> The first [suggested improvement] is the Arrow integration.  I’m not
> enough of a software engineer to understand
> all the internal details here, but as I understand it, the promise of
> Arrow is that many tools
> will share a common memory model and that it will be possible to transfer
> data from one tool
> to the other without having to serialize/deserialize the data.  In the
> data science community
> many of the major platforms, Python-pandas, R, and Spark are moving or
> have adopted Arrow.
>
> Drill’s strength is the ease that it can query many different data sources
> and if Drill
> were to adopt Arrow, I suspect that many people would adopt it as a part
> of a machine learning
> pipeline.  Just recently, I attempted to do some data manipulation using
> Spark, and couldn’t
> help but notice how difficult ti was in contrast with Drill. I’m sure this
> is a very complex
> task, but I do think that it could be worth it in the end.
>
> (unquote)
>
> Thanks,
> - Paul
>
>

Re: "Crude-but-effective" Arrow integration

Reply via email to