Hi Charles,
I didn't see anything on this on the public mailing list. Haven't seen any
commits related to it either. My guess is that this kind of interface is not
important for the kind of data warehouse use cases that MapR is probably still
trying to capture.
I followed the Arrow mailing lists for much of last year. Not much activity in
the Java arena. (I think most of that might be done by Dremio.) Most activity
in other languages. The code itself has drifted far away from the original
Drill structure. I found that even the metadata had vastly changed; turned out
to be far too much work to port the "Row Set" stuff I did for Drill.
This does mean, BTW, that the Drill folks did the right thing by not following
Arrow. They'd have spend a huge amount of time tracking the massive changes.
Still, converting Arrow vectors to Drill vectors might be an exercise in bit
twirling and memory ownership. Harder now than it once was since I think Arrow
defines all vectors to be nullable, and uses a different scheme than Drill for
representing nulls.
Thanks,
- Paul
On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre
<[email protected]> wrote:
Hey Paul,
I’m curious as to what, if anything ever came of this thread? IMHO, you’re on
to something here. We could get the benefit of Arrow—specifically the
interoperability with other big data tools—without the pain of having to
completely re-work Drill. This seems like a real win-win to me.
— C
> On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]> wrote:
>
> Hi Ted,
>
> We may be confusing two very different ideas. The one is a Drill-to-Arrow
> adapter on Drill's periphery, this is the "crude-but-effective" integration
> suggestion. On the periphery we are not changing existing code, we're just
> building an adapter to read Arrow data into Drill, or convert Drill output to
> Arrow.
>
> The other idea, being discussed in a parallel thread, is to convert Drill's
> runtime engine to use Arrow. That is a whole other beast.
>
> When changing Drill internals, code must change. There is a cost associated
> with that. Whether the Arrow code is better or not is not the key question.
> Rather, the key question is simply the volume of changes.
>
> Drill divides into roughly two main layers: plan-time and run-time. Plan-time
> is not much affected by Arrow. But, run-time code is all about manipulating
> vectors and their metadata, often in quite detailed ways with APIs unique to
> Drill. While swapping Arrow vectors for Drill vectors is conceptually simple,
> those of us who've looked at the details have noted that the sheer volume of
> the lines of code that must change is daunting.
>
> Would be good to get second options. That PR I mentioned will show the volume
> of code that changed at that time (but Drill has grown since then.) Parth is
> another good resource as he reviewed the original PR and has kept a close eye
> on Arrow.
>
> When considering Arrow in the Drill execution engine, we must realistically
> understand the cost then ask, do the benefits we gain justify those costs?
> Would Arrow be the highest-priority investment? Frankly, would Arrow
> integration increase Drill adoption more than the many other topics discussed
> recently on these mail lists?
>
> Charles and others make a strong case for Arrow for integration. What is the
> strong case for Drill's internals? That's really the question the group will
> want to answer.
>
> More details below.
>
> Thanks,
> - Paul
>
>
>
> On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning
><[email protected]> wrote:
>
> Inline.
>
>
> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected]>
> wrote:
>
>> ...
>> By contrast, migrating Drill internals to Arrow has always been seen as
>> the bulk of the cost; costs which the "crude-but-effective" suggestion
>> seeks to avoid. Some of the full-integration costs include:
>>
>> * Reworking Drill's direct memory model to work with Arrow's.
>>
>
>
> Ted: This should be relatively isolated to the allocation/deallocation code.
> The
> deallocation should become a no-op. The allocation becomes simpler and
> safer.
>
> Paul: If only that were true. Drill has an ingenious integration of vector
> allocation and Netty. Arrow may have done the same. (Probably did, since such
> integration is key to avoiding copies on send/receive.). That code is highly
> complex. Clearly, the swap can be done; it will simply take some work to get
> right.
>
>
>> * Changing all low-level runtime code that works with vectors to instead
>> work with Arrow vectors.
>>
>
>
> Ted: Why? You already said that most code doesn't have to change since the
> format is the same.
>
> Paul: My comment about the format being the same was that the direct memory
> layout is the same, allowing conversion of a Drill vector to an Arrow vector
> by relabeling the direct memory that holds the data.
>
> Paul: But, in the Drill runtime engine, we don't work with the memory
> directly, we use the vector APIs, mutator APIs and so on. These all changed
> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that
> every vector reference (of which there are thousands) must be revised to use
> the Arrow APIs. That is the cost that has put us off a bit.
>
>
>> * Change all Drill's vector metadata, and code that uses that metadata, to
>> use Arrow's metadata instead.
>>
>
>
> Ted: Why? You said that converting Arrow metadata to Drill's metadata would be
> simple. Why not just continue with that?
>
> Paul: In an API, we can convert one data structure to the other by writing
> code to copy data. But, if we change Drill's internals, we must rewrite code
> in every operator that uses Drill's metadata to instead use Arrows. That is a
> much more extensive undertaking than simply converting metadata on input or
> output.
>
>
>> * Since generated code works directly with vectors, change all the code
>> generation.
>>
>
> Ted: Why? You said the UDFs would just work.
>
> Paul: Again, I fear we are confusing two issues. If we don't change Drill's
> internals, then UDFs will work as today. If we do change Drill to Arrow,
> then, since UDFs are part of the code gen system, they must change to adapt
> to the Arrow APIs. Specially, Drill "holders" must be converted to Arrow
> holders. Drill complex writers must convert to Arrow complex writers.
>
> Paul: Here I'll point out that the Arrow vector code and writers have the
> same uncontrolled memory flaw that they inherited from Drill. So, if we
> replace the mutators and writers, we might as well use the "result set
> loader" model which a) hides the details, and b) manages memory to a given
> budget. Either way, UDFs must change if we move to Arrow for Drill internals.
>
>
>> * Since Drill vectors and metadata are exposed via the Drill client to
>> JDBC and ODBC, those must be revised as well.
>>
>
> Ted: How much given the high level of compatibility?
>
> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector and
> metadata classes must be revised to use Arrow vectors and metadata, adapting
> the code to the changed APIs. This is not a huge technical challenge, it is
> just a pile of work. Perhaps this was done in that Arrow conversion PR.
>
>
>
>> * Since the wire format will change, clients of Drill must upgrade their
>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
>
>
> Ted: Doesn't this have to happen fairly often anyway?
>
> Ted: Perhaps this would be a good excuse for a 2.0 step.
>
> Paul: As Drill matures, users would appreciate the ability to use JDBC and
> ODBC drivers with multiple Drill versions. If a shop has 1000 desktops using
> the drivers against five Drill clusters, it is impractical to upgrade
> everything in one go.
>
> Paul: You hit the nail on the head: conversion to Arrow would justify a jump
> to "Drill 2.0" to explain the required big-bang upgrade (and, to highlight
> the cool new capabilities that come with Arrow.)
>