Re: "Crude-but-effective" Arrow integration

Charles Givre Wed, 30 Jan 2019 05:56:43 -0800

Hi Aman, 
Thanks for sending.  I looked through the slides and really liked the 
presentation.  
@Paul, how would a Drill-to-Arrow bridge work exactly?  Would it require 
serialization/deserialization of Drill objects?  
—C


> On Jan 30, 2019, at 02:16, Paul Rogers <[email protected]> wrote:
> 
> Hi Aman,
> 
> Thanks for sharing the update. Glad to hear things are still percolating.
> 
> I think Drill is an under appreciated treasure for doing queries in the 
> complex systems that folks seem to be building today. The ability to read 
> multiple data sources is something that maybe only Spark can do as well. (And 
> Spark can't act as a general purpose query engine like Drill can.) Adding 
> Arrow support for input and output would build on this advantage.
> 
> I wonder if the output (client) side might be a great first start. Could be 
> build as a separate app just by combining Arrow and the Drill client code 
> together. Would let lots of Arrow-aware apps query data with Drill rather 
> than having to write their own readers, own filters, own aggregators and, in 
> the end, their own query engine.
> 
> Charles was asking about Summer of Code ideas. This might be one: a 
> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that 
> and any Arrow tool in any language could talk to Drill via the bridge.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha 
> <[email protected]> wrote:  
> 
> Hi Charles,
> You may have seen the talk that was given on the Drill Developer Day [1] by
> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
> describes 2 high level options and what the integration might entail.
> Option 1 corresponds to what you and Paul are discussing in this thread.
> Option 2 is the deeper integration.  We do plan to work on one of them (not
> finalized yet) but it will likely be after 1.16.0 since Statistics support
> and Resource Manager related tasks (these were also discussed in the
> Developer Day) are consuming our time.  If you are interested in
> contributing/collaborating, let me know.
> 
> [1]
> https://drive.google.com/drive/folders/17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn
> 
> Aman
> 
> On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <[email protected]>
> wrote:
> 
>> Hi Charles,
>> I didn't see anything on this on the public mailing list. Haven't seen any
>> commits related to it either. My guess is that this kind of interface is
>> not important for the kind of data warehouse use cases that MapR is
>> probably still trying to capture.
>> I followed the Arrow mailing lists for much of last year. Not much
>> activity in the Java arena. (I think most of that might be done by Dremio.)
>> Most activity in other languages. The code itself has drifted far away from
>> the original Drill structure. I found that even the metadata had vastly
>> changed; turned out to be far too much work to port the "Row Set" stuff I
>> did for Drill.
>> This does mean, BTW, that the Drill folks did the right thing by not
>> following Arrow. They'd have spend a huge amount of time tracking the
>> massive changes.
>> Still, converting Arrow vectors to Drill vectors might be an exercise in
>> bit twirling and memory ownership. Harder now than it once was since I
>> think Arrow defines all vectors to be nullable, and uses a different scheme
>> than Drill for representing nulls.
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>     On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
>> [email protected]> wrote:
>> 
>>   Hey Paul,
>> I’m curious as to what, if anything ever came of this thread?  IMHO,
>> you’re on to something here.  We could get the benefit of
>> Arrow—specifically the interoperability with other big data tools—without
>> the pain of having to completely re-work Drill. This seems like a real
>> win-win to me.
>> — C
>> 
>>> On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]>
>> wrote:
>>> 
>>> Hi Ted,
>>> 
>>> We may be confusing two very different ideas. The one is a
>> Drill-to-Arrow adapter on Drill's periphery, this is the
>> "crude-but-effective" integration suggestion. On the periphery we are not
>> changing existing code, we're just building an adapter to read Arrow data
>> into Drill, or convert Drill output to Arrow.
>>> 
>>> The other idea, being discussed in a parallel thread, is to convert
>> Drill's runtime engine to use Arrow. That is a whole other beast.
>>> 
>>> When changing Drill internals, code must change. There is a cost
>> associated with that. Whether the Arrow code is better or not is not the
>> key question. Rather, the key question is simply the volume of changes.
>>> 
>>> Drill divides into roughly two main layers: plan-time and run-time.
>> Plan-time is not much affected by Arrow. But, run-time code is all about
>> manipulating vectors and their metadata, often in quite detailed ways with
>> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
>> conceptually simple, those of us who've looked at the details have noted
>> that the sheer volume of the lines of code that must change is daunting.
>>> 
>>> Would be good to get second options. That PR I mentioned will show the
>> volume of code that changed at that time (but Drill has grown since then.)
>> Parth is another good resource as he reviewed the original PR and has kept
>> a close eye on Arrow.
>>> 
>>> When considering Arrow in the Drill execution engine, we must
>> realistically understand the cost then ask, do the benefits we gain justify
>> those costs? Would Arrow be the highest-priority investment? Frankly, would
>> Arrow integration increase Drill adoption more than the many other topics
>> discussed recently on these mail lists?
>>> 
>>> Charles and others make a strong case for Arrow for integration. What is
>> the strong case for Drill's internals? That's really the question the group
>> will want to answer.
>>> 
>>> More details below.
>>> 
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>     On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
>> [email protected]> wrote:
>>> 
>>> Inline.
>>> 
>>> 
>>> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected]>
>>> wrote:
>>> 
>>>> ...
>>>> By contrast, migrating Drill internals to Arrow has always been seen as
>>>> the bulk of the cost; costs which the "crude-but-effective" suggestion
>>>> seeks to avoid. Some of the full-integration costs include:
>>>> 
>>>> * Reworking Drill's direct memory model to work with Arrow's.
>>>> 
>>> 
>>> 
>>> Ted: This should be relatively isolated to the allocation/deallocation
>> code. The
>>> deallocation should become a no-op. The allocation becomes simpler and
>>> safer.
>>> 
>>> Paul: If only that were true. Drill has an ingenious integration of
>> vector allocation and Netty. Arrow may have done the same. (Probably did,
>> since such integration is key to avoiding copies on send/receive.). That
>> code is highly complex. Clearly, the swap can be done; it will simply take
>> some work to get right.
>>> 
>>> 
>>>> * Changing all low-level runtime code that works with vectors to instead
>>>> work with Arrow vectors.
>>>> 
>>> 
>>> 
>>> Ted: Why? You already said that most code doesn't have to change since
>> the
>>> format is the same.
>>> 
>>> Paul: My comment about the format being the same was that the direct
>> memory layout is the same, allowing conversion of a Drill vector to an
>> Arrow vector by relabeling the direct memory that holds the data.
>>> 
>>> Paul: But, in the Drill runtime engine, we don't work with the memory
>> directly, we use the vector APIs, mutator APIs and so on. These all changed
>> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that
>> every vector reference (of which there are thousands) must be revised to
>> use the Arrow APIs. That is the cost that has put us off a bit.
>>> 
>>> 
>>>> * Change all Drill's vector metadata, and code that uses that metadata,
>> to
>>>> use Arrow's metadata instead.
>>>> 
>>> 
>>> 
>>> Ted: Why? You said that converting Arrow metadata to Drill's metadata
>> would be
>>> simple. Why not just continue with that?
>>> 
>>> Paul: In an API, we can convert one data structure to the other by
>> writing code to copy data. But, if we change Drill's internals, we must
>> rewrite code in every operator that uses Drill's metadata to instead use
>> Arrows. That is a much more extensive undertaking than simply converting
>> metadata on input or output.
>>> 
>>> 
>>>> * Since generated code works directly with vectors, change all the code
>>>> generation.
>>>> 
>>> 
>>> Ted: Why? You said the UDFs would just work.
>>> 
>>> Paul: Again, I fear we are confusing two issues. If we don't change
>> Drill's internals, then UDFs will work as today. If we do change Drill to
>> Arrow, then, since UDFs are part of the code gen system, they must change
>> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to
>> Arrow holders. Drill complex writers must convert to Arrow complex writers.
>>> 
>>> Paul: Here I'll point out that the Arrow vector code and writers have
>> the same uncontrolled memory flaw that they inherited from Drill. So, if we
>> replace the mutators and writers, we might as well use the "result set
>> loader" model which a) hides the details, and b) manages memory to a given
>> budget.  Either way, UDFs must change if we move to Arrow for Drill
>> internals.
>>> 
>>> 
>>>> * Since Drill vectors and metadata are exposed via the Drill client to
>>>> JDBC and ODBC, those must be revised as well.
>>>> 
>>> 
>>> Ted: How much given the high level of compatibility?
>>> 
>>> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector
>> and metadata classes must be revised to use Arrow vectors and metadata,
>> adapting the code to the changed APIs. This is not a huge technical
>> challenge, it is just a pile of work. Perhaps this was done in that Arrow
>> conversion PR.
>>> 
>>> 
>>> 
>>>> * Since the wire format will change, clients of Drill must upgrade their
>>>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
>>> 
>>> 
>>> Ted: Doesn't this have to happen fairly often anyway?
>>> 
>>> Ted: Perhaps this would be a good excuse for a 2.0 step.
>>> 
>>> Paul: As Drill matures, users would appreciate the ability to use JDBC
>> and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops
>> using the drivers against five Drill clusters, it is impractical to upgrade
>> everything in one go.
>>> 
>>> Paul: You hit the nail on the head: conversion to Arrow would justify a
>> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
>> highlight the cool new capabilities that come with Arrow.)
>>>

Re: "Crude-but-effective" Arrow integration

Reply via email to