Re: "Crude-but-effective" Arrow integration

Paul Rogers Tue, 29 Jan 2019 23:17:26 -0800

Hi Aman,

Thanks for sharing the update. Glad to hear things are still percolating.


I think Drill is an under appreciated treasure for doing queries in the complex 
systems that folks seem to be building today. The ability to read multiple data 
sources is something that maybe only Spark can do as well. (And Spark can't act 
as a general purpose query engine like Drill can.) Adding Arrow support for 
input and output would build on this advantage.

I wonder if the output (client) side might be a great first start. Could be 
build as a separate app just by combining Arrow and the Drill client code 
together. Would let lots of Arrow-aware apps query data with Drill rather than 
having to write their own readers, own filters, own aggregators and, in the 
end, their own query engine.

Charles was asking about Summer of Code ideas. This might be one: a stand-alone 
Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that and any Arrow 
tool in any language could talk to Drill via the bridge.

Thanks,
- Paul

 

    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha 
<[email protected]> wrote:  
 
 Hi Charles,
You may have seen the talk that was given on the Drill Developer Day [1] by
Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
describes 2 high level options and what the integration might entail.
Option 1 corresponds to what you and Paul are discussing in this thread.
Option 2 is the deeper integration.  We do plan to work on one of them (not
finalized yet) but it will likely be after 1.16.0 since Statistics support
and Resource Manager related tasks (these were also discussed in the
Developer Day) are consuming our time.  If you are interested in
contributing/collaborating, let me know.

[1]
https://drive.google.com/drive/folders/17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn

Aman

On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <[email protected]>
wrote:

> Hi Charles,
> I didn't see anything on this on the public mailing list. Haven't seen any
> commits related to it either. My guess is that this kind of interface is
> not important for the kind of data warehouse use cases that MapR is
> probably still trying to capture.
> I followed the Arrow mailing lists for much of last year. Not much
> activity in the Java arena. (I think most of that might be done by Dremio.)
> Most activity in other languages. The code itself has drifted far away from
> the original Drill structure. I found that even the metadata had vastly
> changed; turned out to be far too much work to port the "Row Set" stuff I
> did for Drill.
> This does mean, BTW, that the Drill folks did the right thing by not
> following Arrow. They'd have spend a huge amount of time tracking the
> massive changes.
> Still, converting Arrow vectors to Drill vectors might be an exercise in
> bit twirling and memory ownership. Harder now than it once was since I
> think Arrow defines all vectors to be nullable, and uses a different scheme
> than Drill for representing nulls.
> Thanks,
> - Paul
>
>
>
>    On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
> [email protected]> wrote:
>
>  Hey Paul,
> I’m curious as to what, if anything ever came of this thread?  IMHO,
> you’re on to something here.  We could get the benefit of
> Arrow—specifically the interoperability with other big data tools—without
> the pain of having to completely re-work Drill. This seems like a real
> win-win to me.
> — C
>
> > On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]>
> wrote:
> >
> > Hi Ted,
> >
> > We may be confusing two very different ideas. The one is a
> Drill-to-Arrow adapter on Drill's periphery, this is the
> "crude-but-effective" integration suggestion. On the periphery we are not
> changing existing code, we're just building an adapter to read Arrow data
> into Drill, or convert Drill output to Arrow.
> >
> > The other idea, being discussed in a parallel thread, is to convert
> Drill's runtime engine to use Arrow. That is a whole other beast.
> >
> > When changing Drill internals, code must change. There is a cost
> associated with that. Whether the Arrow code is better or not is not the
> key question. Rather, the key question is simply the volume of changes.
> >
> > Drill divides into roughly two main layers: plan-time and run-time.
> Plan-time is not much affected by Arrow. But, run-time code is all about
> manipulating vectors and their metadata, often in quite detailed ways with
> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
> conceptually simple, those of us who've looked at the details have noted
> that the sheer volume of the lines of code that must change is daunting.
> >
> > Would be good to get second options. That PR I mentioned will show the
> volume of code that changed at that time (but Drill has grown since then.)
> Parth is another good resource as he reviewed the original PR and has kept
> a close eye on Arrow.
> >
> > When considering Arrow in the Drill execution engine, we must
> realistically understand the cost then ask, do the benefits we gain justify
> those costs? Would Arrow be the highest-priority investment? Frankly, would
> Arrow integration increase Drill adoption more than the many other topics
> discussed recently on these mail lists?
> >
> > Charles and others make a strong case for Arrow for integration. What is
> the strong case for Drill's internals? That's really the question the group
> will want to answer.
> >
> > More details below.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
> [email protected]> wrote:
> >
> > Inline.
> >
> >
> > On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected]>
> > wrote:
> >
> >> ...
> >> By contrast, migrating Drill internals to Arrow has always been seen as
> >> the bulk of the cost; costs which the "crude-but-effective" suggestion
> >> seeks to avoid. Some of the full-integration costs include:
> >>
> >> * Reworking Drill's direct memory model to work with Arrow's.
> >>
> >
> >
> > Ted: This should be relatively isolated to the allocation/deallocation
> code. The
> > deallocation should become a no-op. The allocation becomes simpler and
> > safer.
> >
> > Paul: If only that were true. Drill has an ingenious integration of
> vector allocation and Netty. Arrow may have done the same. (Probably did,
> since such integration is key to avoiding copies on send/receive.). That
> code is highly complex. Clearly, the swap can be done; it will simply take
> some work to get right.
> >
> >
> >> * Changing all low-level runtime code that works with vectors to instead
> >> work with Arrow vectors.
> >>
> >
> >
> > Ted: Why? You already said that most code doesn't have to change since
> the
> > format is the same.
> >
> > Paul: My comment about the format being the same was that the direct
> memory layout is the same, allowing conversion of a Drill vector to an
> Arrow vector by relabeling the direct memory that holds the data.
> >
> > Paul: But, in the Drill runtime engine, we don't work with the memory
> directly, we use the vector APIs, mutator APIs and so on. These all changed
> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that
> every vector reference (of which there are thousands) must be revised to
> use the Arrow APIs. That is the cost that has put us off a bit.
> >
> >
> >> * Change all Drill's vector metadata, and code that uses that metadata,
> to
> >> use Arrow's metadata instead.
> >>
> >
> >
> > Ted: Why? You said that converting Arrow metadata to Drill's metadata
> would be
> > simple. Why not just continue with that?
> >
> > Paul: In an API, we can convert one data structure to the other by
> writing code to copy data. But, if we change Drill's internals, we must
> rewrite code in every operator that uses Drill's metadata to instead use
> Arrows. That is a much more extensive undertaking than simply converting
> metadata on input or output.
> >
> >
> >> * Since generated code works directly with vectors, change all the code
> >> generation.
> >>
> >
> > Ted: Why? You said the UDFs would just work.
> >
> > Paul: Again, I fear we are confusing two issues. If we don't change
> Drill's internals, then UDFs will work as today. If we do change Drill to
> Arrow, then, since UDFs are part of the code gen system, they must change
> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to
> Arrow holders. Drill complex writers must convert to Arrow complex writers.
> >
> > Paul: Here I'll point out that the Arrow vector code and writers have
> the same uncontrolled memory flaw that they inherited from Drill. So, if we
> replace the mutators and writers, we might as well use the "result set
> loader" model which a) hides the details, and b) manages memory to a given
> budget.  Either way, UDFs must change if we move to Arrow for Drill
> internals.
> >
> >
> >> * Since Drill vectors and metadata are exposed via the Drill client to
> >> JDBC and ODBC, those must be revised as well.
> >>
> >
> > Ted: How much given the high level of compatibility?
> >
> > Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector
> and metadata classes must be revised to use Arrow vectors and metadata,
> adapting the code to the changed APIs. This is not a huge technical
> challenge, it is just a pile of work. Perhaps this was done in that Arrow
> conversion PR.
> >
> >
> >
> >> * Since the wire format will change, clients of Drill must upgrade their
> >> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
> >
> >
> > Ted: Doesn't this have to happen fairly often anyway?
> >
> > Ted: Perhaps this would be a good excuse for a 2.0 step.
> >
> > Paul: As Drill matures, users would appreciate the ability to use JDBC
> and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops
> using the drivers against five Drill clusters, it is impractical to upgrade
> everything in one go.
> >
> > Paul: You hit the nail on the head: conversion to Arrow would justify a
> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
> highlight the cool new capabilities that come with Arrow.)
> >
>

Re: "Crude-but-effective" Arrow integration

Reply via email to