Re: "Crude-but-effective" Arrow integration

Charles Givre Wed, 30 Jan 2019 10:18:50 -0800

Jim,
I really like this use case.  As a data scientist myself, I see the big value 
of Drill as being able to rapidly get raw data ready for machine learning.  
This would be great if we could do this!


> On Jan 30, 2019, at 08:43, Jim Scott <[email protected]> wrote:
> 
> Paul,
> 
> Your example is exactly the same as one which I spoke with some people on
> the RAPIDS.ai project about. Using Drill as a tool to gather (query) all
> the data to get a representative data set for an ML/AI workload, then
> feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow
> which created a GPU Data Frame. The whole point of that project was to
> reduce total number of memcopy operations to result in an end-to-end speed
> up.
> 
> That model to allow Drill to plug into other tools would be a GREAT use
> case for Drill.
> 
> Jim
> 
> On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Hi Aman,
>> 
>> Thanks for sharing the update. Glad to hear things are still percolating.
>> 
>> I think Drill is an under appreciated treasure for doing queries in the
>> complex systems that folks seem to be building today. The ability to read
>> multiple data sources is something that maybe only Spark can do as well.
>> (And Spark can't act as a general purpose query engine like Drill can.)
>> Adding Arrow support for input and output would build on this advantage.
>> 
>> I wonder if the output (client) side might be a great first start. Could
>> be build as a separate app just by combining Arrow and the Drill client
>> code together. Would let lots of Arrow-aware apps query data with Drill
>> rather than having to write their own readers, own filters, own aggregators
>> and, in the end, their own query engine.
>> 
>> Charles was asking about Summer of Code ideas. This might be one: a
>> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that
>> and any Arrow tool in any language could talk to Drill via the bridge.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <
>> [email protected]> wrote:
>> 
>> Hi Charles,
>> You may have seen the talk that was given on the Drill Developer Day [1] by
>> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
>> describes 2 high level options and what the integration might entail.
>> Option 1 corresponds to what you and Paul are discussing in this thread.
>> Option 2 is the deeper integration.  We do plan to work on one of them (not
>> finalized yet) but it will likely be after 1.16.0 since Statistics support
>> and Resource Manager related tasks (these were also discussed in the
>> Developer Day) are consuming our time.  If you are interested in
>> contributing/collaborating, let me know.
>> 
>> [1]
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=>
>> 
>> Aman
>> 
>> On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <[email protected]>
>> wrote:
>> 
>>> Hi Charles,
>>> I didn't see anything on this on the public mailing list. Haven't seen
>> any
>>> commits related to it either. My guess is that this kind of interface is
>>> not important for the kind of data warehouse use cases that MapR is
>>> probably still trying to capture.
>>> I followed the Arrow mailing lists for much of last year. Not much
>>> activity in the Java arena. (I think most of that might be done by
>> Dremio.)
>>> Most activity in other languages. The code itself has drifted far away
>> from
>>> the original Drill structure. I found that even the metadata had vastly
>>> changed; turned out to be far too much work to port the "Row Set" stuff I
>>> did for Drill.
>>> This does mean, BTW, that the Drill folks did the right thing by not
>>> following Arrow. They'd have spend a huge amount of time tracking the
>>> massive changes.
>>> Still, converting Arrow vectors to Drill vectors might be an exercise in
>>> bit twirling and memory ownership. Harder now than it once was since I
>>> think Arrow defines all vectors to be nullable, and uses a different
>> scheme
>>> than Drill for representing nulls.
>>> Thanks,
>>> - Paul
>>> 
>>> 
>>> 
>>>   On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre <
>>> [email protected]> wrote:
>>> 
>>> Hey Paul,
>>> I’m curious as to what, if anything ever came of this thread?  IMHO,
>>> you’re on to something here.  We could get the benefit of
>>> Arrow—specifically the interoperability with other big data tools—without
>>> the pain of having to completely re-work Drill. This seems like a real
>>> win-win to me.
>>> — C
>>> 
>>>> On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]>
>>> wrote:
>>>> 
>>>> Hi Ted,
>>>> 
>>>> We may be confusing two very different ideas. The one is a
>>> Drill-to-Arrow adapter on Drill's periphery, this is the
>>> "crude-but-effective" integration suggestion. On the periphery we are not
>>> changing existing code, we're just building an adapter to read Arrow data
>>> into Drill, or convert Drill output to Arrow.
>>>> 
>>>> The other idea, being discussed in a parallel thread, is to convert
>>> Drill's runtime engine to use Arrow. That is a whole other beast.
>>>> 
>>>> When changing Drill internals, code must change. There is a cost
>>> associated with that. Whether the Arrow code is better or not is not the
>>> key question. Rather, the key question is simply the volume of changes.
>>>> 
>>>> Drill divides into roughly two main layers: plan-time and run-time.
>>> Plan-time is not much affected by Arrow. But, run-time code is all about
>>> manipulating vectors and their metadata, often in quite detailed ways
>> with
>>> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is
>>> conceptually simple, those of us who've looked at the details have noted
>>> that the sheer volume of the lines of code that must change is daunting.
>>>> 
>>>> Would be good to get second options. That PR I mentioned will show the
>>> volume of code that changed at that time (but Drill has grown since
>> then.)
>>> Parth is another good resource as he reviewed the original PR and has
>> kept
>>> a close eye on Arrow.
>>>> 
>>>> When considering Arrow in the Drill execution engine, we must
>>> realistically understand the cost then ask, do the benefits we gain
>> justify
>>> those costs? Would Arrow be the highest-priority investment? Frankly,
>> would
>>> Arrow integration increase Drill adoption more than the many other topics
>>> discussed recently on these mail lists?
>>>> 
>>>> Charles and others make a strong case for Arrow for integration. What
>> is
>>> the strong case for Drill's internals? That's really the question the
>> group
>>> will want to answer.
>>>> 
>>>> More details below.
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>   On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning <
>>> [email protected]> wrote:
>>>> 
>>>> Inline.
>>>> 
>>>> 
>>>> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected]
>>> 
>>>> wrote:
>>>> 
>>>>> ...
>>>>> By contrast, migrating Drill internals to Arrow has always been seen
>> as
>>>>> the bulk of the cost; costs which the "crude-but-effective" suggestion
>>>>> seeks to avoid. Some of the full-integration costs include:
>>>>> 
>>>>> * Reworking Drill's direct memory model to work with Arrow's.
>>>>> 
>>>> 
>>>> 
>>>> Ted: This should be relatively isolated to the allocation/deallocation
>>> code. The
>>>> deallocation should become a no-op. The allocation becomes simpler and
>>>> safer.
>>>> 
>>>> Paul: If only that were true. Drill has an ingenious integration of
>>> vector allocation and Netty. Arrow may have done the same. (Probably did,
>>> since such integration is key to avoiding copies on send/receive.). That
>>> code is highly complex. Clearly, the swap can be done; it will simply
>> take
>>> some work to get right.
>>>> 
>>>> 
>>>>> * Changing all low-level runtime code that works with vectors to
>> instead
>>>>> work with Arrow vectors.
>>>>> 
>>>> 
>>>> 
>>>> Ted: Why? You already said that most code doesn't have to change since
>>> the
>>>> format is the same.
>>>> 
>>>> Paul: My comment about the format being the same was that the direct
>>> memory layout is the same, allowing conversion of a Drill vector to an
>>> Arrow vector by relabeling the direct memory that holds the data.
>>>> 
>>>> Paul: But, in the Drill runtime engine, we don't work with the memory
>>> directly, we use the vector APIs, mutator APIs and so on. These all
>> changed
>>> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean
>> that
>>> every vector reference (of which there are thousands) must be revised to
>>> use the Arrow APIs. That is the cost that has put us off a bit.
>>>> 
>>>> 
>>>>> * Change all Drill's vector metadata, and code that uses that
>> metadata,
>>> to
>>>>> use Arrow's metadata instead.
>>>>> 
>>>> 
>>>> 
>>>> Ted: Why? You said that converting Arrow metadata to Drill's metadata
>>> would be
>>>> simple. Why not just continue with that?
>>>> 
>>>> Paul: In an API, we can convert one data structure to the other by
>>> writing code to copy data. But, if we change Drill's internals, we must
>>> rewrite code in every operator that uses Drill's metadata to instead use
>>> Arrows. That is a much more extensive undertaking than simply converting
>>> metadata on input or output.
>>>> 
>>>> 
>>>>> * Since generated code works directly with vectors, change all the
>> code
>>>>> generation.
>>>>> 
>>>> 
>>>> Ted: Why? You said the UDFs would just work.
>>>> 
>>>> Paul: Again, I fear we are confusing two issues. If we don't change
>>> Drill's internals, then UDFs will work as today. If we do change Drill to
>>> Arrow, then, since UDFs are part of the code gen system, they must change
>>> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted
>> to
>>> Arrow holders. Drill complex writers must convert to Arrow complex
>> writers.
>>>> 
>>>> Paul: Here I'll point out that the Arrow vector code and writers have
>>> the same uncontrolled memory flaw that they inherited from Drill. So, if
>> we
>>> replace the mutators and writers, we might as well use the "result set
>>> loader" model which a) hides the details, and b) manages memory to a
>> given
>>> budget.  Either way, UDFs must change if we move to Arrow for Drill
>>> internals.
>>>> 
>>>> 
>>>>> * Since Drill vectors and metadata are exposed via the Drill client to
>>>>> JDBC and ODBC, those must be revised as well.
>>>>> 
>>>> 
>>>> Ted: How much given the high level of compatibility?
>>>> 
>>>> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill
>> vector
>>> and metadata classes must be revised to use Arrow vectors and metadata,
>>> adapting the code to the changed APIs. This is not a huge technical
>>> challenge, it is just a pile of work. Perhaps this was done in that Arrow
>>> conversion PR.
>>>> 
>>>> 
>>>> 
>>>>> * Since the wire format will change, clients of Drill must upgrade
>> their
>>>>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.>
>>>> 
>>>> 
>>>> Ted: Doesn't this have to happen fairly often anyway?
>>>> 
>>>> Ted: Perhaps this would be a good excuse for a 2.0 step.
>>>> 
>>>> Paul: As Drill matures, users would appreciate the ability to use JDBC
>>> and ODBC drivers with multiple Drill versions. If a shop has 1000
>> desktops
>>> using the drivers against five Drill clusters, it is impractical to
>> upgrade
>>> everything in one go.
>>>> 
>>>> Paul: You hit the nail on the head: conversion to Arrow would justify a
>>> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to
>>> highlight the cool new capabilities that come with Arrow.)
>>>> 
>>> 
> 
> 
> 
> -- 
> 
> 
> *Jim Scott*Mobile/Text | +1 (989) 450-0212
> [image: MapR logo]
> <https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo
>  
> <https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo>>

Re: "Crude-but-effective" Arrow integration

Reply via email to