Re: "Crude-but-effective" Arrow integration

Paul Rogers Wed, 30 Jan 2019 12:52:19 -0800

Hi Jim,

Thanks for the description of the real-world use case. I like your idea of 
letting Drill do the grunt work, then letting the ML/AI workload focus on that 
aspect of the problem.

Charles, just brainstorming a bit, I think the easiest way to start is to 
create a simple, stand-alone server that speaks Arrow to the client, and uses 
the native Drill client to speak to Drill. The native Drill client exposes 
Drill value vectors.

One trick would be to convert Drill vectors to the Arrow format. I think that 
data vectors are the same format. Possibly offset vectors. I think Arrow went 
its own way with null-value (Drill's is-set) vectors. So, some conversion might 
be a no-op, others might need to rewrite a vector. Good thing, this is purely 
at the vector level, so would be easy to write.

The next issue is the one that Parth has long pointed out: Drill and Arrow each 
have their own memory allocators. How could we share a data vector between the 
two? The simplest initial solution is just to copy the data from Drill to 
Arrow. Slow, but transparent to the client.

A crude first-approximation of the development steps:

1. Create the client shell server.
2. Implement the Arrow client protocol. Need some way to accept a query and 
return batches of results.
3. Forward the query to Drill using the native Drill client.
4. As a first pass, copy vectors from Drill to Arrow and return them to the 
client.
5. Then, solve that memory allocator problem to pass data without copying.

Once the experimental work is done in the stand-alone server, the next step is 
to consider merging it into Drill itself for better performance. Still, it may 
be that, since all the bridge does is transform data, it works fine as a 
separate process.

FWIW, I did a prototype of something similar a couple of years ago. The idea 
was to convert Drill's vectors to a row format, but the overall idea is 
similar. Might be one or two ideas to get someone started. [1] One thing I'd do 
differently today is to use something like gRPC instead of Netty for the RPC 
layer.

Something like this is well isolated and not hard if you take it step-by-step. 
That's why it seemed a good Summer of Code project for an enterprising student 
interested in networking and data munging.

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill-jig

    On Wednesday, January 30, 2019, 10:18:47 AM PST, Charles Givre 
<[email protected]> wrote:  

 Jim,
I really like this use case.  As a data scientist myself, I see the big value 
of Drill as being able to rapidly get raw data ready for machine learning.  
This would be great if we could do this!

> On Jan 30, 2019, at 08:43, Jim Scott <[email protected]> wrote:
> 
> Paul,
> 
> Your example is exactly the same as one which I spoke with some people on
> the RAPIDS.ai project about. Using Drill as a tool to gather (query) all
> the data to get a representative data set for an ML/AI workload, then
> feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow
> which created a GPU Data Frame. The whole point of that project was to
> reduce total number of memcopy operations to result in an end-to-end speed
> up.
> 
> That model to allow Drill to plug into other tools would be a GREAT use
> case for Drill.
> 
> Jim
> 
> On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Hi Aman,
>> 
>> Thanks for sharing the update. Glad to hear things are still percolating.
>> 
>> I think Drill is an under appreciated treasure for doing queries in the
>> complex systems that folks seem to be building today. The ability to read
>> multiple data sources is something that maybe only Spark can do as well.
>> (And Spark can't act as a general purpose query engine like Drill can.)
>> Adding Arrow support for input and output would build on this advantage.
>> 
>> I wonder if the output (client) side might be a great first start. Could
>> be build as a separate app just by combining Arrow and the Drill client
>> code together. Would let lots of Arrow-aware apps query data with Drill
>> rather than having to write their own readers, own filters, own aggregators
>> and, in the end, their own query engine.
>> 
>> Charles was asking about Summer of Code ideas. This might be one: a
>> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that
>> and any Arrow tool in any language could talk to Drill via the bridge.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha <
>> [email protected]> wrote:
>> 
>> Hi Charles,
>> You may have seen the talk that was given on the Drill Developer Day [1] by
>> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which
>> describes 2 high level options and what the integration might entail.
>> Option 1 corresponds to what you and Paul are discussing in this thread.
>> Option 2 is the deeper integration.  We do plan to work on one of them (not
>> finalized yet) but it will likely be after 1.16.0 since Statistics support
>> and Resource Manager related tasks (these were also discussed in the
>> Developer Day) are consuming our time.  If you are interested in
>> contributing/collaborating, let me know.
>> 
>> [1]
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=>
>> 
>> Aman
>>

Re: "Crude-but-effective" Arrow integration

Reply via email to