Re: [DISCUSS] Restarting the Arrow Conversation

Charles Givre Mon, 03 Jan 2022 11:08:08 -0800

Thanks Ted for the perspective!  I had always wished to be a "fly on the wall" 
in those conversations.  :-)
-- C


> On Jan 3, 2022, at 11:00 AM, Charles Givre <[email protected]> wrote:
> 
> Hello all, 
> There was a discussion in a recently closed PR [1] with a discussion between 
> z0ltrix, James Turton and a few others about integrating Drill with Apache 
> Arrow and wondering why it was never done.  I'd like to share my perspective 
> as someone who has been around Drill for some time but also as someone who 
> never worked for MapR or Dremio.  This just represents my understanding of 
> events as an outsider, and I could be wrong about some or all of this.   
> Please forgive (or correct) any inaccuracies. 
> 
> When I first learned of Arrow and the idea of integrating Arrow with Drill, 
> the thing that interested me the most was the ability to move data between 
> platforms without having to serialize/deserialize the data.  From my 
> understanding, MapR did some research and didn't find a significant 
> performance advantage and hence didn't really pursue the integration.  The 
> other side of it was that it would require a significant amount of work to 
> refactor major parts of Drill. 
> 
> I don't know the internal politics, but this was one of the major points of 
> diversion between Dremio and Drill.
> 
> With that said, there was a renewed discussion on the list [2] where Paul 
> Rogers proposed what he described as a "Crude but Effective" approach to an 
> Arrow integration.  
> 
> This is in the email link but here was a part of Paul's email:
> 
>> Charles, just brainstorming a bit, I think the easiest way to start is to 
>> create a simple, stand-alone server that speaks Arrow to the client, and 
>> uses the native Drill client to speak to Drill. The native Drill client 
>> exposes Drill value vectors. One trick would be to convert Drill vectors to 
>> the Arrow format. I think that data vectors are the same format. Possibly 
>> offset vectors. I think Arrow went its own way with null-value (Drill's 
>> is-set) vectors. So, some conversion might be a no-op, others might need to 
>> rewrite a vector. Good thing, this is purely at the vector level, so would 
>> be easy to write. The next issue is the one that Parth has long pointed out: 
>> Drill and Arrow each have their own memory allocators. How could we share a 
>> data vector between the two? The simplest initial solution is just to copy 
>> the data from Drill to Arrow. Slow, but transparent to the client. A crude 
>> first-approximation of the development steps:
>> 
>> A crude first-approximation of the development steps: 
>> 1. Create the client shell server. 
>> 2. Implement the Arrow client protocol. Need some way to accept a query and 
>> return batches of results. 
>> 3. Forward the query to Drill using the native Drill client. 
>> 4. As a first pass, copy vectors from Drill to Arrow and return them to the 
>> client. 
>> 5. Then, solve that memory allocator problem to pass data without copying.
> 
> One point that Paul made was that these pieces are fairly discrete and could 
> be implemented without refactoring major components of Drill.  Of course, 
> this could be something for Drill 2.0.  At a minimum, could we take the 
> conversation off of the PR and put it in the email list? ;-)
> 
> Let's discuss... All ideas are welcome!
> 
> Best,
> -- C
> 
> 
> [1]: https://github.com/apache/drill/pull/2412 
> <https://github.com/apache/drill/pull/2412>
> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l 
> <https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
> 
> 
>

Re: [DISCUSS] Restarting the Arrow Conversation

Reply via email to