Re: [DISCUSS] Restarting the Arrow Conversation

Z0ltrix Mon, 03 Jan 2022 11:37:22 -0800

Hi Charles, Ted, and the others here,

it is very interesting to hear the evolution of Drill, Dremio and Arrow in that 
context and thank you Charles for restarting that discussion.


I think, and James mentioned this in the PR as well, that Drill could benefit 
from the continues progress, the Arrow project has made since its separation 
from Drill. And the arrow Community seems to be large, so i assume this goes on 
and on with improvements, new features, etc. but i have not enough experience 
in Drill internals to have an Idea in which mass of refactoring this would lead.

In addition to that, im not aware of the current roadmap of Arrow and if these 
would fit into Drills roadmap. Maybe Arrow would go into a different direction 
than Drill and what should we do, if Drill is bound to Arrow then?

On the other hand, Arrow could help Drill to a wider adoption with clients like 
pyarrow, arrow-flight, various other programming languages etc. and (im not 
sure about that) maybe its a performance benefit if Drill use Arrow to read 
Data from HDFS(example), useses Arrow to work with it during execution and 
gives the vectors directly to my Python(example) programm via arrow-flight so 
that i can Play around with Pandas, etc.

Just some thoughts i have since i have used Dremio with pyarrow and Drill with 
odbc connections.

Regards
Christian
\-------- Original-Nachricht --------
Am 3. Jan. 2022, 20:08, Charles Givre schrieb:

>
>
>
> Thanks Ted for the perspective! I had always wished to be a "fly on the wall" 
> in those conversations. :-)
> \-- C
>
> > On Jan 3, 2022, at 11:00 AM, Charles Givre <cgi...@gmail.com> wrote:
> >
> > Hello all,
> > There was a discussion in a recently closed PR \[1\] with a discussion 
> > between z0ltrix, James Turton and a few others about integrating Drill with 
> > Apache Arrow and wondering why it was never done. I'd like to share my 
> > perspective as someone who has been around Drill for some time but also as 
> > someone who never worked for MapR or Dremio. This just represents my 
> > understanding of events as an outsider, and I could be wrong about some or 
> > all of this. Please forgive (or correct) any inaccuracies.
> >
> > When I first learned of Arrow and the idea of integrating Arrow with Drill, 
> > the thing that interested me the most was the ability to move data between 
> > platforms without having to serialize/deserialize the data. From my 
> > understanding, MapR did some research and didn't find a significant 
> > performance advantage and hence didn't really pursue the integration. The 
> > other side of it was that it would require a significant amount of work to 
> > refactor major parts of Drill.
> >
> > I don't know the internal politics, but this was one of the major points of 
> > diversion between Dremio and Drill.
> >
> > With that said, there was a renewed discussion on the list \[2\] where Paul 
> > Rogers proposed what he described as a "Crude but Effective" approach to an 
> > Arrow integration.
> >
> > This is in the email link but here was a part of Paul's email:
> >
> >> Charles, just brainstorming a bit, I think the easiest way to start is to 
> >> create a simple, stand-alone server that speaks Arrow to the client, and 
> >> uses the native Drill client to speak to Drill. The native Drill client 
> >> exposes Drill value vectors. One trick would be to convert Drill vectors 
> >> to the Arrow format. I think that data vectors are the same format. 
> >> Possibly offset vectors. I think Arrow went its own way with null-value 
> >> (Drill's is-set) vectors. So, some conversion might be a no-op, others 
> >> might need to rewrite a vector. Good thing, this is purely at the vector 
> >> level, so would be easy to write. The next issue is the one that Parth has 
> >> long pointed out: Drill and Arrow each have their own memory allocators. 
> >> How could we share a data vector between the two? The simplest initial 
> >> solution is just to copy the data from Drill to Arrow. Slow, but 
> >> transparent to the client. A crude first-approximation of the development 
> >> steps:
> >>
> >> A crude first-approximation of the development steps:
> >> 1. Create the client shell server.
> >> 2. Implement the Arrow client protocol. Need some way to accept a query 
> >> and return batches of results.
> >> 3. Forward the query to Drill using the native Drill client.
> >> 4. As a first pass, copy vectors from Drill to Arrow and return them to 
> >> the client.
> >> 5. Then, solve that memory allocator problem to pass data without copying.
> >
> > One point that Paul made was that these pieces are fairly discrete and 
> > could be implemented without refactoring major components of Drill. Of 
> > course, this could be something for Drill 2.0. At a minimum, could we take 
> > the conversation off of the PR and put it in the email list? ;-)
> >
> > Let's discuss... All ideas are welcome!
> >
> > Best,
> > -- C
> >
> >
> > \[1\]: https://github.com/apache/drill/pull/2412 
> > <https://github.com/apache/drill/pull/2412>
> > \[2\]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l 
> > <https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
> >
> >
> >

publickey - EmailAddress(s=z0ltrix@pm.me) - 0xF0E154C5.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] Restarting the Arrow Conversation

Reply via email to