Thanks Ted for the perspective! I had always wished to be a "fly on the wall" in those conversations. :-) -- C
> On Jan 3, 2022, at 11:00 AM, Charles Givre <[email protected]> wrote: > > Hello all, > There was a discussion in a recently closed PR [1] with a discussion between > z0ltrix, James Turton and a few others about integrating Drill with Apache > Arrow and wondering why it was never done. I'd like to share my perspective > as someone who has been around Drill for some time but also as someone who > never worked for MapR or Dremio. This just represents my understanding of > events as an outsider, and I could be wrong about some or all of this. > Please forgive (or correct) any inaccuracies. > > When I first learned of Arrow and the idea of integrating Arrow with Drill, > the thing that interested me the most was the ability to move data between > platforms without having to serialize/deserialize the data. From my > understanding, MapR did some research and didn't find a significant > performance advantage and hence didn't really pursue the integration. The > other side of it was that it would require a significant amount of work to > refactor major parts of Drill. > > I don't know the internal politics, but this was one of the major points of > diversion between Dremio and Drill. > > With that said, there was a renewed discussion on the list [2] where Paul > Rogers proposed what he described as a "Crude but Effective" approach to an > Arrow integration. > > This is in the email link but here was a part of Paul's email: > >> Charles, just brainstorming a bit, I think the easiest way to start is to >> create a simple, stand-alone server that speaks Arrow to the client, and >> uses the native Drill client to speak to Drill. The native Drill client >> exposes Drill value vectors. One trick would be to convert Drill vectors to >> the Arrow format. I think that data vectors are the same format. Possibly >> offset vectors. I think Arrow went its own way with null-value (Drill's >> is-set) vectors. So, some conversion might be a no-op, others might need to >> rewrite a vector. Good thing, this is purely at the vector level, so would >> be easy to write. The next issue is the one that Parth has long pointed out: >> Drill and Arrow each have their own memory allocators. How could we share a >> data vector between the two? The simplest initial solution is just to copy >> the data from Drill to Arrow. Slow, but transparent to the client. A crude >> first-approximation of the development steps: >> >> A crude first-approximation of the development steps: >> 1. Create the client shell server. >> 2. Implement the Arrow client protocol. Need some way to accept a query and >> return batches of results. >> 3. Forward the query to Drill using the native Drill client. >> 4. As a first pass, copy vectors from Drill to Arrow and return them to the >> client. >> 5. Then, solve that memory allocator problem to pass data without copying. > > One point that Paul made was that these pieces are fairly discrete and could > be implemented without refactoring major components of Drill. Of course, > this could be something for Drill 2.0. At a minimum, could we take the > conversation off of the PR and put it in the email list? ;-) > > Let's discuss... All ideas are welcome! > > Best, > -- C > > > [1]: https://github.com/apache/drill/pull/2412 > <https://github.com/apache/drill/pull/2412> > [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l > <https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l> > > >
