Re: About integration of drill and arrow

2020-01-14 Thread Charles Givre
Hi Andy, Paul, I would think that the machine learning "pipeline" would be a great use case for this. From my experience, Spark is not the easiest to do data manipulation with, especially if you have complex data pipelines. This is where Drill can really excel, so my thought is that if you ar

Re: About integration of drill and arrow

2020-01-13 Thread Paul Rogers
Thanks Andy! Very helpful. You have hit on one of the questions that we've been wrestling with: which tools would consume Drill data as Arrow? More generally, what are the use cases for Arrow data interchange? Flight makes sense for transferring large data sets, such as in exchanges within a d

Re: About integration of drill and arrow

2020-01-13 Thread Andy Grove
Hi Paul, There is a test flight server in the Arrow Java project [1] that might be a good starting point, although I haven't used it myself. I was looking at Arrow Flight for my Ballista Poc [2] although I don't really have time to spend on that right now. I'm less sure of the value of having an

Re: About integration of drill and arrow

2020-01-13 Thread Paul Rogers
Hi Andy & Charles, We've discussed two ways for Drill to interface to Arrow: either as an input or an output: Arrow Producer --> Drill --> Arrow Consumer Given how Drill works, the easiest of the two is to create a storage plugin to read from an Arrow Producer, perhaps using Arrow Flight (than

Re: About integration of drill and arrow

2020-01-13 Thread Paul Rogers
Hi Igor, Thanks much for volunteering to create some POCs for our various options! It is not entirely obvious what we want to test, so let's think about it a bit. We want to identify those areas that are either the biggest risk or benefit to performance. We want to do that without the cost of a

Re: About integration of drill and arrow

2020-01-13 Thread Andy Grove
I just started working with Drill and I am a PMC member of Apache Arrow. I am in the process of writing my first storage plugin for Drill, and I think it would be interesting to build a storage plugin for the Apache Arrow Flight protocol as a way for Drill to query Arrow data, although I'm not sure

Re: About integration of drill and arrow

2020-01-13 Thread Igor Guzenko
Hi Paul and Volodymyr, Thank you very much Volodymyr and Paul for defining the good migration strategy. It really should work for a smooth migration. What also I really like in the discussion is that excellent questions appeared: - Aren't we just suffering from premature optimizations? - Were

Re: About integration of drill and arrow

2020-01-12 Thread Volodymyr Vysotskyi
Hi Paul, Thanks for summarizing, it looks even better than my previous letters. Answering to Igor's question regarding conversion for join, I imagined it in the following way: Let's look at the simple example first: Join / \ DrillScanConvert operator (Arrow ->

Re: About integration of drill and arrow

2020-01-12 Thread Charles Givre
Hello All, Glad to see this discussion happening! I apologize for the long email. I thought I would contribute my .02 here, which are more strategic than technical. When I first heard about Arrow a few years ago, I was very excited about the promise it had for enabling data interchange. From w

Re: About integration of drill and arrow

2020-01-12 Thread Paul Rogers
Hi All, As you've seen, I've been suggesting that we consider multiple choices for our internal data representation beyond just the current value vector layout and the "obvious" Arrow layout. And, that we consider out options based on where we see Drill adding value in the open source community

Re: About integration of drill and arrow

2020-01-12 Thread Paul Rogers
Hi Volodymyr, You made a number of excellent points that we should remember as we continue our discussion. If I may paraphrase: 1. A conversion of our internal data layout will be complex. We can't expect to do it in a single step. Some readers may never convert. For a while, at least in a de

Re: About integration of drill and arrow

2020-01-10 Thread Paul Rogers
Hi Igor, You are right that Drill vectors are also subject to fragmentation (both internal and external.) We want to fix that bug. So, it does not help us to move from one implementation that suffers from fragmentation to another one that also suffers from fragmentation. In another e-mail you

Re: About integration of drill and arrow

2020-01-10 Thread Paul Rogers
Hi Igor, You asked about the fixed-size block idea. This is the classic DB memory management mechanism: a "buffer pool" consisting of some number of fixed-size blocks. Memory allocation is simply a matter of grabbing a buffer from the pool. Freeing memory returns the buffer to the pool. Since a

Re: About integration of drill and arrow

2020-01-10 Thread Igor Guzenko
Hi Paul, I would like to add that from your wiki seems that Drill Vectors also has the same fragmentation issues described as the first problem. So I don't think that it can be a reason to abandon Arrow completely now. About the second problem, I agree that this might be a big issue. But it seems

Re: About integration of drill and arrow

2020-01-10 Thread Igor Guzenko
Hello All, Volodymyr, from your last response, I can figure out that what you are suggesting is not the actual integration plan, right? To me, it sounds like we can create some POC to evaluate whether the migration to Arrow makes sense. And if it actually makes sense, integrate through the long wa

Re: About integration of drill and arrow

2020-01-10 Thread Igor Guzenko
Hello Paul, Thank you very much for your active participation in this discussion. I agree that we shouldn't blindly accept Arrow as the only option in the world. Also, I would like to learn more about the fixed-size blocks. So I'll read the paper and hope I'll have some related ideas to discuss la

Re: About integration of drill and arrow

2020-01-10 Thread Paul Rogers
Hi All, Glad to see the Arrow discussion heating up and that it is causing us to ask deeper questions. Here I want to get a bit techie on everyone and highlight two potential memory management problems with Arrow. First: memory fragmentation. Recall that this is how we started on the EVF path

Re: About integration of drill and arrow

2020-01-10 Thread Paul Rogers
Hi Volodymyr, Thanks much for the explanation. Your proposal is a good way for us to move to Arrow step-by-step -- if that is what we choose to do. In your proposal, the bulk of Drill code would be the same between current vectors and Arrow. Our memory API (whatever we choose to use) would prov

Re: About integration of drill and arrow

2020-01-10 Thread Paul Rogers
Hi Igor, +1! Very well said! This is exactly the discussion we should have. Perhaps we can drive towards an overall project goal: a vision that helps us choose which of the many options we should select. Here is a suggestion: Drill should become to queries what Python is to data science: a fle

Re: About integration of drill and arrow

2020-01-10 Thread Volodymyr Vysotskyi
Hi Paul and Igor, It is great that the discussion has affected high-level questions of the effort and benefits of moving to the Arrow. The main arguments of moving to Arrow for me were possible performance improvements (perhaps with Gandiva usage) and significant codebase improvements (perhaps wit

Re: About integration of drill and arrow

2020-01-10 Thread Igor Guzenko
Hello Drill Developers and Drill Users, This discussion started as migration to Arrow but uncovered questions of strategical plans for moving towards Apache Drill 2.0. Below are my personal thoughts of what we, as developers, should do to offer Drill users better experience: 1. High performant bu

Re: About integration of drill and arrow

2020-01-09 Thread Paul Rogers
Hi Volodymyr, All good points. The Arrow/Drill conversion is a good option, especially for readers and clients. Between operators, such conversion is likely to introduce performance hits. As you know, the main feature that differentiates one query engine from another is performance, so adding c

Re: About integration of drill and arrow

2020-01-09 Thread Volodymyr Vysotskyi
Hi all, Glad to see that this discussion became active again! I have some comments regarding the steps for moving from Drill Vectors to Arrow Vectors. No doubt that using EVF for all operators and readers instead of value vectors will simplify things a lot. But considering the target goal - inte

Re: About integration of drill and arrow

2020-01-09 Thread Igor Guzenko
Hi Paul, Though I have very limited knowledge about Arrow at the moment, I can highlight a few advantages of trying it: 1. Allows fixing all the long-standing nullability issues and provide better integration for storage plugins like Hive. https://jira.apache.org/jira/browse/DRILL-

Re: About integration of drill and arrow

2020-01-08 Thread Paul Rogers
I Igor, With the background stuff out of the way, we can now discuss the gist of your idea: inserting an API layer between the memory layout (vector implementation) and the rest of Drill. As noted, this is a good idea for many reasons. One of the most compelling reason is that, with this approa

Re: About integration of drill and arrow

2020-01-08 Thread Paul Rogers
Hi Igor, You mentioned EVF. For those who are newer to the project, let me recap the history of EVF and how it fits into the Arrow picture. The original idea of value vectors was to create long blocks of data that we can load into the CPU cache, then apply operations upon without CPU cache mis

Re: About integration of drill and arrow

2020-01-08 Thread Paul Rogers
Hi Igor, Before diving into design issues, it may be worthwhile to think about the premise: should Drill adopt Arrow as its internal memory layout? This is the question that the team has wrestled with since Arrow was launched. Arrow has three parts. Let's think about each. First is a direct me

Re: About integration of drill and arrow

2020-01-08 Thread Paul Rogers
Hi Igor, Bingo! You clearly explained the idea that has been simmering since starting on the "Row Set" framework (which evolved into EVF.) Some of the early ideas are in [1]. At one of the Drill Developer Days, there was brief discussion about the approach you propose: creating an API on top of

Re: About integration of drill and arrow

2020-01-08 Thread Igor Guzenko
Hello Paul, I totally agree that integrating Arrow by simply replacing Vectors usage everywhere will cause a disaster. After the first look at the new *E*nhanced*V*ector*F*ramework and based on your suggestions I think I have an idea to share. In my opinion, the integration can be done in the two

Re: About integration of drill and arrow

2019-12-09 Thread Paul Rogers
Hi All, Would be good to do some design brainstorming around this. Integration with other tools depends on the APIs (the first two items I mentioned.) Last time I checked (more than a year ago), memory layout of Arrow is close to that in Drill; so conversion is around "packaging" and metadata,

Re: About integration of drill and arrow

2019-12-09 Thread Charles Givre
Hi Igor, That would be really great if you could see that through to completion. IMHO, the value from this is not so much performance related but rather the ability to use Drill to gather and prep data and seamlessly "hand it off" to other platforms for machine learning. -- C > On Dec 9,

Re: About integration of drill and arrow

2019-12-09 Thread Igor Guzenko
Hello Nai and Paul, I would like to contribute full Apache Arrow integration. Thanks, Igor On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers wrote: > Hi Nai Yan, > > Integration is still in the discussion stages. Work has been progressing > on some foundations which would help that integration. > > A

Re: About integration of drill and arrow

2019-12-08 Thread Paul Rogers
Hi Nai Yan, Integration is still in the discussion stages. Work has been progressing on some foundations which would help that integration. At the Developer's Day we talked about several ways to integrate. These include: 1. A storage plugin to read Arrow buffers from some source so that you cou

About integration of drill and arrow

2019-12-08 Thread Nai Yan.
Greetings, As mentioned in Drill develper Day 2018, there's a plan for Drill to integrate Arrow (gandiva from Dremio). I was wondering how is going. Thanks in adavance. Nai Yan