Hi Paul, Thanks for summarizing, it looks even better than my previous letters.
Answering to Igor's question regarding conversion for join, I imagined it in the following way: Let's look at the simple example first: Join / \ DrillScan Convert operator (Arrow -> Drill) | ArrowEvfScan So EVF API may be used in Convert operator to create row set from Arrow vectors and populate row set in Drill format. These conversion operators may be inserted at the planning time using trait sets logic (similar to the insertion sort or distribution operators where it is required). Kind regards, Volodymyr Vysotskyi On Sun, Jan 12, 2020 at 11:50 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi Volodymyr, > > You made a number of excellent points that we should remember as we > continue our discussion. If I may paraphrase: > > > 1. A conversion of our internal data layout will be complex. We can't > expect to do it in a single step. Some readers may never convert. For a > while, at least in a development branch, possibly in master, we must be > able to run the two systems together. > > > 2. Running two systems together requires conversion between formats occur > at some point in the DAG. > > > 3. We have discussed an internal API to isolate operators from the details > of the internal data layout. The column accessors (the core of EVF) are > helpful for operators that work with column values, but not for bulk > operations (such as exchanges, maybe Flatten, maybe Implicit Join, etc.) > Specialized operators (the MapR readers, possibly Parquet) may require > operations that do not yet appear in the column accessors. (The same can > probably said for code gen which does some pretty unusual things.) > > > 4. Before we tackle the full conversion, we should try Arrow (or whatever > option we choose) in selected scenarios to verify the benefit we believe we > will receive. > > You also suggest a strategy to address the above requirements. Again, to > paraphrase: > > > 1. Evaluate internal data layout alternatives to learn their advantages, > and to identify what a common API might need beyond what we have today. > > 2. Pick a data layout based on the facts discovered above. Let's call this > the "next gen" option. > > > 3. Enhance/develop the required internal APIs. > > 4. Develop a conversion operator that we can insert between value > vector-based operators and next gen-based operators. > > > 5. Convert operators step-by-step, testing each for performance and > functionality. Build on the APIs from step 3 and the conversion "shims" > from step 4. > > > Is this an accurate summary of your comments? > > Thanks, > - Paul > > > > On Friday, January 10, 2020, 9:55:49 AM PST, Volodymyr Vysotskyi < > volody...@apache.org> wrote: > > Hi Paul and Igor, > > It is great that the discussion has affected high-level questions of the > effort and benefits of moving to the Arrow. > The main arguments of moving to Arrow for me were possible performance > improvements (perhaps with Gandiva usage) and significant codebase > improvements (perhaps with the additional bug fixes) compared to current > Drill's vectors code. > > In my previous letter, I concentrated on the case when we will agree to > completely move to the Arrow and proposed the way, which in my opinion > would be more optimal and splittable. > > I understand your concerns about having mixed two systems. > I don't propose to recommend Drill users to enable Arrow usage while a lot > of conversions between batches would be happening. > But without trying to use Arrow classes step-by-step, we risk to end up > with dozens of unresolved Arrow-related issues, > a lot of changes in unmerged branches and merge conflicts after every > new commit into the master branch. > > Regarding unnecessary complexity connected with adapting Arrow and Drill > vectors to work together, this conversion may be done at EVF API level, no > need to dive into the vectors implementation details and issues of each > side, we just need to extend EVF API if required, and just be sure that > both implementations still works correctly with that API. > Even with your strategy, we should adopt EVF to work with Arrow, I just > propose to do it at the beginning and additionally preserve EVF with Drill > value vectors. > > Also, when we will be able to switch between Arrow and Drill even for some > operator kinds, we would be able to estimate how the operator performance > was changed and make the decision to continue the integration having real > numbers. > And for the case, if we wouldn't be satisfied with the performance changes, > we may stop integrating before rewriting all the project code. But in this > case, we would have minimum changes, enough for just using Arrow as a data > source and data sink for easier integration with other projects. I think > this is a required minimum we should provide independently on the decision > about deeper integration. > > Kind regards, > Volodymyr Vysotskyi > > > > > > > > > On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi < > > > volody...@apache.org> wrote: > > > > > > Hi all, > > > > > > Glad to see that this discussion became active again! > > > > > > I have some comments regarding the steps for moving from Drill Vectors > to > > > Arrow Vectors. > > > > > > No doubt that using EVF for all operators and readers instead of value > > > vectors will simplify things a lot. > > > But considering the target goal - integration with Arrow, it may be the > > > main show-stopper for it. > > > There may be some operators which would be hard to adapt to use EVF, > for > > > example, I think Flatten operator will be among them since its > > > implementation deeply connected with value vectors. > > > Also, it requires moving all storage and format plugins to EVF, which > > also > > > may be problematic, for example, some plugins like MaprDB have specific > > > features, and it should be considered when moving to EVF. > > > Some other plugins are so obsolete, that I'm not sure that they still > > work > > > and that someone still uses it, so except moving to EVF, they should be > > > resurrected to verify that they weren't broken more than before. > > > > > > This is a huge piece of work, and only after that, we will proceed with > > the > > > next step - integrating Arrow to EVF and then handling new > Arrow-related > > > issues for all the operators and readers at the same time. > > > > > > I propose to update these steps a little bit. > > > 1. I agree that at first, we should extract EVF-related classes into a > > > separate module. > > > 2. But as the next step, I propose to extract EVF API which doesn't > > depend > > > on the vector implementation (Drill vectors, or Arrow ones). > > > 3. After that, introduce module with Arrow which also implements this > EVF > > > API. > > > 4. Introduce transformers that will be able to convert from Drill > vectors > > > into Arrow vectors and vice versa. These transformers may be > implemented > > to > > > work using EVF abstractions instead of operating with specific vector > > > implementations. > > > > > > 5.1. At this point, we can introduce Arrow connectors to fetch the data > > in > > > Arrow format or return it in such a format using transformers from step > > 4. > > > > > > 5.2. Also, at this point, we may start rewriting operators to EVF and > > > switching EVF implementation from the EVF based on Drill Vectors to the > > > implementation which uses Arrow Vectors. Or switching implementations > for > > > existing EVF-based format plugins and fix newly discovered issues in > > Arrow. > > > Since at this point we will have operators which use Arrow format and > > > operators which use Drill Vectors format, we should insert operators > that > > > transform one vector format to another introduced in step 4 between > every > > > pair of operators which returns batches in a different format. > > > > > > I know, that such an approach requires some additional work, like > > > introducing transformers from step 4 and may cause some performance > > > degradations for the case when format transformation is complex for > some > > > types and when we still have sequences of operators with different > > formats. > > > > > > But with this approach, transitioning to Arrow wouldn't be blocked > until > > > everything is moved to EVF and it would be possible to transmit > > > step-by-step, and Drill still will be able to switch between formats if > > it > > > would be required. > > > > > > Kind regards, > > > Volodymyr Vysotskyi > > > > > > >