Hi Weijie, Thanks much for the update on your Gandiva work. It is great work.
Can you say more about how you are doing the integration? As you mentioned the memory layout of Arrow's null vector differs from the "is set" vector in Drill. How did you work around that? The Project operator is pretty simple if we are just copying or removing columns. However, much of Project deals with invoking Drill-provided functions: simple ones (add two ints) and complex ones (perform a regex match). To be useful, the integration would have to mimic Drill's behavior for each of these many functions. Project currently works row-by-row. But, to get the maximum performance, it would work column-by-column to take full advantage of vectorization. Doing that would require large changes to the code that sets up codegen, and iterates over the batch. For operators such as Sort, the only vector-based operations are 1) sort a batch using defined keys to get an offset vector, and 2) create a new vector by copying values, row-by-row, from one batch to another according to the offset vector. The join and aggregate operations are even more complex, as are the partition senders and receivers. Can you tell us where you've used Gandiva? Which operators? How did you handle the function integration? I am very curious how you were able to solve these problems. Thanks, - Paul On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <tongweijie...@gmail.com> wrote: HI : Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and simd skill could achieve better query performance. Arrow and Drill has similar column memory format. The main difference now is the null representation. Also Arrow has made great changes to the ValueVector. To adopt Arrow to replace Drill's VV has been discussed before. That would be a great job. But to leverage gandiva , by working at the physical memory address level , this work could be little relatively. Now I have done the integration work at our own branch by make some changes to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes to ARROW-4819 is to make some package level method to be public. But arrow community seems not plan to accept this change. Their advice is to have a arrow branch. So what do you think? 1、Have a self branch of Arrow. 2、waiting for the Arrow integration completely. or some other ideas?