Hi Weijie,

Thanks much for the update on your Gandiva work. It is great work.

Can you say more about how you are doing the integration?

As you mentioned the memory layout of Arrow's null vector differs from the "is 
set" vector in Drill. How did you work around that?

The Project operator is pretty simple if we are just copying or removing 
columns. However, much of Project deals with invoking Drill-provided functions: 
simple ones (add two ints) and complex ones (perform a regex match). To be 
useful, the integration would have to mimic Drill's behavior for each of these 
many functions.

Project currently works row-by-row. But, to get the maximum performance, it 
would work column-by-column to take full advantage of vectorization. Doing that 
would require large changes to the code that sets up codegen, and iterates over 
the batch.


For operators such as Sort, the only vector-based operations are 1) sort a 
batch using defined keys to get an offset vector, and 2) create a new vector by 
copying values, row-by-row, from one batch to another according to the offset 
vector.

The join and aggregate operations are even more complex, as are the partition 
senders and receivers.

Can you tell us where you've used Gandiva? Which operators? How did you handle 
the function integration? I am very curious how you were able to solve these 
problems.


Thanks,

- Paul

 

    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong 
<tongweijie...@gmail.com> wrote:  
 
 HI :

Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
simd skill could achieve better query performance.  Arrow and Drill has
similar column memory format. The main difference now is the null
representation. Also Arrow has made great changes to the ValueVector. To
adopt Arrow to replace Drill's VV has been discussed before. That would be
a great job. But to leverage gandiva , by working at the physical memory
address level , this work could be little relatively.

Now I have done the integration work at our own branch by make some changes
to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes
to ARROW-4819 is to make some package level method to be public. But arrow
community seems not plan to accept this change. Their advice is to have a
arrow branch.

So what do you think?

1、Have a self branch of Arrow.
2、waiting for the Arrow integration completely.
or some other ideas?  

Reply via email to