Hi Weijie,

Thanks much for the explanation. Sounds like you are making good progress.


For which operator is the filter pushed into the scan? Although Impala does 
this for all scans, AFAIK, Drill does not do so. For example, the text and JSON 
reader do not handle filtering. Filtering is instead done by the Filter 
operator in these cases. Perhaps you have your own special scan which handles 
filtering?


The concern in DRILL-6340 was the user might do a project operation that causes 
the output batch to be much larger than the input batch. Someone suggested 
flatten as one example. String concatenation is another example. The input 
batch might be large. The result of the concatenation could be too large for 
available memory. So, the idea was to project the single input batch into two 
(or more) output batches to control batch size.


II like how you've categorized the vectors into the set that Gandiva can 
project, and the set that Drill must handle. Maybe you can extend this idea for 
the case where input batches are split into multiple output batches.

 Let Drill handle VarChar expressions that could increase column width (such as 
the concatenate operator.) Let Drill decide the number of rows in the output 
batch. Then, for the columns that Gandiva can handle, project just those rows 
needed for the current output batch.

Your solution might also be extended to handle the Gandiva library issue. Since 
you are splitting vectors into the Drill group and the Gandiva group, if Drill 
runs on a platform without Gandiva support, or if the Gandiva library can't be 
found, just let all vectors fall into the Drill vector group.

If the user wants to use Gandiva, he/she could set a config option to point to 
the Gandiva library (and supporting files, if any.) Or, use the existing 
LD_LIBRARY_PATH env. variable.

Thanks,
- Paul

 

    On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong 
<tongweijie...@gmail.com> wrote:  
 
 Hi Paul:
Currently Gandiva only supports Project ,Filter operations. My work is to
integrate Project operator. Since most of the Filter operator will be
pushed down to the Scan.

The Gandiva project interface works at the RecordBatch level. It accepts
the memory address of the vectors of  input RecordBatch and . Before that
it also need to construct a binary schema object to describe the input
RecordBatch schema.

The integration work mainly has two parts:
  1. at the setup step, find the expressions which can be solved by the
Gandiva . The matched expression will be solved by the Gandiva, others will
still be solved by Drill.
  2. invoking the Gandiva native project method. The matched expressions'
ValueVectors will all be allocated corresponding Arrow type null
representation ValueVector. The null input vector's bit  will also be set.
The same work will also be done to the output ValueVectors, transfer the
arrow output null vector to Drill's null vector. Since the native method
only care the physical memory address, invoking that native method is not a
hard work.

Since my current implementation is before DRILL-6340, it does not solve the
output size of the project which is less than the input size case. To cover
that case , there's some more work to do which I have not focused on.

To contribute to community , there's also some test case problem which
needs to be considered, since the Gandiva jar is platform dependent.




On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Weijie,
>
> Thanks much for the update on your Gandiva work. It is great work.
>
> Can you say more about how you are doing the integration?
>
> As you mentioned the memory layout of Arrow's null vector differs from the
> "is set" vector in Drill. How did you work around that?
>
> The Project operator is pretty simple if we are just copying or removing
> columns. However, much of Project deals with invoking Drill-provided
> functions: simple ones (add two ints) and complex ones (perform a regex
> match). To be useful, the integration would have to mimic Drill's behavior
> for each of these many functions.
>
> Project currently works row-by-row. But, to get the maximum performance,
> it would work column-by-column to take full advantage of vectorization.
> Doing that would require large changes to the code that sets up codegen,
> and iterates over the batch.
>
>
> For operators such as Sort, the only vector-based operations are 1) sort a
> batch using defined keys to get an offset vector, and 2) create a new
> vector by copying values, row-by-row, from one batch to another according
> to the offset vector.
>
> The join and aggregate operations are even more complex, as are the
> partition senders and receivers.
>
> Can you tell us where you've used Gandiva? Which operators? How did you
> handle the function integration? I am very curious how you were able to
> solve these problems.
>
>
> Thanks,
>
> - Paul
>
>
>
>    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> tongweijie...@gmail.com> wrote:
>
>  HI :
>
> Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> simd skill could achieve better query performance.  Arrow and Drill has
> similar column memory format. The main difference now is the null
> representation. Also Arrow has made great changes to the ValueVector. To
> adopt Arrow to replace Drill's VV has been discussed before. That would be
> a great job. But to leverage gandiva , by working at the physical memory
> address level , this work could be little relatively.
>
> Now I have done the integration work at our own branch by make some changes
> to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes
> to ARROW-4819 is to make some package level method to be public. But arrow
> community seems not plan to accept this change. Their advice is to have a
> arrow branch.
>
> So what do you think?
>
> 1、Have a self branch of Arrow.
> 2、waiting for the Arrow integration completely.
> or some other ideas?  

Reply via email to