Hi Weijie, Thanks much for the explanation. Sounds like you are making good progress.
For which operator is the filter pushed into the scan? Although Impala does this for all scans, AFAIK, Drill does not do so. For example, the text and JSON reader do not handle filtering. Filtering is instead done by the Filter operator in these cases. Perhaps you have your own special scan which handles filtering? The concern in DRILL-6340 was the user might do a project operation that causes the output batch to be much larger than the input batch. Someone suggested flatten as one example. String concatenation is another example. The input batch might be large. The result of the concatenation could be too large for available memory. So, the idea was to project the single input batch into two (or more) output batches to control batch size. II like how you've categorized the vectors into the set that Gandiva can project, and the set that Drill must handle. Maybe you can extend this idea for the case where input batches are split into multiple output batches. Let Drill handle VarChar expressions that could increase column width (such as the concatenate operator.) Let Drill decide the number of rows in the output batch. Then, for the columns that Gandiva can handle, project just those rows needed for the current output batch. Your solution might also be extended to handle the Gandiva library issue. Since you are splitting vectors into the Drill group and the Gandiva group, if Drill runs on a platform without Gandiva support, or if the Gandiva library can't be found, just let all vectors fall into the Drill vector group. If the user wants to use Gandiva, he/she could set a config option to point to the Gandiva library (and supporting files, if any.) Or, use the existing LD_LIBRARY_PATH env. variable. Thanks, - Paul On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong <tongweijie...@gmail.com> wrote: Hi Paul: Currently Gandiva only supports Project ,Filter operations. My work is to integrate Project operator. Since most of the Filter operator will be pushed down to the Scan. The Gandiva project interface works at the RecordBatch level. It accepts the memory address of the vectors of input RecordBatch and . Before that it also need to construct a binary schema object to describe the input RecordBatch schema. The integration work mainly has two parts: 1. at the setup step, find the expressions which can be solved by the Gandiva . The matched expression will be solved by the Gandiva, others will still be solved by Drill. 2. invoking the Gandiva native project method. The matched expressions' ValueVectors will all be allocated corresponding Arrow type null representation ValueVector. The null input vector's bit will also be set. The same work will also be done to the output ValueVectors, transfer the arrow output null vector to Drill's null vector. Since the native method only care the physical memory address, invoking that native method is not a hard work. Since my current implementation is before DRILL-6340, it does not solve the output size of the project which is less than the input size case. To cover that case , there's some more work to do which I have not focused on. To contribute to community , there's also some test case problem which needs to be considered, since the Gandiva jar is platform dependent. On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi Weijie, > > Thanks much for the update on your Gandiva work. It is great work. > > Can you say more about how you are doing the integration? > > As you mentioned the memory layout of Arrow's null vector differs from the > "is set" vector in Drill. How did you work around that? > > The Project operator is pretty simple if we are just copying or removing > columns. However, much of Project deals with invoking Drill-provided > functions: simple ones (add two ints) and complex ones (perform a regex > match). To be useful, the integration would have to mimic Drill's behavior > for each of these many functions. > > Project currently works row-by-row. But, to get the maximum performance, > it would work column-by-column to take full advantage of vectorization. > Doing that would require large changes to the code that sets up codegen, > and iterates over the batch. > > > For operators such as Sort, the only vector-based operations are 1) sort a > batch using defined keys to get an offset vector, and 2) create a new > vector by copying values, row-by-row, from one batch to another according > to the offset vector. > > The join and aggregate operations are even more complex, as are the > partition senders and receivers. > > Can you tell us where you've used Gandiva? Which operators? How did you > handle the function integration? I am very curious how you were able to > solve these problems. > > > Thanks, > > - Paul > > > > On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong < > tongweijie...@gmail.com> wrote: > > HI : > > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and > simd skill could achieve better query performance. Arrow and Drill has > similar column memory format. The main difference now is the null > representation. Also Arrow has made great changes to the ValueVector. To > adopt Arrow to replace Drill's VV has been discussed before. That would be > a great job. But to leverage gandiva , by working at the physical memory > address level , this work could be little relatively. > > Now I have done the integration work at our own branch by make some changes > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes > to ARROW-4819 is to make some package level method to be public. But arrow > community seems not plan to accept this change. Their advice is to have a > arrow branch. > > So what do you think? > > 1、Have a self branch of Arrow. > 2、waiting for the Arrow integration completely. > or some other ideas?