Re: [Discuss] Integrate Arrow gandiva into Drill

Parth Chandra Tue, 23 Apr 2019 16:57:04 -0700

Finally!
We can definitely go ahead with Gandiva if it doesn't depend on the memory
allocator.
Last time I checked the C++ version of Arrow still had its own memory
allocation, but Gandiva likely does not use that for its C++ code.




On Tue, Apr 23, 2019 at 5:46 AM weijie tong <tongweijie...@gmail.com> wrote:

> Gandiva 's Project does not allocate any more memory to execute. It just
> calculates the input memory data whatever they are var-length or
> fixed-width. The output memory will also be allocated by the Drill ahead
> which needs to be fixed-width vectors. The var-width output vector cases
> should not be allowed the Gandiva to evaluate since that will need Gandiva
> to allocate additional memory which is not controlled by the JVM.
>
> I guess that's why Gandiva does not implement operator like HashJoin or
> HashAggregate which need to allocate additional memory to implement. But
> Arrow's WIP PR ARROW-3191 https://github.com/apache/arrow/pull/4151 will
> make that possible.
>
> On Tue, Apr 23, 2019 at 7:15 AM Parth Chandra <par...@apache.org> wrote:
>
> > Is there a way to provide Drill's memory allocator to Gandiva/Arrow? If
> > not, then how do we keep a proper accounting of any memory used by
> > Gandiva/Arrow?
> >
> > On Sat, Apr 20, 2019 at 7:05 PM Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi Weijie,
> > >
> > > Thanks much for the explanation. Sounds like you are making good
> > progress.
> > >
> > >
> > > For which operator is the filter pushed into the scan? Although Impala
> > > does this for all scans, AFAIK, Drill does not do so. For example, the
> > text
> > > and JSON reader do not handle filtering. Filtering is instead done by
> the
> > > Filter operator in these cases. Perhaps you have your own special scan
> > > which handles filtering?
> > >
> > >
> > > The concern in DRILL-6340 was the user might do a project operation
> that
> > > causes the output batch to be much larger than the input batch. Someone
> > > suggested flatten as one example. String concatenation is another
> > example.
> > > The input batch might be large. The result of the concatenation could
> be
> > > too large for available memory. So, the idea was to project the single
> > > input batch into two (or more) output batches to control batch size.
> > >
> > >
> > > II like how you've categorized the vectors into the set that Gandiva
> can
> > > project, and the set that Drill must handle. Maybe you can extend this
> > idea
> > > for the case where input batches are split into multiple output
> batches.
> > >
> > >  Let Drill handle VarChar expressions that could increase column width
> > > (such as the concatenate operator.) Let Drill decide the number of rows
> > in
> > > the output batch. Then, for the columns that Gandiva can handle,
> project
> > > just those rows needed for the current output batch.
> > >
> > > Your solution might also be extended to handle the Gandiva library
> issue.
> > > Since you are splitting vectors into the Drill group and the Gandiva
> > group,
> > > if Drill runs on a platform without Gandiva support, or if the Gandiva
> > > library can't be found, just let all vectors fall into the Drill vector
> > > group.
> > >
> > > If the user wants to use Gandiva, he/she could set a config option to
> > > point to the Gandiva library (and supporting files, if any.) Or, use
> the
> > > existing LD_LIBRARY_PATH env. variable.
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >     On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong <
> > > tongweijie...@gmail.com> wrote:
> > >
> > >  Hi Paul:
> > > Currently Gandiva only supports Project ,Filter operations. My work is
> to
> > > integrate Project operator. Since most of the Filter operator will be
> > > pushed down to the Scan.
> > >
> > > The Gandiva project interface works at the RecordBatch level. It
> accepts
> > > the memory address of the vectors of  input RecordBatch and . Before
> that
> > > it also need to construct a binary schema object to describe the input
> > > RecordBatch schema.
> > >
> > > The integration work mainly has two parts:
> > >   1. at the setup step, find the expressions which can be solved by the
> > > Gandiva . The matched expression will be solved by the Gandiva, others
> > will
> > > still be solved by Drill.
> > >   2. invoking the Gandiva native project method. The matched
> expressions'
> > > ValueVectors will all be allocated corresponding Arrow type null
> > > representation ValueVector. The null input vector's bit  will also be
> > set.
> > > The same work will also be done to the output ValueVectors, transfer
> the
> > > arrow output null vector to Drill's null vector. Since the native
> method
> > > only care the physical memory address, invoking that native method is
> > not a
> > > hard work.
> > >
> > > Since my current implementation is before DRILL-6340, it does not solve
> > the
> > > output size of the project which is less than the input size case. To
> > cover
> > > that case , there's some more work to do which I have not focused on.
> > >
> > > To contribute to community , there's also some test case problem which
> > > needs to be considered, since the Gandiva jar is platform dependent.
> > >
> > >
> > >
> > >
> > > On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <par0...@yahoo.com.invalid
> >
> > > wrote:
> > >
> > > > Hi Weijie,
> > > >
> > > > Thanks much for the update on your Gandiva work. It is great work.
> > > >
> > > > Can you say more about how you are doing the integration?
> > > >
> > > > As you mentioned the memory layout of Arrow's null vector differs
> from
> > > the
> > > > "is set" vector in Drill. How did you work around that?
> > > >
> > > > The Project operator is pretty simple if we are just copying or
> > removing
> > > > columns. However, much of Project deals with invoking Drill-provided
> > > > functions: simple ones (add two ints) and complex ones (perform a
> regex
> > > > match). To be useful, the integration would have to mimic Drill's
> > > behavior
> > > > for each of these many functions.
> > > >
> > > > Project currently works row-by-row. But, to get the maximum
> > performance,
> > > > it would work column-by-column to take full advantage of
> vectorization.
> > > > Doing that would require large changes to the code that sets up
> > codegen,
> > > > and iterates over the batch.
> > > >
> > > >
> > > > For operators such as Sort, the only vector-based operations are 1)
> > sort
> > > a
> > > > batch using defined keys to get an offset vector, and 2) create a new
> > > > vector by copying values, row-by-row, from one batch to another
> > according
> > > > to the offset vector.
> > > >
> > > > The join and aggregate operations are even more complex, as are the
> > > > partition senders and receivers.
> > > >
> > > > Can you tell us where you've used Gandiva? Which operators? How did
> you
> > > > handle the function integration? I am very curious how you were able
> to
> > > > solve these problems.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> > > > tongweijie...@gmail.com> wrote:
> > > >
> > > >  HI :
> > > >
> > > > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen
> and
> > > > simd skill could achieve better query performance.  Arrow and Drill
> has
> > > > similar column memory format. The main difference now is the null
> > > > representation. Also Arrow has made great changes to the ValueVector.
> > To
> > > > adopt Arrow to replace Drill's VV has been discussed before. That
> would
> > > be
> > > > a great job. But to leverage gandiva , by working at the physical
> > memory
> > > > address level , this work could be little relatively.
> > > >
> > > > Now I have done the integration work at our own branch by make some
> > > changes
> > > > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> > > changes
> > > > to ARROW-4819 is to make some package level method to be public. But
> > > arrow
> > > > community seems not plan to accept this change. Their advice is to
> > have a
> > > > arrow branch.
> > > >
> > > > So what do you think?
> > > >
> > > > 1、Have a self branch of Arrow.
> > > > 2、waiting for the Arrow integration completely.
> > > > or some other ideas?
> >
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Reply via email to