Re: [C++] Revamping approach to Arrow compute kernel development

Wes McKinney Mon, 11 May 2020 15:19:00 -0700

I'm working actively on this but perhaps as expected it has ballooned
into a very large project -- it's unclear at the moment whether I'll
be able to break the work into smaller patches that are easier to
digest. I'm working as fast as I can to have an initial
feature-preserving PR up, but the changes to the src/arrow/compute
directory are extensive, so I would please ask that we do not merge
non-essential patches into cpp/src/arrow/compute until I get the PR up
(estimated later this week at present rate) so we can see where things
stand.


On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <[email protected]> wrote:
>
> On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <[email protected]> 
> wrote:
> >
> > Hi Wes,
> > I haven't had time to read the doc, but wanted to ask some questions on
> > points raised on the thread.
> >
> > * For efficiency, kernels used for array-expr evaluation should write
> > > into preallocated memory as their default mode. This enables the
> > > interpreter to avoid temporary memory allocations and improve CPU
> > > cache utilization. Almost none of our kernels are implemented this way
> > > currently.
> >
> > Did something change, I was pretty sure I submitted a patch a while ago for
> > boolean kernels, that separated out memory allocation from computation.
> > Which should allow for writing to the same memory.  Is this a concern with
> > the public Function APIs for the Kernel APIs themselves, or a lower level
> > implementation concern?
>
> Yes, you did in the internal implementation [1]. The concern is the
> public API and the general approach to implementing new kernels.
>
> I'm working on this right now (it's a large project so it will take me
> a little while to produce something to be reviewed) so bear with me =)
>
> [1]: 
> https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
>
> > * Sorting is generally handled by different data processing nodes from
> > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > Projections and Filters use expressions, they do not sort.
> >
> > Would sorting the list-column elements per row be an array-expr?
>
> Yes, as that's an element-wise function. When I said sorting I was
> referring to ORDER BY. The functions we have that do sorting do so in
> the context of a single array [2].
>
> A query engine must be able to sort a (potentially very large) stream
> of record batches. One approach is for the Sort operator to exhaust
> its child input, accumulating all of the record batches in memory
> (spilling to disk as needed) and then sorting and emitting record
> batches from the sorted records/tuples. See e.g. Impala's sorting code
> [3] [4]
>
> [2]: 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
> [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
> [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
>
> >
> > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <[email protected]> wrote:
> >
> > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <[email protected]> wrote:
> > > >
> > > >
> > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > > >>
> > > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> > > since
> > > > >> only the second pass writes to the output.
> > > > >
> > > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> > > > > linked), such as what you can write in SQL
> > > > >
> > > > > Kernels like SortToIndices are a different type of function (in other
> > > > > words, "not a SQL function") and so if we choose to allow such a
> > > > > "non-SQL-like" functions in the expression evaluator then different
> > > > > logic must be used.
> > > >
> > > > Hmm, I think that maybe I'm misunderstanding at which level we're
> > > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> > > > like an important basic block for a query engine (since, after all,
> > > > sorting results is an often used feature in SQL and other languages).
> > > > So it should be usable *inside* the expression engine, even though it's
> > > > not part of the exposed vocabulary, no?
> > >
> > > No, not as part of "expressions" as they are defined in the context of
> > > SQL engines.
> > >
> > > Sorting is generally handled by different data processing nodes from
> > > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > > Projections and Filters use expressions, they do not sort.
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >

Re: [C++] Revamping approach to Arrow compute kernel development

Reply via email to