I'd guess I'm < 24 hours away from putting up my initial PR for this work. Since the work is large and (for all practical purposes) nearly impossible to separate into separately merge-ready PRs, I'll start a new e-mail thread describing what I've done in more detail and proposing a path for merging the PR (so as to unblock PRs into arrow/compute and avoid rebasing headaches) and addressing review feedback that will likely take several weeks to fully accumulate.
On Mon, May 11, 2020 at 5:17 PM Wes McKinney <wesmck...@gmail.com> wrote: > > I'm working actively on this but perhaps as expected it has ballooned > into a very large project -- it's unclear at the moment whether I'll > be able to break the work into smaller patches that are easier to > digest. I'm working as fast as I can to have an initial > feature-preserving PR up, but the changes to the src/arrow/compute > directory are extensive, so I would please ask that we do not merge > non-essential patches into cpp/src/arrow/compute until I get the PR up > (estimated later this week at present rate) so we can see where things > stand. > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > Hi Wes, > > > I haven't had time to read the doc, but wanted to ask some questions on > > > points raised on the thread. > > > > > > * For efficiency, kernels used for array-expr evaluation should write > > > > into preallocated memory as their default mode. This enables the > > > > interpreter to avoid temporary memory allocations and improve CPU > > > > cache utilization. Almost none of our kernels are implemented this way > > > > currently. > > > > > > Did something change, I was pretty sure I submitted a patch a while ago > > > for > > > boolean kernels, that separated out memory allocation from computation. > > > Which should allow for writing to the same memory. Is this a concern with > > > the public Function APIs for the Kernel APIs themselves, or a lower level > > > implementation concern? > > > > Yes, you did in the internal implementation [1]. The concern is the > > public API and the general approach to implementing new kernels. > > > > I'm working on this right now (it's a large project so it will take me > > a little while to produce something to be reviewed) so bear with me =) > > > > [1]: > > https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76 > > > > > * Sorting is generally handled by different data processing nodes from > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins. > > > > Projections and Filters use expressions, they do not sort. > > > > > > Would sorting the list-column elements per row be an array-expr? > > > > Yes, as that's an element-wise function. When I said sorting I was > > referring to ORDER BY. The functions we have that do sorting do so in > > the context of a single array [2]. > > > > A query engine must be able to sort a (potentially very large) stream > > of record batches. One approach is for the Sort operator to exhaust > > its child input, accumulating all of the record batches in memory > > (spilling to disk as needed) and then sorting and emitting record > > batches from the sorted records/tuples. See e.g. Impala's sorting code > > [3] [4] > > > > [2]: > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34 > > [3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h > > [4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h > > > > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <anto...@python.org> > > > > wrote: > > > > > > > > > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit : > > > > > >> > > > > > >> That said, in the SortToIndices case, this wouldn't be a problem, > > > > since > > > > > >> only the second pass writes to the output. > > > > > > > > > > > > This kernel is not valid for normal array-exprs (see the > > > > > > spreadsheet I > > > > > > linked), such as what you can write in SQL > > > > > > > > > > > > Kernels like SortToIndices are a different type of function (in > > > > > > other > > > > > > words, "not a SQL function") and so if we choose to allow such a > > > > > > "non-SQL-like" functions in the expression evaluator then different > > > > > > logic must be used. > > > > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level we're > > > > > talking here. SortToIndices() may not be a "SQL function", but it > > > > > looks > > > > > like an important basic block for a query engine (since, after all, > > > > > sorting results is an often used feature in SQL and other languages). > > > > > So it should be usable *inside* the expression engine, even though > > > > > it's > > > > > not part of the exposed vocabulary, no? > > > > > > > > No, not as part of "expressions" as they are defined in the context of > > > > SQL engines. > > > > > > > > Sorting is generally handled by different data processing nodes from > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins. > > > > Projections and Filters use expressions, they do not sort. > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > >