Hi all,
I wanted to give an update on work that several of us have been doing in
recent months on query processing in C++. Back in March, we circulated [1]
a document [2] with a proposal on implementing the basic pieces of a query
execution engine. Recent patches have introduced some key aspects of this
functionality, including

* ExecPlan and ExecNode API
* Scalar and GroupBy aggregation nodes
* OrderBy node
* A faster, asynchronous Dataset scanner implementation
* R bindings

You may also have noticed a lot of activity around compute functions
(kernels), which these can consume. We roughly doubled the number of
compute functions between 4.0.0 and 5.0.0, and for 6.0.0, we’ve added the
most common aggregation functions and have improved the scalar kernels to
support a more complete set of types. See the compute functions page in the
Arrow C++ library user guide for the full list of compute functions
available in the current release [3].

This means we can now scan, project, filter, aggregate, and sort in-memory,
although the functionality is not easily accessible from bindings other
than R just yet.

In the coming weeks and months, we’ll continue to push ahead on this work,
focusing on

* More nodes, including joins [4] and top-k/bottom-k [5]
* Alternative scheduling approaches such as work stealing approaches or
different methods of applying back pressure
* The ability to spill to disk when necessary to cut down on memory pressure
* Connecting the query engine to the experimental Compute IR [6] and
exposing it in Python via ibis

Thank you to the many members of the Arrow developer community who have
contributed code, comments, and reviews. If you are interested in following
or contributing to this work, please see the “Query engine 6.0 release” [7]
and “Compute kernels 6.0 release” [8] Confluence pages.

Neal

[1]:
https://lists.apache.org/thread.html/rb06b4dc2c6e53fe01784e22e669710710be747faadd46b608c9a27f5%40%3Cdev.arrow.apache.org%3E
[2]:
https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/edit#heading=h.t89hffc3t7si
[3]: https://arrow.apache.org/docs/cpp/compute.html
[4]: https://github.com/apache/arrow/pull/11150
[5]: https://issues.apache.org/jira/browse/ARROW-13973
[6]: https://github.com/apache/arrow/pull/10934
[7]:
https://cwiki.apache.org/confluence/display/ARROW/Query+engine+6.0+release
[8]:
https://cwiki.apache.org/confluence/display/ARROW/Compute+kernels+6.0+release

Reply via email to