Hi all, I wanted to give an update on work that several of us have been doing in recent months on query processing in C++. Back in March, we circulated [1] a document [2] with a proposal on implementing the basic pieces of a query execution engine. Recent patches have introduced some key aspects of this functionality, including
* ExecPlan and ExecNode API * Scalar and GroupBy aggregation nodes * OrderBy node * A faster, asynchronous Dataset scanner implementation * R bindings You may also have noticed a lot of activity around compute functions (kernels), which these can consume. We roughly doubled the number of compute functions between 4.0.0 and 5.0.0, and for 6.0.0, we’ve added the most common aggregation functions and have improved the scalar kernels to support a more complete set of types. See the compute functions page in the Arrow C++ library user guide for the full list of compute functions available in the current release [3]. This means we can now scan, project, filter, aggregate, and sort in-memory, although the functionality is not easily accessible from bindings other than R just yet. In the coming weeks and months, we’ll continue to push ahead on this work, focusing on * More nodes, including joins [4] and top-k/bottom-k [5] * Alternative scheduling approaches such as work stealing approaches or different methods of applying back pressure * The ability to spill to disk when necessary to cut down on memory pressure * Connecting the query engine to the experimental Compute IR [6] and exposing it in Python via ibis Thank you to the many members of the Arrow developer community who have contributed code, comments, and reviews. If you are interested in following or contributing to this work, please see the “Query engine 6.0 release” [7] and “Compute kernels 6.0 release” [8] Confluence pages. Neal [1]: https://lists.apache.org/thread.html/rb06b4dc2c6e53fe01784e22e669710710be747faadd46b608c9a27f5%40%3Cdev.arrow.apache.org%3E [2]: https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/edit#heading=h.t89hffc3t7si [3]: https://arrow.apache.org/docs/cpp/compute.html [4]: https://github.com/apache/arrow/pull/11150 [5]: https://issues.apache.org/jira/browse/ARROW-13973 [6]: https://github.com/apache/arrow/pull/10934 [7]: https://cwiki.apache.org/confluence/display/ARROW/Query+engine+6.0+release [8]: https://cwiki.apache.org/confluence/display/ARROW/Compute+kernels+6.0+release