hi folks, If I'm reading the tea leaves correctly, there is increased interest in developing a full-fledged "embeddable" analytical query engine written in C++ within the Apache Arrow project. I have long been a major proponent of this and have written and spoken public about it on numerous occasions. I am excited that other people believe this to be a good idea.
Similar work is already taking place in Rust with DataFusion and I see no particular conflict between the groups of developers, since a "native" engine can offer more flexibility at times, and I think we all stand to learn from each other throughout the process. Building a query engine is obviously a complex project which will take on the order of years to fully realize, though there will be some pretty useful things that we can build and ship pretty quickly (e.g. being able to do a parallel scan, filter, and aggregation a directory of Parquet files would be incredibly useful to a lot of people) I saw that Anton Malakhov wrote a document about parallel execution in general (though not specifically about query execution) last week, and includes a discussion of Intel TBB and how that may be helpful: https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine (BTW it appears that ASF Confluence is currently down) I think it's helpful to articulate some of the problems we have to solve to help focus discussions about such low-level details as what kind of threading management library to use. I wrote down some general (and pretty rough / unedited) thoughts that I have about how I think an embeddable (i.e. it's a C++ shared library that can be shipped anywhere) Arrow-native query engine should work. I commented on some of the multithreading and concurrency issues that I see (particularly around nested parallelism -- think scan or projection nodes computing for CPU resources with aggregation nodes). I think it would be useful to analyze how we can effectively utilize multicore (and multi-CPU) architectures in the context of the particular kinds of queries that we anticipate to execute using our software. Please have a read and comment on this thread or leave questions or comments in the Google document https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing It would be a good idea at this stage to try to reduce confusion as much as possible by making the "goals" and "non-goals" as clear and uncontroversial as possible. Having worked on analytics tools "for the edge" for over a decade now, I have suffered a great deal for lack of a high quality, reusable, embeddable query engine that can be used at the single node scale, so I am very keen to bring this work to fruition and for it to benefit as many communities as possible. This is of course closely related to the "C++ Datasets" discussion I started a few weeks ago. The Datasets project is an important part of providing the brains for the "scan" operator in a query engine https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing Looking forward to discussing this with the community. Thanks Wes