hi folks,

If I'm reading the tea leaves correctly, there is increased interest
in developing a full-fledged "embeddable" analytical query engine
written in C++ within the Apache Arrow project. I have long been a
major proponent of this and have written and spoken public about it on
numerous occasions. I am excited that other people believe this to be
a good idea.

Similar work is already taking place in Rust with DataFusion and I see
no particular conflict between the groups of developers, since a
"native" engine can offer more flexibility at times, and I think we
all stand to learn from each other throughout the process.

Building a query engine is obviously a complex project which will take
on the order of years to fully realize, though there will be some
pretty useful things that we can build and ship pretty quickly (e.g.
being able to do a parallel scan, filter, and aggregation a directory
of Parquet files would be incredibly useful to a lot of people)

I saw that Anton Malakhov wrote a document about parallel execution in
general (though not specifically about query execution) last week, and
includes a discussion of Intel TBB and how that may be helpful:

https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+Engine

(BTW it appears that ASF Confluence is currently down)

I think it's helpful to articulate some of the problems we have to
solve to help focus discussions about such low-level details as what
kind of threading management library to use. I wrote down some general
(and pretty rough / unedited) thoughts that I have about how I think
an embeddable (i.e. it's a C++ shared library that can be shipped
anywhere) Arrow-native query engine should work. I commented on some
of the multithreading and concurrency issues that I see (particularly
around nested parallelism -- think scan or projection nodes computing
for CPU resources with aggregation nodes). I think it would be useful
to analyze how we can effectively utilize multicore (and multi-CPU)
architectures in the context of the particular kinds of queries that
we anticipate to execute using our software.

Please have a read and comment on this thread or leave questions or
comments in the Google document

https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit?usp=sharing

It would be a good idea at this stage to try to reduce confusion as
much as possible by making the "goals" and "non-goals" as clear and
uncontroversial as possible. Having worked on analytics tools "for the
edge" for over a decade now, I have suffered a great deal for lack of
a high quality, reusable, embeddable query engine that can be used at
the single node scale, so I am very keen to bring this work to
fruition and for it to benefit as many communities as possible.

This is of course closely related to the "C++ Datasets" discussion I
started a few weeks ago. The Datasets project is an important part of
providing the brains for the "scan" operator in a query engine

https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing

Looking forward to discussing this with the community.

Thanks
Wes

Reply via email to