There are a few levels of loops. Two at the moment and three in the
future. Some are fused and some are not. What we have right now is
early stages, is not ideal, and there are people investigating and
working on improvements. I can speak a little bit about where we want
to go. An example may be helpful.
For example, given a filter "x < 100 && x > 0" we have something like
(this is an approximation of the work done by
arrow::compute::ExecuteScalarExpression and not actual code):
```
for batch_of_128k_rows in data:
auto lt_one_hundred = less_than(batch_of_128k_rows, 100)
auto gt_zero = greater_than(batch_of_128k_rows, 0)
auto filter_pred = and(lt_one_hundred, gt_zero)
consume(filter(batch_of_128k_rows, filter_pred))
```
There are two big things we need to fix here. First,
`batch_of_128k_rows` is meant to be some percentage of one thread's
portion of the L3 cache. This is a good unit of parallelism but it is
not ideal for processing because we'd rather use the L2 cache since we
are making three passes across `batch_of_128k_rows`. Second, each of
those `auto ... =` lines is allocating new memory. This is not ideal
because we'd like to avoid excess allocation if possible.
To solve the first problem we are moving towards the "morsel/batch"
model[1]. This means we have two "batch" sizes. The outer batch
(ironically, the morsel) is the largest and is the one used for
determining parallelism. The inner batch should be smaller (size
based on L2).
To solve the second problem a number of solutions have been proposed
(thread-local buffer pools, thread-local buffer stacks, etc.) and we
will hopefully adopt one at some point. So the above code snippet
would hopefully become something like:
```
thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size, bool)
thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool)
thread_local auto filter_pred = allocate_array(l2_sized_batch_size, bool)
for batch_of_128k_rows in data:
for l2_sized_batch in batch_of_128k_rows:
less_than(l2_sized_batch, 100, <_one_hundred)
greater_than(l2_sized_batch, 0, >_zero)
and(lt_one_hundred, gt_zero, &filter_pred)
consume(filter(l2_sized_batch, filter_pred))
```
There is still a fair amount of work to do before we get here but I
hope this gives you some idea of the direction we are headed.
[1] https://db.in.tum.de/~leis/papers/morsels.pdf
On Tue, May 24, 2022 at 6:27 AM Shawn Yang <[email protected]> wrote:
>
> Hi Ion, thank you for your reply which recaps the history of arrow compute.
> Those links are very valuable for me to understand arrow compute internal. I
> took a quick for those documents and will take a deeper into those later. I
> have another question, does arrow compute supports loop fusion, which execute
> multiple vectorized operand in one loop? This is very common in dataframe
> comouting, our engine can extract those expressions into a dag/tree. If arrow
> computer support loop fusion,the performance would be very promising
>
> Sent from my iPhone
>
> > On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote:
> >
> > Hi Shawn,
> >
> > In March of 2021, when major work on the C++ query execution machinery
> > in Arrow was beginning, Wes sent a message [1] to the dev list and
> > linked to a doc [2] with some details about the planned design. A few
> > months later Neal sent an update [3] about this work. However those
> > documents are now somewhat out of date. More recently, Wes shared
> > another update [4] and linked to a doc [5] regarding task execution /
> > control flow / scheduling. However I think the best source of
> > information is the doc you linked to. The query execution work has
> > proceeded organically with many contributors, and efforts to document
> > the overall design in sufficient detail have not kept pace.
> >
> > Regarding benchmarks: There has been extensive work done using
> > Conbench [6] as part of the Arrow CI infrastructure to benchmark
> > commits, for purposes of avoiding / identifying performance
> > regressions and measuring efforts to improve performance. However I am
> > not aware of any efforts to produce and publicly share benchmarks for
> > the purpose of comparing performance vs. other query engines.
> >
> > There is a proposal [7] to give the name "Acero" to the Arrow C++
> > compute engine, so in the future you will likely see it referred to by
> > that name. I think that having a clearer name for this will motivate
> > more efforts to write and share more about it.
> >
> > Ian
> >
> > [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0
> > [2]
> > https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/
> > [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1
> > [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc
> > [5]
> > https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/
> > [6] https://conbench.ursa.dev
> > [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1
> >
> >
> >
> >> On Mon, May 23, 2022 at 10:58 AM Shawn Yang <[email protected]>
> >> wrote:
> >>
> >> Hi, I'm considering using arrow compute as an execution kernel for our
> >> distributed dataframe framework. I already read the great doc:
> >> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage doc. Is
> >> there any design doc, inside introduction or benchmarks for arrow compute
> >> so I can quickly understand how arrow compute works, what can be done and
> >> what should be done by it? Or I should just read the code in
> >> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> >>
> >>