I see, the key for multiple loop is to ensure the data can be hold in l2 cache, so that later calculation can process this batch without reading from the main memory, and we can record the exec stats for every batch , and do better local task scheduling based on those stats. Thanks a lot. Morsels is new to me, very interesting ideas Sent from my iPhone
> On May 25, 2022, at 7:23 AM, Weston Pace <[email protected]> wrote: > > There are a few levels of loops. Two at the moment and three in the > future. Some are fused and some are not. What we have right now is > early stages, is not ideal, and there are people investigating and > working on improvements. I can speak a little bit about where we want > to go. An example may be helpful. > > For example, given a filter "x < 100 && x > 0" we have something like > (this is an approximation of the work done by > arrow::compute::ExecuteScalarExpression and not actual code): > > ``` > for batch_of_128k_rows in data: > auto lt_one_hundred = less_than(batch_of_128k_rows, 100) > auto gt_zero = greater_than(batch_of_128k_rows, 0) > auto filter_pred = and(lt_one_hundred, gt_zero) > consume(filter(batch_of_128k_rows, filter_pred)) > ``` > > There are two big things we need to fix here. First, > `batch_of_128k_rows` is meant to be some percentage of one thread's > portion of the L3 cache. This is a good unit of parallelism but it is > not ideal for processing because we'd rather use the L2 cache since we > are making three passes across `batch_of_128k_rows`. Second, each of > those `auto ... =` lines is allocating new memory. This is not ideal > because we'd like to avoid excess allocation if possible. > > To solve the first problem we are moving towards the "morsel/batch" > model[1]. This means we have two "batch" sizes. The outer batch > (ironically, the morsel) is the largest and is the one used for > determining parallelism. The inner batch should be smaller (size > based on L2). > > To solve the second problem a number of solutions have been proposed > (thread-local buffer pools, thread-local buffer stacks, etc.) and we > will hopefully adopt one at some point. So the above code snippet > would hopefully become something like: > > ``` > thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size, bool) > thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool) > thread_local auto filter_pred = allocate_array(l2_sized_batch_size, bool) > for batch_of_128k_rows in data: > for l2_sized_batch in batch_of_128k_rows: > less_than(l2_sized_batch, 100, <_one_hundred) > greater_than(l2_sized_batch, 0, >_zero) > and(lt_one_hundred, gt_zero, &filter_pred) > consume(filter(l2_sized_batch, filter_pred)) > ``` > > There is still a fair amount of work to do before we get here but I > hope this gives you some idea of the direction we are headed. > > [1] https://db.in.tum.de/~leis/papers/morsels.pdf > >> On Tue, May 24, 2022 at 6:27 AM Shawn Yang <[email protected]> wrote: >> >> Hi Ion, thank you for your reply which recaps the history of arrow compute. >> Those links are very valuable for me to understand arrow compute internal. I >> took a quick for those documents and will take a deeper into those later. I >> have another question, does arrow compute supports loop fusion, which >> execute multiple vectorized operand in one loop? This is very common in >> dataframe comouting, our engine can extract those expressions into a >> dag/tree. If arrow computer support loop fusion,the performance would be >> very promising >> >> Sent from my iPhone >> >>>> On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote: >>> >>> Hi Shawn, >>> >>> In March of 2021, when major work on the C++ query execution machinery >>> in Arrow was beginning, Wes sent a message [1] to the dev list and >>> linked to a doc [2] with some details about the planned design. A few >>> months later Neal sent an update [3] about this work. However those >>> documents are now somewhat out of date. More recently, Wes shared >>> another update [4] and linked to a doc [5] regarding task execution / >>> control flow / scheduling. However I think the best source of >>> information is the doc you linked to. The query execution work has >>> proceeded organically with many contributors, and efforts to document >>> the overall design in sufficient detail have not kept pace. >>> >>> Regarding benchmarks: There has been extensive work done using >>> Conbench [6] as part of the Arrow CI infrastructure to benchmark >>> commits, for purposes of avoiding / identifying performance >>> regressions and measuring efforts to improve performance. However I am >>> not aware of any efforts to produce and publicly share benchmarks for >>> the purpose of comparing performance vs. other query engines. >>> >>> There is a proposal [7] to give the name "Acero" to the Arrow C++ >>> compute engine, so in the future you will likely see it referred to by >>> that name. I think that having a clearer name for this will motivate >>> more efforts to write and share more about it. >>> >>> Ian >>> >>> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0 >>> [2] >>> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/ >>> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1 >>> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc >>> [5] >>> https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/ >>> [6] https://conbench.ursa.dev >>> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1 >>> >>> >>> >>>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang <[email protected]> >>>> wrote: >>>> >>>> Hi, I'm considering using arrow compute as an execution kernel for our >>>> distributed dataframe framework. I already read the great doc: >>>> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage doc. Is >>>> there any design doc, inside introduction or benchmarks for arrow compute >>>> so I can quickly understand how arrow compute works, what can be done and >>>> what should be done by it? Or I should just read the code in >>>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute >>>> >>>>
