I see, thanks. I'll do more tests and dive into more arrow compute code. Sent from my iPhone
> On Jun 9, 2022, at 5:30 PM, Weston Pace <[email protected]> wrote: > > >> >> Hi, do you guys know which functions support vectorized SIMD in arrow >> compute? > > I don't know that anyone has done a fully systematic analysis of which > kernels support and do not support SIMD at the moment. The kernels > are still in flux. There is an active effort to reduce overhead[1] > which is the top priority as this could possibly have more impact on > performance than SIMD when running expressions involving multiple > kernels across multiple threads. > >> I only found very little functions support vectorized SIMD: >> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2 >> >> Does scalar operation support vectorized SIMD? > > A lack of explicit vectorization instructions does not mean a lack of > SIMD support. For many kernels we expect modern compilers to be smart > enough to automatically implement vectorization as long as the data is > provided in a vectorized fashion (e.g. columnar) and the kernel is > simple enough. For more complex kernels there are options such as > xsimd but this hasn't yet been very thoroughly explored. At the > moment I'm not aware of anyone writing explicitly vectorized kernels > as this tends to be rather hardware specific and have a small return > on investment. Instead, we benchmark regularly and have > micro-optimized certain critical sections (e.g. some of the hash > stuff). > >> I tested with numpy and found arrow is ten times slower: > > That result you posted appears to be 3.5x slower. You might want to > double check and ensure that Arrow was compiled with the appropriate > architecture (the cmake files are generally good at figuring this out) > but I wouldn't be too surprised if this was the case. Some of this > might be unavoidable. For example, does numpy support null values (I > don't know for sure but I seem to recall it does not)? Some of this > might be an inefficiency or overhead problem in Arrow-C++. It is > possible that the add kernel is not being vectorized correctly by the > compiler but I don't think those numbers alone are enough proof of > that. > > Performance can be quite tricky. It is important for us but Arrow's > compute functionality is still relatively new compared to numpy and > work on performance is balanced with work on features. > > [1] https://lists.apache.org/thread/rh10ykcolt0gxydhgv4vxk2m7ktwx5mh > >> On Wed, Jun 8, 2022 at 11:08 PM Shawn Yang <[email protected]> wrote: >> >> Hi, do you guys know which functions support vectorized SIMD in arrow >> compute? After a quick look as arrow compute cpp code, I only found very >> little functions support vectorized SIMD: >> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2 >> >> Does scalar operation support vectorized SIMD? >> >> I tested with numpy and found arrow is ten times slower: >> >> def test_multiply(rows=5000000): >> a = pa.array(list(range(rows, 0, -1))) >> b = pa.array(list(range(rows))) >> import pyarrow.compute as pc >> >> print("arrow multiply took", timeit.timeit( >> lambda: pc.multiply(a, b), number=3)) >> a = np.array(list(range(rows, 0, -1))) >> b = np.array(list(range(rows))) >> print("numpy multiply took", timeit.timeit( >> lambda: a * b, number=3)) >> # arrow multiply took 0.14826057000000015 >> # numpy multiply took 0.04047051300000071 >> >> >>> On Wed, May 25, 2022 at 10:09 PM Shawn Yang <[email protected]> wrote: >>> >>> I see, the key for multiple loop is to ensure the data can be hold in l2 >>> cache, so that later >>> calculation can process this batch without reading from the main memory, >>> and we can record the exec stats for every batch , and do better local task >>> scheduling based on those stats. Thanks a lot. Morsels is new to me, very >>> interesting ideas >>> Sent from my iPhone >>> >>>> On May 25, 2022, at 7:23 AM, Weston Pace <[email protected]> wrote: >>>> >>>> There are a few levels of loops. Two at the moment and three in the >>>> future. Some are fused and some are not. What we have right now is >>>> early stages, is not ideal, and there are people investigating and >>>> working on improvements. I can speak a little bit about where we want >>>> to go. An example may be helpful. >>>> >>>> For example, given a filter "x < 100 && x > 0" we have something like >>>> (this is an approximation of the work done by >>>> arrow::compute::ExecuteScalarExpression and not actual code): >>>> >>>> ``` >>>> for batch_of_128k_rows in data: >>>> auto lt_one_hundred = less_than(batch_of_128k_rows, 100) >>>> auto gt_zero = greater_than(batch_of_128k_rows, 0) >>>> auto filter_pred = and(lt_one_hundred, gt_zero) >>>> consume(filter(batch_of_128k_rows, filter_pred)) >>>> ``` >>>> >>>> There are two big things we need to fix here. First, >>>> `batch_of_128k_rows` is meant to be some percentage of one thread's >>>> portion of the L3 cache. This is a good unit of parallelism but it is >>>> not ideal for processing because we'd rather use the L2 cache since we >>>> are making three passes across `batch_of_128k_rows`. Second, each of >>>> those `auto ... =` lines is allocating new memory. This is not ideal >>>> because we'd like to avoid excess allocation if possible. >>>> >>>> To solve the first problem we are moving towards the "morsel/batch" >>>> model[1]. This means we have two "batch" sizes. The outer batch >>>> (ironically, the morsel) is the largest and is the one used for >>>> determining parallelism. The inner batch should be smaller (size >>>> based on L2). >>>> >>>> To solve the second problem a number of solutions have been proposed >>>> (thread-local buffer pools, thread-local buffer stacks, etc.) and we >>>> will hopefully adopt one at some point. So the above code snippet >>>> would hopefully become something like: >>>> >>>> ``` >>>> thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size, >>>> bool) >>>> thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool) >>>> thread_local auto filter_pred = allocate_array(l2_sized_batch_size, bool) >>>> for batch_of_128k_rows in data: >>>> for l2_sized_batch in batch_of_128k_rows: >>>> less_than(l2_sized_batch, 100, <_one_hundred) >>>> greater_than(l2_sized_batch, 0, >_zero) >>>> and(lt_one_hundred, gt_zero, &filter_pred) >>>> consume(filter(l2_sized_batch, filter_pred)) >>>> ``` >>>> >>>> There is still a fair amount of work to do before we get here but I >>>> hope this gives you some idea of the direction we are headed. >>>> >>>> [1] https://db.in.tum.de/~leis/papers/morsels.pdf >>>> >>>>> On Tue, May 24, 2022 at 6:27 AM Shawn Yang <[email protected]> >>>>> wrote: >>>>> >>>>> Hi Ion, thank you for your reply which recaps the history of arrow >>>>> compute. Those links are very valuable for me to understand arrow compute >>>>> internal. I took a quick for those documents and will take a deeper into >>>>> those later. I have another question, does arrow compute supports loop >>>>> fusion, which execute multiple vectorized operand in one loop? This is >>>>> very common in dataframe comouting, our engine can extract those >>>>> expressions into a dag/tree. If arrow computer support loop fusion,the >>>>> performance would be very promising >>>>> >>>>> Sent from my iPhone >>>>> >>>>>>> On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote: >>>>>> >>>>>> Hi Shawn, >>>>>> >>>>>> In March of 2021, when major work on the C++ query execution machinery >>>>>> in Arrow was beginning, Wes sent a message [1] to the dev list and >>>>>> linked to a doc [2] with some details about the planned design. A few >>>>>> months later Neal sent an update [3] about this work. However those >>>>>> documents are now somewhat out of date. More recently, Wes shared >>>>>> another update [4] and linked to a doc [5] regarding task execution / >>>>>> control flow / scheduling. However I think the best source of >>>>>> information is the doc you linked to. The query execution work has >>>>>> proceeded organically with many contributors, and efforts to document >>>>>> the overall design in sufficient detail have not kept pace. >>>>>> >>>>>> Regarding benchmarks: There has been extensive work done using >>>>>> Conbench [6] as part of the Arrow CI infrastructure to benchmark >>>>>> commits, for purposes of avoiding / identifying performance >>>>>> regressions and measuring efforts to improve performance. However I am >>>>>> not aware of any efforts to produce and publicly share benchmarks for >>>>>> the purpose of comparing performance vs. other query engines. >>>>>> >>>>>> There is a proposal [7] to give the name "Acero" to the Arrow C++ >>>>>> compute engine, so in the future you will likely see it referred to by >>>>>> that name. I think that having a clearer name for this will motivate >>>>>> more efforts to write and share more about it. >>>>>> >>>>>> Ian >>>>>> >>>>>> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0 >>>>>> [2] >>>>>> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/ >>>>>> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1 >>>>>> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc >>>>>> [5] >>>>>> https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/ >>>>>> [6] https://conbench.ursa.dev >>>>>> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1 >>>>>> >>>>>> >>>>>> >>>>>>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, I'm considering using arrow compute as an execution kernel for our >>>>>>> distributed dataframe framework. I already read the great doc: >>>>>>> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage doc. >>>>>>> Is there any design doc, inside introduction or benchmarks for arrow >>>>>>> compute so I can quickly understand how arrow compute works, what can >>>>>>> be done and what should be done by it? Or I should just read the code >>>>>>> in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute >>>>>>> >>>>>>>
