Re: Arrow compute/dataset design doc missing

Shawn Yang Thu, 09 Jun 2022 08:43:06 -0700

I see, thanks. I'll do more tests and dive into more arrow compute code.

Sent from my iPhone


> On Jun 9, 2022, at 5:30 PM, Weston Pace <[email protected]> wrote:
> 
> 
>> 
>> Hi, do you guys know which functions support vectorized SIMD in arrow 
>> compute?
> 
> I don't know that anyone has done a fully systematic analysis of which
> kernels support and do not support SIMD at the moment.  The kernels
> are still in flux.  There is an active effort to reduce overhead[1]
> which is the top priority as this could possibly have more impact on
> performance than SIMD when running expressions involving multiple
> kernels across multiple threads.
> 
>> I only found very little functions support vectorized SIMD:
>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
>> 
>> Does scalar operation support vectorized SIMD?
> 
> A lack of explicit vectorization instructions does not mean a lack of
> SIMD support.  For many kernels we expect modern compilers to be smart
> enough to automatically implement vectorization as long as the data is
> provided in a vectorized fashion (e.g. columnar) and the kernel is
> simple enough.  For more complex kernels there are options such as
> xsimd but this hasn't yet been very thoroughly explored.  At the
> moment I'm not aware of anyone writing explicitly vectorized kernels
> as this tends to be rather hardware specific and have a small return
> on investment.  Instead, we benchmark regularly and have
> micro-optimized certain critical sections (e.g. some of the hash
> stuff).
> 
>> I tested with numpy and found arrow is ten times slower:
> 
> That result you posted appears to be 3.5x slower.  You might want to
> double check and ensure that Arrow was compiled with the appropriate
> architecture (the cmake files are generally good at figuring this out)
> but I wouldn't be too surprised if this was the case.  Some of this
> might be unavoidable.  For example, does numpy support null values (I
> don't know for sure but I seem to recall it does not)?  Some of this
> might be an inefficiency or overhead problem in Arrow-C++.  It is
> possible that the add kernel is not being vectorized correctly by the
> compiler but I don't think those numbers alone are enough proof of
> that.
> 
> Performance can be quite tricky.  It is important for us but Arrow's
> compute functionality is still relatively new compared to numpy and
> work on performance is balanced with work on features.
> 
> [1] https://lists.apache.org/thread/rh10ykcolt0gxydhgv4vxk2m7ktwx5mh
> 
>> On Wed, Jun 8, 2022 at 11:08 PM Shawn Yang <[email protected]> wrote:
>> 
>> Hi, do you guys know which functions support vectorized SIMD in arrow 
>> compute? After a quick look as arrow compute cpp code, I only found very 
>> little functions support vectorized SIMD:
>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
>> 
>> Does scalar operation support vectorized SIMD?
>> 
>> I tested with numpy and found arrow is ten times slower:
>> 
>> def test_multiply(rows=5000000):
>>    a = pa.array(list(range(rows, 0, -1)))
>>    b = pa.array(list(range(rows)))
>>    import pyarrow.compute as pc
>> 
>>    print("arrow multiply took", timeit.timeit(
>>        lambda: pc.multiply(a, b), number=3))
>>    a = np.array(list(range(rows, 0, -1)))
>>    b = np.array(list(range(rows)))
>>    print("numpy multiply took", timeit.timeit(
>>        lambda: a * b, number=3))
>>    # arrow multiply took 0.14826057000000015
>>    # numpy multiply took 0.04047051300000071
>> 
>> 
>>> On Wed, May 25, 2022 at 10:09 PM Shawn Yang <[email protected]> wrote:
>>> 
>>> I see, the key for multiple loop is to ensure the data can be hold in l2 
>>> cache, so that later
>>> calculation can process this batch without reading from the main memory, 
>>> and we can record the exec stats for every batch , and do better local task 
>>> scheduling based on those stats.  Thanks a lot. Morsels is new to me, very 
>>> interesting ideas
>>> Sent from my iPhone
>>> 
>>>> On May 25, 2022, at 7:23 AM, Weston Pace <[email protected]> wrote:
>>>> 
>>>> There are a few levels of loops.  Two at the moment and three in the
>>>> future.  Some are fused and some are not.  What we have right now is
>>>> early stages, is not ideal, and there are people investigating and
>>>> working on improvements.  I can speak a little bit about where we want
>>>> to go.  An example may be helpful.
>>>> 
>>>> For example, given a filter "x < 100 && x > 0" we have something like
>>>> (this is an approximation of the work done by
>>>> arrow::compute::ExecuteScalarExpression and not actual code):
>>>> 
>>>> ```
>>>> for batch_of_128k_rows in data:
>>>>   auto lt_one_hundred = less_than(batch_of_128k_rows, 100)
>>>>   auto gt_zero = greater_than(batch_of_128k_rows, 0)
>>>>   auto filter_pred = and(lt_one_hundred, gt_zero)
>>>>   consume(filter(batch_of_128k_rows, filter_pred))
>>>> ```
>>>> 
>>>> There are two big things we need to fix here.  First,
>>>> `batch_of_128k_rows` is meant to be some percentage of one thread's
>>>> portion of the L3 cache.  This is a good unit of parallelism but it is
>>>> not ideal for processing because we'd rather use the L2 cache since we
>>>> are making three passes across `batch_of_128k_rows`.  Second, each of
>>>> those `auto ... =` lines is allocating new memory.  This is not ideal
>>>> because we'd like to avoid excess allocation if possible.
>>>> 
>>>> To solve the first problem we are moving towards the "morsel/batch"
>>>> model[1].  This means we have two "batch" sizes.  The outer batch
>>>> (ironically, the morsel) is the largest and is the one used for
>>>> determining parallelism.  The inner batch should be smaller (size
>>>> based on L2).
>>>> 
>>>> To solve the second problem a number of solutions have been proposed
>>>> (thread-local buffer pools, thread-local buffer stacks, etc.) and we
>>>> will hopefully adopt one at some point.  So the above code snippet
>>>> would hopefully become something like:
>>>> 
>>>> ```
>>>> thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size, 
>>>> bool)
>>>> thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool)
>>>> thread_local auto filter_pred = allocate_array(l2_sized_batch_size, bool)
>>>> for batch_of_128k_rows in data:
>>>>   for l2_sized_batch in batch_of_128k_rows:
>>>>       less_than(l2_sized_batch, 100, &lt_one_hundred)
>>>>       greater_than(l2_sized_batch, 0, &gt_zero)
>>>>       and(lt_one_hundred, gt_zero, &filter_pred)
>>>>       consume(filter(l2_sized_batch, filter_pred))
>>>> ```
>>>> 
>>>> There is still a fair amount of work to do before we get here but I
>>>> hope this gives you some idea of the direction we are headed.
>>>> 
>>>> [1] https://db.in.tum.de/~leis/papers/morsels.pdf
>>>> 
>>>>> On Tue, May 24, 2022 at 6:27 AM Shawn Yang <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> Hi Ion, thank you for your reply which recaps  the history of arrow 
>>>>> compute. Those links are very valuable for me to understand arrow compute 
>>>>> internal. I took a quick for those documents and will take a deeper into 
>>>>> those later. I have another question, does arrow compute supports loop 
>>>>> fusion, which execute multiple vectorized operand in one loop? This is 
>>>>> very common in dataframe comouting, our engine can extract those 
>>>>> expressions into a dag/tree. If arrow computer support loop fusion，the 
>>>>> performance would be very promising
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>>> On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote:
>>>>>> 
>>>>>> Hi Shawn,
>>>>>> 
>>>>>> In March of 2021, when major work on the C++ query execution machinery
>>>>>> in Arrow was beginning, Wes sent a message [1] to the dev list and
>>>>>> linked to a doc [2] with some details about the planned design. A few
>>>>>> months later Neal sent an update [3] about this work. However those
>>>>>> documents are now somewhat out of date. More recently, Wes shared
>>>>>> another update [4] and linked to a doc [5] regarding task execution /
>>>>>> control flow / scheduling. However I think the best source of
>>>>>> information is the doc you linked to. The query execution work has
>>>>>> proceeded organically with many contributors, and efforts to document
>>>>>> the overall design in sufficient detail have not kept pace.
>>>>>> 
>>>>>> Regarding benchmarks: There has been extensive work done using
>>>>>> Conbench [6] as part of the Arrow CI infrastructure to benchmark
>>>>>> commits, for purposes of avoiding / identifying performance
>>>>>> regressions and measuring efforts to improve performance. However I am
>>>>>> not aware of any efforts to produce and publicly share benchmarks for
>>>>>> the purpose of comparing performance vs. other query engines.
>>>>>> 
>>>>>> There is a proposal [7] to give the name "Acero" to the Arrow C++
>>>>>> compute engine, so in the future you will likely see it referred to by
>>>>>> that name. I think that having a clearer name for this will motivate
>>>>>> more efforts to write and share more about it.
>>>>>> 
>>>>>> Ian
>>>>>> 
>>>>>> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0
>>>>>> [2] 
>>>>>> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/
>>>>>> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1
>>>>>> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc
>>>>>> [5] 
>>>>>> https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/
>>>>>> [6] https://conbench.ursa.dev
>>>>>> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi, I'm considering using arrow compute as an execution kernel for our 
>>>>>>> distributed dataframe framework. I already read the great doc: 
>>>>>>> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage doc. 
>>>>>>> Is there any design doc, inside introduction or benchmarks for arrow 
>>>>>>> compute so I can quickly understand how arrow compute works, what can 
>>>>>>> be done and what should be done by it? Or I should just read the code 
>>>>>>> in https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
>>>>>>> 
>>>>>>>

Re: Arrow compute/dataset design doc missing

Reply via email to