Hi, do you guys know which functions support vectorized SIMD in arrow
compute? After a quick look as arrow compute cpp code, I only found very
little functions support vectorized SIMD:
● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
Does scalar operation support vectorized SIMD?
I tested with numpy and found arrow is ten times slower:
def test_multiply(rows=5000000):
a = pa.array(list(range(rows, 0, -1)))
b = pa.array(list(range(rows)))
import pyarrow.compute as pc
print("arrow multiply took", timeit.timeit(
lambda: pc.multiply(a, b), number=3))
a = np.array(list(range(rows, 0, -1)))
b = np.array(list(range(rows)))
print("numpy multiply took", timeit.timeit(
lambda: a * b, number=3))
# arrow multiply took 0.14826057000000015
# numpy multiply took 0.04047051300000071
On Wed, May 25, 2022 at 10:09 PM Shawn Yang <[email protected]> wrote:
> I see, the key for multiple loop is to ensure the data can be hold in l2
> cache, so that later
> calculation can process this batch without reading from the main memory,
> and we can record the exec stats for every batch , and do better local task
> scheduling based on those stats. Thanks a lot. Morsels is new to me, very
> interesting ideas
> Sent from my iPhone
>
> > On May 25, 2022, at 7:23 AM, Weston Pace <[email protected]> wrote:
> >
> > There are a few levels of loops. Two at the moment and three in the
> > future. Some are fused and some are not. What we have right now is
> > early stages, is not ideal, and there are people investigating and
> > working on improvements. I can speak a little bit about where we want
> > to go. An example may be helpful.
> >
> > For example, given a filter "x < 100 && x > 0" we have something like
> > (this is an approximation of the work done by
> > arrow::compute::ExecuteScalarExpression and not actual code):
> >
> > ```
> > for batch_of_128k_rows in data:
> > auto lt_one_hundred = less_than(batch_of_128k_rows, 100)
> > auto gt_zero = greater_than(batch_of_128k_rows, 0)
> > auto filter_pred = and(lt_one_hundred, gt_zero)
> > consume(filter(batch_of_128k_rows, filter_pred))
> > ```
> >
> > There are two big things we need to fix here. First,
> > `batch_of_128k_rows` is meant to be some percentage of one thread's
> > portion of the L3 cache. This is a good unit of parallelism but it is
> > not ideal for processing because we'd rather use the L2 cache since we
> > are making three passes across `batch_of_128k_rows`. Second, each of
> > those `auto ... =` lines is allocating new memory. This is not ideal
> > because we'd like to avoid excess allocation if possible.
> >
> > To solve the first problem we are moving towards the "morsel/batch"
> > model[1]. This means we have two "batch" sizes. The outer batch
> > (ironically, the morsel) is the largest and is the one used for
> > determining parallelism. The inner batch should be smaller (size
> > based on L2).
> >
> > To solve the second problem a number of solutions have been proposed
> > (thread-local buffer pools, thread-local buffer stacks, etc.) and we
> > will hopefully adopt one at some point. So the above code snippet
> > would hopefully become something like:
> >
> > ```
> > thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size,
> bool)
> > thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool)
> > thread_local auto filter_pred = allocate_array(l2_sized_batch_size, bool)
> > for batch_of_128k_rows in data:
> > for l2_sized_batch in batch_of_128k_rows:
> > less_than(l2_sized_batch, 100, <_one_hundred)
> > greater_than(l2_sized_batch, 0, >_zero)
> > and(lt_one_hundred, gt_zero, &filter_pred)
> > consume(filter(l2_sized_batch, filter_pred))
> > ```
> >
> > There is still a fair amount of work to do before we get here but I
> > hope this gives you some idea of the direction we are headed.
> >
> > [1] https://db.in.tum.de/~leis/papers/morsels.pdf
> >
> >> On Tue, May 24, 2022 at 6:27 AM Shawn Yang <[email protected]>
> wrote:
> >>
> >> Hi Ion, thank you for your reply which recaps the history of arrow
> compute. Those links are very valuable for me to understand arrow compute
> internal. I took a quick for those documents and will take a deeper into
> those later. I have another question, does arrow compute supports loop
> fusion, which execute multiple vectorized operand in one loop? This is very
> common in dataframe comouting, our engine can extract those expressions
> into a dag/tree. If arrow computer support loop fusion,the performance
> would be very promising
> >>
> >> Sent from my iPhone
> >>
> >>>> On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote:
> >>>
> >>> Hi Shawn,
> >>>
> >>> In March of 2021, when major work on the C++ query execution machinery
> >>> in Arrow was beginning, Wes sent a message [1] to the dev list and
> >>> linked to a doc [2] with some details about the planned design. A few
> >>> months later Neal sent an update [3] about this work. However those
> >>> documents are now somewhat out of date. More recently, Wes shared
> >>> another update [4] and linked to a doc [5] regarding task execution /
> >>> control flow / scheduling. However I think the best source of
> >>> information is the doc you linked to. The query execution work has
> >>> proceeded organically with many contributors, and efforts to document
> >>> the overall design in sufficient detail have not kept pace.
> >>>
> >>> Regarding benchmarks: There has been extensive work done using
> >>> Conbench [6] as part of the Arrow CI infrastructure to benchmark
> >>> commits, for purposes of avoiding / identifying performance
> >>> regressions and measuring efforts to improve performance. However I am
> >>> not aware of any efforts to produce and publicly share benchmarks for
> >>> the purpose of comparing performance vs. other query engines.
> >>>
> >>> There is a proposal [7] to give the name "Acero" to the Arrow C++
> >>> compute engine, so in the future you will likely see it referred to by
> >>> that name. I think that having a clearer name for this will motivate
> >>> more efforts to write and share more about it.
> >>>
> >>> Ian
> >>>
> >>> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0
> >>> [2]
> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/
> >>> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1
> >>> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc
> >>> [5]
> https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/
> >>> [6] https://conbench.ursa.dev
> >>> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1
> >>>
> >>>
> >>>
> >>>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang <[email protected]>
> wrote:
> >>>>
> >>>> Hi, I'm considering using arrow compute as an execution kernel for
> our distributed dataframe framework. I already read the great doc:
> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage doc.
> Is there any design doc, inside introduction or benchmarks for arrow
> compute so I can quickly understand how arrow compute works, what can be
> done and what should be done by it? Or I should just read the code in
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> >>>>
> >>>>
>