Hello,

As part of https://github.com/apache/arrow/pull/7314, a discussion
started about our strategy for adding SIMD optimizations to various
routines and kernels.

Currently, we have no defined strategy and we have been adding
hand-written SIMD-optimized functions for particular primitives and
instruction sets, thanks to the submissions of contributors.  For
example, the above PR adds ~500 lines of code for the purpose of
accelerating the SUM kernel, when the input has no nulls, on the SSE
instruction set.

However, it seems that this ad hoc approach may not scale very well.
There are several widely-used SIMD instruction sets out there (the most
common being SSE[2], AVX[2], AVX512, Neon... I suppose ARM SVE will come
into play at some point), and there will be many potential functions to
optimize once we start writing a comprehensive library of computation
kernels.  Adding hand-written implementations, using intrinsic
functions, for each {routine, instruction set} pair threatens to create
a large maintenance burden.

In that PR, I suggested that we instead take a look at the SIMD wrapper
libraries available in C++.  There are several available:
* MIPP (https://github.com/aff3ct/MIPP)
* Vc (https://github.com/VcDevel/Vc)
* libsimdpp (https://github.com/p12tic/libsimdpp)
* (others yet)

In the course of the discussion, an interesting paper was mentioned:
https://dl.acm.org/doi/pdf/10.1145/3178433.3178435
together with an implementation comparison of a simple function:
https://gitlab.inria.fr/acassagn/mandelbrot

The SIMD wrappers met skepticism from Frank, the PR submitter, on the
basis that performance may not be optimal and that not all desired
features may be provided (such as runtime dispatching).

However, we also have to account that, without a wrapper library, we
will probably only integrate and maintain a small fraction of the
optimized routines that would be otherwise possible with a more
abstracted approach.  So, while the hand-written approach can be better
on a single {routine, instruction set} pair, it may lead to a globally
suboptimal situation (that is, unless the number of full-time developers
and maintainers on Arrow C++ inflates significantly).

Personally, I would like interested developers and contributors (such as
Micah, Frank, Yibo Cai) to hash out the various possible approaches, and
propose a way forward (which may be hybrid).

Regards

Antoine.

Reply via email to