Hello,
As part of https://github.com/apache/arrow/pull/7314, a discussion started about our strategy for adding SIMD optimizations to various routines and kernels. Currently, we have no defined strategy and we have been adding hand-written SIMD-optimized functions for particular primitives and instruction sets, thanks to the submissions of contributors. For example, the above PR adds ~500 lines of code for the purpose of accelerating the SUM kernel, when the input has no nulls, on the SSE instruction set. However, it seems that this ad hoc approach may not scale very well. There are several widely-used SIMD instruction sets out there (the most common being SSE[2], AVX[2], AVX512, Neon... I suppose ARM SVE will come into play at some point), and there will be many potential functions to optimize once we start writing a comprehensive library of computation kernels. Adding hand-written implementations, using intrinsic functions, for each {routine, instruction set} pair threatens to create a large maintenance burden. In that PR, I suggested that we instead take a look at the SIMD wrapper libraries available in C++. There are several available: * MIPP (https://github.com/aff3ct/MIPP) * Vc (https://github.com/VcDevel/Vc) * libsimdpp (https://github.com/p12tic/libsimdpp) * (others yet) In the course of the discussion, an interesting paper was mentioned: https://dl.acm.org/doi/pdf/10.1145/3178433.3178435 together with an implementation comparison of a simple function: https://gitlab.inria.fr/acassagn/mandelbrot The SIMD wrappers met skepticism from Frank, the PR submitter, on the basis that performance may not be optimal and that not all desired features may be provided (such as runtime dispatching). However, we also have to account that, without a wrapper library, we will probably only integrate and maintain a small fraction of the optimized routines that would be otherwise possible with a more abstracted approach. So, while the hand-written approach can be better on a single {routine, instruction set} pair, it may lead to a globally suboptimal situation (that is, unless the number of full-time developers and maintainers on Arrow C++ inflates significantly). Personally, I would like interested developers and contributors (such as Micah, Frank, Yibo Cai) to hash out the various possible approaches, and propose a way forward (which may be hybrid). Regards Antoine.