Hi,
I would recommend against reinventing the wheel. It would be possible to reuse an existing C++ SIMD library. There are several of them (Vc, xsimd, libsimdpp...). Of course, "just use Gandiva" is another possible answer. Regards Antoine. Le 20/12/2019 à 08:32, Yibo Cai a écrit : > Hi, > > I'm investigating SIMD support to C++ compute kernel(not gandiva). > > A typical case is the sum kernel[1]. Below tight loop can be easily optimized > with SIMD. > > for (int64_t i = 0; i < length; i++) { > local.sum += values[i]; > } > > Compiler already does loop vectorization. But it's done at compile time > without knowledge of target cpu. > Binaries compiled with avx-512 cannot run on old cpu, while binaries compiled > with only sse4 enabled is suboptimal on new hardware. > > I have some proposals, would like to hear comments from community. > > - Based on our experience of ISA-L[2] project(optimized storage acceleration > library for x86 and Arm), runtime dispatcher is a good approach. Basically, > it links in codes optimized for different cpu features(sse4,avx2,neon,...) > and selects the best one fits target cpu at first invocation. This is similar > to gcc indirect function[3], but doesn't depend on compilers. > > - Use gcc FMV [4] to generate multiple binaries for one function. See sample > source and compiled code [5]. > Though looks simple, it has many limitations: It's gcc specific feature, > no support from clang and msvc. It only works on x86, no Arm support. > I think this approach is no-go. > > - Don't do it. > Gandiva leverages LLVM JIT for runtime code optimization. Is it duplicated > effort to do it in C++ kernel? Will these vetorizable computations move to > Gandiva in the future? > > [1] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106 > [2] https://github.com/intel/isa-l > [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/ > [4] https://lwn.net/Articles/691932/ > [5] https://godbolt.org/z/ajpuq_ >