Hi,
I would recommend against reinventing the wheel. It would be possible
to reuse an existing C++ SIMD library. There are several of them (Vc,
xsimd, libsimdpp...). Of course, "just use Gandiva" is another possible
answer.
Regards
Antoine.
Le 20/12/2019 à 08:32, Yibo Cai a écrit :
> Hi,
>
> I'm investigating SIMD support to C++ compute kernel(not gandiva).
>
> A typical case is the sum kernel[1]. Below tight loop can be easily optimized
> with SIMD.
>
> for (int64_t i = 0; i < length; i++) {
> local.sum += values[i];
> }
>
> Compiler already does loop vectorization. But it's done at compile time
> without knowledge of target cpu.
> Binaries compiled with avx-512 cannot run on old cpu, while binaries compiled
> with only sse4 enabled is suboptimal on new hardware.
>
> I have some proposals, would like to hear comments from community.
>
> - Based on our experience of ISA-L[2] project(optimized storage acceleration
> library for x86 and Arm), runtime dispatcher is a good approach. Basically,
> it links in codes optimized for different cpu features(sse4,avx2,neon,...)
> and selects the best one fits target cpu at first invocation. This is similar
> to gcc indirect function[3], but doesn't depend on compilers.
>
> - Use gcc FMV [4] to generate multiple binaries for one function. See sample
> source and compiled code [5].
> Though looks simple, it has many limitations: It's gcc specific feature,
> no support from clang and msvc. It only works on x86, no Arm support.
> I think this approach is no-go.
>
> - Don't do it.
> Gandiva leverages LLVM JIT for runtime code optimization. Is it duplicated
> effort to do it in C++ kernel? Will these vetorizable computations move to
> Gandiva in the future?
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sum_internal.h#L104-L106
> [2] https://github.com/intel/isa-l
> [3] https://willnewton.name/2013/07/02/using-gnu-indirect-functions/
> [4] https://lwn.net/Articles/691932/
> [5] https://godbolt.org/z/ajpuq_
>