On Fri, 20 Mar 2020 10:56:51 +0800
Yibo Cai wrote:
> I'm revisiting this old thread as I see some avx512 code merged recently[1].
> Code maintenance will be non-trivial if we want to cover more
> hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future.
> #ifdef is obviously
Thanks Wes for quick response.
Yes, inlining can be a problem for runtime dispatcher. It means we should
take care of the whole loop[1], not the code inside the loop[2]. This may
lead to some traps to developer.
[1]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bpacking.h#L3760
hi Yibo,
I agree with this, having #ifdef in many places in the codebase is not
maintainable longer-term.
As far as runtime dispatch, we could populate a function table of all
machine-dependent functions once so then the dispatch isn't happening
on each function. Or some similar strategy
This
I'm revisiting this old thread as I see some avx512 code merged recently[1].
Code maintenance will be non-trivial if we want to cover more
hardware(sse/avx/avx512/neon/sve/...) and optimize more code in the future.
#ifdef is obviously no-go.
So I'm selling my proposal again :)
- put all
If we go the route of AOT-compilation of Gandiva kernels as an
approach to generate a shared library with many kernels, we might
indeed look at possibly generating a "fat" binary with runtime
dispatch between AVX2-optimized vs. SSE <= 4.2 (or non-SIMD
altogether) kernels. This is something we
Hi,
I would recommend against reinventing the wheel. It would be possible
to reuse an existing C++ SIMD library. There are several of them (Vc,
xsimd, libsimdpp...). Of course, "just use Gandiva" is another possible
answer.
Regards
Antoine.
Le 20/12/2019 à 08:32, Yibo Cai a écrit :
> Hi,
Hi,
I'm investigating SIMD support to C++ compute kernel(not gandiva).
A typical case is the sum kernel[1]. Below tight loop can be easily optimized
with SIMD.
for (int64_t i = 0; i < length; i++) {
local.sum += values[i];
}
Compiler already does loop vectorization. But it's done at