hi folks,

I had a conversation with the developers of xsimd last week in Paris
and was made aware that they are working on a substantial refactor of
xsimd to improve its usability for cross-compilation and
dynamic-dispatch based on runtime processor capabilities. The branch
with the refactor is located here:

https://github.com/xtensor-stack/xsimd/tree/feature/xsimd-refactoring

In particular, the simd batch API is changing from

template <class T, size_t N>
class batch;

to

template <class T, class arch>
class batch;

So rather than using xsimd::batch<uint32_t, 16> for an AVX512 batch,
you would do xsimd::batch<uint32_t, xsimd::arch::avx512> (or e.g.
neon/neon64 for ARM ISAs) and then access the batch size through the
batch::size static property.

A few comments for discussion / investigation:

* Firstly, we will have to prepare ourselves to migrate to this new
API in the future

* At some point, we will likely want to generate SIMD-variants of our
C++ math kernels usable via dynamic dispatch for each different CPU
support level. It would be beneficial to author as much code in an
ISA-independent fashion that can be cross-compiled to generate binary
code for each ISA. We should investigate whether the new approach in
xsimd will provide what we need or if we need to take a different
approach.

* We have some of our own dynamic dispatch code to enable runtime
function pointer selection based on available SIMD levels. Can we
benefit from any of the work that is happening in this xsimd refactor?

* We have some compute code (e.g. hash tables for aggregation / joins)
that uses explicit AVX2 intrinsics — can some of this code be ported
to use generic xsimd APIs or will we need to use a different
fundamental algorithm design to yield maximum efficiency for each SIMD
ISA?

Thanks,
Wes

Reply via email to