Gilles Gouaillardet via users <users@lists.open-mpi.org> writes: > One motivation is packaging: a single Open MPI implementation has to be > built, that can run on older x86 processors (supporting only SSE) and the > latest ones (supporting AVX512).
I take dispatch on micro-architecture for granted, but it doesn't require an assembler/intrinsics implementation. See the level-1 routines in recent BLIS, for example (an instance where GCC was supposed to fail). That works for all relevant architectures, though I don't think the aarch64 and ppc64le dispatch was ever included. Presumably it's less prone to errors than low-level code. > The op/avx component will select at > runtime the most efficient implementation for vectorized reductions. It will select the micro-architecture with the most features, which may or may not be the most efficient. Is the avx512 version actually faster than avx2? Anyway, if this is important at scale, which I can't test, please at least vectorize op_base_functions.c for aarch64 and ppc64le. With GCC, and probably other compilers -- at least clang, I think -- it doesn't even need changes to cc flags. With GCC and recent glibc, target clones cover micro-arches with practically no effort. Otherwise you probably need similar infrastructure to what's there now, but not to devote the effort to using intrinsics as far as I can see.