Gilles Gouaillardet via users <users@lists.open-mpi.org> writes:

> One motivation is packaging: a single Open MPI implementation has to be
> built, that can run on older x86 processors (supporting only SSE) and the
> latest ones (supporting AVX512).

I take dispatch on micro-architecture for granted, but it doesn't
require an assembler/intrinsics implementation.  See the level-1
routines in recent BLIS, for example (an instance where GCC was supposed
to fail).  That works for all relevant architectures, though I don't
think the aarch64 and ppc64le dispatch was ever included.  Presumably
it's less prone to errors than low-level code.

> The op/avx component will select at
> runtime the most efficient implementation for vectorized reductions.

It will select the micro-architecture with the most features, which may
or may not be the most efficient.  Is the avx512 version actually faster
than avx2?

Anyway, if this is important at scale, which I can't test, please at
least vectorize op_base_functions.c for aarch64 and ppc64le.  With GCC,
and probably other compilers -- at least clang, I think -- it doesn't
even need changes to cc flags.  With GCC and recent glibc, target clones
cover micro-arches with practically no effort.  Otherwise you probably
need similar infrastructure to what's there now, but not to devote the
effort to using intrinsics as far as I can see.

Reply via email to