Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft
version of NEP 38 [0] up for discussion. As per NEP 0, this is the next
step in the community accepting the approach layed out in the NEP. The
NEP PR [1] has already garnered a fair amount of discussion about the
viability of Universal SIMD Intrinsics, so I will try to capture some of
that here as well.
Abstract
While compilers are getting better at using hardware-specific routines
to optimize code, they sometimes do not produce optimal results. Also,
we would like to be able to copy binary optimized C-extension modules
from one machine to another with the same base architecture (x86, ARM,
PowerPC) but with different capabilities without recompiling.
We have a mechanism in the ufunc machinery to build alternative loops
indexed by CPU feature name. At import (in InitOperators), the loop
function that matches the run-time CPU info is chosen from the
candidates.This NEP proposes a mechanism to build on that for many more
features and architectures. The steps proposed are to:
Establish a set of well-defined, architecture-agnostic, universal
intrisics which capture features available across architectures.
Capture these universal intrisics in a set of C macros and use the
macros to build code paths for sets of features from the baseline up to
the maximum set of features available on that architecture. Offer these
as a limited number of compiled alternative code paths.
At runtime, discover which CPU features are available, and choose
from among the possible code paths accordingly.
Motivation and Scope
Traditionally NumPy has counted on the compilers to generate optimal
code specifically for the target architecture. However few users today
compile NumPy locally for their machines. Most use the binary packages
which must provide run-time support for the lowest-common denominator
CPU architecture. Thus NumPy cannot take advantage of more advanced
features of their CPU processors, since they may not be available on all
users’ systems. The ufunc machinery already has a loop-selection
protocol based on dtypes, so it is easy to extend this to also select an
optimal loop for specifically available CPU features at runtime.
Traditionally, these features have been exposed through intrinsics which
are compiler-specific instructions that map directly to assembly
instructions. Recently there were discussions about the effectiveness of
adding more intrinsics (e.g., `gh-11113`_ for AVX optimizations for
floats). In the past, architecture-specific code was added to NumPy for
fast avx512 routines in various ufuncs, using the mechanism described
above to choose the best loop for the architecture. However the code is
not generic and does not generalize to other architectures.
Recently, OpenCV moved to using universal intrinsics in the Hardware
Abstraction Layer (HAL) which provided a nice abstraction for common
shared Single Instruction Multiple Data (SIMD) constructs. This NEP
proposes a similar mechanism for NumPy. There are three stages to using
the mechanism:
- Infrastructure is provided in the code for abstract intrinsics. The
ufunc machinery will be extended using sets of these abstract
intrinsics, so that a single ufunc will be expressed as a set of loops,
going from a minimal to a maximal set of possibly availabe intrinsics.
- At compile time, compiler macros and CPU detection are used to turn
the abstract intrinsics into concrete intrinsic calls. Any intrinsics
not available on the platform, either because the CPU does not support
them (and so cannot be tested) or because the abstract intrinsic does
not have a parallel concrete intrinsic on the platform will not error,
rather the corresponding loop will not be produced and added to the set
of possibilities.
- At runtime, the CPU detection code will further limit the set of loops
available, and the optimal one will be chosen for the ufunc.
The current NEP proposes only to use the runtime feature detection and
optimal loop selection mechanism for ufuncs. Future NEPS may propose
other uses for the proposed solution.
Usage and Impact
The end user will be able to get a list of intrinsics available for
their platform and compiler. Optionally, the user may be able to specify
which of the loops available at runtime will be used, perhaps via an
environment variable to enable benchmarking the impact of the different
loops. There should be no direct impact to naive end users, the results
of all the loops should be identical to within a small number (1-3?)
ULPs. On the other hand, users with more powerful machines should notice
a significant performance boost.
Binary releases - wheels on PyPI and conda packages
The binaries released by this process will be larger since they include
all possible loops for the architecture. Some packagers may prefer to
limit the number of loops in order to limit the size of the binaries, we
would hope they would still support a wide range of families of
architectures. Note this problem already exists in the Intel MKL
offering, where the binary package includes an extensive set of
alternative shared objects (DLLs) for various CPU alternatives.
Source builds
See “Detailed Description” below. A source build where the packager
knows details of the target machine could theoretically produce a
smaller binary by choosing to compile only the loops needed by the
target via command line arguments.
How to run benchmarks to assess performance benefits
Adding more code which use intrinsics will make the code harder to
maintain. Therefore, such code should only be added if it yields a
significant performance benefit. Assessing this performance benefit can
be nontrivial. To aid with this, the implementation for this NEP will
add a way to select which instruction sets can be used at runtime via
environment variables. (name TBD). This ablility is critical for CI code
verification.
Diagnostics
A new dictionary __cpu_features__ will be available to python. The keys
are the available features, the value is a boolean whether the feature
is available or not. Various new private C functions will be used
internally to query available features. These might be exposed via
specific c-extension modules for testing.
Workflow for adding a new CPU architecture-specific optimization
NumPy will always have a baseline C implementation for any code that may
be a candidate for SIMD vectorization. If a contributor wants to add
SIMD support for some architecture (typically the one of most interest
to them), this is the proposed workflow:
TODO (see
https://github.com/numpy/numpy/pull/13516#issuecomment-558859638, needs
to be worked out more)
Reuse by other projects
It would be nice if the universal intrinsics would be available to other
libraries like SciPy or Astropy that also build ufuncs, but that is not
an explicit goal of the first implementation of this NEP.
-----------------------------------------------------------------------------------
My biased summary of select comments from the PR:
(Raghuveer): A very similar SIMD library has been proposed for C++. Here
is the link to the details:
1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf
2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf
There is good discussion on the minimal/common set of instructions
across architectures (which narrows down to loads, stores, arithmetic,
compare, bitwise and shuffle instructions). Based on my developer
experience so far, these instructions aren't by themselves enough to
implement and optimize NumPy ufuncs. As i pointed out earlier, I think I
would find it useful to learn the workflow of how to use instructions
that don't fit in the Universal Intrinsic framework.
(Raguveer) gave a well laid out table of currently proposed unversal
intrinsics by use: load/store, reorder, operators, conversions,
arithmatic and misc [2] which led to a long response from Sayed [3] with
some sample code, demonstrating how more complex operations can be built
up from the primitives.
(catree) mentioned the Simd Library [4] and Halide [5] and asked about
maintainability.
(Ralf) responded [6] with concerns about competent developer bandwidth
for code review. He also mentioned that our CI system currently supports
all the architectures we are targeting (x86, aarch64, s390x, ppc64le)
although some of these machines may not have the most advanced hardware
to support the latest intrinsics.
I apologize if my summary is not accurate, pleas correct any mistakes or
misconceptions.
----------------------------------------------------------------------------------------
Barring complete rejection of the idea here, we will be pushing forward
with PRs to implement this. Comments either on the mailing list or in
those PRs are welcome.
Matti
[0] https://numpy.org/neps/nep-0038-SIMD-optimizations.html
[1] https://github.com/numpy/numpy/pull/15228
[2] https://github.com/numpy/numpy/pull/15228#issuecomment-580479336
[3] https://github.com/numpy/numpy/pull/15228#issuecomment-580605718
[4] https://github.com/ermig1979/Simd
[5] https://halide-lang.org
[6] https://github.com/numpy/numpy/pull/15228#issuecomment-581029991
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion