Together with Sayed Adel (cc) and Ralf, I am pleased to put the draft version of NEP 38 [0] up for discussion. As per NEP 0, this is the next step in the community accepting the approach layed out in the NEP. The NEP PR [1] has already garnered a fair amount of discussion about the viability of Universal SIMD Intrinsics, so I will try to capture some of that here as well.

Abstract

While compilers are getting better at using hardware-specific routines to optimize code, they sometimes do not produce optimal results. Also, we would like to be able to copy binary optimized C-extension modules from one machine to another with the same base architecture (x86, ARM, PowerPC) but with different capabilities without recompiling.

We have a mechanism in the ufunc machinery to build alternative loops indexed by CPU feature name. At import (in InitOperators), the loop function that matches the run-time CPU info is chosen from the candidates.This NEP proposes a mechanism to build on that for many more features and architectures. The steps proposed are to:

    Establish a set of well-defined, architecture-agnostic, universal intrisics which capture features available across architectures.

    Capture these universal intrisics in a set of C macros and use the macros to build code paths for sets of features from the baseline up to the maximum set of features available on that architecture. Offer these as a limited number of compiled alternative code paths.

    At runtime, discover which CPU features are available, and choose from among the possible code paths accordingly.

Motivation and Scope

Traditionally NumPy has counted on the compilers to generate optimal code specifically for the target architecture. However few users today compile NumPy locally for their machines. Most use the binary packages which must provide run-time support for the lowest-common denominator CPU architecture. Thus NumPy cannot take advantage of more advanced features of their CPU processors, since they may not be available on all users’ systems. The ufunc machinery already has a loop-selection protocol based on dtypes, so it is easy to extend this to also select an optimal loop for specifically available CPU features at runtime.

Traditionally, these features have been exposed through intrinsics which are compiler-specific instructions that map directly to assembly instructions. Recently there were discussions about the effectiveness of adding more intrinsics (e.g., `gh-11113`_ for AVX optimizations for floats). In the past, architecture-specific code was added to NumPy for fast avx512 routines in various ufuncs, using the mechanism described above to choose the best loop for the architecture. However the code is not generic and does not generalize to other architectures.

Recently, OpenCV moved to using universal intrinsics in the Hardware Abstraction Layer (HAL) which provided a nice abstraction for common shared Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar mechanism for NumPy. There are three stages to using the mechanism:


- Infrastructure is provided in the code for abstract intrinsics. The ufunc machinery will be extended using sets of these abstract intrinsics, so that a single ufunc will be expressed as a set of loops, going from a minimal to a maximal set of possibly availabe intrinsics.


- At compile time, compiler macros and CPU detection are used to turn the abstract intrinsics into concrete intrinsic calls. Any intrinsics not available on the platform, either because the CPU does not support them (and so cannot be tested) or because the abstract intrinsic does not have a parallel concrete intrinsic on the platform will not error, rather the corresponding loop will not be produced and added to the set of possibilities.


- At runtime, the CPU detection code will further limit the set of loops available, and the optimal one will be chosen for the ufunc.

The current NEP proposes only to use the runtime feature detection and optimal loop selection mechanism for ufuncs. Future NEPS may propose other uses for the proposed solution.


Usage and Impact

The end user will be able to get a list of intrinsics available for their platform and compiler. Optionally, the user may be able to specify which of the loops available at runtime will be used, perhaps via an environment variable to enable benchmarking the impact of the different loops. There should be no direct impact to naive end users, the results of all the loops should be identical to within a small number (1-3?) ULPs. On the other hand, users with more powerful machines should notice a significant performance boost.
Binary releases - wheels on PyPI and conda packages

The binaries released by this process will be larger since they include all possible loops for the architecture. Some packagers may prefer to limit the number of loops in order to limit the size of the binaries, we would hope they would still support a wide range of families of architectures. Note this problem already exists in the Intel MKL offering, where the binary package includes an extensive set of alternative shared objects (DLLs) for various CPU alternatives.


Source builds

See “Detailed Description” below. A source build where the packager knows details of the target machine could theoretically produce a smaller binary by choosing to compile only the loops needed by the target via command line arguments.
How to run benchmarks to assess performance benefits

Adding more code which use intrinsics will make the code harder to maintain. Therefore, such code should only be added if it yields a significant performance benefit. Assessing this performance benefit can be nontrivial. To aid with this, the implementation for this NEP will add a way to select which instruction sets can be used at runtime via environment variables. (name TBD). This ablility is critical for CI code verification.
Diagnostics

A new dictionary __cpu_features__ will be available to python. The keys are the available features, the value is a boolean whether the feature is available or not. Various new private C functions will be used internally to query available features. These might be exposed via specific c-extension modules for testing.
Workflow for adding a new CPU architecture-specific optimization

NumPy will always have a baseline C implementation for any code that may be a candidate for SIMD vectorization. If a contributor wants to add SIMD support for some architecture (typically the one of most interest to them), this is the proposed workflow:

TODO (see https://github.com/numpy/numpy/pull/13516#issuecomment-558859638, needs to be worked out more)
Reuse by other projects

It would be nice if the universal intrinsics would be available to other libraries like SciPy or Astropy that also build ufuncs, but that is not an explicit goal of the first implementation of this NEP.

-----------------------------------------------------------------------------------

My biased summary of select comments from the PR:

(Raghuveer): A very similar SIMD library has been proposed for C++. Here is the link to the details:

1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf
2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf

There is good discussion on the minimal/common set of instructions across architectures (which narrows down to loads, stores, arithmetic, compare, bitwise and shuffle instructions). Based on my developer experience so far, these instructions aren't by themselves enough to implement and optimize NumPy ufuncs. As i pointed out earlier, I think I would find it useful to learn the workflow of how to use instructions that don't fit in the Universal Intrinsic framework.


(Raguveer) gave a well laid out table of currently proposed unversal intrinsics by use: load/store, reorder, operators, conversions, arithmatic and misc [2] which led to a long response from Sayed [3] with some sample code, demonstrating how more complex operations can be built up from the primitives.


(catree) mentioned the Simd Library [4] and Halide [5] and asked about maintainability.


(Ralf) responded [6] with concerns about competent developer bandwidth for code review. He also mentioned that our CI system currently supports all the architectures we are targeting (x86, aarch64, s390x, ppc64le) although some of these machines may not have the most advanced hardware to support the latest intrinsics.


I apologize if my summary is not accurate, pleas correct any mistakes or misconceptions.

----------------------------------------------------------------------------------------


Barring complete rejection of the idea here, we will be pushing forward with PRs to implement this. Comments either on the mailing list or in those PRs are welcome.

Matti


[0] https://numpy.org/neps/nep-0038-SIMD-optimizations.html

[1] https://github.com/numpy/numpy/pull/15228

[2] https://github.com/numpy/numpy/pull/15228#issuecomment-580479336

[3] https://github.com/numpy/numpy/pull/15228#issuecomment-580605718

[4] https://github.com/ermig1979/Simd

[5] https://halide-lang.org

[6] https://github.com/numpy/numpy/pull/15228#issuecomment-581029991

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to