[Numpy-discussion] NEP 38 - Universal SIMD intrinsics

Matti Picus Tue, 04 Feb 2020 07:08:56 -0800

Together with Sayed Adel (cc) and Ralf, I am pleased to put the draftversion of NEP 38 [0] up for discussion. As per NEP 0, this is the nextstep in the community accepting the approach layed out in the NEP. TheNEP PR [1] has already garnered a fair amount of discussion about theviability of Universal SIMD Intrinsics, so I will try to capture some ofthat here as well.


Abstract

While compilers are getting better at using hardware-specific routinesto optimize code, they sometimes do not produce optimal results. Also,we would like to be able to copy binary optimized C-extension modulesfrom one machine to another with the same base architecture (x86, ARM,PowerPC) but with different capabilities without recompiling.

We have a mechanism in the ufunc machinery to build alternative loopsindexed by CPU feature name. At import (in InitOperators), the loopfunction that matches the run-time CPU info is chosen from thecandidates.This NEP proposes a mechanism to build on that for many morefeatures and architectures. The steps proposed are to:

Establish a set of well-defined, architecture-agnostic, universalintrisics which capture features available across architectures.

Capture these universal intrisics in a set of C macros and use themacros to build code paths for sets of features from the baseline up tothe maximum set of features available on that architecture. Offer theseas a limited number of compiled alternative code paths.

At runtime, discover which CPU features are available, and choosefrom among the possible code paths accordingly.


Motivation and Scope

Traditionally NumPy has counted on the compilers to generate optimalcode specifically for the target architecture. However few users todaycompile NumPy locally for their machines. Most use the binary packageswhich must provide run-time support for the lowest-common denominatorCPU architecture. Thus NumPy cannot take advantage of more advancedfeatures of their CPU processors, since they may not be available on allusers’ systems. The ufunc machinery already has a loop-selectionprotocol based on dtypes, so it is easy to extend this to also select anoptimal loop for specifically available CPU features at runtime.

Traditionally, these features have been exposed through intrinsics whichare compiler-specific instructions that map directly to assemblyinstructions. Recently there were discussions about the effectiveness ofadding more intrinsics (e.g., `gh-11113`_ for AVX optimizations forfloats). In the past, architecture-specific code was added to NumPy forfast avx512 routines in various ufuncs, using the mechanism describedabove to choose the best loop for the architecture. However the code isnot generic and does not generalize to other architectures.

Recently, OpenCV moved to using universal intrinsics in the HardwareAbstraction Layer (HAL) which provided a nice abstraction for commonshared Single Instruction Multiple Data (SIMD) constructs. This NEPproposes a similar mechanism for NumPy. There are three stages to usingthe mechanism:

- Infrastructure is provided in the code for abstract intrinsics. Theufunc machinery will be extended using sets of these abstractintrinsics, so that a single ufunc will be expressed as a set of loops,going from a minimal to a maximal set of possibly availabe intrinsics.

- At compile time, compiler macros and CPU detection are used to turnthe abstract intrinsics into concrete intrinsic calls. Any intrinsicsnot available on the platform, either because the CPU does not supportthem (and so cannot be tested) or because the abstract intrinsic doesnot have a parallel concrete intrinsic on the platform will not error,rather the corresponding loop will not be produced and added to the setof possibilities.

- At runtime, the CPU detection code will further limit the set of loopsavailable, and the optimal one will be chosen for the ufunc.

The current NEP proposes only to use the runtime feature detection andoptimal loop selection mechanism for ufuncs. Future NEPS may proposeother uses for the proposed solution.



Usage and Impact

The end user will be able to get a list of intrinsics available fortheir platform and compiler. Optionally, the user may be able to specifywhich of the loops available at runtime will be used, perhaps via anenvironment variable to enable benchmarking the impact of the differentloops. There should be no direct impact to naive end users, the resultsof all the loops should be identical to within a small number (1-3?)ULPs. On the other hand, users with more powerful machines should noticea significant performance boost.

Binary releases - wheels on PyPI and conda packages

The binaries released by this process will be larger since they includeall possible loops for the architecture. Some packagers may prefer tolimit the number of loops in order to limit the size of the binaries, wewould hope they would still support a wide range of families ofarchitectures. Note this problem already exists in the Intel MKLoffering, where the binary package includes an extensive set ofalternative shared objects (DLLs) for various CPU alternatives.



Source builds

See “Detailed Description” below. A source build where the packagerknows details of the target machine could theoretically produce asmaller binary by choosing to compile only the loops needed by thetarget via command line arguments.

How to run benchmarks to assess performance benefits

Adding more code which use intrinsics will make the code harder tomaintain. Therefore, such code should only be added if it yields asignificant performance benefit. Assessing this performance benefit canbe nontrivial. To aid with this, the implementation for this NEP willadd a way to select which instruction sets can be used at runtime viaenvironment variables. (name TBD). This ablility is critical for CI codeverification.

Diagnostics

A new dictionary __cpu_features__ will be available to python. The keysare the available features, the value is a boolean whether the featureis available or not. Various new private C functions will be usedinternally to query available features. These might be exposed viaspecific c-extension modules for testing.

Workflow for adding a new CPU architecture-specific optimization

NumPy will always have a baseline C implementation for any code that maybe a candidate for SIMD vectorization. If a contributor wants to addSIMD support for some architecture (typically the one of most interestto them), this is the proposed workflow:

TODO (seehttps://github.com/numpy/numpy/pull/13516#issuecomment-558859638, needsto be worked out more)

Reuse by other projects

It would be nice if the universal intrinsics would be available to otherlibraries like SciPy or Astropy that also build ufuncs, but that is notan explicit goal of the first implementation of this NEP.


-----------------------------------------------------------------------------------

My biased summary of select comments from the PR:

(Raghuveer): A very similar SIMD library has been proposed for C++. Hereis the link to the details:


1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r8.pdf
2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4808.pdf

There is good discussion on the minimal/common set of instructionsacross architectures (which narrows down to loads, stores, arithmetic,compare, bitwise and shuffle instructions). Based on my developerexperience so far, these instructions aren't by themselves enough toimplement and optimize NumPy ufuncs. As i pointed out earlier, I think Iwould find it useful to learn the workflow of how to use instructionsthat don't fit in the Universal Intrinsic framework.

(Raguveer) gave a well laid out table of currently proposed unversalintrinsics by use: load/store, reorder, operators, conversions,arithmatic and misc [2] which led to a long response from Sayed [3] withsome sample code, demonstrating how more complex operations can be builtup from the primitives.

(catree) mentioned the Simd Library [4] and Halide [5] and asked aboutmaintainability.

(Ralf) responded [6] with concerns about competent developer bandwidthfor code review. He also mentioned that our CI system currently supportsall the architectures we are targeting (x86, aarch64, s390x, ppc64le)although some of these machines may not have the most advanced hardwareto support the latest intrinsics.

I apologize if my summary is not accurate, pleas correct any mistakes ormisconceptions.


----------------------------------------------------------------------------------------

Barring complete rejection of the idea here, we will be pushing forwardwith PRs to implement this. Comments either on the mailing list or inthose PRs are welcome.


Matti


[0] https://numpy.org/neps/nep-0038-SIMD-optimizations.html

[1] https://github.com/numpy/numpy/pull/15228

[2] https://github.com/numpy/numpy/pull/15228#issuecomment-580479336

[3] https://github.com/numpy/numpy/pull/15228#issuecomment-580605718

[4] https://github.com/ermig1979/Simd

[5] https://halide-lang.org

[6] https://github.com/numpy/numpy/pull/15228#issuecomment-581029991

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] NEP 38 - Universal SIMD intrinsics

Reply via email to