Hi all, Here is what is probably the second-to-last NEP for NumPy 2.0 (the last one being the informational summary NEP of all major changes): https://github.com/numpy/numpy/pull/25542. Full text below.
A lot of the work has been under discussion since the 2.0 developer meeting in April and has been merged. A few of PRs that didn't make sense as standalone changes without this NEP are still open (see the "NumPy 2.0 API Changes" label), and there's a couple ones that still need to be opened. For editorial comments on the text, please comment on GitHub. For significant conceptual/design comments, please post them on this thread. Cheers, Ralf ============================================================= NEP 56 — Array API standard support in NumPy's main namespace ============================================================= :Author: Ralf Gommers <ralf.gomm...@gmail.com> :Author: Mateusz Sokół <mso...@quansight.com> :Author: Nathan Goldbaum <ngoldb...@quansight.com> :Status: Draft :Type: Standards Track :Created: 2023-12-19 :Resolution: TODO mailing list link Abstract -------- This NEP proposes adding full support for the 2022.12 version of the array API standard in NumPy's main namespace for the 2.0 release. Motivation and scope -------------------- .. note:: The main changes proposed in this NEP were presented in the NumPy 2.0 Developer Meeting in April 2023 (see `here < https://github.com/numpy/archive/blob/main/2.0_developer_meeting/NumPy_2.0_devmeeting_array_API_adoption.pdf >`__ for presentations from that meeting) and given a thumbs up there. The majority of the implementation work for NumPy 2.0 has already been merged. For the rest, PRs are ready - those are mainly the items that are specific to array API support and we'd probably not consider for inclusion in NumPy without that context. This NEP will focus on those APIs and PRs in a bit more detail. :ref:`NEP47` contains the motivation for adding array API support to NumPy. This NEP expands on and supersedes NEP 47. The main reason NEP 47 aimed for a separate ``numpy.array_api`` submodule rather than the main namespace is that casting rules differed too much. With value-based casting being removed (:ref:`NEP50`), that will be resolved in NumPy 2.0. Having NumPy be a superset of the array API standard will be a significant improvement for code portability to other libraries (CuPy, JAX, PyTorch, etc.) and thereby address one of the top user requests from the 2020 NumPy user survey [4]_ (GPU support). See `the numpy.array_api API docs (1.26.x) < https://numpy.org/doc/1.26/reference/array_api.html#table-of-differences-between-numpy-array-api-and-numpy >`__ for an overview of differences between it and the main namespace (note that the "strictness" ones are not applicable). Experiences with ``numpy.array_api``, which is still marked as experimental, have shown that the separate strict implementation and separate array object are mostly good for testing purposes, but not for regular usage in downstream libraries. Having support in the main namespace resolves this issue. Hence this NEP supersedes NEP 47. The ``numpy.array_api`` module will be moved to a standalone package, to facilitate easier updates not tied to a NumPy release cycle. Some of the key design rules from the array API standard (e.g., output dtypes predictable from input dtypes, no polymorphic APIs with varying number of returns controlled by keywords) will also be applied to NumPy functions that are not part of the array API standard, because those design rules are now understood to be good practice in general. Those two design rules in particular make it easier for Numba and other JIT compilers to support NumPy or NumPy-compatible APIs. We'll note that making existing arguments positional-only and keyword-only is a good idea for functions added to NumPy in the future, but will not be done for existing functions since each such change is a backwards compatibility break and it's not necessary for writing code that is portable across libraries supporting the standard. An additional reason to apply those design rules to all functions in the main namespace now is that it then becomes much easier to deal with potential standardization of new functions already present in NumPy - those could otherwise be blocked or forced to use alternative function names due to the need for backwards compatibility. It is important that new functions added to the main namespace integrate well with the rest of NumPy. So they should for example follow broadcasting and other rules as expected, and work with all NumPy's dtypes rather than only the ones in the standard. The same goes for backwards-incompatible changes (e.g., linear algebra functions need to all support batching in the same way, and consider the last two axes as matrices). As a result, NumPy should become more rather than less consistent. We'll note that one remaining incompatibility will be that NumPy is returning array scalars rather than 0-D arrays in most cases where the standard, and other array libraries, return 0-D arrays (e.g., indexing and reductions). There have been multiple discussions over the past year about the feasibility of removing array scalars from NumPy, or at least no longer returning them by default. However, this would be a large effort with some uncertainty about technical risks and impact of the change, and no one has taken it on. Given that array scalars implement a mostly array-compatible interface, this doesn't seem like the highest-prio item regarding array API standard compatibility. Here are what we see as the main expected benefits and costs of the complete set of proposed changes: Benefits: - It will remove the "having to make a choice between the NumPy API and the Array API" issue for other libraries, - The array API standard tends to have more consistent behavior than NumPy itself has (in cases where there are differences between the two, see for example the `linear algebra design principles < https://data-apis.org/array-api/2022.12/extensions/linear_algebra_functions.html#design-principles >`__ and `data-dependent output shapes page < https://data-apis.org/array-api/2022.12/design_topics/data_dependent_output_shapes.html >`__ in the standard), - Easier for CuPy, JAX, PyTorch, Dask, Numba, and other such libraries and compilers to fully match or support NumPy, - A few new features that have benefits independent of the standard: adding ``matrix_transpose`` and ``ndarray.mT``, adding ``vecdot``, introducing ``matrix_norm``/``vector_norm`` (they can be made gufuncs, vecdot already has a PR making it one), Costs: - A number of backwards compatibility breaks (mostly minor, see the Backwards compatibility section further down), - Expanding the size of the main namespace with about ~20 aliases (e.g., ``acos`` & co. with C99 names aliasing ``arccos`` & co.). Overall we believe that the benefits significantly outweigh the gains - and are permanent, while the costs are largely temporary. In particular, the benefits to array libraries and compilers that want to achieve compatibility with NumPy are significant. And as a result, the long-term benefits for the PyData (or scientific Python) ecosystem as a whole - because of downstream libraries being able to support multiple array libraries as easily as possible - are significant too. The number of breaking changes needed is fairly limited, and the impact of those changes seems modest. Not painless, but smaller than the impact of other breaking changes in NumPy 2.0, and a price worth paying. In scope for this NEP are: - Changes to NumPy's Python API needed to support the 2022.12 version of the array API standard, - Changes in the behavior of existing NumPy functions not (or not yet) present in the array API standard, to align with key design principles of the standard. Out of scope for this NEP are: - Other changes to NumPy's Python API unrelated to the array API standard, - Changes to NumPy's C API. This NEP will supersede the following NEPs: - :ref:`NEP30` (never implemented) - :ref:`NEP31` (never implemented) - :ref:`NEP37` (never implemented; the ``__array_module__`` idea is basically the same as ``__array_namespace__``) - :ref:`NEP47` (implemented with an experimental label in ``numpy.array_api``, will be removed) Usage and impact ---------------- We have several different types of users in mind: end users writing numerical code, downstream packages that depend on NumPy who want to start supporting multiple array libraries, and other array libraries and tools which aim to implement NumPy-like or NumPy-compatible APIs. The most prominent users who will benefit from array API support are probably downstream libraries that want to start supporting CuPy, PyTorch, JAX, Dask, or other such libraries. SciPy and scikit-learn are already fairly far along the way of doing just that, and successfully support CuPy arrays and PyTorch tensors in a small part of their own APIs (that support is still marked as experimental). The main principle they use is that they replace the regular ``import numpy as np`` with a utility function to retrieve the array library namespace from the input array. They call it ``xp``, which is effectively an alias to ``np`` if the input is a NumPy array, ``cupy`` for a CuPy array, ``torch`` for a PyTorch tensor. This ``xp`` then allows writing code that works for all these libraries - because the array API standard is the common denominator. As a concrete example, this code is taken from ``scipy.cluster``: .. code:: python def vq_py(obs, code_book, check_finite=True): """Python version of vq algorithm""" xp = array_namespace(obs, code_book) obs = as_xparray(obs, xp=xp, check_finite=check_finite) code_book = as_xparray(code_book, xp=xp, check_finite=check_finite) if obs.ndim != code_book.ndim: raise ValueError("Observation and code_book should have the same rank") if obs.ndim == 1: obs = obs[:, xp.newaxis] code_book = code_book[:, xp.newaxis] # Once `cdist` has array API support, this `xp.asarray` call can be removed dist = xp.asarray(cdist(obs, code_book)) code = xp.argmin(dist, axis=1) min_dist = xp.min(dist, axis=1) return code, min_dist It mostly looks like normal NumPy code, but will run with for example PyTorch tensors as input and then return PyTorch tensors. There is a lot more to this story of course then this basic example. These blog posts on scikit-learn [1]_ and SciPy's [2]_ experiences and impact (large performance gains in some cases - ``LinearDiscriminantAnalysis.fit`` showed ~28x gain with PyTorch on GPU vs. NumPy) paint a more complete picture. For end users who are using NumPy directly, little changes aside from there being fewer differences between NumPy and other libraries they may want to use as well. This shortens their learning curve and makes it easier to switch between NumPy and PyTorch/JAX/CuPy. In addition, they should benefit from array-consuming libraries starting to support multiple array libraries, making their experience of using a stack of Python packages for scientific computing or data science more seamless. Finally, for authors of other array libraries as well as tools like Numba, array API standard support should save them time. The design rules ([3]_), and in some cases new APIs like the ``unique_*`` ones, are easier to implement on GPU and for JIT compilers as a result of more predictable behavior. Backward compatibility ---------------------- The changes that have a backwards compatibility impact fall into these categories: 1. Raising errors for consistency/strictness in some places where NumPy now allows more flexible behavior, 2. Dtypes of returned arrays for some element-wise functions and reductions, 3. Numerical behavior for few tolerance keywords, 4. Functions moved to ``numpy.linalg`` and supporting stacking/batching. Raising errors for consistency/strictness includes: 1. Making ``.T`` error for >2 dimensions, 2. Making ``cross`` error on size-2 vectors (only size-3 vectors are supported), 3. Making ``solve`` error on ambiguous input (only accept ``x2`` as vector if ``x2.ndim == 1``), 4. ``outer`` raises rather than flattens on >1-D inputs, 5. In-place operators are disallowed when the left-hand side would be promoted. Dtypes of returned arrays for some element-wise functions and reductions includes functions where dtypes need to be preserved: ``ceil``, ``floor``, and ``trunc`` will start returning arrays with the same integer dtypes if the input has an integer dtype. It also includes dtype changes: ``sum`` and ``prod`` always upcast lower-precision floating-point dtypes to ``float64`` when ``dtype=None`` (this upcasting is already done for inputs with lower-precision integer dtypes). Changes in numerical behavior include: - The ``rtol`` default value for ``pinv`` changes from ``1e-15`` to a dtype-dependent default value of ``None``, interpreted as ``max(M, N) * finfo(result_dtype).eps``, - The ``tol`` keyword to ``matrix_rank`` changes to ``rtol`` with a different interpretation. In addition, ``matrix_rank`` will no longer support 1-D array input, - ``argsort`` and ``sort`` will gain a ``stable`` keyword argument in addition to ``kind``, and the default will become ``stable=True``. - The ``ddof`` keyword in ``std`` and ``var`` changes its name to ``correction``. The ``diagonal`` and ``trace`` functions are part of the ``linalg`` submodule in the standard, rather than the main namespace. Hence they will be introduced in ``numpy.linalg``. They will operate on the last two rather than first two axes (for consistency, and to support stacking). Hence the ``linalg`` and main namespace functions of the same names will differ. This is technically not breaking, but potentially confusing because of the different behavior for functions with the same name. We may deprecate ``np.trace`` and ``np.diagonal`` to resolve it, but preferably not immediately to avoid users having to write ``if-2.0-else`` conditional code. A related note on terminology: "stacking" is a confusing and fairly NumPy-specific term, this is called "batching" in deep learning frameworks and elsewhere - we plan to change the terminology in NumPy to "batching". There may be other minor changes that don't quite fall in one of the categories above. For example, ``numpy.fft`` functions need to preserve precision for 32-bit input dtypes rather than upcast to ``float64``/``complex128``. And there's an issue with the ``s``/``axes`` argument in n-D transforms that needs solving (see `gh-25495 <https://github.com/numpy/numpy/pull/25495>`__). Adapting to the changes ^^^^^^^^^^^^^^^^^^^^^^^ Some part of the Array API has already been implemented as part of the general Python API cleanup for NumPy 2.0 (see NEP 52), such as: - establishing one and way for naming ``inf`` and ``nan`` that is array API compatible. - removing cryptic dtype names and establishing canonical names for each dtype. All instructions for migrating to a NEP 52 compatible codebase are available in the `NumPy 2.0 Migration Guide <https://numpy.org/devdocs/numpy_2_0_migration_guide.html>`__ . Additionally, a new ``ruff`` rule was implemented for an automatic migration of Python API changes. It's worth pointing out that the new rule NP201 is only to adhere to the NEP 52 changes, and does not cover using new functions that are part of the array API standard nor APIs with some types of backwards incompatible changes discussed above. For an automated migration to Array API compatible codebase, a new rule is being implemented (see issue `ruff#8615 < https://github.com/astral-sh/ruff/issues/8615>`__ and PR `ruff#8910 <https://github.com/astral-sh/ruff/pull/8910>`__). With both rules in place a downstream user should be able to update their project, to the extent that that is possible with automation, to a library agnostic codebase that can benefit from different array libraries and devices. Backwards incompatible changes that cannot be handled automatically (e.g., a change in ``rtol`` defaults for a linear algebra function), this will be handled the same way as any other backwards incompatible change in NumPy 2.0 - through documentation, release notes, and API migrations and deprecations over several releases. Detailed description -------------------- In this section we'll focus on specific API additions and functionality that we would not consider introducing into NumPy if the standard would not exist and we wouldn't have to think/worry about its main goal of writing code that is portable across multiple array libraries and their supported features like GPUs and other hardware accelerators or JIT compilers. ``device`` support ^^^^^^^^^^^^^^^^^^ Device support is perhaps the most obvious example. NumPy is and will remain a CPU-only library, so why bother introducing a ``ndarray.device`` attribute or ``device=`` keywords in several functions? This one feature is purely meant to make it easier to write code that is portable across libraries. The ``.device`` attribute will return an object representing CPU, and that object will be accepted as an input to ``device=`` keywords. For example: .. code:: # Should work when `xp` is `np` and `x1` a numpy array x2 = xp.asarray([0, 1, 2, 3], dtype=xp.float64, device=x1.device) This will work as expected for NumPy, creating a 1-D numpy array from the input list. It will also work for CuPy & co, where it may create a new array on a GPU or other supported device. ``isdtype`` ^^^^^^^^^^^ The array API standard introduced a new function ``isdtype`` for introspection of dtypes, because there was no suitable alternative in NumPy. The closest one is ``np.issubdtype``, however that assumes a complex class hierarchy which other array libraries don't have, isn't the most ergonomic API, and required a larger API surface (``np.floating`` and friends). ``isdtype`` will be the new and canonical way to introspect dtypes. All it requires from a dtype is that ``_eq__`` is implemented and has the expected behavior when compared with other dtypes from the same library. Note that as part of the effort on NEP 52, some dtype aliases were removed and canonical Python and C names documented. See also `gh-17325 <https://github.com/numpy/numpy/issues/17325>`__ covering issues with NumPy's lack of a good API for this. ``copy`` keyword semantics ^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``copy`` keyword in ``asarray`` and ``array`` will now support ``True``/``False``/``None`` with new meanings: - ``True`` - Always make a copy. - ``False`` - Never make a copy. If a copy is required a ``ValueError`` is raised. - ``None`` - A copy will only be made if it is necessary (previously ``False``). The ``copy`` keyword in ``astype`` will stick to its current meaning, because "never copy" when asking for a cast to a different dtype doesn't quite make sense. New function name aliases ^^^^^^^^^^^^^^^^^^^^^^^^^ In the Python API cleanup for NumPy 2.0 (see :ref:`NEP52`) we spent a lot of effort removing aliases. So introducing new aliases has to have a good rationale. In this case, it is needed in order to match other libraries. The main set of aliases added is for trigonometric functions, where the array API standard chose to follow C99 and other libraries in using ``acos``, ``asin`` etc. rather than ``arccos``, ``arcsin``, etc. NumPy usually also follows C99; it is not entirely clear why this naming choice was made many years ago. New keywords with overlapping semantics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Similarly to function name aliases, there are a couple of new keywords which have overlap with existing ones: - ``correction`` keyword for ``std`` and ``var`` (overlaps with ``ddof``) - ``stable`` keyword for ``sort`` and ``argsort`` (overlaps with ``kind``) The ``correction`` name is for clarity ("delta degrees of freedom" is not easy to understand) and ``stable`` is complementary to ``kind``, allowing a library to reserve the right to change/improve the stable and unstable sorting algorithms. New ``unique_*`` functions ^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``unique`` function, with ``return_index``, ``return_inverse``, and ``return_counts`` arguments that influence the cardinality of the returned tuple, is replaced in the Array API by four respective functions: ``unique_all``, ``unique_counts``, ``unique_inverse``, and ``unique_values``. This new API helps to better anticipate the return object and support clearer typing. ``np.bool`` addition ^^^^^^^^^^^^^^^^^^^^ One of the aliases that used to live in NumPy but got removed was ``np.bool``. To comply with the Array API it got reintroduced with a different meaning, as now it points to NumPy's bool instead of a Python builtin. This change is a good idea and we were planning to make it anyway, because ``bool`` is a nicer name than ``bool_``. However, we may not have scheduled that reintroduction of the name for 2.0 perhaps if it had not been part of the array API standard. Related work ------------ The array API standard (`html docs <https://data-apis.org/array-api/2022.12/ >`__, `repository <https://github.com/data-apis/array-api/>`__) is the first related work; a lot of design discussion in its issue tracker may be relevant in case reasons for particular decisions need to be found. Downstream adoption from array-consuming libraries is actively happening at the moment, see for example: - scikit-learn `docs on array API support < https://scikit-learn.org/dev/modules/array_api.html>`__ and `PRs < https://github.com/scikit-learn/scikit-learn/pulls?q=is%3Aopen+is%3Apr+label%3A%22Array+API%22>`__ and `issues < https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aopen+is%3Aissue+label%3A%22Array+API%22 >`__ labeled with *Array API*. - SciPy `docs on array API support < http://scipy.github.io/devdocs/dev/api-dev/array_api.html>`__ and `PRs < https://github.com/scipy/scipy/pulls?q=is%3Aopen+is%3Apr+label%3A%22array+types%22 >`__ and `issues < https://github.com/scipy/scipy/issues?q=is%3Aopen+is%3Aissue+label%3A%22array+types%22>`__ labeled with *array types*. - Einops `docs on supported frameworks < https://einops.rocks/#supported-frameworks>`__ and `PR to implement array API standard support < https://github.com/arogozhnikov/einops/pull/261>`__. Other array libraries either already have support or are implementing support for the array API standard (in sync with the changes for NumPy 2.0, since they usually try to be as compatible to NumPy as possible). For example: - CuPy's `docs on array API support < https://docs.cupy.dev/en/stable/reference/array_api.html>`__ and `PRs labelled with array-api < https://github.com/cupy/cupy/pulls?q=is%3Aopen+is%3Apr+label%3Aarray-api >`__. - JAX: enhancement proposal `Scope of JAX NumPy & SciPy Wrappers < https://jax.readthedocs.io/en/latest/jep/18137-numpy-scipy-scope.html#axis-2-array-api-alignment >`__ and `PR with initial implementation < https://github.com/google/jax/pull/16099>`__. Implementation -------------- The tracking issue for Array API standard support (`gh-25076 <https://github.com/numpy/numpy/issues/25076>`__) records progress of implementing full support and links to related discussions. It lists all relevant PRs (merged and pending) that verify or provide array API support. As NEP 52 blends to some degree with this NEP, we can find some relevant implementations and discussion also on its tracking issue (`gh-23999 < https://github.com/numpy/numpy/issues/23999>`__). The PR that was merged as one of the first contained a new CI job that adds the `array-api-tests <https://github.com/data-apis/array-api-tests>`__ test suite. This way we had a better control over which batch of functions/aliases is being added each time, and be sure that the implementation conforms with the Array API standard (see `gh-25167 <https://github.com/numpy/numpy/pull/25167>`__). Then, we continued to merge one batch at the time, adding a specific API section. Below we list some of the more substantial ones, including some that we discussed in the previous sections of this NEP: - `gh-25167: MAINT: Add array-api-tests CI stage, add ndarray.__array_namespace__ <https://github.com/numpy/numpy/pull/25167>`__. - `gh-25088: API: Add Array API setops [Array API] < https://github.com/numpy/numpy/pull/25088>`__ - `gh-25155: API: Add matrix_norm, vector_norm, vecdot and matrix_transpose [Array API] <https://github.com/numpy/numpy/pull/25155>`__ - `gh-25080: API: Add and redefine numpy.bool [Array API] < https://github.com/numpy/numpy/pull/25080>`__ - `gh-25054: API: Introduce np.isdtype function [Array API] < https://github.com/numpy/numpy/pull/25054>`__ - `gh-25168: API: Introduce copy argument for np.asarray [Array API] < https://github.com/numpy/numpy/pull/25168>`__ Alternatives ------------ The alternatives to implementing support for the array API standard in NumPy's main namespace include: - one or more of the superseded NEPs, or - making ``ndarray.__array_namespace__()`` return a hidden namespace with compatible functions, - not implementing support for the array API standard at all. The superseded NEPs all have some drawbacks compared to the array API standard, and by now a lot of work has gone into the standard - as well as adoption by other key libraries. So those alternatives are not appealing. Given the amount of interest in this topic, doing nothing also is not appealing. The "hidden namespace" option would be a smaller change to this proposal. We prefer not to do that since it leads to duplicate implementations staying around, a more complex implementation (e.g., potential issues with static typing), and still having two flavors of essentially the same API. An alternative to removing ``numpy.array_api`` from NumPy is to keep it in its current place, since it is still useful - it is the best way to test if downstream code is actually portable between array libraries. This is a very reasonable alternative, however there is a slight preference for taking that module and turning it into a standalone package. Discussion ---------- References and footnotes ------------------------ .. [1] https://labs.quansight.org/blog/array-api-support-scikit-learn .. [2] https://labs.quansight.org/blog/scipy-array-api .. [3] A. Meurer et al., "Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem." (2023), https://conference.scipy.org/proceedings/scipy2023/pdfs/aaron_meurer.pdf .. [4] https://numpy.org/user-survey-2020/, 2020 NumPy User Survey results Copyright --------- This document has been placed in the public domain.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com