Hi all,

Here's a first draft NEP for comments.

--

Synopsis
========

Improving numpy's dtype system requires that ufunc loops start having
access to details of the specific dtype instance they are acting on:
e.g. an implementation of np.equal for strings needs access to the
dtype object in order to know what "n" to pass to strncmp. Similar
issues arise with variable length strings, missing values, categorical
data, unit support, datetime with timezone support, etc. -- this is a
major blocker for improving numpy.

Unfortunately, the current ufunc inner loop function signature makes
it very difficult to provide this information. We might be able to
wedge it in there, but it'd be ugly.

The other option would be to change the signature. What would happen
if we did this? For most common uses of the C API/ABI, we could do
this easily while maintaining backwards compatibility. But there are
also some rarely-used parts  of the API/ABI that would be
prohibitively difficult to preserve.

In addition, there are other potential changes to ufuncs on the
horizon (e.g. extensions of gufuncs to allow them to be used more
generally), and the current API exposure is so massive that any such
changes will be difficult to make in a fully compatible way. This NEP
thus considers the possibility of closing down the ufunc API to a
minimal, maintainable subset of the current API.

To better understand the consequences of this potential change, I
performed an exhaustive analysis of all the code on Github, Bitbucket,
and Fedora, among others. The results make me highly confident that of
all the publically available projects in the world, the only ones
which touch the problematic parts of the ufunc API are: Numba,
dynd-python, and `gulinalg <https://github.com/ContinuumIO/gulinalg>`_
(with the latter's exposure being trivial).

Given this, I propose that for 1.11 we:
1) go ahead and hide/disable the problematic parts of the ABI/API,
2) coordinate with the known affected projects to minimize disruption
to their users (which is made easier since they are all projects that
are almost exclusively distributed via conda, which enforces strict
NumPy ABI versioning),
3) publicize these changes widely so as to give any private code that
might be affected a chance to speak up or adapt, and
4) leave the "ABI version tag" as it is, so as not to force rebuilds
of the vast majority of projects that will be unaffected by these
changes.

This NEP defers the question of exactly what the improved API should
be, since there's no point in trying to nail down the details until
we've decided whether it's even possible to change.


Details
=======

The problem
-----------

Currently, a ufunc inner loop implementation is called via the
following function prototype::

    typedef void (*PyUFuncGenericFunction)
                (char **args,
                 npy_intp *dimensions,
                 npy_intp *strides,
                 void *innerloopdata);

Here ``args`` is an array of pointers to 1-d buffers of input/output
data, ``dimensions`` is a pointer to the number of entries in these
buffers, ``strides`` is an array of integers giving the strides for
each input/output array, and ``innerloopdata`` is an arbitrary void*
supplied by whoever registered the ufunc loop. (For gufuncs, extra
shape and stride information about the core dimensions also gets
packed into the ends of these arrays in a somewhat complicated way.)

There are 4 key items that define a NumPy array: data, shape, strides,
dtype. Notice that this function only gets access to 3 of them. Our
goal is to fix that. For example, a better signature would be::

    typedef void (*PyUFuncGenericFunction_NEW)
                (char **data,
                 npy_intp *shapes,
                 npy_intp *strides,
                 PyArray_Descr *dtypes,   /* NEW */
                 void *innerloopdata);

(In practice I suspect we might want to make some more changes as
well, like upgrading gufunc core shape/strides to proper arguments
instead of tacking it onto the existing arrays, and adding an "escape
valve" void* reserved for future extensions. But working out such
details is outside the scope of this NEP; the above will do for
illustration.)

The goal of this NEP is to clear the ground so that we can start
supporting ufunc inner loops that take dtype arguments, and make other
enhancements to ufunc functionality going forward.


Proposal
--------

Currently, the public API/ABI for ufuncs consists of the functions::

    PyUFunc_GenericFunction

    PyUFunc_FromFuncAndData
    PyUFunc_FromFuncAndDataAndSignature
    PyUFunc_RegisterLoopForDescr
    PyUFunc_RegisterLoopForType

    PyUFunc_ReplaceLoopBySignature
    PyUFunc_SetUsesArraysAsData

together with direct access to PyUFuncObject's internal fields::

    typedef struct {
        PyObject_HEAD
        int nin, nout, nargs;
        int identity;
        PyUFuncGenericFunction *functions;
        void **data;
        int ntypes;
        int check_return;
        const char *name;
        char *types;
        const char *doc;
        void *ptr;
        PyObject *obj;
        PyObject *userloops;
        int core_enabled;
        int core_num_dim_ix;
        int *core_num_dims;
        int *core_dim_ixs;
        int *core_offsets;
        char *core_signature;
        PyUFunc_TypeResolutionFunc *type_resolver;
        PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector;
        PyUFunc_InnerLoopSelectionFunc *inner_loop_selector;
        PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector;
        npy_uint32 *op_flags;
        npy_uint32 iter_flags;
    } PyUFuncObject;

Obviously almost any future changes to how ufuncs work internally will
involve touching some part of this public API/ABI.

Concretely, the proposal here is that we avoid this by disabling the
following functions (i.e., any attempt to call them should simply
raise a ``NotImplementedError``)::

    PyUFunc_ReplaceLoopBySignature
    PyUFunc_SetUsesArraysAsData

and that we reduce the publicly visible portion of PyUFuncObject down to::

    typedef struct {
        PyObject_HEAD
        int nin, nout, nargs;
    } PyUFuncObject;


Data on current API/ABI usage
-----------------------------

In order to assess how much code would be affected by this proposal, I
used a combination of Github search and Searchcode.com to trawl
through the majority of all publicly available open source code.
Neither search tool provides a fine-grained enough query language to
directly tell us what we want to know, so I instead followed the
strategy of first, casting a wide net: picking a set of search terms
that are likely to catch all possibly-broken code (together with many
false positives), and second, using automated tools to sift out the
false positives and see what remained. Altogether, I reviewed 4464
search results.

The tool I wrote to do this is `available on github
<https://github.com/njsmith/codetrawl>`_, and so is `the analysis code
itself <https://github.com/njsmith/ufunc-abi-analysis>`_.


Uses of PyUFuncObject internals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are no functions in the public API which return
``PyUFuncObject*`` values directly, so any code that access
PyUFuncObject fields will have to mention that token in the course of
defining a variable, performing a cast, setting up a typedef, etc.

Therefore, I searched Github for all files written in C, C++,
Objective C, Python, or Cython, which mentioned either "PyUFuncObject
AND types" or "PyUFuncObject AND NOT types". (This is to work around
limitations on how many results Github search is willing to return to
a single query.) In addition, I searched for ``PyUFuncObject`` on
searchcode.com.

The full report on these searches is available here:
https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html

The following were screened out as non-problems:

- Copies of NumPy itself (an astonishing number of people have checked
in copies of it to their own source tree)
- NumPy forks / precursors / etc. (e.g. Numeric also had a type called
PyUFuncObject, the "bohrium" project has a fork of numpy 1.6, etc.)
- Cython-generated boilerplate used to generate the "object has
changed size" warning (which we `unconditionally filter out anyway
<https://github.com/numpy/numpy/blob/master/numpy/__init__.py#L226>`_)
- Lots of calls to ``PyUFunc_RegisterLoopForType`` and friends, which
require casting the first argument to ``PyUFuncObject*``
- Misc. other unproblematic stuff (like Cython header declarations
that never get used)

There were also several cases that actually referenced PyUFuncObject
internal fields:

- The "rational" dtype from numpy-dtypes, which is used in a few
projects, accesses ``ufunc->nargs`` as a safety check, but does not
touch any other fields (`see here
<https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151>`_).

- Numba: does some rather elaborate things to support the definition
of on-the-fly JITted ufuncs. These seem to be clear deficiencies in
the ufunc API (e.g., there's no good way to control the lifespan of
the array of function pointers passed to ``PyUFunc_FromFuncAndData``),
so we should work with them to provide the API they need to do this in
a maintainable way. Some of the relevant code:

  https://github.com/numba/numba/tree/master/numba/npyufunc
  
https://github.com/numba/numba/blob/98752647a95ac6c9d480e81ca5c8afcfa3ddfd18/numba/npyufunc/_internal.c

- dynd-python: Contains some code that attempts to extract the inner
loops from a numpy ufunc object and wrap them into the dynd 'ufunc'
equivalent:
  
https://github.com/libdynd/dynd-python/blob/c06f8fc4e72257abac589faf76f10df8c045159b/dynd/src/numpy_ufunc_kernel.cpp

- gulinalg: I'm not sure if anyone is still using this code since most
of it was merged into numpy itself, but it's not a big deal in any
case: all it contains is a `debugging function
<https://github.com/ContinuumIO/gulinalg/blob/2ef365c48427c026dab4f45dc6f8b1b9af184460/gulinalg/src/gulinalg.c.src#L527-L550>`_
that dumps some internal fields from the PyUFuncObject. If you look,
though, all calls to this function are already commented out :-).

The full report is available here:
https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html

In the course of this analysis, it was also noted that the standard
Cython pxd files contain a wrapper for ufunc objects::

    cdef class ufunc [object PyUFuncObject]:
        ...

which means that Cython code can access internal struct fields via an
object of type ``ufunc``, and thus escape our string-based search
above. Therefore I also examined all Cython files on Github or
searchcode.com that matched the query ``ufunc``, and searched for any
lines matching any of the following regular expressions::

    cdef\W+ufunc
       catches: 'cdef ufunc fn'
    cdef\W+.*\.\W*ufunc
       catches: 'cdef np.ufunc fn'
    <.*ufunc\W*>
       catches: '(<ufunc> fn).nargs', '(< np.ufunc > fn).nargs'
    cdef.*\(.*ufunc
       catches: 'cdef doit(np.ufunc fn, ...):'

(I considered parsing the actual source and analysing it that way, but
decided I was too lazy. This could be done if anyone is worried that
the above regexes might miss things though.)

There were zero files that contained matches for any of the above regexes:
https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/ufunc-cython-report.html


Uses of PyUFunc_ReplaceLoopBySignature
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Applying the same screening as above, the only code that was found
that used this function is also in Numba:
https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_ReplaceLoopBySignature-report.html


Uses of PyUFunc_SetUsesArraysAsData
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Aside from being semi-broken since 1.7 (it never got implemented for
"masked" ufunc loops, i.e. those that use where=), there appear to be
zero uses of this functionality either inside or outside NumPy:
https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_SetUsesArraysAsData-report.html


Rationale
---------

**Rationale for preserving the remaining API functions**::

    PyUFunc_GenericFunction

    PyUFunc_FromFuncAndData
    PyUFunc_FromFuncAndDataAndSignature
    PyUFunc_RegisterLoopForDescr
    PyUFunc_RegisterLoopForType

In addition to being widely used, these functions can easily be
preserved even if we change how ufuncs work internally, because they
only ingest loop function pointers, they never return them. So they
can easily be modified to wrap whatever loop function(s) they receive
inside an adapter function that calls them at the appropriate time,
and then register that adapter function using whatever API we add in
the future.

**Rationale for preserving the particular fields that are preserved**:
Preserving ``nargs`` lets us avoid a little bit of breakage with the
random dtype, and it doesn't seem like preserving ``nin``, ``nout``,
``nargs`` fields will produce any undue burden on future changes to
ufunc internals; even if we were to introduce variadic ufuncs we could
always just stick a -1 in the appropriate fields or whatever.

**Rationale for removing PyUFunc_ReplaceLoopBySignature**: this
function *returns* the PyUFunc_GenericFunction that was replaced; if
we stop representing all loops using the legacy
PyUFunc_GenericFunction type, then this will not be possible to do in
the future.

**Rationale for removing PyUFunc_SetUsesArraysAsData**: If set as the
``innerloopdata`` on a ufunc loop, then this function acts as a
sentinel value, and causes the ``innerloopdata`` to instead be set to
a pointer to the passed-in PyArrayObjects.  In principle we could
preserve this function, but it has a number of deficiencies:
- No-one appears to use it.
- It's been buggy for several releases and no-one noticed.
- AFAIK the only reason it was included in the first place is that it
provides a backdoor for ufunc loops to get access to the dtypes -- but
we are planning to fix this in a better way.
- It can't be shimmed as easily as the loop registration functions,
because we don't anticipate that the new-and-improved ufunc loop
functions will *get* access to the array objects, only to the dtypes;
so this would have to remain cluttering up the core dispatch path
indefinitely.
- We have good reason for *not* wanting to get ufunc loops get access
to the actual array objects, because one of the goals on our roadmap
is exactly to enable the use of ufuncs on non-ndarray objects. Giving
ufuncs access to dtypes alone creates a clean boundary here: it
guarantees that ufunc loops can work equally on all duck-array objects
(so long as they have a dtype), and enforces the invariant that
anything which affects the interpretation of data values should be
attached to the dtype, not to the array object.


Rejected alternatives
---------------------

**Do nothing**: there's no way we'll ever be able to touch ufuncs at
all if we don't hide those fields sooner or later. While any amount of
breakage is regrettable, the costs of cleaning things up now are less
than the costs of never improving numpy's APIs.

**Somehow sneak the dtype information in via ``void
*innerloopdata``**: This might let us preserve the signature of
PyUFunc_GenericFunction, and thus preserve
PyUFunc_ReplaceLoopBySignature. But we'd still have the problem of
leaving way too much internal state exposed, and it's not even clear
how this would work, given that we actually do want to preserve the
use of ``innerloopdata`` for actual per-loop data. (This is where the
PyUFunc_SetUsesArraysAsData hack fails.)


-- 
Nathaniel J. Smith -- http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to