Hi all, Here's a first draft NEP for comments.
-- Synopsis ======== Improving numpy's dtype system requires that ufunc loops start having access to details of the specific dtype instance they are acting on: e.g. an implementation of np.equal for strings needs access to the dtype object in order to know what "n" to pass to strncmp. Similar issues arise with variable length strings, missing values, categorical data, unit support, datetime with timezone support, etc. -- this is a major blocker for improving numpy. Unfortunately, the current ufunc inner loop function signature makes it very difficult to provide this information. We might be able to wedge it in there, but it'd be ugly. The other option would be to change the signature. What would happen if we did this? For most common uses of the C API/ABI, we could do this easily while maintaining backwards compatibility. But there are also some rarely-used parts of the API/ABI that would be prohibitively difficult to preserve. In addition, there are other potential changes to ufuncs on the horizon (e.g. extensions of gufuncs to allow them to be used more generally), and the current API exposure is so massive that any such changes will be difficult to make in a fully compatible way. This NEP thus considers the possibility of closing down the ufunc API to a minimal, maintainable subset of the current API. To better understand the consequences of this potential change, I performed an exhaustive analysis of all the code on Github, Bitbucket, and Fedora, among others. The results make me highly confident that of all the publically available projects in the world, the only ones which touch the problematic parts of the ufunc API are: Numba, dynd-python, and `gulinalg <https://github.com/ContinuumIO/gulinalg>`_ (with the latter's exposure being trivial). Given this, I propose that for 1.11 we: 1) go ahead and hide/disable the problematic parts of the ABI/API, 2) coordinate with the known affected projects to minimize disruption to their users (which is made easier since they are all projects that are almost exclusively distributed via conda, which enforces strict NumPy ABI versioning), 3) publicize these changes widely so as to give any private code that might be affected a chance to speak up or adapt, and 4) leave the "ABI version tag" as it is, so as not to force rebuilds of the vast majority of projects that will be unaffected by these changes. This NEP defers the question of exactly what the improved API should be, since there's no point in trying to nail down the details until we've decided whether it's even possible to change. Details ======= The problem ----------- Currently, a ufunc inner loop implementation is called via the following function prototype:: typedef void (*PyUFuncGenericFunction) (char **args, npy_intp *dimensions, npy_intp *strides, void *innerloopdata); Here ``args`` is an array of pointers to 1-d buffers of input/output data, ``dimensions`` is a pointer to the number of entries in these buffers, ``strides`` is an array of integers giving the strides for each input/output array, and ``innerloopdata`` is an arbitrary void* supplied by whoever registered the ufunc loop. (For gufuncs, extra shape and stride information about the core dimensions also gets packed into the ends of these arrays in a somewhat complicated way.) There are 4 key items that define a NumPy array: data, shape, strides, dtype. Notice that this function only gets access to 3 of them. Our goal is to fix that. For example, a better signature would be:: typedef void (*PyUFuncGenericFunction_NEW) (char **data, npy_intp *shapes, npy_intp *strides, PyArray_Descr *dtypes, /* NEW */ void *innerloopdata); (In practice I suspect we might want to make some more changes as well, like upgrading gufunc core shape/strides to proper arguments instead of tacking it onto the existing arrays, and adding an "escape valve" void* reserved for future extensions. But working out such details is outside the scope of this NEP; the above will do for illustration.) The goal of this NEP is to clear the ground so that we can start supporting ufunc inner loops that take dtype arguments, and make other enhancements to ufunc functionality going forward. Proposal -------- Currently, the public API/ABI for ufuncs consists of the functions:: PyUFunc_GenericFunction PyUFunc_FromFuncAndData PyUFunc_FromFuncAndDataAndSignature PyUFunc_RegisterLoopForDescr PyUFunc_RegisterLoopForType PyUFunc_ReplaceLoopBySignature PyUFunc_SetUsesArraysAsData together with direct access to PyUFuncObject's internal fields:: typedef struct { PyObject_HEAD int nin, nout, nargs; int identity; PyUFuncGenericFunction *functions; void **data; int ntypes; int check_return; const char *name; char *types; const char *doc; void *ptr; PyObject *obj; PyObject *userloops; int core_enabled; int core_num_dim_ix; int *core_num_dims; int *core_dim_ixs; int *core_offsets; char *core_signature; PyUFunc_TypeResolutionFunc *type_resolver; PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector; PyUFunc_InnerLoopSelectionFunc *inner_loop_selector; PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector; npy_uint32 *op_flags; npy_uint32 iter_flags; } PyUFuncObject; Obviously almost any future changes to how ufuncs work internally will involve touching some part of this public API/ABI. Concretely, the proposal here is that we avoid this by disabling the following functions (i.e., any attempt to call them should simply raise a ``NotImplementedError``):: PyUFunc_ReplaceLoopBySignature PyUFunc_SetUsesArraysAsData and that we reduce the publicly visible portion of PyUFuncObject down to:: typedef struct { PyObject_HEAD int nin, nout, nargs; } PyUFuncObject; Data on current API/ABI usage ----------------------------- In order to assess how much code would be affected by this proposal, I used a combination of Github search and Searchcode.com to trawl through the majority of all publicly available open source code. Neither search tool provides a fine-grained enough query language to directly tell us what we want to know, so I instead followed the strategy of first, casting a wide net: picking a set of search terms that are likely to catch all possibly-broken code (together with many false positives), and second, using automated tools to sift out the false positives and see what remained. Altogether, I reviewed 4464 search results. The tool I wrote to do this is `available on github <https://github.com/njsmith/codetrawl>`_, and so is `the analysis code itself <https://github.com/njsmith/ufunc-abi-analysis>`_. Uses of PyUFuncObject internals ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are no functions in the public API which return ``PyUFuncObject*`` values directly, so any code that access PyUFuncObject fields will have to mention that token in the course of defining a variable, performing a cast, setting up a typedef, etc. Therefore, I searched Github for all files written in C, C++, Objective C, Python, or Cython, which mentioned either "PyUFuncObject AND types" or "PyUFuncObject AND NOT types". (This is to work around limitations on how many results Github search is willing to return to a single query.) In addition, I searched for ``PyUFuncObject`` on searchcode.com. The full report on these searches is available here: https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html The following were screened out as non-problems: - Copies of NumPy itself (an astonishing number of people have checked in copies of it to their own source tree) - NumPy forks / precursors / etc. (e.g. Numeric also had a type called PyUFuncObject, the "bohrium" project has a fork of numpy 1.6, etc.) - Cython-generated boilerplate used to generate the "object has changed size" warning (which we `unconditionally filter out anyway <https://github.com/numpy/numpy/blob/master/numpy/__init__.py#L226>`_) - Lots of calls to ``PyUFunc_RegisterLoopForType`` and friends, which require casting the first argument to ``PyUFuncObject*`` - Misc. other unproblematic stuff (like Cython header declarations that never get used) There were also several cases that actually referenced PyUFuncObject internal fields: - The "rational" dtype from numpy-dtypes, which is used in a few projects, accesses ``ufunc->nargs`` as a safety check, but does not touch any other fields (`see here <https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151>`_). - Numba: does some rather elaborate things to support the definition of on-the-fly JITted ufuncs. These seem to be clear deficiencies in the ufunc API (e.g., there's no good way to control the lifespan of the array of function pointers passed to ``PyUFunc_FromFuncAndData``), so we should work with them to provide the API they need to do this in a maintainable way. Some of the relevant code: https://github.com/numba/numba/tree/master/numba/npyufunc https://github.com/numba/numba/blob/98752647a95ac6c9d480e81ca5c8afcfa3ddfd18/numba/npyufunc/_internal.c - dynd-python: Contains some code that attempts to extract the inner loops from a numpy ufunc object and wrap them into the dynd 'ufunc' equivalent: https://github.com/libdynd/dynd-python/blob/c06f8fc4e72257abac589faf76f10df8c045159b/dynd/src/numpy_ufunc_kernel.cpp - gulinalg: I'm not sure if anyone is still using this code since most of it was merged into numpy itself, but it's not a big deal in any case: all it contains is a `debugging function <https://github.com/ContinuumIO/gulinalg/blob/2ef365c48427c026dab4f45dc6f8b1b9af184460/gulinalg/src/gulinalg.c.src#L527-L550>`_ that dumps some internal fields from the PyUFuncObject. If you look, though, all calls to this function are already commented out :-). The full report is available here: https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html In the course of this analysis, it was also noted that the standard Cython pxd files contain a wrapper for ufunc objects:: cdef class ufunc [object PyUFuncObject]: ... which means that Cython code can access internal struct fields via an object of type ``ufunc``, and thus escape our string-based search above. Therefore I also examined all Cython files on Github or searchcode.com that matched the query ``ufunc``, and searched for any lines matching any of the following regular expressions:: cdef\W+ufunc catches: 'cdef ufunc fn' cdef\W+.*\.\W*ufunc catches: 'cdef np.ufunc fn' <.*ufunc\W*> catches: '(<ufunc> fn).nargs', '(< np.ufunc > fn).nargs' cdef.*\(.*ufunc catches: 'cdef doit(np.ufunc fn, ...):' (I considered parsing the actual source and analysing it that way, but decided I was too lazy. This could be done if anyone is worried that the above regexes might miss things though.) There were zero files that contained matches for any of the above regexes: https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/ufunc-cython-report.html Uses of PyUFunc_ReplaceLoopBySignature ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Applying the same screening as above, the only code that was found that used this function is also in Numba: https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_ReplaceLoopBySignature-report.html Uses of PyUFunc_SetUsesArraysAsData ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Aside from being semi-broken since 1.7 (it never got implemented for "masked" ufunc loops, i.e. those that use where=), there appear to be zero uses of this functionality either inside or outside NumPy: https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_SetUsesArraysAsData-report.html Rationale --------- **Rationale for preserving the remaining API functions**:: PyUFunc_GenericFunction PyUFunc_FromFuncAndData PyUFunc_FromFuncAndDataAndSignature PyUFunc_RegisterLoopForDescr PyUFunc_RegisterLoopForType In addition to being widely used, these functions can easily be preserved even if we change how ufuncs work internally, because they only ingest loop function pointers, they never return them. So they can easily be modified to wrap whatever loop function(s) they receive inside an adapter function that calls them at the appropriate time, and then register that adapter function using whatever API we add in the future. **Rationale for preserving the particular fields that are preserved**: Preserving ``nargs`` lets us avoid a little bit of breakage with the random dtype, and it doesn't seem like preserving ``nin``, ``nout``, ``nargs`` fields will produce any undue burden on future changes to ufunc internals; even if we were to introduce variadic ufuncs we could always just stick a -1 in the appropriate fields or whatever. **Rationale for removing PyUFunc_ReplaceLoopBySignature**: this function *returns* the PyUFunc_GenericFunction that was replaced; if we stop representing all loops using the legacy PyUFunc_GenericFunction type, then this will not be possible to do in the future. **Rationale for removing PyUFunc_SetUsesArraysAsData**: If set as the ``innerloopdata`` on a ufunc loop, then this function acts as a sentinel value, and causes the ``innerloopdata`` to instead be set to a pointer to the passed-in PyArrayObjects. In principle we could preserve this function, but it has a number of deficiencies: - No-one appears to use it. - It's been buggy for several releases and no-one noticed. - AFAIK the only reason it was included in the first place is that it provides a backdoor for ufunc loops to get access to the dtypes -- but we are planning to fix this in a better way. - It can't be shimmed as easily as the loop registration functions, because we don't anticipate that the new-and-improved ufunc loop functions will *get* access to the array objects, only to the dtypes; so this would have to remain cluttering up the core dispatch path indefinitely. - We have good reason for *not* wanting to get ufunc loops get access to the actual array objects, because one of the goals on our roadmap is exactly to enable the use of ufuncs on non-ndarray objects. Giving ufuncs access to dtypes alone creates a clean boundary here: it guarantees that ufunc loops can work equally on all duck-array objects (so long as they have a dtype), and enforces the invariant that anything which affects the interpretation of data values should be attached to the dtype, not to the array object. Rejected alternatives --------------------- **Do nothing**: there's no way we'll ever be able to touch ufuncs at all if we don't hide those fields sooner or later. While any amount of breakage is regrettable, the costs of cleaning things up now are less than the costs of never improving numpy's APIs. **Somehow sneak the dtype information in via ``void *innerloopdata``**: This might let us preserve the signature of PyUFunc_GenericFunction, and thus preserve PyUFunc_ReplaceLoopBySignature. But we'd still have the problem of leaving way too much internal state exposed, and it's not even clear how this would work, given that we actually do want to preserve the use of ``innerloopdata`` for actual per-loop data. (This is where the PyUFunc_SetUsesArraysAsData hack fails.) -- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion