We have the PyArrayObject vs PyArrayObject_fields definition in ndarraytypes.h that is used to enforce access to the members through inline functions rather than directly, which seems to me like the right way to go: don't leave stones unturned, hide everything and provide PyUFunc_NIN, PyUFunc_NOUT and friends to handle those too.
On Sun, Sep 20, 2015 at 9:13 PM, Nathaniel Smith <n...@pobox.com> wrote: > Hi all, > > Here's a first draft NEP for comments. > > -- > > Synopsis > ======== > > Improving numpy's dtype system requires that ufunc loops start having > access to details of the specific dtype instance they are acting on: > e.g. an implementation of np.equal for strings needs access to the > dtype object in order to know what "n" to pass to strncmp. Similar > issues arise with variable length strings, missing values, categorical > data, unit support, datetime with timezone support, etc. -- this is a > major blocker for improving numpy. > > Unfortunately, the current ufunc inner loop function signature makes > it very difficult to provide this information. We might be able to > wedge it in there, but it'd be ugly. > > The other option would be to change the signature. What would happen > if we did this? For most common uses of the C API/ABI, we could do > this easily while maintaining backwards compatibility. But there are > also some rarely-used parts of the API/ABI that would be > prohibitively difficult to preserve. > > In addition, there are other potential changes to ufuncs on the > horizon (e.g. extensions of gufuncs to allow them to be used more > generally), and the current API exposure is so massive that any such > changes will be difficult to make in a fully compatible way. This NEP > thus considers the possibility of closing down the ufunc API to a > minimal, maintainable subset of the current API. > > To better understand the consequences of this potential change, I > performed an exhaustive analysis of all the code on Github, Bitbucket, > and Fedora, among others. The results make me highly confident that of > all the publically available projects in the world, the only ones > which touch the problematic parts of the ufunc API are: Numba, > dynd-python, and `gulinalg <https://github.com/ContinuumIO/gulinalg>`_ > (with the latter's exposure being trivial). > > Given this, I propose that for 1.11 we: > 1) go ahead and hide/disable the problematic parts of the ABI/API, > 2) coordinate with the known affected projects to minimize disruption > to their users (which is made easier since they are all projects that > are almost exclusively distributed via conda, which enforces strict > NumPy ABI versioning), > 3) publicize these changes widely so as to give any private code that > might be affected a chance to speak up or adapt, and > 4) leave the "ABI version tag" as it is, so as not to force rebuilds > of the vast majority of projects that will be unaffected by these > changes. > > This NEP defers the question of exactly what the improved API should > be, since there's no point in trying to nail down the details until > we've decided whether it's even possible to change. > > > Details > ======= > > The problem > ----------- > > Currently, a ufunc inner loop implementation is called via the > following function prototype:: > > typedef void (*PyUFuncGenericFunction) > (char **args, > npy_intp *dimensions, > npy_intp *strides, > void *innerloopdata); > > Here ``args`` is an array of pointers to 1-d buffers of input/output > data, ``dimensions`` is a pointer to the number of entries in these > buffers, ``strides`` is an array of integers giving the strides for > each input/output array, and ``innerloopdata`` is an arbitrary void* > supplied by whoever registered the ufunc loop. (For gufuncs, extra > shape and stride information about the core dimensions also gets > packed into the ends of these arrays in a somewhat complicated way.) > > There are 4 key items that define a NumPy array: data, shape, strides, > dtype. Notice that this function only gets access to 3 of them. Our > goal is to fix that. For example, a better signature would be:: > > typedef void (*PyUFuncGenericFunction_NEW) > (char **data, > npy_intp *shapes, > npy_intp *strides, > PyArray_Descr *dtypes, /* NEW */ > void *innerloopdata); > > (In practice I suspect we might want to make some more changes as > well, like upgrading gufunc core shape/strides to proper arguments > instead of tacking it onto the existing arrays, and adding an "escape > valve" void* reserved for future extensions. But working out such > details is outside the scope of this NEP; the above will do for > illustration.) > > The goal of this NEP is to clear the ground so that we can start > supporting ufunc inner loops that take dtype arguments, and make other > enhancements to ufunc functionality going forward. > > > Proposal > -------- > > Currently, the public API/ABI for ufuncs consists of the functions:: > > PyUFunc_GenericFunction > > PyUFunc_FromFuncAndData > PyUFunc_FromFuncAndDataAndSignature > PyUFunc_RegisterLoopForDescr > PyUFunc_RegisterLoopForType > > PyUFunc_ReplaceLoopBySignature > PyUFunc_SetUsesArraysAsData > > together with direct access to PyUFuncObject's internal fields:: > > typedef struct { > PyObject_HEAD > int nin, nout, nargs; > int identity; > PyUFuncGenericFunction *functions; > void **data; > int ntypes; > int check_return; > const char *name; > char *types; > const char *doc; > void *ptr; > PyObject *obj; > PyObject *userloops; > int core_enabled; > int core_num_dim_ix; > int *core_num_dims; > int *core_dim_ixs; > int *core_offsets; > char *core_signature; > PyUFunc_TypeResolutionFunc *type_resolver; > PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector; > PyUFunc_InnerLoopSelectionFunc *inner_loop_selector; > PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector; > npy_uint32 *op_flags; > npy_uint32 iter_flags; > } PyUFuncObject; > > Obviously almost any future changes to how ufuncs work internally will > involve touching some part of this public API/ABI. > > Concretely, the proposal here is that we avoid this by disabling the > following functions (i.e., any attempt to call them should simply > raise a ``NotImplementedError``):: > > PyUFunc_ReplaceLoopBySignature > PyUFunc_SetUsesArraysAsData > > and that we reduce the publicly visible portion of PyUFuncObject down to:: > > typedef struct { > PyObject_HEAD > int nin, nout, nargs; > } PyUFuncObject; > > > Data on current API/ABI usage > ----------------------------- > > In order to assess how much code would be affected by this proposal, I > used a combination of Github search and Searchcode.com to trawl > through the majority of all publicly available open source code. > Neither search tool provides a fine-grained enough query language to > directly tell us what we want to know, so I instead followed the > strategy of first, casting a wide net: picking a set of search terms > that are likely to catch all possibly-broken code (together with many > false positives), and second, using automated tools to sift out the > false positives and see what remained. Altogether, I reviewed 4464 > search results. > > The tool I wrote to do this is `available on github > <https://github.com/njsmith/codetrawl>`_, and so is `the analysis code > itself <https://github.com/njsmith/ufunc-abi-analysis>`_. > > > Uses of PyUFuncObject internals > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > There are no functions in the public API which return > ``PyUFuncObject*`` values directly, so any code that access > PyUFuncObject fields will have to mention that token in the course of > defining a variable, performing a cast, setting up a typedef, etc. > > Therefore, I searched Github for all files written in C, C++, > Objective C, Python, or Cython, which mentioned either "PyUFuncObject > AND types" or "PyUFuncObject AND NOT types". (This is to work around > limitations on how many results Github search is willing to return to > a single query.) In addition, I searched for ``PyUFuncObject`` on > searchcode.com. > > The full report on these searches is available here: > > https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html > > The following were screened out as non-problems: > > - Copies of NumPy itself (an astonishing number of people have checked > in copies of it to their own source tree) > - NumPy forks / precursors / etc. (e.g. Numeric also had a type called > PyUFuncObject, the "bohrium" project has a fork of numpy 1.6, etc.) > - Cython-generated boilerplate used to generate the "object has > changed size" warning (which we `unconditionally filter out anyway > <https://github.com/numpy/numpy/blob/master/numpy/__init__.py#L226>`_) > - Lots of calls to ``PyUFunc_RegisterLoopForType`` and friends, which > require casting the first argument to ``PyUFuncObject*`` > - Misc. other unproblematic stuff (like Cython header declarations > that never get used) > > There were also several cases that actually referenced PyUFuncObject > internal fields: > > - The "rational" dtype from numpy-dtypes, which is used in a few > projects, accesses ``ufunc->nargs`` as a safety check, but does not > touch any other fields (`see here > < > https://github.com/numpy/numpy-dtypes/blob/c0175a6b1c5aa89b4520b29487f06d0e200e2a03/npytypes/rational/rational.c#L1140-L1151 > >`_). > > - Numba: does some rather elaborate things to support the definition > of on-the-fly JITted ufuncs. These seem to be clear deficiencies in > the ufunc API (e.g., there's no good way to control the lifespan of > the array of function pointers passed to ``PyUFunc_FromFuncAndData``), > so we should work with them to provide the API they need to do this in > a maintainable way. Some of the relevant code: > > https://github.com/numba/numba/tree/master/numba/npyufunc > > https://github.com/numba/numba/blob/98752647a95ac6c9d480e81ca5c8afcfa3ddfd18/numba/npyufunc/_internal.c > > - dynd-python: Contains some code that attempts to extract the inner > loops from a numpy ufunc object and wrap them into the dynd 'ufunc' > equivalent: > > https://github.com/libdynd/dynd-python/blob/c06f8fc4e72257abac589faf76f10df8c045159b/dynd/src/numpy_ufunc_kernel.cpp > > - gulinalg: I'm not sure if anyone is still using this code since most > of it was merged into numpy itself, but it's not a big deal in any > case: all it contains is a `debugging function > < > https://github.com/ContinuumIO/gulinalg/blob/2ef365c48427c026dab4f45dc6f8b1b9af184460/gulinalg/src/gulinalg.c.src#L527-L550 > >`_ > that dumps some internal fields from the PyUFuncObject. If you look, > though, all calls to this function are already commented out :-). > > The full report is available here: > > https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/pyufuncobject-report.html > > In the course of this analysis, it was also noted that the standard > Cython pxd files contain a wrapper for ufunc objects:: > > cdef class ufunc [object PyUFuncObject]: > ... > > which means that Cython code can access internal struct fields via an > object of type ``ufunc``, and thus escape our string-based search > above. Therefore I also examined all Cython files on Github or > searchcode.com that matched the query ``ufunc``, and searched for any > lines matching any of the following regular expressions:: > > cdef\W+ufunc > catches: 'cdef ufunc fn' > cdef\W+.*\.\W*ufunc > catches: 'cdef np.ufunc fn' > <.*ufunc\W*> > catches: '(<ufunc> fn).nargs', '(< np.ufunc > fn).nargs' > cdef.*\(.*ufunc > catches: 'cdef doit(np.ufunc fn, ...):' > > (I considered parsing the actual source and analysing it that way, but > decided I was too lazy. This could be done if anyone is worried that > the above regexes might miss things though.) > > There were zero files that contained matches for any of the above regexes: > > https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/ufunc-cython-report.html > > > Uses of PyUFunc_ReplaceLoopBySignature > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Applying the same screening as above, the only code that was found > that used this function is also in Numba: > > https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_ReplaceLoopBySignature-report.html > > > Uses of PyUFunc_SetUsesArraysAsData > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Aside from being semi-broken since 1.7 (it never got implemented for > "masked" ufunc loops, i.e. those that use where=), there appear to be > zero uses of this functionality either inside or outside NumPy: > > https://rawgit.com/njsmith/ufunc-abi-analysis/master/reports/PyUFunc_SetUsesArraysAsData-report.html > > > Rationale > --------- > > **Rationale for preserving the remaining API functions**:: > > PyUFunc_GenericFunction > > PyUFunc_FromFuncAndData > PyUFunc_FromFuncAndDataAndSignature > PyUFunc_RegisterLoopForDescr > PyUFunc_RegisterLoopForType > > In addition to being widely used, these functions can easily be > preserved even if we change how ufuncs work internally, because they > only ingest loop function pointers, they never return them. So they > can easily be modified to wrap whatever loop function(s) they receive > inside an adapter function that calls them at the appropriate time, > and then register that adapter function using whatever API we add in > the future. > > **Rationale for preserving the particular fields that are preserved**: > Preserving ``nargs`` lets us avoid a little bit of breakage with the > random dtype, and it doesn't seem like preserving ``nin``, ``nout``, > ``nargs`` fields will produce any undue burden on future changes to > ufunc internals; even if we were to introduce variadic ufuncs we could > always just stick a -1 in the appropriate fields or whatever. > > **Rationale for removing PyUFunc_ReplaceLoopBySignature**: this > function *returns* the PyUFunc_GenericFunction that was replaced; if > we stop representing all loops using the legacy > PyUFunc_GenericFunction type, then this will not be possible to do in > the future. > > **Rationale for removing PyUFunc_SetUsesArraysAsData**: If set as the > ``innerloopdata`` on a ufunc loop, then this function acts as a > sentinel value, and causes the ``innerloopdata`` to instead be set to > a pointer to the passed-in PyArrayObjects. In principle we could > preserve this function, but it has a number of deficiencies: > - No-one appears to use it. > - It's been buggy for several releases and no-one noticed. > - AFAIK the only reason it was included in the first place is that it > provides a backdoor for ufunc loops to get access to the dtypes -- but > we are planning to fix this in a better way. > - It can't be shimmed as easily as the loop registration functions, > because we don't anticipate that the new-and-improved ufunc loop > functions will *get* access to the array objects, only to the dtypes; > so this would have to remain cluttering up the core dispatch path > indefinitely. > - We have good reason for *not* wanting to get ufunc loops get access > to the actual array objects, because one of the goals on our roadmap > is exactly to enable the use of ufuncs on non-ndarray objects. Giving > ufuncs access to dtypes alone creates a clean boundary here: it > guarantees that ufunc loops can work equally on all duck-array objects > (so long as they have a dtype), and enforces the invariant that > anything which affects the interpretation of data values should be > attached to the dtype, not to the array object. > > > Rejected alternatives > --------------------- > > **Do nothing**: there's no way we'll ever be able to touch ufuncs at > all if we don't hide those fields sooner or later. While any amount of > breakage is regrettable, the costs of cleaning things up now are less > than the costs of never improving numpy's APIs. > > **Somehow sneak the dtype information in via ``void > *innerloopdata``**: This might let us preserve the signature of > PyUFunc_GenericFunction, and thus preserve > PyUFunc_ReplaceLoopBySignature. But we'd still have the problem of > leaving way too much internal state exposed, and it's not even clear > how this would work, given that we actually do want to preserve the > use of ``innerloopdata`` for actual per-loop data. (This is where the > PyUFunc_SetUsesArraysAsData hack fails.) > > > -- > Nathaniel J. Smith -- http://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion