On Tue, Mar 17, 2020 at 9:03 PM Sebastian Berg <sebast...@sipsolutions.net> wrote:
> Hi all, > > in the spirit of trying to keep this moving, can I assume that the main > reason for little discussion is that the actual changes proposed are > not very far reaching as of now? Or is the reason that this is a > fairly complex topic that you need more time to think about it? > Probably (a) it's a long NEP on a complex topic, (b) the past week has been a very weird week for everyone (in the extra-news-reading-time I could easily have re-reviewed the NEP), and (c) the amount of feedback one expects to get on a NEP is roughly inversely proportional to the scope and complexity of the NEP contents. Today I re-read the parts I commented on before. This version is a big improvement over the previous ones. Thanks in particular for adding clear examples and the diagram, it helps a lot. > If it is the latter, is there some way I can help with it? I tried to > minimize how much is part of this initial NEP. > > If there is not much need for discussion, I would like to officially > accept the NEP very soon, sending out an official one week notice in > the next days. > I agree. I think I would like to keep the option open though to come back to the NEP later to improve the clarity of the text about motivation/plan/examples/scope, given that this will be the reference for a major amount of work for a long time to come. To summarize one more time, the main point is that: > This point seems fine, and I'm +1 for going ahead with the described parts of the technical design. Cheers, Ralf > type(np.dtype(np.float64)) > > will be `np.dtype[float64]`, a subclass of dtype, so that: > > issubclass(np.dtype[float64], np.dtype) > > is true. This means that we will have one class for every current type > number: `dtype.num`. The implementation of these subclasses will be a > C-written (extension) MetaClass, all details of this class are supposed > to remain experimental in flux at this time. > > Cheers > > Sebastian > > > On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: > > Hi all, > > > > I am pleased to propose NEP 41: First step towards a new Datatype > > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > > > This NEP motivates the larger restructure of the datatype machinery > > in > > NumPy and defines a few fundamental design aspects. The long term > > user > > impact will be allowing easier and more rich featured user defined > > datatypes. > > > > As this is a large restructure, the NEP represents only the first > > steps > > with some additional information in further NEPs being drafted [1] > > (this may be helpful to look at depending on the level of detail you > > are interested in). > > The NEP itself does not propose to add significant new public API. > > Instead it proposes to move forward with an incremental internal > > refactor and lays the foundation for this process. > > > > The main user facing change at this time is that datatypes will > > become > > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > > specific > > class. > > For most users, the main impact should be many new datatypes in the > > long run (see the user impact section). However, for those interested > > in API design within NumPy or with respect to implementing new > > datatypes, this and the following NEPs are important decisions in the > > future roadmap for NumPy. > > > > The current full text is reproduced below, although the above link is > > probably a better way to read it. > > > > Cheers > > > > Sebastian > > > > > > [1] NEP 40 gives some background information about the current > > systems > > and issues with it: > > > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > > and NEP 42 being a first draft of how the new API may look like: > > > > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > > (links to current rendered versions, check > > https://github.com/numpy/numpy/pull/15505 and > > https://github.com/numpy/numpy/pull/15507 for updates) > > > > > > ------------------------------------------------------------------- > > --- > > > > > > ================================================= > > NEP 41 — First step towards a new Datatype System > > ================================================= > > > > :title: Improved Datatype Support > > :Author: Sebastian Berg > > :Author: Stéfan van der Walt > > :Author: Matti Picus > > :Status: Draft > > :Type: Standard Track > > :Created: 2020-02-03 > > > > > > .. note:: > > > > This NEP is part of a series of NEPs encompassing first > > information > > about the previous dtype implementation and issues with it in NEP > > 40. > > NEP 41 (this document) then provides an overview and generic > > design > > choices for the refactor. > > Further NEPs 42 and 43 go into the technical details of the > > datatype > > and universal function related internal and external API changes. > > In some cases it may be necessary to consult the other NEPs for a > > full > > picture of the desired changes and why these changes are > > necessary. > > > > > > Abstract > > -------- > > > > `Datatypes <data-type-objects-dtype>` in NumPy describe how to > > interpret each > > element in arrays. NumPy provides ``int``, ``float``, and ``complex`` > > numerical > > types, as well as string, datetime, and structured datatype > > capabilities. > > The growing Python community, however, has need for more diverse > > datatypes. > > Examples are datatypes with unit information attached (such as > > meters) or > > categorical datatypes (fixed set of possible values). > > However, the current NumPy datatype API is too limited to allow the > > creation > > of these. > > > > This NEP is the first step to enable such growth; it will lead to > > a simpler development path for new datatypes. > > In the long run the new datatype system will also support the > > creation > > of datatypes directly from Python rather than C. > > Refactoring the datatype API will improve maintainability and > > facilitate > > development of both user-defined external datatypes, > > as well as new features for existing datatypes internal to NumPy. > > > > > > Motivation and Scope > > -------------------- > > > > .. seealso:: > > > > The user impact section includes examples of what kind of new > > datatypes > > will be enabled by the proposed changes in the long run. > > It may thus help to read these section out of order. > > > > Motivation > > ^^^^^^^^^^ > > > > One of the main issues with the current API is the definition of > > typical > > functions such as addition and multiplication for parametric > > datatypes > > (see also NEP 40) which require additional steps to determine the > > output type. > > For example when adding two strings of length 4, the result is a > > string > > of length 8, which is different from the input. > > Similarly, a datatype which embeds a physical unit must calculate the > > new unit > > information: dividing a distance by a time results in a speed. > > A related difficulty is that the :ref:`current casting rules > > <_ufuncs.casting>` > > -- the conversion between different datatypes -- > > cannot describe casting for such parametric datatypes implemented > > outside of NumPy. > > > > This additional functionality for supporting parametric datatypes > > introduces > > increased complexity within NumPy itself, > > and furthermore is not available to external user-defined datatypes. > > In general the concerns of different datatypes are not well well- > > encapsulated. > > This burden is exacerbated by the exposure of internal C structures, > > limiting the addition of new fields > > (for example to support new sorting methods [new_sort]_). > > > > Currently there are many factors which limit the creation of new > > user-defined > > datatypes: > > > > * Creating casting rules for parametric user-defined dtypes is either > > impossible > > or so complex that it has never been attempted. > > * Type promotion, e.g. the operation deciding that adding float and > > integer > > values should return a float value, is very valuable for numeric > > datatypes > > but is limited in scope for user-defined and especially parametric > > datatypes. > > * Much of the logic (e.g. promotion) is written in single functions > > instead of being split as methods on the datatype itself. > > * In the current design datatypes cannot have methods that do not > > generalize > > to other datatypes. For example a unit datatype cannot have a > > ``.to_si()`` method to > > easily find the datatype which would represent the same values in > > SI units. > > > > The large need to solve these issues has driven the scientific > > community > > to create work-arounds in multiple projects implementing physical > > units as an > > array-like class instead of a datatype, which would generalize better > > across > > multiple array-likes (Dask, pandas, etc.). > > Already, Pandas has made a push into the same direction with its > > extension arrays [pandas_extension_arrays]_ and undoubtedly > > the community would be best served if such new features could be > > common > > between NumPy, Pandas, and other projects. > > > > Scope > > ^^^^^ > > > > The proposed refactoring of the datatype system is a large > > undertaking and > > thus is proposed to be split into various phases, roughly: > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > NEP 41) > > * Phase II: Incrementally define or rework API (Detailed largely in > > NEPs 42/43) > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > capabilities. > > > > For a more detailed accounting of the various phases, see > > "Plan to Approach the Full Refactor" in the Implementation section > > below. > > This NEP proposes to move ahead with the necessary creation of new > > dtype > > subclasses (Phase I), > > and start working on implementing current functionality. > > Within the context of this NEP all development will be fully private > > API or > > use preliminary underscored names which must be changed in the > > future. > > Most of the internal and public API choices are part of a second > > Phase > > and will be discussed in more detail in the following NEPs 42 and 43. > > The initial implementation of this NEP will have little or no effect > > on users, > > but provides the necessary ground work for incrementally addressing > > the > > full rework. > > > > The implementation of this NEP and the following, implied large > > rework of how > > datatypes are defined in NumPy is expected to create small > > incompatibilities > > (see backward compatibility section). > > However, a transition requiring large code adaption is not > > anticipated and not > > within scope. > > > > Specifically, this NEP makes the following design choices which are > > discussed > > in more details in the detailed description section: > > > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > > with most of the > > datatype-specific logic being implemented > > as special methods on the class. In the C-API, these correspond to > > specific > > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > > np.dtype)`` will remain true, > > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > > just ``np.dtype`` itself. > > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > > on the instance (as ``PyArray_Descr->f``), > > should instead be stored on the class as typically done in Python. > > In the future these may correspond to python side dunder methods. > > Storage information such as itemsize and byteorder can differ > > between > > different dtype instances (e.g. "S3" vs. "S8") and will remain > > part of the instance. > > This means that in the long run the current lowlevel access to > > dtype methods > > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > > > 2. The current NumPy scalars will *not* change, they will not be > > instances of > > datatypes. This will also be true for new datatypes, scalars will > > not be > > instances of a dtype (although ``isinstance(scalar, dtype)`` may > > be made > > to return ``True`` when appropriate). > > > > Detailed technical decisions to follow in NEP 42. > > > > Further, the public API will be designed in a way that is extensible > > in the future: > > > > 3. All new C-API functions provided to the user will hide > > implementation details > > as much as possible. The public API should be an identical, but > > limited, > > version of the C-API used for the internal NumPy datatypes. > > > > The changes to the datatype system in Phase II must include a large > > refactor of the > > UFunc machinery, which will be further defined in NEP 43: > > > > 4. To enable all of the desired functionality for new user-defined > > datatypes, > > the UFunc machinery will be changed to replace the current > > dispatching > > and type resolution system. > > The old system should be *mostly* supported as a legacy version > > for some time. > > > > Additionally, as a general design principle, the addition of new > > user-defined > > datatypes will *not* change the behaviour of programs. > > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or > > ``b`` know > > that ``c`` exists. > > > > > > User Impact > > ----------- > > > > The current ecosystem has very few user-defined datatypes using > > NumPy, the > > two most prominent being: ``rational`` and ``quaternion``. > > These represent fairly simple datatypes which are not strongly > > impacted > > by the current limitations. > > However, we have identified a need for datatypes such as: > > > > * bfloat16, used in deep learning > > * categorical types > > * physical units (such as meters) > > * datatypes for tracing/automatic differentiation > > * high, fixed precision math > > * specialized integer types such as int2, int24 > > * new, better datetime representations > > * extending e.g. integer dtypes to have a sentinel NA value > > * geometrical objects [pygeos]_ > > > > Some of these are partially solved; for example unit capability is > > provided > > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > > subclasses. > > Most of these datatypes, however, simply cannot be reasonably defined > > right now. > > An advantage of having such datatypes in NumPy is that they should > > integrate > > seamlessly with other array or array-like packages such as Pandas, > > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > > > The long term user impact of implementing this NEP will be to allow > > both > > the growth of the whole ecosystem by having such new datatypes, as > > well as > > consolidating implementation of such datatypes within NumPy to > > achieve > > better interoperability. > > > > > > Examples > > ^^^^^^^^ > > > > The following examples represent future user-defined datatypes we > > wish to enable. > > These datatypes are not part the NEP and choices (e.g. choice of > > casting rules) > > are possibilities we wish to enable and do not represent > > recommendations. > > > > Simple Numerical Types > > """""""""""""""""""""" > > > > Mainly used where memory is a consideration, lower-precision numeric > > types > > such as :ref:```bfloat16`` < > > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > > are common in other computational frameworks. > > For these types the definitions of things such as ``np.common_type`` > > and > > ``np.can_cast`` are some of the most important interfaces. Once they > > support ``np.common_type``, it is (for the most part) possible to > > find > > the correct ufunc loop to call, since most ufuncs -- such as add -- > > effectively > > only require ``np.result_type``:: > > > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > > > and `~numpy.result_type` is largely identical to > > `~numpy.common_type`. > > > > > > Fixed, high precision math > > """""""""""""""""""""""""" > > > > Allowing arbitrary precision or higher precision math is important in > > simulations. For instance ``mpmath`` defines a precision:: > > > > >>> import mpmath as mp > > >>> print(mp.dps) # the current (default) precision > > 15 > > > > NumPy should be able to construct a native, memory-efficient array > > from > > a list of ``mpmath.mpf`` floating point objects:: > > > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > > list) > > >>> print(arr_15_dps) # Must find the correct precision from the > > objects: > > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > > > We should also be able to specify the desired precision when > > creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` > > to find the DType class (the notation is not part of this NEP), > > which is then instantiated with the desired parameter. > > This could also be written as ``MpfDType`` class:: > > > > >>> arr_100_dps = np.array([1, 2, 3], > > dtype=np.dtype[mp.mpf](dps=100)) > > >>> print(arr_15_dps + arr_100_dps) > > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > > > The ``mpf`` datatype can decide that the result of the operation > > should be the > > higher precision one of the two, so uses a precision of 100. > > Furthermore, we should be able to define casting, for example as in:: > > > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > > casting="safe") > > True > > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > > casting="safe") > > False # loses precision > > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > > casting="same_kind") > > True > > > > Casting from float is a probably always at least a ``same_kind`` > > cast, but > > in general, it is not safe:: > > > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > > casting="safe") > > False > > > > since a float64 has a higer precision than the ``mpf`` datatype with > > ``dps=4``. > > > > Alternatively, we can say that:: > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > > np.dtype[mp.mpf](dps=10)) > > np.dtype[mp.mpf](dps=10) > > > > And possibly even:: > > > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > > believe) > > > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > > safely. > > > > > > Categoricals > > """""""""""" > > > > Categoricals are interesting in that they can have fixed, predefined > > values, > > or can be dynamic with the ability to modify categories when > > necessary. > > The fixed categories (defined ahead of time) is the most straight > > forward > > categorical definition. > > Categoricals are *hard*, since there are many strategies to implement > > them, > > suggesting NumPy should only provide the scaffolding for user-defined > > categorical types. For instance:: > > > > >>> cat = Categorical(["eggs", "spam", "toast"]) > > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > > dtype=cat) > > > > could store the array very efficiently, since it knows that there are > > only 3 > > categories. > > Since a categorical in this sense knows almost nothing about the data > > stored > > in it, few operations makes, sense, although equality does: > > > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > > dtype=cat) > > >>> breakfast == breakfast2 > > array[True, False, True, False]) > > > > The categorical datatype could work like a dictionary: no two > > items names can be equal (checked on dtype creation), so that the > > equality > > operation above can be performed very efficiently. > > If the values define an order, the category labels (internally > > integers) could > > be ordered the same way to allow efficient sorting and comparison. > > > > Whether or not casting is defined from one categorical with less to > > one with > > strictly more values defined, is something that the Categorical > > datatype would > > need to decide. Both options should be available. > > > > > > Unit on the Datatype > > """""""""""""""""""" > > > > There are different ways to define Units, depending on how the > > internal > > machinery would be organized, one way is to have a single Unit > > datatype > > for every existing numerical type. > > This will be written as ``Unit[float64]``, the unit itself is part of > > the > > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > > attached:: > > > > >>> from astropy import units > > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > > meters > > >>> print(meters) > > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > Note that units are a bit tricky. It is debatable, whether:: > > > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > > > should be valid syntax (coercing the float scalars without a unit to > > meters). > > Once the array is created, math will work without any issue:: > > > > >>> meters / (2 * unit.seconds) > > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > > > Casting is not valid from one unit to the other, but can be valid > > between > > different scales of the same dimensionality (although this may be > > "unsafe"):: > > > > >>> meters.astype(Unit[float64]("s")) > > TypeError: Cannot cast meters to seconds. > > >>> meters.astype(Unit[float64]("km")) > > >>> # Convert to centimeter-gram-second (cgs) units: > > >>> meters.astype(meters.dtype.to_cgs()) > > > > The above notation is somewhat clumsy. Functions > > could be used instead to convert between units. > > There may be ways to make these more convenient, but those must be > > left > > for future discussions:: > > > > >>> units.convert(meters, "km") > > >>> units.to_cgs(meters) > > > > There are some open questions. For example, whether additional > > methods > > on the array object could exist to simplify some of the notions, and > > how these > > would percolate from the datatype to the ``ndarray``. > > > > The interaction with other scalars would likely be defined through:: > > > > >>> np.common_type(np.float64, Unit) > > Unit[np.float64](dimensionless) > > > > Ufunc output datatype determination can be more involved than for > > simple > > numerical dtypes since there is no "universal" output type:: > > > > >>> np.multiply(meters, seconds).dtype != np.result_type(meters, > > seconds) > > > > In fact ``np.result_type(meters, seconds)`` must error without > > context > > of the operation being done. > > This example highlights how the specific ufunc loop > > (loop with known, specific DTypes as inputs), has to be able to to > > make > > certain decisions before the actual calculation can start. > > > > > > > > Implementation > > -------------- > > > > Plan to Approach the Full Refactor > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > To address these issues in NumPy and enable new datatypes, > > multiple development stages are required: > > > > * Phase I: Restructure and extend the datatype infrastructure (This > > NEP) > > > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > > > * Phase II: Incrementally define or rework API > > > > * Create a new and easily extensible API for defining new datatypes > > and related functionality. (NEP 42) > > > > * Incrementally define all necessary functionality through the new > > API (NEP 42): > > > > * Defining operations such as ``np.common_type``. > > * Allowing to define casting between datatypes. > > * Add functionality necessary to create a numpy array from Python > > scalars > > (i.e. ``np.array(...)``). > > * … > > > > * Restructure how universal functions work (NEP 43), in order to: > > > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > > to be > > extended by user-defined datatypes such as Units. > > > > * allow efficient lookup for the correct implementation for user- > > defined > > datatypes. > > > > * enable reuse of existing code. Units should be able to use the > > normal math loops and add additional logic to determine output > > type. > > > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > > capabilities: > > > > * Cleanup of legacy behaviour where it is considered buggy or > > undesirable. > > * Provide a path to define new datatypes from Python. > > * Assist the community in creating types such as Units or > > Categoricals > > * Allow strings to be used in functions such as ``np.equal`` or > > ``np.add``. > > * Remove legacy code paths within NumPy to improve long term > > maintainability > > > > This document serves as a basis for phase I and provides the vision > > and > > motivation for the full project. > > Phase I does not introduce any new user-facing features, > > but is concerned with the necessary conceptual cleanup of the current > > datatype system. > > It provides a more "pythonic" datatype Python type object, with a > > clear class hierarchy. > > > > The second phase is the incremental creation of all APIs necessary to > > define > > fully featured datatypes and reorganization of the NumPy datatype > > system. > > This phase will thus be primarily concerned with defining an, > > initially preliminary, stable public API. > > > > Some of the benefits of a large refactor may only become evident > > after the full > > deprecation of the current legacy implementation (i.e. larger code > > removals). > > However, these steps are necessary for improvements to many parts of > > the > > core NumPy API, and are expected to make the implementation generally > > easier to understand. > > > > The following figure illustrates the proposed design at a high level, > > and roughly delineates the components of the overall design. > > Note that this NEP only regards Phase I (shaded area), > > the rest encompasses Phase II and the design choices are up for > > discussion, > > however, it highlights that the DType datatype class is the central, > > necessary > > concept: > > > > .. image:: _static/nep-0041-mindmap.svg > > > > > > First steps directly related to this NEP > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The required changes necessary to NumPy are large and touch many > > areas > > of the code base > > but many of these changes can be addressed incrementally. > > > > To enable an incremental approach we will start by creating a C > > defined > > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > > classes, > > subclasses of ``np.dtype``. > > This is necessary to add the ability of storing custom slots on the > > DType in C. > > This ``DTypeMeta`` will be implemented first to then enable > > incremental > > restructuring of current code. > > > > The addition of ``DType`` will then enable addressing other changes > > incrementally, some of which may begin before the settling the full > > internal > > API: > > > > 1. New machinery for array coercion, with the goal of enabling user > > DTypes > > with appropriate class methods. > > 2. The replacement or wrapping of the current casting machinery. > > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots > > into > > DType method slots. > > > > At this point, no or only very limited new public API will be added > > and > > the internal API is considered to be in flux. > > Any new public API may be set up give warnings and will have leading > > underscores > > to indicate that it is not finalized and can be changed without > > warning. > > > > > > Backward compatibility > > ---------------------- > > > > While the actual backward compatibility impact of implementing Phase > > I and II > > are not yet fully clear, we anticipate, and accept the following > > changes: > > > > * **Python API**: > > > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > > while right > > now ``type(np.dtype("f8")) is np.dtype``. > > Code should use ``isinstance`` checks, and in very rare cases may > > have to > > be adapted to use it. > > > > * **C-API**: > > > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > > which uses > > ``type(dtype) is np.dtype``. When compiling against an old > > NumPy version, > > the macro may have to be replaced with the corresponding > > ``PyObject_IsInstance`` call. (If this is a problem, we could > > backport > > fixing the macro) > > > > * The UFunc machinery changes will break *limited* parts of the > > current > > implementation. Replacing e.g. the default ``TypeResolver`` is > > expected > > to remain supported for a time, although optimized masked inner > > loop iteration > > (which is not even used *within* NumPy) will no longer be > > supported. > > > > * All functions currently defined on the dtypes, such as > > ``PyArray_Descr->f->nonzero``, will be defined and accessed > > differently. > > This means that in the long run lowlevel access code will > > have to be changed to use the new API. Such changes are expected > > to be > > necessary in very few project. > > > > * **dtype implementors (C-API)**: > > > > * The array which is currently provided to some functions (such as > > cast functions), > > will no longer be provided. > > For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f- > > >copyswapn``, > > may instead receive a dummy array object with only some fields > > (mainly the > > dtype), being valid. > > At least in some code paths, a similar mechanism is already used. > > > > * The ``scalarkind`` slot and registration of scalar casting will > > be > > removed/ignored without replacement. > > It currently allows partial value-based casting. > > The ``PyArray_ScalarKind`` function will continue to work for > > builtin types, > > but will not be used internally and be deprecated. > > > > * Currently user dtypes are defined as instances of ``np.dtype``. > > The creation works by the user providing a prototype instance. > > NumPy will need to modify at least the type during registration. > > This has no effect for either ``rational`` or ``quaternion`` and > > mutation > > of the structure seems unlikely after registration. > > > > Since there is a fairly large API surface concerning datatypes, > > further changes > > or the limitation certain function to currently existing datatypes is > > likely to occur. > > For example functions which use the type number as input > > should be replaced with functions taking DType classes instead. > > Although public, large parts of this C-API seem to be used rarely, > > possibly never, by downstream projects. > > > > > > > > Detailed Description > > -------------------- > > > > This section details the design decisions covered by this NEP. > > The subsections correspond to the list of design choices presented > > in the Scope section. > > > > Datatypes as Python Classes (1) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The current NumPy datatypes are not full scale python classes. > > They are instead (prototype) instances of a single ``np.dtype`` > > class. > > Changing this means that any special handling, e.g. for ``datetime`` > > can be moved to the Datetime DType class instead, away from > > monolithic general > > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > > > The main consequence of this change with respect to the API is that > > special methods move from the dtype instances to methods on the new > > DType class. > > This is the typical design pattern used in Python. > > Organizing these methods and information in a more Pythonic way > > provides a > > solid foundation for refining and extending the API in the future. > > The current API cannot be extended due to how it is exposed > > publically. > > This means for example that the methods currently stored in > > ``PyArray_ArrFuncs`` > > on each datatype (see NEP 40) will be defined differently in the > > future and > > deprecated in the long run. > > > > The most prominent visible side effect of this will be that > > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > > Instead it will be a subclass of ``np.dtype`` meaning that > > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > > This will also add the ability to use ``isinstance(dtype, > > np.dtype[float64])`` > > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > > ``dtype.type`` > > to do this check. > > > > With the design decision of DTypes as full-scale Python classes, > > the question of subclassing arises. > > Inheritance, however, appears problematic and a complexity best > > avoided > > (at least initially) for container datatypes. > > Further, subclasses may be more interesting for interoperability for > > example with GPU backends (CuPy) storing additional methods related > > to the > > GPU rather than as a mechanism to define new datatypes. > > A class hierarchy does provides value, this may be achieved by > > allowing the creation of *abstract* datatypes. > > An example for an abstract datatype would be the datatype equivalent > > of > > ``np.floating``, representing any floating point number. > > These can serve the same purpose as Python's abstract base classes. > > > > > > Scalars should not be instances of the datatypes (2) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > For simple datatypes such as ``float64`` (see also below), it seems > > tempting that the instance of a ``np.dtype("float64")`` can be the > > scalar. > > This idea may be even more appealing due to the fact that scalars, > > rather than datatypes, currently define a useful type hierarchy. > > > > However, we have specifically decided against this for a number of > > reasons. > > First, the new datatypes described herein would be instances of DType > > classes. > > Making these instances themselves classes, while possible, adds > > additional > > complexity that users need to understand. > > It would also mean that scalars must have storage information (such > > as byteorder) > > which is generally unnecessary and currently is not used. > > Second, while the simple NumPy scalars such as ``float64`` may be > > such instances, > > it should be possible to create datatypes for Python objects without > > enforcing > > NumPy as a dependency. > > However, Python objects that do not depend on NumPy cannot be > > instances of a NumPy DType. > > Third, there is a mismatch between the methods and attributes which > > are useful > > for scalars and datatypes. For instance ``to_float()`` makes sense > > for a scalar > > but not for a datatype and ``newbyteorder`` is not useful on a scalar > > (or has > > a different meaning). > > > > Overall, it seem rather than reducing the complexity, i.e. by merging > > the two distinct type hierarchies, making scalars instances of DTypes > > would > > increase the complexity of both the design and implementation. > > > > A possible future path may be to instead simplify the current NumPy > > scalars to > > be much simpler objects which largely derive their behaviour from the > > datatypes. > > > > C-API for creating new Datatypes (3) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > The current C-API with which users can create new datatypes > > is limited in scope, and requires use of "private" structures. This > > means > > the API is not extensible: no new members can be added to the > > structure > > without losing binary compatibility. > > This has already limited the inclusion of new sorting methods into > > NumPy [new_sort]_. > > > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > > structure used > > to define new datatypes. > > Datatypes that currently exist and are defined using these slots will > > be > > supported during a deprecation period. > > > > The most likely solution is to hide the implementation from the user > > and thus make > > it extensible in the future is to model the API after Python's stable > > API [PEP-384]_: > > > > .. code-block:: C > > > > static struct PyArrayMethodDef slots[] = { > > {NPY_dt_method, method_implementation}, > > ..., > > {0, NULL} > > } > > > > typedef struct{ > > PyTypeObject *typeobj; /* type of python scalar */ > > ...; > > PyType_Slot *slots; > > } PyArrayDTypeMeta_Spec; > > > > PyObject* PyArray_InitDTypeMetaFromSpec( > > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > > *dtype_spec); > > > > The C-side slots should be designed to mirror Python side methods > > such as ``dtype.__dtype_method__``, although the exposure to Python > > is > > a later step in the implementation to reduce the complexity of the > > initial > > implementation. > > > > > > C-API Changes to the UFunc Machinery (4) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > Proposed changes to the UFunc machinery will be part of NEP 43. > > However, the following changes will be necessary (see NEP 40 for a > > detailed > > description of the current implementation and its issues): > > > > * The current UFunc type resolution must be adapted to allow better > > control > > for user-defined dtypes as well as resolve current inconsistencies. > > * The inner-loop used in UFuncs must be expanded to include a return > > value. > > Further, error reporting must be improved, and passing in dtype- > > specific > > information enabled. > > This requires the modification of the inner-loop function signature > > and > > addition of new hooks called before and after the inner-loop is > > used. > > > > An important goal for any changes to the universal functions will be > > to > > allow the reuse of existing loops. > > It should be easy for a new units datatype to fall back to existing > > math > > functions after handling the unit related computations. > > > > > > Discussion > > ---------- > > > > See NEP 40 for a list of previous meetings and discussions. > > > > > > References > > ---------- > > > > .. [pandas_extension_arrays] > > > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > > > .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 > > > > .. [pygeos] https://github.com/caspervdw/pygeos > > > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > > > > Copyright > > --------- > > > > This document has been placed in the public domain. > > > > > > Acknowledgments > > --------------- > > > > The effort to create new datatypes for NumPy has been discussed for > > several > > years in many different contexts and settings, making it impossible > > to list everyone involved. > > We would like to thank especially Stephan Hoyer, Nathaniel Smith, and > > Eric Wieser > > for repeated in-depth discussion about datatype design. > > We are very grateful for the community input in reviewing and > > revising this > > NEP and would like to thank especially Ross Barnowski and Ralf > > Gommers. > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion