Hi all, in the spirit of trying to keep this moving, can I assume that the main reason for little discussion is that the actual changes proposed are not very far reaching as of now? Or is the reason that this is a fairly complex topic that you need more time to think about it? If it is the latter, is there some way I can help with it? I tried to minimize how much is part of this initial NEP.
If there is not much need for discussion, I would like to officially accept the NEP very soon, sending out an official one week notice in the next days. To summarize one more time, the main point is that: type(np.dtype(np.float64)) will be `np.dtype[float64]`, a subclass of dtype, so that: issubclass(np.dtype[float64], np.dtype) is true. This means that we will have one class for every current type number: `dtype.num`. The implementation of these subclasses will be a C-written (extension) MetaClass, all details of this class are supposed to remain experimental in flux at this time. Cheers Sebastian On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote: > Hi all, > > I am pleased to propose NEP 41: First step towards a new Datatype > System https://numpy.org/neps/nep-0041-improved-dtype-support.html > > This NEP motivates the larger restructure of the datatype machinery > in > NumPy and defines a few fundamental design aspects. The long term > user > impact will be allowing easier and more rich featured user defined > datatypes. > > As this is a large restructure, the NEP represents only the first > steps > with some additional information in further NEPs being drafted [1] > (this may be helpful to look at depending on the level of detail you > are interested in). > The NEP itself does not propose to add significant new public API. > Instead it proposes to move forward with an incremental internal > refactor and lays the foundation for this process. > > The main user facing change at this time is that datatypes will > become > classes (e.g. ``type(np.dtype("float64"))`` will be a float64 > specific > class. > For most users, the main impact should be many new datatypes in the > long run (see the user impact section). However, for those interested > in API design within NumPy or with respect to implementing new > datatypes, this and the following NEPs are important decisions in the > future roadmap for NumPy. > > The current full text is reproduced below, although the above link is > probably a better way to read it. > > Cheers > > Sebastian > > > [1] NEP 40 gives some background information about the current > systems > and issues with it: > https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst > and NEP 42 being a first draft of how the new API may look like: > > https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst > (links to current rendered versions, check > https://github.com/numpy/numpy/pull/15505 and > https://github.com/numpy/numpy/pull/15507 for updates) > > > ------------------------------------------------------------------- > --- > > > ================================================= > NEP 41 — First step towards a new Datatype System > ================================================= > > :title: Improved Datatype Support > :Author: Sebastian Berg > :Author: Stéfan van der Walt > :Author: Matti Picus > :Status: Draft > :Type: Standard Track > :Created: 2020-02-03 > > > .. note:: > > This NEP is part of a series of NEPs encompassing first > information > about the previous dtype implementation and issues with it in NEP > 40. > NEP 41 (this document) then provides an overview and generic > design > choices for the refactor. > Further NEPs 42 and 43 go into the technical details of the > datatype > and universal function related internal and external API changes. > In some cases it may be necessary to consult the other NEPs for a > full > picture of the desired changes and why these changes are > necessary. > > > Abstract > -------- > > `Datatypes <data-type-objects-dtype>` in NumPy describe how to > interpret each > element in arrays. NumPy provides ``int``, ``float``, and ``complex`` > numerical > types, as well as string, datetime, and structured datatype > capabilities. > The growing Python community, however, has need for more diverse > datatypes. > Examples are datatypes with unit information attached (such as > meters) or > categorical datatypes (fixed set of possible values). > However, the current NumPy datatype API is too limited to allow the > creation > of these. > > This NEP is the first step to enable such growth; it will lead to > a simpler development path for new datatypes. > In the long run the new datatype system will also support the > creation > of datatypes directly from Python rather than C. > Refactoring the datatype API will improve maintainability and > facilitate > development of both user-defined external datatypes, > as well as new features for existing datatypes internal to NumPy. > > > Motivation and Scope > -------------------- > > .. seealso:: > > The user impact section includes examples of what kind of new > datatypes > will be enabled by the proposed changes in the long run. > It may thus help to read these section out of order. > > Motivation > ^^^^^^^^^^ > > One of the main issues with the current API is the definition of > typical > functions such as addition and multiplication for parametric > datatypes > (see also NEP 40) which require additional steps to determine the > output type. > For example when adding two strings of length 4, the result is a > string > of length 8, which is different from the input. > Similarly, a datatype which embeds a physical unit must calculate the > new unit > information: dividing a distance by a time results in a speed. > A related difficulty is that the :ref:`current casting rules > <_ufuncs.casting>` > -- the conversion between different datatypes -- > cannot describe casting for such parametric datatypes implemented > outside of NumPy. > > This additional functionality for supporting parametric datatypes > introduces > increased complexity within NumPy itself, > and furthermore is not available to external user-defined datatypes. > In general the concerns of different datatypes are not well well- > encapsulated. > This burden is exacerbated by the exposure of internal C structures, > limiting the addition of new fields > (for example to support new sorting methods [new_sort]_). > > Currently there are many factors which limit the creation of new > user-defined > datatypes: > > * Creating casting rules for parametric user-defined dtypes is either > impossible > or so complex that it has never been attempted. > * Type promotion, e.g. the operation deciding that adding float and > integer > values should return a float value, is very valuable for numeric > datatypes > but is limited in scope for user-defined and especially parametric > datatypes. > * Much of the logic (e.g. promotion) is written in single functions > instead of being split as methods on the datatype itself. > * In the current design datatypes cannot have methods that do not > generalize > to other datatypes. For example a unit datatype cannot have a > ``.to_si()`` method to > easily find the datatype which would represent the same values in > SI units. > > The large need to solve these issues has driven the scientific > community > to create work-arounds in multiple projects implementing physical > units as an > array-like class instead of a datatype, which would generalize better > across > multiple array-likes (Dask, pandas, etc.). > Already, Pandas has made a push into the same direction with its > extension arrays [pandas_extension_arrays]_ and undoubtedly > the community would be best served if such new features could be > common > between NumPy, Pandas, and other projects. > > Scope > ^^^^^ > > The proposed refactoring of the datatype system is a large > undertaking and > thus is proposed to be split into various phases, roughly: > > * Phase I: Restructure and extend the datatype infrastructure (This > NEP 41) > * Phase II: Incrementally define or rework API (Detailed largely in > NEPs 42/43) > * Phase III: Growth of NumPy and Scientific Python Ecosystem > capabilities. > > For a more detailed accounting of the various phases, see > "Plan to Approach the Full Refactor" in the Implementation section > below. > This NEP proposes to move ahead with the necessary creation of new > dtype > subclasses (Phase I), > and start working on implementing current functionality. > Within the context of this NEP all development will be fully private > API or > use preliminary underscored names which must be changed in the > future. > Most of the internal and public API choices are part of a second > Phase > and will be discussed in more detail in the following NEPs 42 and 43. > The initial implementation of this NEP will have little or no effect > on users, > but provides the necessary ground work for incrementally addressing > the > full rework. > > The implementation of this NEP and the following, implied large > rework of how > datatypes are defined in NumPy is expected to create small > incompatibilities > (see backward compatibility section). > However, a transition requiring large code adaption is not > anticipated and not > within scope. > > Specifically, this NEP makes the following design choices which are > discussed > in more details in the detailed description section: > > 1. Each datatype will be an instance of a subclass of ``np.dtype``, > with most of the > datatype-specific logic being implemented > as special methods on the class. In the C-API, these correspond to > specific > slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f, > np.dtype)`` will remain true, > but ``type(f)`` will be a subclass of ``np.dtype`` rather than > just ``np.dtype`` itself. > The ``PyArray_ArrFuncs`` which are currently stored as a pointer > on the instance (as ``PyArray_Descr->f``), > should instead be stored on the class as typically done in Python. > In the future these may correspond to python side dunder methods. > Storage information such as itemsize and byteorder can differ > between > different dtype instances (e.g. "S3" vs. "S8") and will remain > part of the instance. > This means that in the long run the current lowlevel access to > dtype methods > will be removed (see ``PyArray_ArrFuncs`` in NEP 40). > > 2. The current NumPy scalars will *not* change, they will not be > instances of > datatypes. This will also be true for new datatypes, scalars will > not be > instances of a dtype (although ``isinstance(scalar, dtype)`` may > be made > to return ``True`` when appropriate). > > Detailed technical decisions to follow in NEP 42. > > Further, the public API will be designed in a way that is extensible > in the future: > > 3. All new C-API functions provided to the user will hide > implementation details > as much as possible. The public API should be an identical, but > limited, > version of the C-API used for the internal NumPy datatypes. > > The changes to the datatype system in Phase II must include a large > refactor of the > UFunc machinery, which will be further defined in NEP 43: > > 4. To enable all of the desired functionality for new user-defined > datatypes, > the UFunc machinery will be changed to replace the current > dispatching > and type resolution system. > The old system should be *mostly* supported as a legacy version > for some time. > > Additionally, as a general design principle, the addition of new > user-defined > datatypes will *not* change the behaviour of programs. > For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or > ``b`` know > that ``c`` exists. > > > User Impact > ----------- > > The current ecosystem has very few user-defined datatypes using > NumPy, the > two most prominent being: ``rational`` and ``quaternion``. > These represent fairly simple datatypes which are not strongly > impacted > by the current limitations. > However, we have identified a need for datatypes such as: > > * bfloat16, used in deep learning > * categorical types > * physical units (such as meters) > * datatypes for tracing/automatic differentiation > * high, fixed precision math > * specialized integer types such as int2, int24 > * new, better datetime representations > * extending e.g. integer dtypes to have a sentinel NA value > * geometrical objects [pygeos]_ > > Some of these are partially solved; for example unit capability is > provided > in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray` > subclasses. > Most of these datatypes, however, simply cannot be reasonably defined > right now. > An advantage of having such datatypes in NumPy is that they should > integrate > seamlessly with other array or array-like packages such as Pandas, > ``xarray`` [xarray_dtype_issue]_, or ``Dask``. > > The long term user impact of implementing this NEP will be to allow > both > the growth of the whole ecosystem by having such new datatypes, as > well as > consolidating implementation of such datatypes within NumPy to > achieve > better interoperability. > > > Examples > ^^^^^^^^ > > The following examples represent future user-defined datatypes we > wish to enable. > These datatypes are not part the NEP and choices (e.g. choice of > casting rules) > are possibilities we wish to enable and do not represent > recommendations. > > Simple Numerical Types > """""""""""""""""""""" > > Mainly used where memory is a consideration, lower-precision numeric > types > such as :ref:```bfloat16`` < > https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>` > are common in other computational frameworks. > For these types the definitions of things such as ``np.common_type`` > and > ``np.can_cast`` are some of the most important interfaces. Once they > support ``np.common_type``, it is (for the most part) possible to > find > the correct ufunc loop to call, since most ufuncs -- such as add -- > effectively > only require ``np.result_type``:: > > >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2) > > and `~numpy.result_type` is largely identical to > `~numpy.common_type`. > > > Fixed, high precision math > """""""""""""""""""""""""" > > Allowing arbitrary precision or higher precision math is important in > simulations. For instance ``mpmath`` defines a precision:: > > >>> import mpmath as mp > >>> print(mp.dps) # the current (default) precision > 15 > > NumPy should be able to construct a native, memory-efficient array > from > a list of ``mpmath.mpf`` floating point objects:: > > >>> arr_15_dps = np.array(mp.arange(3)) # (mp.arange returns a > list) > >>> print(arr_15_dps) # Must find the correct precision from the > objects: > array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15]) > > We should also be able to specify the desired precision when > creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]`` > to find the DType class (the notation is not part of this NEP), > which is then instantiated with the desired parameter. > This could also be written as ``MpfDType`` class:: > > >>> arr_100_dps = np.array([1, 2, 3], > dtype=np.dtype[mp.mpf](dps=100)) > >>> print(arr_15_dps + arr_100_dps) > array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100]) > > The ``mpf`` datatype can decide that the result of the operation > should be the > higher precision one of the two, so uses a precision of 100. > Furthermore, we should be able to define casting, for example as in:: > > >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype, > casting="safe") > True > >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype, > casting="safe") > False # loses precision > >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype, > casting="same_kind") > True > > Casting from float is a probably always at least a ``same_kind`` > cast, but > in general, it is not safe:: > > >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4), > casting="safe") > False > > since a float64 has a higer precision than the ``mpf`` datatype with > ``dps=4``. > > Alternatively, we can say that:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), > np.dtype[mp.mpf](dps=10)) > np.dtype[mp.mpf](dps=10) > > And possibly even:: > > >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64) > np.dtype[mp.mpf](dps=16) # equivalent precision to float64 (I > believe) > > since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)`` > safely. > > > Categoricals > """""""""""" > > Categoricals are interesting in that they can have fixed, predefined > values, > or can be dynamic with the ability to modify categories when > necessary. > The fixed categories (defined ahead of time) is the most straight > forward > categorical definition. > Categoricals are *hard*, since there are many strategies to implement > them, > suggesting NumPy should only provide the scaffolding for user-defined > categorical types. For instance:: > > >>> cat = Categorical(["eggs", "spam", "toast"]) > >>> breakfast = array(["eggs", "spam", "eggs", "toast"], > dtype=cat) > > could store the array very efficiently, since it knows that there are > only 3 > categories. > Since a categorical in this sense knows almost nothing about the data > stored > in it, few operations makes, sense, although equality does: > > >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"], > dtype=cat) > >>> breakfast == breakfast2 > array[True, False, True, False]) > > The categorical datatype could work like a dictionary: no two > items names can be equal (checked on dtype creation), so that the > equality > operation above can be performed very efficiently. > If the values define an order, the category labels (internally > integers) could > be ordered the same way to allow efficient sorting and comparison. > > Whether or not casting is defined from one categorical with less to > one with > strictly more values defined, is something that the Categorical > datatype would > need to decide. Both options should be available. > > > Unit on the Datatype > """""""""""""""""""" > > There are different ways to define Units, depending on how the > internal > machinery would be organized, one way is to have a single Unit > datatype > for every existing numerical type. > This will be written as ``Unit[float64]``, the unit itself is part of > the > DType instance ``Unit[float64]("m")`` is a ``float64`` with meters > attached:: > > >>> from astropy import units > >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m # > meters > >>> print(meters) > array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > Note that units are a bit tricky. It is debatable, whether:: > > >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m")) > > should be valid syntax (coercing the float scalars without a unit to > meters). > Once the array is created, math will work without any issue:: > > >>> meters / (2 * unit.seconds) > array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s")) > > Casting is not valid from one unit to the other, but can be valid > between > different scales of the same dimensionality (although this may be > "unsafe"):: > > >>> meters.astype(Unit[float64]("s")) > TypeError: Cannot cast meters to seconds. > >>> meters.astype(Unit[float64]("km")) > >>> # Convert to centimeter-gram-second (cgs) units: > >>> meters.astype(meters.dtype.to_cgs()) > > The above notation is somewhat clumsy. Functions > could be used instead to convert between units. > There may be ways to make these more convenient, but those must be > left > for future discussions:: > > >>> units.convert(meters, "km") > >>> units.to_cgs(meters) > > There are some open questions. For example, whether additional > methods > on the array object could exist to simplify some of the notions, and > how these > would percolate from the datatype to the ``ndarray``. > > The interaction with other scalars would likely be defined through:: > > >>> np.common_type(np.float64, Unit) > Unit[np.float64](dimensionless) > > Ufunc output datatype determination can be more involved than for > simple > numerical dtypes since there is no "universal" output type:: > > >>> np.multiply(meters, seconds).dtype != np.result_type(meters, > seconds) > > In fact ``np.result_type(meters, seconds)`` must error without > context > of the operation being done. > This example highlights how the specific ufunc loop > (loop with known, specific DTypes as inputs), has to be able to to > make > certain decisions before the actual calculation can start. > > > > Implementation > -------------- > > Plan to Approach the Full Refactor > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > To address these issues in NumPy and enable new datatypes, > multiple development stages are required: > > * Phase I: Restructure and extend the datatype infrastructure (This > NEP) > > * Organize Datatypes like normal Python classes [`PR 15508`]_ > > * Phase II: Incrementally define or rework API > > * Create a new and easily extensible API for defining new datatypes > and related functionality. (NEP 42) > > * Incrementally define all necessary functionality through the new > API (NEP 42): > > * Defining operations such as ``np.common_type``. > * Allowing to define casting between datatypes. > * Add functionality necessary to create a numpy array from Python > scalars > (i.e. ``np.array(...)``). > * … > > * Restructure how universal functions work (NEP 43), in order to: > > * make it possible to allow a `~numpy.ufunc` such as ``np.add`` > to be > extended by user-defined datatypes such as Units. > > * allow efficient lookup for the correct implementation for user- > defined > datatypes. > > * enable reuse of existing code. Units should be able to use the > normal math loops and add additional logic to determine output > type. > > * Phase III: Growth of NumPy and Scientific Python Ecosystem > capabilities: > > * Cleanup of legacy behaviour where it is considered buggy or > undesirable. > * Provide a path to define new datatypes from Python. > * Assist the community in creating types such as Units or > Categoricals > * Allow strings to be used in functions such as ``np.equal`` or > ``np.add``. > * Remove legacy code paths within NumPy to improve long term > maintainability > > This document serves as a basis for phase I and provides the vision > and > motivation for the full project. > Phase I does not introduce any new user-facing features, > but is concerned with the necessary conceptual cleanup of the current > datatype system. > It provides a more "pythonic" datatype Python type object, with a > clear class hierarchy. > > The second phase is the incremental creation of all APIs necessary to > define > fully featured datatypes and reorganization of the NumPy datatype > system. > This phase will thus be primarily concerned with defining an, > initially preliminary, stable public API. > > Some of the benefits of a large refactor may only become evident > after the full > deprecation of the current legacy implementation (i.e. larger code > removals). > However, these steps are necessary for improvements to many parts of > the > core NumPy API, and are expected to make the implementation generally > easier to understand. > > The following figure illustrates the proposed design at a high level, > and roughly delineates the components of the overall design. > Note that this NEP only regards Phase I (shaded area), > the rest encompasses Phase II and the design choices are up for > discussion, > however, it highlights that the DType datatype class is the central, > necessary > concept: > > .. image:: _static/nep-0041-mindmap.svg > > > First steps directly related to this NEP > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The required changes necessary to NumPy are large and touch many > areas > of the code base > but many of these changes can be addressed incrementally. > > To enable an incremental approach we will start by creating a C > defined > ``PyArray_DTypeMeta`` class with its instances being the ``DType`` > classes, > subclasses of ``np.dtype``. > This is necessary to add the ability of storing custom slots on the > DType in C. > This ``DTypeMeta`` will be implemented first to then enable > incremental > restructuring of current code. > > The addition of ``DType`` will then enable addressing other changes > incrementally, some of which may begin before the settling the full > internal > API: > > 1. New machinery for array coercion, with the goal of enabling user > DTypes > with appropriate class methods. > 2. The replacement or wrapping of the current casting machinery. > 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots > into > DType method slots. > > At this point, no or only very limited new public API will be added > and > the internal API is considered to be in flux. > Any new public API may be set up give warnings and will have leading > underscores > to indicate that it is not finalized and can be changed without > warning. > > > Backward compatibility > ---------------------- > > While the actual backward compatibility impact of implementing Phase > I and II > are not yet fully clear, we anticipate, and accept the following > changes: > > * **Python API**: > > * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``, > while right > now ``type(np.dtype("f8")) is np.dtype``. > Code should use ``isinstance`` checks, and in very rare cases may > have to > be adapted to use it. > > * **C-API**: > > * In old versions of NumPy ``PyArray_DescrCheck`` is a macro > which uses > ``type(dtype) is np.dtype``. When compiling against an old > NumPy version, > the macro may have to be replaced with the corresponding > ``PyObject_IsInstance`` call. (If this is a problem, we could > backport > fixing the macro) > > * The UFunc machinery changes will break *limited* parts of the > current > implementation. Replacing e.g. the default ``TypeResolver`` is > expected > to remain supported for a time, although optimized masked inner > loop iteration > (which is not even used *within* NumPy) will no longer be > supported. > > * All functions currently defined on the dtypes, such as > ``PyArray_Descr->f->nonzero``, will be defined and accessed > differently. > This means that in the long run lowlevel access code will > have to be changed to use the new API. Such changes are expected > to be > necessary in very few project. > > * **dtype implementors (C-API)**: > > * The array which is currently provided to some functions (such as > cast functions), > will no longer be provided. > For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f- > >copyswapn``, > may instead receive a dummy array object with only some fields > (mainly the > dtype), being valid. > At least in some code paths, a similar mechanism is already used. > > * The ``scalarkind`` slot and registration of scalar casting will > be > removed/ignored without replacement. > It currently allows partial value-based casting. > The ``PyArray_ScalarKind`` function will continue to work for > builtin types, > but will not be used internally and be deprecated. > > * Currently user dtypes are defined as instances of ``np.dtype``. > The creation works by the user providing a prototype instance. > NumPy will need to modify at least the type during registration. > This has no effect for either ``rational`` or ``quaternion`` and > mutation > of the structure seems unlikely after registration. > > Since there is a fairly large API surface concerning datatypes, > further changes > or the limitation certain function to currently existing datatypes is > likely to occur. > For example functions which use the type number as input > should be replaced with functions taking DType classes instead. > Although public, large parts of this C-API seem to be used rarely, > possibly never, by downstream projects. > > > > Detailed Description > -------------------- > > This section details the design decisions covered by this NEP. > The subsections correspond to the list of design choices presented > in the Scope section. > > Datatypes as Python Classes (1) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current NumPy datatypes are not full scale python classes. > They are instead (prototype) instances of a single ``np.dtype`` > class. > Changing this means that any special handling, e.g. for ``datetime`` > can be moved to the Datetime DType class instead, away from > monolithic general > code (e.g. current ``PyArray_AdjustFlexibleDType``). > > The main consequence of this change with respect to the API is that > special methods move from the dtype instances to methods on the new > DType class. > This is the typical design pattern used in Python. > Organizing these methods and information in a more Pythonic way > provides a > solid foundation for refining and extending the API in the future. > The current API cannot be extended due to how it is exposed > publically. > This means for example that the methods currently stored in > ``PyArray_ArrFuncs`` > on each datatype (see NEP 40) will be defined differently in the > future and > deprecated in the long run. > > The most prominent visible side effect of this will be that > ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore. > Instead it will be a subclass of ``np.dtype`` meaning that > ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true. > This will also add the ability to use ``isinstance(dtype, > np.dtype[float64])`` > thus removing the need to use ``dtype.kind``, ``dtype.char``, or > ``dtype.type`` > to do this check. > > With the design decision of DTypes as full-scale Python classes, > the question of subclassing arises. > Inheritance, however, appears problematic and a complexity best > avoided > (at least initially) for container datatypes. > Further, subclasses may be more interesting for interoperability for > example with GPU backends (CuPy) storing additional methods related > to the > GPU rather than as a mechanism to define new datatypes. > A class hierarchy does provides value, this may be achieved by > allowing the creation of *abstract* datatypes. > An example for an abstract datatype would be the datatype equivalent > of > ``np.floating``, representing any floating point number. > These can serve the same purpose as Python's abstract base classes. > > > Scalars should not be instances of the datatypes (2) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > For simple datatypes such as ``float64`` (see also below), it seems > tempting that the instance of a ``np.dtype("float64")`` can be the > scalar. > This idea may be even more appealing due to the fact that scalars, > rather than datatypes, currently define a useful type hierarchy. > > However, we have specifically decided against this for a number of > reasons. > First, the new datatypes described herein would be instances of DType > classes. > Making these instances themselves classes, while possible, adds > additional > complexity that users need to understand. > It would also mean that scalars must have storage information (such > as byteorder) > which is generally unnecessary and currently is not used. > Second, while the simple NumPy scalars such as ``float64`` may be > such instances, > it should be possible to create datatypes for Python objects without > enforcing > NumPy as a dependency. > However, Python objects that do not depend on NumPy cannot be > instances of a NumPy DType. > Third, there is a mismatch between the methods and attributes which > are useful > for scalars and datatypes. For instance ``to_float()`` makes sense > for a scalar > but not for a datatype and ``newbyteorder`` is not useful on a scalar > (or has > a different meaning). > > Overall, it seem rather than reducing the complexity, i.e. by merging > the two distinct type hierarchies, making scalars instances of DTypes > would > increase the complexity of both the design and implementation. > > A possible future path may be to instead simplify the current NumPy > scalars to > be much simpler objects which largely derive their behaviour from the > datatypes. > > C-API for creating new Datatypes (3) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > The current C-API with which users can create new datatypes > is limited in scope, and requires use of "private" structures. This > means > the API is not extensible: no new members can be added to the > structure > without losing binary compatibility. > This has already limited the inclusion of new sorting methods into > NumPy [new_sort]_. > > The new version shall thus replace the current ``PyArray_ArrFuncs`` > structure used > to define new datatypes. > Datatypes that currently exist and are defined using these slots will > be > supported during a deprecation period. > > The most likely solution is to hide the implementation from the user > and thus make > it extensible in the future is to model the API after Python's stable > API [PEP-384]_: > > .. code-block:: C > > static struct PyArrayMethodDef slots[] = { > {NPY_dt_method, method_implementation}, > ..., > {0, NULL} > } > > typedef struct{ > PyTypeObject *typeobj; /* type of python scalar */ > ...; > PyType_Slot *slots; > } PyArrayDTypeMeta_Spec; > > PyObject* PyArray_InitDTypeMetaFromSpec( > PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec > *dtype_spec); > > The C-side slots should be designed to mirror Python side methods > such as ``dtype.__dtype_method__``, although the exposure to Python > is > a later step in the implementation to reduce the complexity of the > initial > implementation. > > > C-API Changes to the UFunc Machinery (4) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Proposed changes to the UFunc machinery will be part of NEP 43. > However, the following changes will be necessary (see NEP 40 for a > detailed > description of the current implementation and its issues): > > * The current UFunc type resolution must be adapted to allow better > control > for user-defined dtypes as well as resolve current inconsistencies. > * The inner-loop used in UFuncs must be expanded to include a return > value. > Further, error reporting must be improved, and passing in dtype- > specific > information enabled. > This requires the modification of the inner-loop function signature > and > addition of new hooks called before and after the inner-loop is > used. > > An important goal for any changes to the universal functions will be > to > allow the reuse of existing loops. > It should be easy for a new units datatype to fall back to existing > math > functions after handling the unit related computations. > > > Discussion > ---------- > > See NEP 40 for a list of previous meetings and discussions. > > > References > ---------- > > .. [pandas_extension_arrays] > https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types > > .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262 > > .. [pygeos] https://github.com/caspervdw/pygeos > > .. [new_sort] https://github.com/numpy/numpy/pull/12945 > > .. [PEP-384] https://www.python.org/dev/peps/pep-0384/ > > .. [PR 15508] https://github.com/numpy/numpy/pull/15508 > > > Copyright > --------- > > This document has been placed in the public domain. > > > Acknowledgments > --------------- > > The effort to create new datatypes for NumPy has been discussed for > several > years in many different contexts and settings, making it impossible > to list everyone involved. > We would like to thank especially Stephan Hoyer, Nathaniel Smith, and > Eric Wieser > for repeated in-depth discussion about datatype design. > We are very grateful for the community input in reviewing and > revising this > NEP and would like to thank especially Ross Barnowski and Ralf > Gommers. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion