Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System

Sebastian Berg Tue, 17 Mar 2020 13:03:29 -0700

Hi all,

in the spirit of trying to keep this moving, can I assume that the main
reason for little discussion is that the actual changes proposed are
not very far reaching as of now?  Or is the reason that this is a
fairly complex topic that you need more time to think about it?
If it is the latter, is there some way I can help with it?  I tried to
minimize how much is part of this initial NEP.


If there is not much need for discussion, I would like to officially
accept the NEP very soon, sending out an official one week notice in
the next days.


To summarize one more time, the main point is that:

    type(np.dtype(np.float64))

will be `np.dtype[float64]`, a subclass of dtype, so that:

    issubclass(np.dtype[float64], np.dtype)

is true. This means that we will have one class for every current type
number: `dtype.num`. The implementation of these subclasses will be a
C-written (extension) MetaClass, all details of this class are supposed
to remain experimental in flux at this time.

Cheers

Sebastian


On Wed, 2020-03-11 at 17:02 -0700, Sebastian Berg wrote:
> Hi all,
> 
> I am pleased to propose NEP 41: First step towards a new Datatype
> System https://numpy.org/neps/nep-0041-improved-dtype-support.html
> 
> This NEP motivates the larger restructure of the datatype machinery
> in
> NumPy and defines a few fundamental design aspects. The long term
> user
> impact will be allowing easier and more rich featured user defined
> datatypes.
> 
> As this is a large restructure, the NEP represents only the first
> steps
> with some additional information in further NEPs being drafted [1]
> (this may be helpful to look at depending on the level of detail you
> are interested in).
> The NEP itself does not propose to add significant new public API.
> Instead it proposes to move forward with an incremental internal
> refactor and lays the foundation for this process.
> 
> The main user facing change at this time is that datatypes will
> become
> classes (e.g. ``type(np.dtype("float64"))`` will be a float64
> specific
> class.
> For most users, the main impact should be many new datatypes in the
> long run (see the user impact section). However, for those interested
> in API design within NumPy or with respect to implementing new
> datatypes, this and the following NEPs are important decisions in the
> future roadmap for NumPy.
> 
> The current full text is reproduced below, although the above link is
> probably a better way to read it.
> 
> Cheers
> 
> Sebastian
> 
> 
> [1] NEP 40 gives some background information about the current
> systems
> and issues with it:
> https://github.com/numpy/numpy/blob/1248cf7a8765b7b53d883f9e7061173817533aac/doc/neps/nep-0040-legacy-datatype-impl.rst
> and NEP 42 being a first draft of how the new API may look like:
> 
> https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst
> (links to current rendered versions, check 
> https://github.com/numpy/numpy/pull/15505 and 
> https://github.com/numpy/numpy/pull/15507 for updates)
> 
> 
> -------------------------------------------------------------------
> ---
> 
> 
> =================================================
> NEP 41 — First step towards a new Datatype System
> =================================================
> 
> :title: Improved Datatype Support
> :Author: Sebastian Berg
> :Author: Stéfan van der Walt
> :Author: Matti Picus
> :Status: Draft
> :Type: Standard Track
> :Created: 2020-02-03
> 
> 
> .. note::
> 
>     This NEP is part of a series of NEPs encompassing first
> information
>     about the previous dtype implementation and issues with it in NEP
> 40.
>     NEP 41 (this document) then provides an overview and generic
> design
>     choices for the refactor.
>     Further NEPs 42 and 43 go into the technical details of the
> datatype
>     and universal function related internal and external API changes.
>     In some cases it may be necessary to consult the other NEPs for a
> full
>     picture of the desired changes and why these changes are
> necessary.
> 
> 
> Abstract
> --------
> 
> `Datatypes <data-type-objects-dtype>` in NumPy describe how to
> interpret each
> element in arrays. NumPy provides ``int``, ``float``, and ``complex``
> numerical
> types, as well as string, datetime, and structured datatype
> capabilities.
> The growing Python community, however, has need for more diverse
> datatypes.
> Examples are datatypes with unit information attached (such as
> meters) or
> categorical datatypes (fixed set of possible values).
> However, the current NumPy datatype API is too limited to allow the
> creation
> of these.
> 
> This NEP is the first step to enable such growth; it will lead to
> a simpler development path for new datatypes.
> In the long run the new datatype system will also support the
> creation
> of datatypes directly from Python rather than C.
> Refactoring the datatype API will improve maintainability and
> facilitate
> development of both user-defined external datatypes,
> as well as new features for existing datatypes internal to NumPy.
> 
> 
> Motivation and Scope
> --------------------
> 
> .. seealso::
> 
>     The user impact section includes examples of what kind of new
> datatypes
>     will be enabled by the proposed changes in the long run.
>     It may thus help to read these section out of order.
> 
> Motivation
> ^^^^^^^^^^
> 
> One of the main issues with the current API is the definition of
> typical
> functions such as addition and multiplication for parametric
> datatypes
> (see also NEP 40) which require additional steps to determine the
> output type.
> For example when adding two strings of length 4, the result is a
> string
> of length 8, which is different from the input.
> Similarly, a datatype which embeds a physical unit must calculate the
> new unit
> information: dividing a distance by a time results in a speed.
> A related difficulty is that the :ref:`current casting rules
> <_ufuncs.casting>`
> -- the conversion between different datatypes --
> cannot describe casting for such parametric datatypes implemented
> outside of NumPy.
> 
> This additional functionality for supporting parametric datatypes
> introduces
> increased complexity within NumPy itself,
> and furthermore is not available to external user-defined datatypes.
> In general the concerns of different datatypes are not well well-
> encapsulated.
> This burden is exacerbated by the exposure of internal C structures,
> limiting the addition of new fields
> (for example to support new sorting methods [new_sort]_).
> 
> Currently there are many factors which limit the creation of new
> user-defined
> datatypes:
> 
> * Creating casting rules for parametric user-defined dtypes is either
> impossible
>   or so complex that it has never been attempted.
> * Type promotion, e.g. the operation deciding that adding float and
> integer
>   values should return a float value, is very valuable for numeric
> datatypes
>   but is limited in scope for user-defined and especially parametric
> datatypes.
> * Much of the logic (e.g. promotion) is written in single functions
>   instead of being split as methods on the datatype itself.
> * In the current design datatypes cannot have methods that do not
> generalize
>   to other datatypes. For example a unit datatype cannot have a
> ``.to_si()`` method to
>   easily find the datatype which would represent the same values in
> SI units.
> 
> The large need to solve these issues has driven the scientific
> community
> to create work-arounds in multiple projects implementing physical
> units as an
> array-like class instead of a datatype, which would generalize better
> across
> multiple array-likes (Dask, pandas, etc.).
> Already, Pandas has made a push into the same direction with its
> extension arrays [pandas_extension_arrays]_ and undoubtedly
> the community would be best served if such new features could be
> common
> between NumPy, Pandas, and other projects.
> 
> Scope
> ^^^^^
> 
> The proposed refactoring of the datatype system is a large
> undertaking and
> thus is proposed to be split into various phases, roughly:
> 
> * Phase I: Restructure and extend the datatype infrastructure (This
> NEP 41)
> * Phase II: Incrementally define or rework API (Detailed largely in
> NEPs 42/43)
> * Phase III: Growth of NumPy and Scientific Python Ecosystem
> capabilities.
> 
> For a more detailed accounting of the various phases, see
> "Plan to Approach the Full Refactor" in the Implementation section
> below.
> This NEP proposes to move ahead with the necessary creation of new
> dtype
> subclasses (Phase I),
> and start working on implementing current functionality.
> Within the context of this NEP all development will be fully private
> API or
> use preliminary underscored names which must be changed in the
> future.
> Most of the internal and public API choices are part of a second
> Phase
> and will be discussed in more detail in the following NEPs 42 and 43.
> The initial implementation of this NEP will have little or no effect
> on users,
> but provides the necessary ground work for incrementally addressing
> the
> full rework.
> 
> The implementation of this NEP and the following, implied large
> rework of how
> datatypes are defined in NumPy is expected to create small
> incompatibilities
> (see backward compatibility section).
> However, a transition requiring large code adaption is not
> anticipated and not
> within scope.
> 
> Specifically, this NEP makes the following design choices which are
> discussed
> in more details in the detailed description section:
> 
> 1. Each datatype will be an instance of a subclass of ``np.dtype``,
> with most of the
>    datatype-specific logic being implemented
>    as special methods on the class. In the C-API, these correspond to
> specific
>    slots. In short, for ``f = np.dtype("f8")``, ``isinstance(f,
> np.dtype)`` will remain true,
>    but ``type(f)`` will be a subclass of ``np.dtype`` rather than
> just ``np.dtype`` itself.
>    The ``PyArray_ArrFuncs`` which are currently stored as a pointer
> on the instance (as ``PyArray_Descr->f``),
>    should instead be stored on the class as typically done in Python.
>    In the future these may correspond to python side dunder methods.
>    Storage information such as itemsize and byteorder can differ
> between
>    different dtype instances (e.g. "S3" vs. "S8") and will remain
> part of the instance.
>    This means that in the long run the current lowlevel access to
> dtype methods
>    will be removed (see ``PyArray_ArrFuncs`` in NEP 40).
> 
> 2. The current NumPy scalars will *not* change, they will not be
> instances of
>    datatypes. This will also be true for new datatypes, scalars will
> not be
>    instances of a dtype (although ``isinstance(scalar, dtype)`` may
> be made
>    to return ``True`` when appropriate).
> 
> Detailed technical decisions to follow in NEP 42.
> 
> Further, the public API will be designed in a way that is extensible
> in the future:
> 
> 3. All new C-API functions provided to the user will hide
> implementation details
>    as much as possible. The public API should be an identical, but
> limited,
>    version of the C-API used for the internal NumPy datatypes.
> 
> The changes to the datatype system in Phase II must include a large
> refactor of the
> UFunc machinery, which will be further defined in NEP 43:
> 
> 4. To enable all of the desired functionality for new user-defined
> datatypes,
>    the UFunc machinery will be changed to replace the current
> dispatching
>    and type resolution system.
>    The old system should be *mostly* supported as a legacy version
> for some time.
> 
> Additionally, as a general design principle, the addition of new
> user-defined
> datatypes will *not* change the behaviour of programs.
> For example ``common_dtype(a, b)`` must not be ``c`` unless ``a`` or
> ``b`` know
> that ``c`` exists.
> 
> 
> User Impact
> -----------
> 
> The current ecosystem has very few user-defined datatypes using
> NumPy, the
> two most prominent being: ``rational`` and ``quaternion``.
> These represent fairly simple datatypes which are not strongly
> impacted
> by the current limitations.
> However, we have identified a need for datatypes such as:
> 
> * bfloat16, used in deep learning
> * categorical types
> * physical units (such as meters)
> * datatypes for tracing/automatic differentiation
> * high, fixed precision math
> * specialized integer types such as int2, int24
> * new, better datetime representations
> * extending e.g. integer dtypes to have a sentinel NA value
> * geometrical objects [pygeos]_
> 
> Some of these are partially solved; for example unit capability is
> provided
> in ``astropy.units``, ``unyt``, or ``pint``, as `numpy.ndarray`
> subclasses.
> Most of these datatypes, however, simply cannot be reasonably defined
> right now.
> An advantage of having such datatypes in NumPy is that they should
> integrate
> seamlessly with other array or array-like packages such as Pandas,
> ``xarray`` [xarray_dtype_issue]_, or ``Dask``.
> 
> The long term user impact of implementing this NEP will be to allow
> both
> the growth of the whole ecosystem by having such new datatypes, as
> well as
> consolidating implementation of such datatypes within NumPy to
> achieve
> better interoperability.
> 
> 
> Examples
> ^^^^^^^^
> 
> The following examples represent future user-defined datatypes we
> wish to enable.
> These datatypes are not part the NEP and choices (e.g. choice of
> casting rules)
> are possibilities we wish to enable and do not represent
> recommendations.
> 
> Simple Numerical Types
> """"""""""""""""""""""
> 
> Mainly used where memory is a consideration, lower-precision numeric
> types
> such as :ref:```bfloat16`` <
> https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`
> are common in other computational frameworks.
> For these types the definitions of things such as ``np.common_type``
> and
> ``np.can_cast`` are some of the most important interfaces. Once they
> support ``np.common_type``, it is (for the most part) possible to
> find
> the correct ufunc loop to call, since most ufuncs -- such as add --
> effectively
> only require ``np.result_type``::
> 
>     >>> np.add(arr1, arr2).dtype == np.result_type(arr1, arr2)
> 
> and `~numpy.result_type` is largely identical to
> `~numpy.common_type`.
> 
> 
> Fixed, high precision math
> """"""""""""""""""""""""""
> 
> Allowing arbitrary precision or higher precision math is important in
> simulations. For instance ``mpmath`` defines a precision::
> 
>     >>> import mpmath as mp
>     >>> print(mp.dps)  # the current (default) precision
>     15
> 
> NumPy should be able to construct a native, memory-efficient array
> from
> a list of ``mpmath.mpf`` floating point objects::
> 
>     >>> arr_15_dps = np.array(mp.arange(3))  # (mp.arange returns a
> list)
>     >>> print(arr_15_dps)  # Must find the correct precision from the
> objects:
>     array(['0.0', '1.0', '2.0'], dtype=mpf[dps=15])
> 
> We should also be able to specify the desired precision when
> creating the datatype for an array. Here, we use ``np.dtype[mp.mpf]``
> to find the DType class (the notation is not part of this NEP),
> which is then instantiated with the desired parameter.
> This could also be written as ``MpfDType`` class::
> 
>     >>> arr_100_dps = np.array([1, 2, 3],
> dtype=np.dtype[mp.mpf](dps=100))
>     >>> print(arr_15_dps + arr_100_dps)
>     array(['0.0', '2.0', '4.0'], dtype=mpf[dps=100])
> 
> The ``mpf`` datatype can decide that the result of the operation
> should be the
> higher precision one of the two, so uses a precision of 100.
> Furthermore, we should be able to define casting, for example as in::
> 
>     >>> np.can_cast(arr_15_dps.dtype, arr_100_dps.dtype,
> casting="safe")
>     True
>     >>> np.can_cast(arr_100_dps.dtype, arr_15_dps.dtype,
> casting="safe")
>     False  # loses precision
>     >>> np.can_cast(arr_100_dps.dtype, arr_100_dps.dtype,
> casting="same_kind")
>     True
> 
> Casting from float is a probably always at least a ``same_kind``
> cast, but
> in general, it is not safe::
> 
>     >>> np.can_cast(np.float64, np.dtype[mp.mpf](dps=4),
> casting="safe")
>     False
> 
> since a float64 has a higer precision than the ``mpf`` datatype with
> ``dps=4``.
> 
> Alternatively, we can say that::
> 
>     >>> np.common_type(np.dtype[mp.mpf](dps=5),
> np.dtype[mp.mpf](dps=10))
>     np.dtype[mp.mpf](dps=10)
> 
> And possibly even::
> 
>     >>> np.common_type(np.dtype[mp.mpf](dps=5), np.float64)
>     np.dtype[mp.mpf](dps=16)  # equivalent precision to float64 (I
> believe)
> 
> since ``np.float64`` can be cast to a ``np.dtype[mp.mpf](dps=16)``
> safely.
> 
> 
> Categoricals
> """"""""""""
> 
> Categoricals are interesting in that they can have fixed, predefined
> values,
> or can be dynamic with the ability to modify categories when
> necessary.
> The fixed categories (defined ahead of time) is the most straight
> forward
> categorical definition.
> Categoricals are *hard*, since there are many strategies to implement
> them,
> suggesting NumPy should only provide the scaffolding for user-defined
> categorical types. For instance::
> 
>     >>> cat = Categorical(["eggs", "spam", "toast"])
>     >>> breakfast = array(["eggs", "spam", "eggs", "toast"],
> dtype=cat)
> 
> could store the array very efficiently, since it knows that there are
> only 3
> categories.
> Since a categorical in this sense knows almost nothing about the data
> stored
> in it, few operations makes, sense, although equality does:
> 
>     >>> breakfast2 = array(["eggs", "eggs", "eggs", "eggs"],
> dtype=cat)
>     >>> breakfast == breakfast2
>     array[True, False, True, False])
> 
> The categorical datatype could work like a dictionary: no two
> items names can be equal (checked on dtype creation), so that the
> equality
> operation above can be performed very efficiently.
> If the values define an order, the category labels (internally
> integers) could
> be ordered the same way to allow efficient sorting and comparison.
> 
> Whether or not casting is defined from one categorical with less to
> one with
> strictly more values defined, is something that the Categorical
> datatype would
> need to decide. Both options should be available.
> 
> 
> Unit on the Datatype
> """"""""""""""""""""
> 
> There are different ways to define Units, depending on how the
> internal
> machinery would be organized, one way is to have a single Unit
> datatype
> for every existing numerical type.
> This will be written as ``Unit[float64]``, the unit itself is part of
> the
> DType instance ``Unit[float64]("m")`` is a ``float64`` with meters
> attached::
> 
>     >>> from astropy import units
>     >>> meters = np.array([1, 2, 3], dtype=np.float64) * units.m  #
> meters
>     >>> print(meters)
>     array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
> 
> Note that units are a bit tricky. It is debatable, whether::
> 
>     >>> np.array([1.0, 2.0, 3.0], dtype=Unit[float64]("m"))
> 
> should be valid syntax (coercing the float scalars without a unit to
> meters).
> Once the array is created, math will work without any issue::
> 
>     >>> meters / (2 * unit.seconds)
>     array([0.5, 1.0, 1.5], dtype=Unit[float64]("m/s"))
> 
> Casting is not valid from one unit to the other, but can be valid
> between
> different scales of the same dimensionality (although this may be
> "unsafe")::
> 
>     >>> meters.astype(Unit[float64]("s"))
>     TypeError: Cannot cast meters to seconds.
>     >>> meters.astype(Unit[float64]("km"))
>     >>> # Convert to centimeter-gram-second (cgs) units:
>     >>> meters.astype(meters.dtype.to_cgs())
> 
> The above notation is somewhat clumsy. Functions
> could be used instead to convert between units.
> There may be ways to make these more convenient, but those must be
> left
> for future discussions::
> 
>     >>> units.convert(meters, "km")
>     >>> units.to_cgs(meters)
> 
> There are some open questions. For example, whether additional
> methods
> on the array object could exist to simplify some of the notions, and
> how these
> would percolate from the datatype to the ``ndarray``.
> 
> The interaction with other scalars would likely be defined through::
> 
>     >>> np.common_type(np.float64, Unit)
>     Unit[np.float64](dimensionless)
> 
> Ufunc output datatype determination can be more involved than for
> simple
> numerical dtypes since there is no "universal" output type::
> 
>     >>> np.multiply(meters, seconds).dtype != np.result_type(meters,
> seconds)
> 
> In fact ``np.result_type(meters, seconds)`` must error without
> context
> of the operation being done.
> This example highlights how the specific ufunc loop
> (loop with known, specific DTypes as inputs), has to be able to to
> make
> certain decisions before the actual calculation can start.
> 
> 
> 
> Implementation
> --------------
> 
> Plan to Approach the Full Refactor
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> To address these issues in NumPy and enable new datatypes,
> multiple development stages are required:
> 
> * Phase I: Restructure and extend the datatype infrastructure (This
> NEP)
> 
>   * Organize Datatypes like normal Python classes [`PR 15508`]_
> 
> * Phase II: Incrementally define or rework API
> 
>   * Create a new and easily extensible API for defining new datatypes
>     and related functionality. (NEP 42)
> 
>   * Incrementally define all necessary functionality through the new
> API (NEP 42):
> 
>     * Defining operations such as ``np.common_type``.
>     * Allowing to define casting between datatypes.
>     * Add functionality necessary to create a numpy array from Python
> scalars
>       (i.e. ``np.array(...)``).
>     * …
> 
>   * Restructure how universal functions work (NEP 43), in order to:
> 
>     * make it possible to allow a `~numpy.ufunc` such as ``np.add``
> to be
>       extended by user-defined datatypes such as Units.
> 
>     * allow efficient lookup for the correct implementation for user-
> defined
>       datatypes.
> 
>     * enable reuse of existing code. Units should be able to use the
>       normal math loops and add additional logic to determine output
> type.
> 
> * Phase III: Growth of NumPy and Scientific Python Ecosystem
> capabilities:
> 
>   * Cleanup of legacy behaviour where it is considered buggy or
> undesirable.
>   * Provide a path to define new datatypes from Python.
>   * Assist the community in creating types such as Units or
> Categoricals
>   * Allow strings to be used in functions such as ``np.equal`` or
> ``np.add``.
>   * Remove legacy code paths within NumPy to improve long term
> maintainability
> 
> This document serves as a basis for phase I and provides the vision
> and
> motivation for the full project.
> Phase I does not introduce any new user-facing features,
> but is concerned with the necessary conceptual cleanup of the current
> datatype system.
> It provides a more "pythonic" datatype Python type object, with a
> clear class hierarchy.
> 
> The second phase is the incremental creation of all APIs necessary to
> define
> fully featured datatypes and reorganization of the NumPy datatype
> system.
> This phase will thus be primarily concerned with defining an,
> initially preliminary, stable public API.
> 
> Some of the benefits of a large refactor may only become evident
> after the full
> deprecation of the current legacy implementation (i.e. larger code
> removals).
> However, these steps are necessary for improvements to many parts of
> the
> core NumPy API, and are expected to make the implementation generally
> easier to understand.
> 
> The following figure illustrates the proposed design at a high level,
> and roughly delineates the components of the overall design.
> Note that this NEP only regards Phase I (shaded area),
> the rest encompasses Phase II and the design choices are up for
> discussion,
> however, it highlights that the DType datatype class is the central,
> necessary
> concept:
> 
> .. image:: _static/nep-0041-mindmap.svg
> 
> 
> First steps directly related to this NEP
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> The required changes necessary to NumPy are large and touch many
> areas
> of the code base
> but many of these changes can be addressed incrementally.
> 
> To enable an incremental approach we will start by creating a C
> defined
> ``PyArray_DTypeMeta`` class with its instances being the ``DType``
> classes,
> subclasses of ``np.dtype``.
> This is necessary to add the ability of storing custom slots on the
> DType in C.
> This ``DTypeMeta`` will be implemented first to then enable
> incremental
> restructuring of current code.
> 
> The addition of ``DType`` will then enable addressing other changes
> incrementally, some of which may begin before the settling the full
> internal
> API:
> 
> 1. New machinery for array coercion, with the goal of enabling user
> DTypes
>    with appropriate class methods.
> 2. The replacement or wrapping of the current casting machinery.
> 3. Incremental redefinition of the current ``PyArray_ArrFuncs`` slots
> into
>    DType method slots.
> 
> At this point, no or only very limited new public API will be added
> and
> the internal API is considered to be in flux.
> Any new public API may be set up give warnings and will have leading
> underscores
> to indicate that it is not finalized and can be changed without
> warning.
> 
> 
> Backward compatibility
> ----------------------
> 
> While the actual backward compatibility impact of implementing Phase
> I and II
> are not yet fully clear, we anticipate, and accept the following
> changes:
> 
> * **Python API**:
> 
>   * ``type(np.dtype("f8"))`` will be a subclass of ``np.dtype``,
> while right
>     now ``type(np.dtype("f8")) is np.dtype``.
>     Code should use ``isinstance`` checks, and in very rare cases may
> have to
>     be adapted to use it.
> 
> * **C-API**:
> 
>     * In old versions of NumPy ``PyArray_DescrCheck`` is a macro
> which uses
>       ``type(dtype) is np.dtype``. When compiling against an old
> NumPy version,
>       the macro may have to be replaced with the corresponding
>       ``PyObject_IsInstance`` call. (If this is a problem, we could
> backport
>       fixing the macro)
> 
>    * The UFunc machinery changes will break *limited* parts of the
> current
>      implementation. Replacing e.g. the default ``TypeResolver`` is
> expected
>      to remain supported for a time, although optimized masked inner
> loop iteration
>      (which is not even used *within* NumPy) will no longer be
> supported.
> 
>    * All functions currently defined on the dtypes, such as
>      ``PyArray_Descr->f->nonzero``, will be defined and accessed
> differently.
>      This means that in the long run lowlevel access code will
>      have to be changed to use the new API. Such changes are expected
> to be
>      necessary in very few project.
> 
> * **dtype implementors (C-API)**:
> 
>   * The array which is currently provided to some functions (such as
> cast functions),
>     will no longer be provided.
>     For example ``PyArray_Descr->f->nonzero`` or ``PyArray_Descr->f-
> >copyswapn``,
>     may instead receive a dummy array object with only some fields
> (mainly the
>     dtype), being valid.
>     At least in some code paths, a similar mechanism is already used.
> 
>   * The ``scalarkind`` slot and registration of scalar casting will
> be
>      removed/ignored without replacement.
>      It currently allows partial value-based casting.
>      The ``PyArray_ScalarKind`` function will continue to work for
> builtin types,
>      but will not be used internally and be deprecated.
> 
>    * Currently user dtypes are defined as instances of ``np.dtype``.
>      The creation works by the user providing a prototype instance.
>      NumPy will need to modify at least the type during registration.
>      This has no effect for either ``rational`` or ``quaternion`` and
> mutation
>      of the structure seems unlikely after registration.
> 
> Since there is a fairly large API surface concerning datatypes,
> further changes
> or the limitation certain function to currently existing datatypes is
> likely to occur.
> For example functions which use the type number as input
> should be replaced with functions taking DType classes instead.
> Although public, large parts of this C-API seem to be used rarely,
> possibly never, by downstream projects.
> 
> 
> 
> Detailed Description
> --------------------
> 
> This section details the design decisions covered by this NEP.
> The subsections correspond to the list of design choices presented
> in the Scope section.
> 
> Datatypes as Python Classes (1)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> The current NumPy datatypes are not full scale python classes.
> They are instead (prototype) instances of a single ``np.dtype``
> class.
> Changing this means that any special handling, e.g. for ``datetime``
> can be moved to the Datetime DType class instead, away from
> monolithic general
> code (e.g. current ``PyArray_AdjustFlexibleDType``).
> 
> The main consequence of this change with respect to the API is that
> special methods move from the dtype instances to methods on the new
> DType class.
> This is the typical design pattern used in Python.
> Organizing these methods and information in a more Pythonic way
> provides a
> solid foundation for refining and extending the API in the future.
> The current API cannot be extended due to how it is exposed
> publically.
> This means for example that the methods currently stored in
> ``PyArray_ArrFuncs``
> on each datatype (see NEP 40) will be defined differently in the
> future and
> deprecated in the long run.
> 
> The most prominent visible side effect of this will be that
> ``type(np.dtype(np.float64))`` will not be ``np.dtype`` anymore.
> Instead it will be a subclass of ``np.dtype`` meaning that
> ``isinstance(np.dtype(np.float64), np.dtype)`` will remain true.
> This will also add the ability to use ``isinstance(dtype,
> np.dtype[float64])``
> thus removing the need to use ``dtype.kind``, ``dtype.char``, or
> ``dtype.type``
> to do this check.
> 
> With the design decision of DTypes as full-scale Python classes,
> the question of subclassing arises.
> Inheritance, however, appears problematic and a complexity best
> avoided
> (at least initially) for container datatypes.
> Further, subclasses may be more interesting for interoperability for
> example with GPU backends (CuPy) storing additional methods related
> to the
> GPU rather than as a mechanism to define new datatypes.
> A class hierarchy does provides value, this may be achieved by
> allowing the creation of *abstract* datatypes.
> An example for an abstract datatype would be the datatype equivalent
> of
> ``np.floating``, representing any floating point number.
> These can serve the same purpose as Python's abstract base classes.
> 
> 
> Scalars should not be instances of the datatypes (2)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> For simple datatypes such as ``float64`` (see also below), it seems
> tempting that the instance of a ``np.dtype("float64")`` can be the
> scalar.
> This idea may be even more appealing due to the fact that scalars,
> rather than datatypes, currently define a useful type hierarchy.
> 
> However, we have specifically decided against this for a number of
> reasons.
> First, the new datatypes described herein would be instances of DType
> classes.
> Making these instances themselves classes, while possible, adds
> additional
> complexity that users need to understand.
> It would also mean that scalars must have storage information (such
> as byteorder)
> which is generally unnecessary and currently is not used.
> Second, while the simple NumPy scalars such as ``float64`` may be
> such instances,
> it should be possible to create datatypes for Python objects without
> enforcing
> NumPy as a dependency.
> However, Python objects that do not depend on NumPy cannot be
> instances of a NumPy DType.
> Third, there is a mismatch between the methods and attributes which
> are useful
> for scalars and datatypes. For instance ``to_float()`` makes sense
> for a scalar
> but not for a datatype and ``newbyteorder`` is not useful on a scalar
> (or has
> a different meaning).
> 
> Overall, it seem rather than reducing the complexity, i.e. by merging
> the two distinct type hierarchies, making scalars instances of DTypes
> would
> increase the complexity of both the design and implementation.
> 
> A possible future path may be to instead simplify the current NumPy
> scalars to
> be much simpler objects which largely derive their behaviour from the
> datatypes.
> 
> C-API for creating new Datatypes (3)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> The current C-API with which users can create new datatypes
> is limited in scope, and requires use of "private" structures. This
> means
> the API is not extensible: no new members can be added to the
> structure
> without losing binary compatibility.
> This has already limited the inclusion of new sorting methods into
> NumPy [new_sort]_.
> 
> The new version shall thus replace the current ``PyArray_ArrFuncs``
> structure used
> to define new datatypes.
> Datatypes that currently exist and are defined using these slots will
> be
> supported during a deprecation period.
> 
> The most likely solution is to hide the implementation from the user
> and thus make
> it extensible in the future is to model the API after Python's stable
> API [PEP-384]_:
> 
> .. code-block:: C
> 
>     static struct PyArrayMethodDef slots[] = {
>         {NPY_dt_method, method_implementation},
>         ...,
>         {0, NULL}
>     }
> 
>     typedef struct{
>       PyTypeObject *typeobj;  /* type of python scalar */
>       ...;
>       PyType_Slot *slots;
>     } PyArrayDTypeMeta_Spec;
> 
>     PyObject* PyArray_InitDTypeMetaFromSpec(
>             PyArray_DTypeMeta *user_dtype, PyArrayDTypeMeta_Spec
> *dtype_spec);
> 
> The C-side slots should be designed to mirror Python side methods
> such as ``dtype.__dtype_method__``, although the exposure to Python
> is
> a later step in the implementation to reduce the complexity of the
> initial
> implementation.
> 
> 
> C-API Changes to the UFunc Machinery (4)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Proposed changes to the UFunc machinery will be part of NEP 43.
> However, the following changes will be necessary (see NEP 40 for a
> detailed
> description of the current implementation and its issues):
> 
> * The current UFunc type resolution must be adapted to allow better
> control
>   for user-defined dtypes as well as resolve current inconsistencies.
> * The inner-loop used in UFuncs must be expanded to include a return
> value.
>   Further, error reporting must be improved, and passing in dtype-
> specific
>   information enabled.
>   This requires the modification of the inner-loop function signature
> and
>   addition of new hooks called before and after the inner-loop is
> used.
> 
> An important goal for any changes to the universal functions will be
> to
> allow the reuse of existing loops.
> It should be easy for a new units datatype to fall back to existing
> math
> functions after handling the unit related computations.
> 
> 
> Discussion
> ----------
> 
> See NEP 40 for a list of previous meetings and discussions.
> 
> 
> References
> ----------
> 
> .. [pandas_extension_arrays] 
> https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types
> 
> .. _xarray_dtype_issue: https://github.com/pydata/xarray/issues/1262
> 
> .. [pygeos] https://github.com/caspervdw/pygeos
> 
> .. [new_sort] https://github.com/numpy/numpy/pull/12945
> 
> .. [PEP-384] https://www.python.org/dev/peps/pep-0384/
> 
> .. [PR 15508] https://github.com/numpy/numpy/pull/15508
> 
> 
> Copyright
> ---------
> 
> This document has been placed in the public domain.
> 
> 
> Acknowledgments
> ---------------
> 
> The effort to create new datatypes for NumPy has been discussed for
> several
> years in many different contexts and settings, making it impossible
> to list everyone involved.
> We would like to thank especially Stephan Hoyer, Nathaniel Smith, and
> Eric Wieser
> for repeated in-depth discussion about datatype design.
> We are very grateful for the community input in reviewing and
> revising this
> NEP and would like to thank especially Ross Barnowski and Ralf
> Gommers.
> 
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/numpy-discussion

signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal: NEP 41 -- First step towards a new Datatype System

Reply via email to