The NEP was merged in draft form, see below.

https://numpy.org/neps/nep-0055-string_dtype.html

On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com> wrote:

> Hello all,
>
> I just opened a pull request to add NEP 55, see
> https://github.com/numpy/numpy/pull/24483.
>
> Per NEP 0, I've copied everything up to the "detailed description" section
> below.
>
> I'm looking forward to your feedback on this.
>
> -Nathan Goldbaum
>
> =========================================================
> NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy
> =========================================================
>
> :Author: Nathan Goldbaum <ngoldb...@quansight.com>
> :Status: Draft
> :Type: Standards Track
> :Created: 2023-06-29
>
>
> Abstract
> --------
>
> We propose adding a new string data type to NumPy where each item in the
> array
> is an arbitrary length UTF-8 encoded string. This will enable performance,
> memory usage, and usability improvements for NumPy users, including:
>
> * Memory savings for workflows that currently use fixed-width strings and
> store
> primarily ASCII data or a mix of short and long strings in a single NumPy
> array.
>
> * Downstream libraries and users will be able to move away from object
> arrays
> currently used as a substitute for variable-length string arrays, unlocking
> performance improvements by avoiding passes over the data outside of NumPy.
>
> * A more intuitive user-facing API for working with arrays of Python
> strings,
> without a need to think about the in-memory array representation.
>
> Motivation and Scope
> --------------------
>
> First, we will describe how the current state of support for string or
> string-like data in NumPy arose. Next, we will summarize the last major
> previous
> discussion about this topic. Finally, we will describe the scope of the
> proposed
> changes to NumPy as well as changes that are explicitly out of scope of
> this
> proposal.
>
> History of String Support in Numpy
> **********************************
>
> Support in NumPy for textual data evolved organically in response to early
> user
> needs and then changes in the Python ecosystem.
>
> Support for strings was added to numpy to support users of the NumArray
> ``chararray`` type. Remnants of this are still visible in the NumPy API:
> string-related functionality lives in ``np.char``, to support the obsolete
> ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string
> DTypes.
>
> NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``
> str``
> type before Python 3 support was added to NumPy. The bytes DType makes the
> most
> sense when it is used to represent Python 2 strings or other
> null-terminated
> byte sequences. However, ignoring data after the first null character
> means the
> ``bytes_`` DType is only suitable for bytestreams that do not contain
> nulls, so
> it is a poor match for generic bytestreams.
>
> The ``unicode`` DType was added to support the Python 2 ``unicode`` type.
> It
> stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which
> makes for
> a straightforward implementation, but is inefficient for storing text that
> can
> be represented well using a one-byte ASCII or Latin-1 encoding. This was
> not a
> problem in Python 2, where ASCII or mostly-ASCII text could use the Python
> 2
> ``str`` DType (the current ``bytes_`` DType).
>
> With the arrival of Python 3 support in NumPy, the string DTypes were
> largely
> left alone due to backward compatibility concerns, although the unicode
> DType
> became the default DType for ``str`` data and the old ``string`` DType was
> renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal
> situation of shipping a data type originally intended for null-terminated
> bytestrings as the data type for *all* python ``bytes`` data, and a
> default
> string type with an in-memory representation that consumes four times as
> much
> memory as needed for ASCII or mostly-ASCII data.
>
> Problems with Fixed-Width Strings
> *********************************
>
> Both existing string DTypes represent fixed-width sequences, allowing
> storage of
> the string data in the array buffer. This avoids adding out-of-band
> storage to
> NumPy, however, it makes for an awkward user interface. In particular, the
> maximum string size must be inferred by NumPy or estimated by the user
> before
> loading the data into a NumPy array or selecting an output DType for string
> operations. In the worst case, this requires an expensive pass over the
> full
> dataset to calculate the maximum length of an array element. It also wastes
> memory when array elements have varying lengths. Pathological cases where
> an
> array stores many short strings and a few very long strings are
> particularly bad
> for wasting memory.
>
> Downstream usage of string data in NumPy arrays has proven out the need
> for a
> variable-width string data type. In practice, most downstream users employ
> ``object`` arrays for this purpose. In particular, ``pandas`` has
> explicitly
> deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width
> string arrays to ``object`` arrays, and in the future may switch to only
> supporting string data via ``PyArrow``, which has native support for UTF-8
> encoded variable-width string arrays [1]_. This is unfortunate, since ``
> object``
> arrays have no type guarantees, necessitating expensive sanitization
> passes and
> operations using object arrays cannot release the GIL.
>
> Previous Discussions
> --------------------
>
> The project last discussed this topic in depth in 2017, when Julian Taylor
> proposed a fixed-width text data type parameterized by an encoding [2]_.
> This
> started a wide-ranging discussion about pain points for working with
> string data
> in NumPy and possible ways forward.
>
> In the end, the discussion identified two use-cases that the current
> support for
> strings does a poor job of handling:
>
> * Loading or memory-mapping scientific datasets with unknown encoding,
> * Working with string data in a manner that allows transparent conversion
> between NumPy arrays and Python strings, including support for missing
> strings.
>
> As a result of this discussion, improving support for string data was
> added to
> the NumPy project roadmap [3]_, with an explicit call-out to add a DType
> better
> suited to memory-mapping bytes with any or no encoding, and a
> variable-width
> string DType that supports missing data to replace usages of object string
> arrays.
>
> Proposed work
> -------------
>
> This NEP proposes adding ``StringDType``, a DType that stores
> variable-width
> heap-allocated strings in Numpy arrays, to replace downstream usages of the
> ``object`` DType for string data. This work will heavily leverage recent
> improvements in NumPy to improve support for user-defined DTypes, so we
> will
> also necessarily be working on the data type internals in NumPy. In
> particular,
> we propose to:
>
> * Add a new variable-length string DType to NumPy, targeting NumPy 2.0.
>
> * Work out issues related to adding a DType implemented using the
> experimental
> DType API to NumPy itself.
>
> * Support for a user-provided missing data sentinel.
>
> * A cleanup of ``np.char``, with the ufunc-like functions moved to a new
> namespace for functions and types related to string support.
>
> * An update to the ``npy`` and ``npz`` file formats to allow storage of
> arbitrary-length sidecar data.
>
> The following is out of scope for this work:
>
> * Changing DType inference for string data.
>
> * Adding a DType for memory-mapping text in unknown encodings or a DType
> that
> attempts to fix issues with the ``bytes_`` DType.
>
> * Fully agreeing on the semantics of a missing data sentinels or adding a
> missing data sentinel to NumPy itself.
>
> * Implement fast ufuncs or SIMD optimizations for string operations.
>
> While we're explicitly ruling out implementing these items as part of this
> work,
> adding a new string DType helps set up future work that does implement
> some of
> these items.
>
> If implemented this NEP will make it easier to add a new fixed-width text
> DType
> in the future by moving string operations into a long-term supported
> namespace. We are also proposing a memory layout that should be amenable to
> writing fast ufuncs and SIMD optimization in some cases, increasing the
> payoff
> for writing string operations as SIMD-optimized ufuncs in the future.
>
> While we are not proposing adding a missing data sentinel to NumPy, we are
> proposing adding support for an optional, user-provided missing data
> sentinel,
> so this does move NumPy a little closer to officially supporting missing
> data. We are attempting to avoid resolving the disagreement described in
> :ref:`NEP 26<NEP26>` and this proposal does not require or preclude
> adding a
> missing data sentinel or bitflag-based missing data support in the future.
>
> Usage and Impact
> ----------------
>
> The DType is intended as a drop-in replacement for object string arrays.
> This
> means that we intend to support as many downstream usages of object string
> arrays as possible, including all supported NumPy functionality. Pandas is
> the
> obvious first user, and substantial work has already occurred to add
> support in
> a fork of Pandas. ``scikit-learn`` also uses object string arrays and
> will be
> able to migrate to a DType with guarantees that the arrays contains only
> strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class
> support for variable-width UTF-8 encoded string datasets in HDF5. String
> data
> are heavily used in machine-learning workflows and downstream machine
> learning
> libraries will be able to leverage this new DType.
>
> Users who wish to load string data into NumPy and leverage NumPy features
> like
> fancy advanced indexing will have a natural choice that offers substantial
> memory savings over fixed-width unicode strings and better validation
> guarantees
> and overall integration with NumPy than object string arrays. Moving to a
> first-class string DType also removes the need to acquire the GIL during
> string
> operations, unlocking future optimizations that are impossible with object
> string arrays.
>
> Performance
> ***********
>
> Here we briefly describe preliminary performance measurements of the
> prototype
> version of ``StringDType`` we have implemented outside of NumPy using the
> experimental DType API. All benchmarks in this section were performed on a
> Dell
> XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv.
> NumPy,
> Pandas, and the ``StringDType`` prototype were all compiled with meson
> release
> builds.
>
> Currently, the ``StringDType`` prototype has comparable performance with
> object
> arrays and fixed-width string arrays. One exception is array creation from
> python strings, performance is somewhat slower than object arrays and
> comparable
> to fixed-width unicode arrays::
>
> In [1]: from stringdtype import StringDType
>
> In [2]: import numpy as np
>
> In [3]: data = [str(i) * 10 for i in range(100_000)]
>
> In [4]: %timeit arr_object = np.array(data, dtype=object)
> 3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType())
> 12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> In [6]: %timeit arr_strdtype = np.array(data, dtype=str)
> 11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> In this example, object DTypes are substantially faster because the
> objects in
> the ``data`` list can be directly interned in the array, while ``StrDType``
> and
> ``StringDType`` need to copy the string data and ``StringDType`` needs to
> convert the data to UTF-8 and perform additional heap allocations outside
> the
> array buffer. In the future, if Python moves to a UTF-8 internal
> representation
> for strings, the string loading performance of ``StringDType`` should
> improve.
>
> String operations have similar performance::
>
> In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object)
> 30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> In [8]: %timeit np.char.capitalize(arr_stringdtype)
> 38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> In [9]: %timeit np.char.capitalize(arr_strdtype)
> 46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> The poor performance here is a reflection of the slow iterator-based
> implementation of operations in ``np.char``. If we were to rewrite these
> operations as ufuncs, we could unlock substantial performance
> improvements. Using the example of the ``add`` ufunc, which we have
> implemented
> for the ``StringDType`` prototype::
>
> In [10]: %timeit arr_object + arr_object
> 10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> In [11]: %timeit arr_stringdtype + arr_stringdtype
> 5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype)
> 65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> As described below, we have already updated a fork of Pandas to use a
> prototype
> version of ``StringDType``. This demonstrates the performance improvements
> available when data are already loaded into a NumPy array and are passed
> to a
> third-party library. Currently Pandas attempts to coerce all ``str`` data
> to
> ``object`` DType by default, and has to check and sanitize existing ``
> object``
> arrays that are passed in. This requires a copy or pass over the data made
> unnecessary by first-class support for variable-width strings in both
> NumPy and
> Pandas::
>
> In [13]: import pandas as pd
>
> In [14]: %timeit pd.Series(arr_stringdtype)
> 20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>
> In [15]: %timeit pd.Series(arr_object)
> 1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>
> We have also implemented a Pandas extension DType that uses ``StringDType
> ``
> under the hood, which is also substantially faster for creating Pandas data
> structures than the existing Pandas string DType that uses ``object``
> arrays::
>
> In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]')
> 54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>
> In [17]: %timeit pd.Series(arr_object, dtype='string[python]')
> 1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>
> Backward compatibility
> ----------------------
>
> We are not proposing a change to DType inference for python strings and
> do not
> expect to see any impacts on existing usages of NumPy, besides warnings or
> errors related to new deprecations or expiring deprecations in ``np.char
> ``.
>
>
>
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to