The NEP was merged in draft form, see below. https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com> wrote: > Hello all, > > I just opened a pull request to add NEP 55, see > https://github.com/numpy/numpy/pull/24483. > > Per NEP 0, I've copied everything up to the "detailed description" section > below. > > I'm looking forward to your feedback on this. > > -Nathan Goldbaum > > ========================================================= > NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy > ========================================================= > > :Author: Nathan Goldbaum <ngoldb...@quansight.com> > :Status: Draft > :Type: Standards Track > :Created: 2023-06-29 > > > Abstract > -------- > > We propose adding a new string data type to NumPy where each item in the > array > is an arbitrary length UTF-8 encoded string. This will enable performance, > memory usage, and usability improvements for NumPy users, including: > > * Memory savings for workflows that currently use fixed-width strings and > store > primarily ASCII data or a mix of short and long strings in a single NumPy > array. > > * Downstream libraries and users will be able to move away from object > arrays > currently used as a substitute for variable-length string arrays, unlocking > performance improvements by avoiding passes over the data outside of NumPy. > > * A more intuitive user-facing API for working with arrays of Python > strings, > without a need to think about the in-memory array representation. > > Motivation and Scope > -------------------- > > First, we will describe how the current state of support for string or > string-like data in NumPy arose. Next, we will summarize the last major > previous > discussion about this topic. Finally, we will describe the scope of the > proposed > changes to NumPy as well as changes that are explicitly out of scope of > this > proposal. > > History of String Support in Numpy > ********************************** > > Support in NumPy for textual data evolved organically in response to early > user > needs and then changes in the Python ecosystem. > > Support for strings was added to numpy to support users of the NumArray > ``chararray`` type. Remnants of this are still visible in the NumPy API: > string-related functionality lives in ``np.char``, to support the obsolete > ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string > DTypes. > > NumPy's ``bytes_`` DType was originally used to represent the Python 2 `` > str`` > type before Python 3 support was added to NumPy. The bytes DType makes the > most > sense when it is used to represent Python 2 strings or other > null-terminated > byte sequences. However, ignoring data after the first null character > means the > ``bytes_`` DType is only suitable for bytestreams that do not contain > nulls, so > it is a poor match for generic bytestreams. > > The ``unicode`` DType was added to support the Python 2 ``unicode`` type. > It > stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which > makes for > a straightforward implementation, but is inefficient for storing text that > can > be represented well using a one-byte ASCII or Latin-1 encoding. This was > not a > problem in Python 2, where ASCII or mostly-ASCII text could use the Python > 2 > ``str`` DType (the current ``bytes_`` DType). > > With the arrival of Python 3 support in NumPy, the string DTypes were > largely > left alone due to backward compatibility concerns, although the unicode > DType > became the default DType for ``str`` data and the old ``string`` DType was > renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal > situation of shipping a data type originally intended for null-terminated > bytestrings as the data type for *all* python ``bytes`` data, and a > default > string type with an in-memory representation that consumes four times as > much > memory as needed for ASCII or mostly-ASCII data. > > Problems with Fixed-Width Strings > ********************************* > > Both existing string DTypes represent fixed-width sequences, allowing > storage of > the string data in the array buffer. This avoids adding out-of-band > storage to > NumPy, however, it makes for an awkward user interface. In particular, the > maximum string size must be inferred by NumPy or estimated by the user > before > loading the data into a NumPy array or selecting an output DType for string > operations. In the worst case, this requires an expensive pass over the > full > dataset to calculate the maximum length of an array element. It also wastes > memory when array elements have varying lengths. Pathological cases where > an > array stores many short strings and a few very long strings are > particularly bad > for wasting memory. > > Downstream usage of string data in NumPy arrays has proven out the need > for a > variable-width string data type. In practice, most downstream users employ > ``object`` arrays for this purpose. In particular, ``pandas`` has > explicitly > deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width > string arrays to ``object`` arrays, and in the future may switch to only > supporting string data via ``PyArrow``, which has native support for UTF-8 > encoded variable-width string arrays [1]_. This is unfortunate, since `` > object`` > arrays have no type guarantees, necessitating expensive sanitization > passes and > operations using object arrays cannot release the GIL. > > Previous Discussions > -------------------- > > The project last discussed this topic in depth in 2017, when Julian Taylor > proposed a fixed-width text data type parameterized by an encoding [2]_. > This > started a wide-ranging discussion about pain points for working with > string data > in NumPy and possible ways forward. > > In the end, the discussion identified two use-cases that the current > support for > strings does a poor job of handling: > > * Loading or memory-mapping scientific datasets with unknown encoding, > * Working with string data in a manner that allows transparent conversion > between NumPy arrays and Python strings, including support for missing > strings. > > As a result of this discussion, improving support for string data was > added to > the NumPy project roadmap [3]_, with an explicit call-out to add a DType > better > suited to memory-mapping bytes with any or no encoding, and a > variable-width > string DType that supports missing data to replace usages of object string > arrays. > > Proposed work > ------------- > > This NEP proposes adding ``StringDType``, a DType that stores > variable-width > heap-allocated strings in Numpy arrays, to replace downstream usages of the > ``object`` DType for string data. This work will heavily leverage recent > improvements in NumPy to improve support for user-defined DTypes, so we > will > also necessarily be working on the data type internals in NumPy. In > particular, > we propose to: > > * Add a new variable-length string DType to NumPy, targeting NumPy 2.0. > > * Work out issues related to adding a DType implemented using the > experimental > DType API to NumPy itself. > > * Support for a user-provided missing data sentinel. > > * A cleanup of ``np.char``, with the ufunc-like functions moved to a new > namespace for functions and types related to string support. > > * An update to the ``npy`` and ``npz`` file formats to allow storage of > arbitrary-length sidecar data. > > The following is out of scope for this work: > > * Changing DType inference for string data. > > * Adding a DType for memory-mapping text in unknown encodings or a DType > that > attempts to fix issues with the ``bytes_`` DType. > > * Fully agreeing on the semantics of a missing data sentinels or adding a > missing data sentinel to NumPy itself. > > * Implement fast ufuncs or SIMD optimizations for string operations. > > While we're explicitly ruling out implementing these items as part of this > work, > adding a new string DType helps set up future work that does implement > some of > these items. > > If implemented this NEP will make it easier to add a new fixed-width text > DType > in the future by moving string operations into a long-term supported > namespace. We are also proposing a memory layout that should be amenable to > writing fast ufuncs and SIMD optimization in some cases, increasing the > payoff > for writing string operations as SIMD-optimized ufuncs in the future. > > While we are not proposing adding a missing data sentinel to NumPy, we are > proposing adding support for an optional, user-provided missing data > sentinel, > so this does move NumPy a little closer to officially supporting missing > data. We are attempting to avoid resolving the disagreement described in > :ref:`NEP 26<NEP26>` and this proposal does not require or preclude > adding a > missing data sentinel or bitflag-based missing data support in the future. > > Usage and Impact > ---------------- > > The DType is intended as a drop-in replacement for object string arrays. > This > means that we intend to support as many downstream usages of object string > arrays as possible, including all supported NumPy functionality. Pandas is > the > obvious first user, and substantial work has already occurred to add > support in > a fork of Pandas. ``scikit-learn`` also uses object string arrays and > will be > able to migrate to a DType with guarantees that the arrays contains only > strings. Both h5py [4]_ and PyTables [5]_ will be able to add first-class > support for variable-width UTF-8 encoded string datasets in HDF5. String > data > are heavily used in machine-learning workflows and downstream machine > learning > libraries will be able to leverage this new DType. > > Users who wish to load string data into NumPy and leverage NumPy features > like > fancy advanced indexing will have a natural choice that offers substantial > memory savings over fixed-width unicode strings and better validation > guarantees > and overall integration with NumPy than object string arrays. Moving to a > first-class string DType also removes the need to acquire the GIL during > string > operations, unlocking future optimizations that are impossible with object > string arrays. > > Performance > *********** > > Here we briefly describe preliminary performance measurements of the > prototype > version of ``StringDType`` we have implemented outside of NumPy using the > experimental DType API. All benchmarks in this section were performed on a > Dell > XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. > NumPy, > Pandas, and the ``StringDType`` prototype were all compiled with meson > release > builds. > > Currently, the ``StringDType`` prototype has comparable performance with > object > arrays and fixed-width string arrays. One exception is array creation from > python strings, performance is somewhat slower than object arrays and > comparable > to fixed-width unicode arrays:: > > In [1]: from stringdtype import StringDType > > In [2]: import numpy as np > > In [3]: data = [str(i) * 10 for i in range(100_000)] > > In [4]: %timeit arr_object = np.array(data, dtype=object) > 3.55 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType()) > 12.9 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > In [6]: %timeit arr_strdtype = np.array(data, dtype=str) > 11.7 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > In this example, object DTypes are substantially faster because the > objects in > the ``data`` list can be directly interned in the array, while ``StrDType`` > and > ``StringDType`` need to copy the string data and ``StringDType`` needs to > convert the data to UTF-8 and perform additional heap allocations outside > the > array buffer. In the future, if Python moves to a UTF-8 internal > representation > for strings, the string loading performance of ``StringDType`` should > improve. > > String operations have similar performance:: > > In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object) > 30.2 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) > > In [8]: %timeit np.char.capitalize(arr_stringdtype) > 38.5 ms ± 3.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > > In [9]: %timeit np.char.capitalize(arr_strdtype) > 46.4 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > > The poor performance here is a reflection of the slow iterator-based > implementation of operations in ``np.char``. If we were to rewrite these > operations as ufuncs, we could unlock substantial performance > improvements. Using the example of the ``add`` ufunc, which we have > implemented > for the ``StringDType`` prototype:: > > In [10]: %timeit arr_object + arr_object > 10 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > In [11]: %timeit arr_stringdtype + arr_stringdtype > 5.91 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype) > 65.9 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > > As described below, we have already updated a fork of Pandas to use a > prototype > version of ``StringDType``. This demonstrates the performance improvements > available when data are already loaded into a NumPy array and are passed > to a > third-party library. Currently Pandas attempts to coerce all ``str`` data > to > ``object`` DType by default, and has to check and sanitize existing `` > object`` > arrays that are passed in. This requires a copy or pass over the data made > unnecessary by first-class support for variable-width strings in both > NumPy and > Pandas:: > > In [13]: import pandas as pd > > In [14]: %timeit pd.Series(arr_stringdtype) > 20.9 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) > > In [15]: %timeit pd.Series(arr_object) > 1.08 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) > > We have also implemented a Pandas extension DType that uses ``StringDType > `` > under the hood, which is also substantially faster for creating Pandas data > structures than the existing Pandas string DType that uses ``object`` > arrays:: > > In [16]: %timeit pd.Series(arr_stringdtype, dtype='string[numpy]') > 54.7 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) > > In [17]: %timeit pd.Series(arr_object, dtype='string[python]') > 1.39 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) > > Backward compatibility > ---------------------- > > We are not proposing a change to DType inference for python strings and > do not > expect to see any impacts on existing usages of NumPy, besides warnings or > errors related to new deprecations or expiring deprecations in ``np.char > ``. > > >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com