[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Warren Weckesser Sun, 03 Sep 2023 09:55:20 -0700

On Tue, Aug 29, 2023 at 10:09 AM Nathan <[email protected]> wrote:
>
> The NEP was merged in draft form, see below.
>
> https://numpy.org/neps/nep-0055-string_dtype.html
>
> On Mon, Aug 21, 2023 at 2:36 PM Nathan <[email protected]> wrote:
>>
>> Hello all,
>>
>> I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.
>>
>> Per NEP 0, I've copied everything up to the "detailed description"
section below.
>>
>> I'm looking forward to your feedback on this.
>>
>> -Nathan Goldbaum
>>


This will be a nice addition to NumPy, and matches a suggestion by
@rkern (and probably others) made in the 2017 mailing list thread;
see the last bullet of

 https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html

So +1 for the enhancement!

Now for some nitty-gritty review...

There is a design change that I think should be made in the
implementation of missing values.

In the current design described in the NEP, and expanded on in the
comment

https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,

the meaning of the values `{len = 0, buf = NULL}` in an instance of
`npy_static_string` depends on whether or not the `na_object` has been
set in the dtype. If it has not been set, that data represents a string
of length 0. If `na_object` *has* been set, that data represents a
missing value. To get a string of length 0 in this case, some non-NULL
value must be assigned to the `buf` field. (In the comment linked
above, @ngoldbaum suggested `{0, "\0"}`, but strings are not
NUL-terminated, so there is no need for that `\0` in `buf`, and in fact,
with `len == 0`, it would be a bug for the pointer to be dereferenced,
so *any* non-NULL value--valid pointer or not--could be used for `buf`.)

I think it would be better if `len == 0` *always* meant a string with
length 0, with no additional qualifications; it shouldn't be necessary
to put some non-NULL value in `buf` just to get an empty string. We
can achieve this if we use a bit in `len` as a flag for a missing value.
Reserving a bit from `len` as a flag reduces the maximum possible string
length, but as discussed in the NEP pull request, we're almost certainly
going to reserve at least the high bit of `len` when small string
optimization (SSO) is implemented. This will reduce the maximum string
length to `2**(N-1)-1`, where `N` is the bit width of `size_t`
(equivalent to using a signed type for `len`). Even if SSO isn't
implemented immediately, we can anticipate the need for flags stored
in `len`, and use them to implement missing values.

The actual implementation of SSO will require some more design work,
because the offset of the most significant byte of `len` within the
`npy_static_string` struct depends on the platform endianess. For
little-endian, the most significant byte is not the first byte in the
struct, so the bytes available for SSO within the struct are not
contiguous when the fields have the order `{len, buf}`.

I experimented with these ideas, and put the result at

https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring

The idea that I propose there is to make the memory layout of the
struct depend on the endianess of the platform, so the most
significant byte of `len` (which I called `size`, to avoid any chance
of confusion with the actual length of the string [1]) is at the
beginning of the struct on big-endian platforms and at the end of the
struct for little-endian platforms. More details are included in the
file README.md. Note that I am not suggesting that all the SSO stuff
be included in the current NEP! This is just a proof-of-concept that
shows one possibility for SSO.

In that design, the high bit of `size` (which is `len` here) being set
indicates that the `npy_static_string` struct should not be interpreted
as the standard `{len, buf}` representation of a string. When the
second highest bit is set, it means we have a missing value. If the
second highest bit is not set, SSO is active; see the link above for
more details.

With this design, `len == 0` *always* means a string of length 0,
regardless of whether or not `na_object` is defined in the dtype.

Also with this design, an array created with `calloc()` will
automatically be an array of empty strings. With current design in
the NEP, an array created with `calloc()` will be either an array of
empty strings, or an array of missing values, depending on whether or
not the dtype has `na_object` defined. That conditional behavior
seems less than desirable.

What do you think?

--Warren

[1] I would like to see `len` renamed to `size` in the
`npy_static_string` struct, but that's bikeshed stuff, and not
a blocker.

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Reply via email to