On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <warren.weckes...@gmail.com>
wrote:

>
>
> On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldb...@gmail.com> wrote:
> >
> > The NEP was merged in draft form, see below.
> >
> > https://numpy.org/neps/nep-0055-string_dtype.html
> >
> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com>
> wrote:
> >>
> >> Hello all,
> >>
> >> I just opened a pull request to add NEP 55, see
> https://github.com/numpy/numpy/pull/24483.
> >>
> >> Per NEP 0, I've copied everything up to the "detailed description"
> section below.
> >>
> >> I'm looking forward to your feedback on this.
> >>
> >> -Nathan Goldbaum
> >>
>
> This will be a nice addition to NumPy, and matches a suggestion by
> @rkern (and probably others) made in the 2017 mailing list thread;
> see the last bullet of
>
>  https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
>
> So +1 for the enhancement!
>
> Now for some nitty-gritty review...
>

Thanks for the nitty-gritty review! I was on vacation last week and haven't
had a chance to look over this in detail yet, but at first glance this
seems like a really nice improvement.

I'm going to try to integrate your proposed design into the dtype prototype
this week. If that works, I'd like to include some of the text from the
README in your repo in the NEP and add you as an author, would that be
alright?


>
> There is a design change that I think should be made in the
> implementation of missing values.
>
> In the current design described in the NEP, and expanded on in the
> comment
>
> https://github.com/numpy/numpy/pull/24483#discussion_r1311815944,
>
> the meaning of the values `{len = 0, buf = NULL}` in an instance of
> `npy_static_string` depends on whether or not the `na_object` has been
> set in the dtype. If it has not been set, that data represents a string
> of length 0. If `na_object` *has* been set, that data represents a
> missing value. To get a string of length 0 in this case, some non-NULL
> value must be assigned to the `buf` field. (In the comment linked
> above, @ngoldbaum suggested `{0, "\0"}`, but strings are not
> NUL-terminated, so there is no need for that `\0` in `buf`, and in fact,
> with `len == 0`, it would be a bug for the pointer to be dereferenced,
> so *any* non-NULL value--valid pointer or not--could be used for `buf`.)
>
> I think it would be better if `len == 0` *always* meant a string with
> length 0, with no additional qualifications; it shouldn't be necessary
> to put some non-NULL value in `buf` just to get an empty string. We
> can achieve this if we use a bit in `len` as a flag for a missing value.
> Reserving a bit from `len` as a flag reduces the maximum possible string
> length, but as discussed in the NEP pull request, we're almost certainly
> going to reserve at least the high bit of `len` when small string
> optimization (SSO) is implemented. This will reduce the maximum string
> length to `2**(N-1)-1`, where `N` is the bit width of `size_t`
> (equivalent to using a signed type for `len`). Even if SSO isn't
> implemented immediately, we can anticipate the need for flags stored
> in `len`, and use them to implement missing values.
>
> The actual implementation of SSO will require some more design work,
> because the offset of the most significant byte of `len` within the
> `npy_static_string` struct depends on the platform endianess. For
> little-endian, the most significant byte is not the first byte in the
> struct, so the bytes available for SSO within the struct are not
> contiguous when the fields have the order `{len, buf}`.
>
> I experimented with these ideas, and put the result at
>
> https://github.com/WarrenWeckesser/experiments/tree/master/c/numpy-vstring
>
> The idea that I propose there is to make the memory layout of the
> struct depend on the endianess of the platform, so the most
> significant byte of `len` (which I called `size`, to avoid any chance
> of confusion with the actual length of the string [1]) is at the
> beginning of the struct on big-endian platforms and at the end of the
> struct for little-endian platforms. More details are included in the
> file README.md. Note that I am not suggesting that all the SSO stuff
> be included in the current NEP! This is just a proof-of-concept that
> shows one possibility for SSO.
>
> In that design, the high bit of `size` (which is `len` here) being set
> indicates that the `npy_static_string` struct should not be interpreted
> as the standard `{len, buf}` representation of a string. When the
> second highest bit is set, it means we have a missing value. If the
> second highest bit is not set, SSO is active; see the link above for
> more details.
>
> With this design, `len == 0` *always* means a string of length 0,
> regardless of whether or not `na_object` is defined in the dtype.
>
> Also with this design, an array created with `calloc()` will
> automatically be an array of empty strings. With current design in
> the NEP, an array created with `calloc()` will be either an array of
> empty strings, or an array of missing values, depending on whether or
> not the dtype has `na_object` defined. That conditional behavior
> seems less than desirable.
>
> What do you think?
>
> --Warren
>
> [1] I would like to see `len` renamed to `size` in the
> `npy_static_string` struct, but that's bikeshed stuff, and not
> a blocker.
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: nathan12...@gmail.com
>
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to