[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Nathan Wed, 20 Sep 2023 08:30:31 -0700

On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard <[email protected]>
wrote:


>
>
> On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <[email protected]>
> wrote:
>
>>
>>
>> On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser <
>> [email protected]> wrote:
>>
>>>
>>>
>>> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser <
>>> [email protected]> wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Sep 11, 2023 at 12:25 PM Nathan <[email protected]>
>>> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
>>> [email protected]> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan <[email protected]>
>>> wrote:
>>> >>> >
>>> >>> > The NEP was merged in draft form, see below.
>>> >>> >
>>> >>> > https://numpy.org/neps/nep-0055-string_dtype.html
>>> >>> >
>>> >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <[email protected]>
>>> wrote:
>>> >>> >>
>>> >>> >> Hello all,
>>> >>> >>
>>> >>> >> I just opened a pull request to add NEP 55, see
>>> https://github.com/numpy/numpy/pull/24483.
>>> >>> >>
>>> >>> >> Per NEP 0, I've copied everything up to the "detailed
>>> description" section below.
>>> >>> >>
>>> >>> >> I'm looking forward to your feedback on this.
>>> >>> >>
>>> >>> >> -Nathan Goldbaum
>>> >>> >>
>>> >>>
>>> >>> This will be a nice addition to NumPy, and matches a suggestion by
>>> >>> @rkern (and probably others) made in the 2017 mailing list thread;
>>> >>> see the last bullet of
>>> >>>
>>> >>>
>>> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
>>> >>>
>>> >>> So +1 for the enhancement!
>>> >>>
>>> >>> Now for some nitty-gritty review...
>>> >>
>>> >>
>>> >> Thanks for the nitty-gritty review! I was on vacation last week and
>>> haven't had a chance to look over this in detail yet, but at first glance
>>> this seems like a really nice improvement.
>>> >>
>>> >> I'm going to try to integrate your proposed design into the dtype
>>> prototype this week. If that works, I'd like to include some of the text
>>> from the README in your repo in the NEP and add you as an author, would
>>> that be alright?
>>> >
>>> >
>>> >
>>> > Sure, that would be fine.
>>> >
>>> > I have a few more comments and questions about the NEP that I'll
>>> finish up and send this weekend.
>>> >
>>>
>>> One more comment on the NEP...
>>>
>>> My first impression of the missing data API design is that
>>> it is more complicated than necessary. An alternative that
>>> is simpler--and is consistent with the pattern established for
>>> floats and datetimes--is to define a "not a string" value, say
>>> `np.nastring` or something similar, just like we have `nan` for
>>> floats and `nat` for datetimes. Its behavior could be what
>>> you called "nan-like".
>>>
>>
>> Float `np.nan` and datetime missing value sentinel are not all that
>> similar, and the latter was always a bit questionable (at least partially
>> it's a left-over of trying to introduce generic missing value support I
>> believe). `nan` is a float and part of C/C++ standards with well-defined
>> numerical behavior. In contrast, there is no `np.nat`; you can retrieve a
>> sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's
>> possible to generate a NaT value with a regular operation on a datetime
>> array a la `np.array([1.5]) / 0.0`.
>>
>> The handling of `np.nastring` would be an intrinsic part of the
>>> dtype, so there would be no need for the `na_object` parameter
>>> of `StringDType`. All `StringDType`s would handle `np.nastring`
>>> in the same consistent manner.
>>>
>>> The use-case for the string sentinel does not seem very
>>> compelling (but maybe I just don't understand the use-cases).
>>> If there is a real need here that is not covered by
>>> `np.nastring`, perhaps just a flag to control the repr of
>>> `np.nastring` for each StringDType instance would be enough?
>>>
>>
>> My understanding is that the NEP provides the necessary but limited
>> support to allow Pandas to adopt the new dtype. The scope section of the
>> NEP says: "Fully agreeing on the semantics of a missing data sentinels or
>> adding a missing data sentinel to NumPy itself.". And then further down:
>> "By only supporting user-provided missing data sentinels, we avoid
>> resolving exactly how NumPy itself should support missing data and the
>> correct semantics of the missing data object, leaving that up to users to
>> decide"
>>
>> That general approach I agree with, it's a large can of worms and not the
>> main purpose of this NEP. Nathan may have more thoughts about what, if
>> anything, from your suggestions could be adopted, but the general "let's
>> introduce a missing value thing" is a path we should not go down here imho.
>>
>>
>>
>>>
>>> If there is an objection to a potential proliferation of
>>> "not a thing" special values, one for each type that can
>>> handle them, then perhaps a generic "not a value" (say
>>> `np.navalue`) could be created that, when assigned to an
>>> element of an array, results in the appropriate "not a thing"
>>> value actually being assigned. In a sense, I guess this NEP is
>>> proposing that, but it is reusing the floating point object
>>> `np.nan` as the generic "not a thing" value
>>>
>>
>> It is explicitly not using `np.nan` but instead allowing the user to
>> provide their preferred sentinel. You're probably referring to the example
>> with `na_object=np.nan`, but that example would work with another sentinel
>> value too.
>>
>> Cheers,
>> Ralf
>>
>>
>>
>>> , and my preference
>>> is that, *if* we go with such a generic object, it is not
>>> the floating point value `nan` but a new thing with a name
>>> that reflects its purpose. (I guess Pandas users might be
>>> accustomed to `nan` being a generic sentinel for missing data,
>>> so its use doesn't feel as incohesive as it might to others.
>>> Passing a string array to `np.isnan()` just feels *wrong* to
>>> me.)
>>>
>>> Any, that's my 2¢.
>>>
>>> Warren
>>>
>>>
>>>
>>
> I was a bit surprised that len was not used as part of the missing value.
> The NEP proposal that 0 is a empty string unless there is a sentinal in
> which case it is a missing value feels pretty limiting, since these are
> distinctly different things.
>
> Would it make sense for len<0 to indicate a missing value.  This would
> require using ssize_t instead of size_t, and would then limit the string
> size. In principle this would allow for sizeof(ssize_t) / 2 distinct
> missing value.  I think ssize_t is well-defined on all platforms
> targeted by NumPy.
>
> Kevin
>
>
Hey Kevin,

Thanks for the comment. Right now the current NEP text is a little out of
date compared to the implementation. I've since rewritten it to use
Warren's proposal more or less verbatim, so now the missing value flag is
stored in a bit of the size field

See https://github.com/numpy/numpy-user-dtypes/pull/86 for the
implementation, which also includes a small string optimization
implementation.


> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]
>

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

Reply via email to