On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard <kevin.k.shepp...@gmail.com> wrote:
> > > On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > >> >> >> On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> >>> >>> >>> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < >>> warren.weckes...@gmail.com> wrote: >>> > >>> > >>> > >>> > On Mon, Sep 11, 2023 at 12:25 PM Nathan <nathan.goldb...@gmail.com> >>> wrote: >>> >> >>> >> >>> >> >>> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < >>> warren.weckes...@gmail.com> wrote: >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan <nathan.goldb...@gmail.com> >>> wrote: >>> >>> > >>> >>> > The NEP was merged in draft form, see below. >>> >>> > >>> >>> > https://numpy.org/neps/nep-0055-string_dtype.html >>> >>> > >>> >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <nathan.goldb...@gmail.com> >>> wrote: >>> >>> >> >>> >>> >> Hello all, >>> >>> >> >>> >>> >> I just opened a pull request to add NEP 55, see >>> https://github.com/numpy/numpy/pull/24483. >>> >>> >> >>> >>> >> Per NEP 0, I've copied everything up to the "detailed >>> description" section below. >>> >>> >> >>> >>> >> I'm looking forward to your feedback on this. >>> >>> >> >>> >>> >> -Nathan Goldbaum >>> >>> >> >>> >>> >>> >>> This will be a nice addition to NumPy, and matches a suggestion by >>> >>> @rkern (and probably others) made in the 2017 mailing list thread; >>> >>> see the last bullet of >>> >>> >>> >>> >>> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html >>> >>> >>> >>> So +1 for the enhancement! >>> >>> >>> >>> Now for some nitty-gritty review... >>> >> >>> >> >>> >> Thanks for the nitty-gritty review! I was on vacation last week and >>> haven't had a chance to look over this in detail yet, but at first glance >>> this seems like a really nice improvement. >>> >> >>> >> I'm going to try to integrate your proposed design into the dtype >>> prototype this week. If that works, I'd like to include some of the text >>> from the README in your repo in the NEP and add you as an author, would >>> that be alright? >>> > >>> > >>> > >>> > Sure, that would be fine. >>> > >>> > I have a few more comments and questions about the NEP that I'll >>> finish up and send this weekend. >>> > >>> >>> One more comment on the NEP... >>> >>> My first impression of the missing data API design is that >>> it is more complicated than necessary. An alternative that >>> is simpler--and is consistent with the pattern established for >>> floats and datetimes--is to define a "not a string" value, say >>> `np.nastring` or something similar, just like we have `nan` for >>> floats and `nat` for datetimes. Its behavior could be what >>> you called "nan-like". >>> >> >> Float `np.nan` and datetime missing value sentinel are not all that >> similar, and the latter was always a bit questionable (at least partially >> it's a left-over of trying to introduce generic missing value support I >> believe). `nan` is a float and part of C/C++ standards with well-defined >> numerical behavior. In contrast, there is no `np.nat`; you can retrieve a >> sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's >> possible to generate a NaT value with a regular operation on a datetime >> array a la `np.array([1.5]) / 0.0`. >> >> The handling of `np.nastring` would be an intrinsic part of the >>> dtype, so there would be no need for the `na_object` parameter >>> of `StringDType`. All `StringDType`s would handle `np.nastring` >>> in the same consistent manner. >>> >>> The use-case for the string sentinel does not seem very >>> compelling (but maybe I just don't understand the use-cases). >>> If there is a real need here that is not covered by >>> `np.nastring`, perhaps just a flag to control the repr of >>> `np.nastring` for each StringDType instance would be enough? >>> >> >> My understanding is that the NEP provides the necessary but limited >> support to allow Pandas to adopt the new dtype. The scope section of the >> NEP says: "Fully agreeing on the semantics of a missing data sentinels or >> adding a missing data sentinel to NumPy itself.". And then further down: >> "By only supporting user-provided missing data sentinels, we avoid >> resolving exactly how NumPy itself should support missing data and the >> correct semantics of the missing data object, leaving that up to users to >> decide" >> >> That general approach I agree with, it's a large can of worms and not the >> main purpose of this NEP. Nathan may have more thoughts about what, if >> anything, from your suggestions could be adopted, but the general "let's >> introduce a missing value thing" is a path we should not go down here imho. >> >> >> >>> >>> If there is an objection to a potential proliferation of >>> "not a thing" special values, one for each type that can >>> handle them, then perhaps a generic "not a value" (say >>> `np.navalue`) could be created that, when assigned to an >>> element of an array, results in the appropriate "not a thing" >>> value actually being assigned. In a sense, I guess this NEP is >>> proposing that, but it is reusing the floating point object >>> `np.nan` as the generic "not a thing" value >>> >> >> It is explicitly not using `np.nan` but instead allowing the user to >> provide their preferred sentinel. You're probably referring to the example >> with `na_object=np.nan`, but that example would work with another sentinel >> value too. >> >> Cheers, >> Ralf >> >> >> >>> , and my preference >>> is that, *if* we go with such a generic object, it is not >>> the floating point value `nan` but a new thing with a name >>> that reflects its purpose. (I guess Pandas users might be >>> accustomed to `nan` being a generic sentinel for missing data, >>> so its use doesn't feel as incohesive as it might to others. >>> Passing a string array to `np.isnan()` just feels *wrong* to >>> me.) >>> >>> Any, that's my 2¢. >>> >>> Warren >>> >>> >>> >> > I was a bit surprised that len was not used as part of the missing value. > The NEP proposal that 0 is a empty string unless there is a sentinal in > which case it is a missing value feels pretty limiting, since these are > distinctly different things. > > Would it make sense for len<0 to indicate a missing value. This would > require using ssize_t instead of size_t, and would then limit the string > size. In principle this would allow for sizeof(ssize_t) / 2 distinct > missing value. I think ssize_t is well-defined on all platforms > targeted by NumPy. > > Kevin > > Hey Kevin, Thanks for the comment. Right now the current NEP text is a little out of date compared to the implementation. I've since rewritten it to use Warren's proposal more or less verbatim, so now the missing value flag is stored in a bit of the size field See https://github.com/numpy/numpy-user-dtypes/pull/86 for the implementation, which also includes a small string optimization implementation. > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: nathan12...@gmail.com >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com