On Thu, Apr 20, 2017 at 8:55 PM Robert Kern <robert.k...@gmail.com> wrote:
> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explicitly? I feel like we ended up with the > current sub-optimal situation because we never really laid out the use > cases. We just felt like we needed bytestring and unicode dtypes, more out > of completionism than anything, and we made a bunch of assumptions just to > get each one done. I think there may be broad agreement that many of those > assumptions are "wrong", but it would be good to reference that against > concretely-stated use cases. > +1 > FWIW, if I need to work with in-memory arrays of strings in Python code, > I'm going to use dtype=object a la pandas. It has almost no arbitrary > constraints, and I can rely on Python's unicode facilities freely. There > may be some cases where it's a little less memory-efficient (e.g. > representing a column of enumerated single-character values like 'M'/'F'), > but that's never prevented me from doing anything (compare to the > uniform-length restrictions, which *have* prevented me from doing things). > > So what's left? Being able to memory-map to files that have string data > conveniently laid out according to numpy assumptions (e.g. FITS). Being > able to work with C/C++/Fortran APIs that have arrays of strings laid out > according to numpy assumptions (e.g. HDF5). I think it would behoove us to > canvass the needs of these formats and APIs before making any more > assumptions. > > For example, to my understanding, FITS files more or less follow numpy > assumptions for its string columns (i.e. uniform-length). But it enforces > 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the > singular motivating use case for the trailing-NULL behavior of np.string. > Actually if I understood the spec, FITS header lines are 80 bytes long and contain ASCII with no NULLs; strings are quoted and trailing spaces are stripped. [...] > If I had to jump ahead and propose new dtypes, I might suggest this: > > * For the most part, treat the string dtypes as temporary communication > formats rather than the preferred in-memory working format, similar to how > we use `float16` to communicate with GPU APIs. > > * Acknowledge the use cases of the current NULL-terminated np.string > dtype, but perhaps add a new canonical alias, document it as being for > those specific use cases, and deprecate/de-emphasize the current name. > > * Add a dtype for holding uniform-length `bytes` strings. This would be > similar to the current `void` dtype, but work more transparently with the > `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` > like `float64` does with `float`. This would not be NULL-terminated. No > encoding would be implied. > How would this differ from a numpy array of bytes with one more dimension? > * Maybe add a dtype similar to `object_` that only permits `unicode/str` > (2.x/3.x) strings (and maybe None to represent missing data a la pandas). > This maintains all of the flexibility of using a `dtype=object` array while > allowing code to specialize for working with strings without all kinds of > checking on every item. But most importantly, we can serialize such an > array to bytes without having to use pickle. Utility functions could be > written for en-/decoding to/from the uniform-length bytestring arrays > handling different encodings and things like NULL-termination (also working > with the legacy dtypes and handling structured arrays easily, etc.). > I think there may also be a niche for fixed-byte-size null-terminated strings of uniform encoding, that do decoding and encoding automatically. The encoding would naturally be attached to the dtype, and they would handle too-long strings by either truncating to a valid encoding or simply raising an exception. As with the current fixed-length strings, they'd mostly be for communication with other code, so the necessity depends on whether such other codes exist at all. Databases, perhaps? Custom hunks of C that don't want to deal with variable-length packing of data? Actually this last seems plausible - if I want to pass a great wodge of data, including Unicode strings, to a C program, writing out a numpy array seems maybe the easiest. Anne
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion