On Thu, Apr 20, 2017 at 12:53 PM, Robert Kern <robert.k...@gmail.com> wrote:
> On Thu, Apr 20, 2017 at 6:15 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > > > Do you have comments on how to go forward, in particular in regards to > > new dtype vs modify np.unicode? > > Can we restate the use cases explicitly? I feel like we ended up with the > current sub-optimal situation because we never really laid out the use > cases. We just felt like we needed bytestring and unicode dtypes, more out > of completionism than anything, and we made a bunch of assumptions just to > get each one done. I think there may be broad agreement that many of those > assumptions are "wrong", but it would be good to reference that against > concretely-stated use cases. > > FWIW, if I need to work with in-memory arrays of strings in Python code, > I'm going to use dtype=object a la pandas. It has almost no arbitrary > constraints, and I can rely on Python's unicode facilities freely. There > may be some cases where it's a little less memory-efficient (e.g. > representing a column of enumerated single-character values like 'M'/'F'), > but that's never prevented me from doing anything (compare to the > uniform-length restrictions, which *have* prevented me from doing things). > > So what's left? Being able to memory-map to files that have string data > conveniently laid out according to numpy assumptions (e.g. FITS). Being > able to work with C/C++/Fortran APIs that have arrays of strings laid out > according to numpy assumptions (e.g. HDF5). I think it would behoove us to > canvass the needs of these formats and APIs before making any more > assumptions. > > For example, to my understanding, FITS files more or less follow numpy > assumptions for its string columns (i.e. uniform-length). But it enforces > 7-bit-clean ASCII and pads with terminating NULLs; I believe this was the > singular motivating use case for the trailing-NULL behavior of np.string. > > I don't know of a format off-hand that works with numpy uniform-length > strings and Unicode as well. HDF5 (to my recollection) supports arrays of > NULL-terminated, uniform-length ASCII like FITS, but only variable-length > UTF8 strings. > > We should look at some of the newer formats and APIs, like Parquet and > Arrow, and also consider the cross-language APIs with Julia and R. > > If I had to jump ahead and propose new dtypes, I might suggest this: > > * For the most part, treat the string dtypes as temporary communication > formats rather than the preferred in-memory working format, similar to how > we use `float16` to communicate with GPU APIs. > > * Acknowledge the use cases of the current NULL-terminated np.string > dtype, but perhaps add a new canonical alias, document it as being for > those specific use cases, and deprecate/de-emphasize the current name. > > * Add a dtype for holding uniform-length `bytes` strings. This would be > similar to the current `void` dtype, but work more transparently with the > `bytes` type, perhaps with the scalar type multiply-inheriting from `bytes` > like `float64` does with `float`. This would not be NULL-terminated. No > encoding would be implied. > > * Maybe add a dtype similar to `object_` that only permits `unicode/str` > (2.x/3.x) strings (and maybe None to represent missing data a la pandas). > This maintains all of the flexibility of using a `dtype=object` array while > allowing code to specialize for working with strings without all kinds of > checking on every item. But most importantly, we can serialize such an > array to bytes without having to use pickle. Utility functions could be > written for en-/decoding to/from the uniform-length bytestring arrays > handling different encodings and things like NULL-termination (also working > with the legacy dtypes and handling structured arrays easily, etc.). > > A little history, IIRC, storing null terminated strings in fixed byte lengths was done in Fortran, strings were usually stored in integers/integer_arrays. If memory mapping of arbitrary types is not important, I'd settle for ascii or latin-1, utf-8 fixed byte length, and arrays of fixed python object type. Using one byte encodings and utf-8 avoids needing to deal with endianess. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion