On Apr 25, 2017 11:53 AM, "Robert Kern" <robert.k...@gmail.com> wrote:
On Tue, Apr 25, 2017 at 11:18 AM, Charles R Harris < charlesr.har...@gmail.com> wrote: > > On Tue, Apr 25, 2017 at 11:34 AM, Anne Archibald < peridot.face...@gmail.com> wrote: >> Clearly there is a need for fixed-storage-size zero-padded UTF-8; two other packages are waiting specifically for it. But specifying this requires two pieces of information: What is the encoding? and How is the length specified? I know they're not numpy-compatible, but FITS header values are space-padded; does that occur elsewhere? Are there other ways existing data specifies string length within a fixed-size field? There are some cryptographic length-specification tricks - ANSI X.293, ISO 10126, PKCS7, etc. - but they are probably too specialized to need? We should make sure we can support all the ways that actually occur. > > > Agree with the UTF-8 fixed byte length strings, although I would tend towards null terminated. Just to clarify some terminology (because it wasn't originally clear to me until I looked it up in reference to HDF5): * "NULL-padded" implies that, for a fixed width of N, there can be up to N non-NULL bytes. Any extra space left over is padded with NULLs, but no space needs to be reserved for NULLs. * "NULL-terminated" implies that, for a fixed width of N, there can be up to N-1 non-NULL bytes. There must always be space reserved for the terminating NULL. I'm not really sure if "NULL-padded" also specifies the behavior for embedded NULLs. It's certainly possible to deal with them: just strip trailing NULLs and leave any embedded ones alone. But I'm also sure that there are some implementations somewhere that interpret the requirement as "stop at the first NULL or the end of the fixed width, whichever comes first", effectively being NULL-terminated just not requiring the reserved space. And to save anyone else having to check, numpy's current NUL-padded dtypes only strip trailing NULs, so they can round-trip strings that contain NULs, just not strings where NUL is the last character. So the set of strings representable by str/bytes is a strict superset of the set of strings representable by numpy U/S dtypes, which in turn is a strict superset of the set of strings representable by a hypothetical NUL-terminated dtype. (Of course this doesn't matter for most practical purposes, because people rarely make strings with embedded NULs.) -n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion