Re: [Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer Mon, 24 Apr 2017 16:19:45 -0700

On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <[email protected]> wrote:


> Let me make a counter-proposal for your latin-1 dtype (your #2) that might
> address your, Thomas's, and Julian's use cases:
>
> 2) We want a single-byte-per-character, NULL-terminated string dtype that
> can be used to represent mostly-ASCII textish data that may have some
> high-bit characters from some 8-bit encoding. It should be able to read
> arbitrary bytes (that is, up to the NULL-termination) and write them back
> out as the same bytes if unmodified. This lets us read this text from files
> where the encoding is unspecified (or is lying about the encoding) into
> `unicode/str` objects. The encoding is specified as `ascii` but the
> decoding/encoding is done with the `surrogateescape` option so that
> high-bit characters are faithfully represented in the `unicode/str` string
> but are not erroneously reinterpreted as other characters from an arbitrary
> encoding.
>
> I'd even be happy if Julian or someone wants to go ahead and implement
> this right now and leave the UTF-8 dtype for a later time.
>
> As long as this ASCII-surrogateescape dtype is not called np.realstring
> (it's *really* important to me that the bikeshed not be this color). ;-)
>

This sounds quite similar to my text[unknown] proposal, with the advantage
that the concept of "surrogateescape" that already exists. Surrogate-escape
characters compare equal to themselves, which is maybe less than ideal, but
it looks like you can put them in real unicode strings, which is nice.

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Reply via email to