On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.k...@gmail.com> wrote:
> Let me make a counter-proposal for your latin-1 dtype (your #2) that might > address your, Thomas's, and Julian's use cases: > > 2) We want a single-byte-per-character, NULL-terminated string dtype that > can be used to represent mostly-ASCII textish data that may have some > high-bit characters from some 8-bit encoding. It should be able to read > arbitrary bytes (that is, up to the NULL-termination) and write them back > out as the same bytes if unmodified. This lets us read this text from files > where the encoding is unspecified (or is lying about the encoding) into > `unicode/str` objects. The encoding is specified as `ascii` but the > decoding/encoding is done with the `surrogateescape` option so that > high-bit characters are faithfully represented in the `unicode/str` string > but are not erroneously reinterpreted as other characters from an arbitrary > encoding. > > I'd even be happy if Julian or someone wants to go ahead and implement > this right now and leave the UTF-8 dtype for a later time. > > As long as this ASCII-surrogateescape dtype is not called np.realstring > (it's *really* important to me that the bikeshed not be this color). ;-) > This sounds quite similar to my text[unknown] proposal, with the advantage that the concept of "surrogateescape" that already exists. Surrogate-escape characters compare equal to themselves, which is maybe less than ideal, but it looks like you can put them in real unicode strings, which is nice.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion