Re: [Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer Mon, 24 Apr 2017 16:10:52 -0700

On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <[email protected]>
wrote:


> On the other hand, if this is the use-case, perhaps we really want an
>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>> signaled more explicitly. I would suggest that "text[unknown]" should
>> support operations like a string if it can be decoded as ASCII, and
>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>> bytes.
>>
>
> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really
> is ascii, then it's perfect. If it really is latin-*, then you get some
> extra useful stuff, and if it's corrupted somehow, you still get the ascii
> text correct, and the rest won't  barf and can be passed on through.
>

I am totally in agreement with Thomas that "We are living in a messy world
right now with messy legacy datasets that have character type data that are
*mostly* ASCII, but not infrequently contain non-ASCII characters."

My question: What are those non-ASCII characters? How often are they truly
latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't think that silently (mis)interpreting non-ASCII characters as
latin-1/9 is a good idea, which is why I think it would be a mistake to use
'latin-1' for text data with unknown encoding.

I could get behind a data type that compares equal to strings for ASCII
only and allows for *storing* other characters, but making blind
assumptions about characters 128-255 seems like a recipe for disaster.
Imagine text[unknown] as a one character string type, but it supports
.decode() like bytes and every character in the range 128-255 compares for
equality with other characters like NaN -- not even equal to itself.

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Reply via email to