Re: [Numpy-discussion] proposal: smaller representation of string arrays

Antoine Pitrou Thu, 20 Apr 2017 11:23:48 -0700

On Thu, 20 Apr 2017 10:26:13 -0700
Stephan Hoyer <[email protected]> wrote:
> 
> I agree with Anne here. Variable-length encoding would be great to have,
> but even fixed length UTF-8 (in terms of memory usage, not characters)
> would solve NumPy's Python 3 string problem. NumPy's memory model needs a
> fixed size per array element, but that doesn't mean we need a fixed sized
> per character. Each element in a UTF-8 array would be a string with a fixed
> number of codepoints, not characters.
> 
> In fact, we already have this sort of distinction between element size and
> memory usage: np.string_ uses null padding to store shorter strings in a
> larger dtype.
> 
> The only reason I see for supporting encodings other than UTF-8 is for
> memory-mapping arrays stored with those encodings, but that seems like a
> lot of extra trouble for little gain.


I think you want at least: ascii, utf8, ucs2 (aka utf16 without
surrogates), utf32.  That is, 3 common fixed width encodings and one
variable width encoding.

Regards

Antoine.
_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Reply via email to