On Thu, 20 Apr 2017 10:26:13 -0700 Stephan Hoyer <sho...@gmail.com> wrote: > > I agree with Anne here. Variable-length encoding would be great to have, > but even fixed length UTF-8 (in terms of memory usage, not characters) > would solve NumPy's Python 3 string problem. NumPy's memory model needs a > fixed size per array element, but that doesn't mean we need a fixed sized > per character. Each element in a UTF-8 array would be a string with a fixed > number of codepoints, not characters. > > In fact, we already have this sort of distinction between element size and > memory usage: np.string_ uses null padding to store shorter strings in a > larger dtype. > > The only reason I see for supporting encodings other than UTF-8 is for > memory-mapping arrays stored with those encodings, but that seems like a > lot of extra trouble for little gain.
I think you want at least: ascii, utf8, ucs2 (aka utf16 without surrogates), utf32. That is, 3 common fixed width encodings and one variable width encoding. Regards Antoine. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion