Re: [Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern Tue, 25 Apr 2017 21:22:08 -0700

On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <[email protected]>
wrote:


> The maximum length of an UTF-8 character is 4 bytes, so we could use that
to size arrays by character length. The advantage over UTF-32 is that it is
easily compressible, probably by a factor of 4 in many cases. That doesn't
solve the in memory problem, but does have some advantages on disk as well
as making for easy display. We could compress it ourselves after encoding
by truncation.

The major use case that we have for a UTF-8 array is HDF5, and it specifies
the width in bytes, not Unicode characters.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Reply via email to