On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.k...@gmail.com> wrote:

> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
> > The maximum length of an UTF-8 character is 4 bytes, so we could use
> that to size arrays by character length. The advantage over UTF-32 is that
> it is easily compressible, probably by a factor of 4 in many cases. That
> doesn't solve the in memory problem, but does have some advantages on disk
> as well as making for easy display. We could compress it ourselves after
> encoding by truncation.
>
> The major use case that we have for a UTF-8 array is HDF5, and it
> specifies the width in bytes, not Unicode characters.
>

It's not just HDF5. Counting bytes is the Right Way to measure the size of
UTF-8 encoded text:
http://utf8everywhere.org/#myths

I also firmly believe (though clearly this is not universally agreed upon)
that UTF-8 is the Right Way to encode strings for *non-legacy*
applications. So if we're adding any new string encodings, it needs to be
one of them.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to