On Tue, Apr 25, 2017 at 9:21 PM Robert Kern <robert.k...@gmail.com> wrote:
> On Tue, Apr 25, 2017 at 6:27 PM, Charles R Harris < > charlesr.har...@gmail.com> wrote: > > > The maximum length of an UTF-8 character is 4 bytes, so we could use > that to size arrays by character length. The advantage over UTF-32 is that > it is easily compressible, probably by a factor of 4 in many cases. That > doesn't solve the in memory problem, but does have some advantages on disk > as well as making for easy display. We could compress it ourselves after > encoding by truncation. > > The major use case that we have for a UTF-8 array is HDF5, and it > specifies the width in bytes, not Unicode characters. > It's not just HDF5. Counting bytes is the Right Way to measure the size of UTF-8 encoded text: http://utf8everywhere.org/#myths I also firmly believe (though clearly this is not universally agreed upon) that UTF-8 is the Right Way to encode strings for *non-legacy* applications. So if we're adding any new string encodings, it needs to be one of them.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion