Re: UTF-8 question from Dive into Python 3

Tim Harig Wed, 19 Jan 2011 08:08:29 -0800

On 2011-01-19, Antoine Pitrou <[email protected]> wrote:
> On Wed, 19 Jan 2011 14:00:13 +0000 (UTC)
> Tim Harig <[email protected]> wrote:
>> UTF-8 has no apparent endianess if you only store it as a byte stream.
>> It does however have a byte order.  If you store it using multibytes
>> (six bytes for all UTF-8 possibilites) , which is useful if you want
>> to have one storage container for each letter as opposed to one for
>> each byte(1)
>
> That's a ridiculous proposition. Why would you waste so much space?


Space is only one tradeoff.  There are many others to consider.  I have
created data structures with much higher overhead than that because
they happen to make the problem easier and significantly faster for the
operations that I am performing on the data.

For many operations, it is just much faster and simpler to use a single
character based container opposed to having to process an entire byte
stream to determine individual letters from the bytes or to having
adaptive size containers to store the data.

> UTF-8 exists *precisely* so that you can save space with most scripts.

UTF-8 has many reasons for existing.  One of the biggest is that it
is compatible for tools that were designed to process ASCII and other
8bit encodings.

> If you are ready to use 4+ bytes per character, just use UTF-32 which
> has much nicer properties.

I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might
not want to have to worry about converting the encodings back and forth
before and after processing them.  That said, and more importantly, many
variable length byte streams may not have alternate representations as
unicode does.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: UTF-8 question from Dive into Python 3

Reply via email to