On 2011-01-19, Antoine Pitrou <solip...@pitrou.net> wrote: > On Wed, 19 Jan 2011 14:00:13 +0000 (UTC) > Tim Harig <user...@ilthio.net> wrote: >> UTF-8 has no apparent endianess if you only store it as a byte stream. >> It does however have a byte order. If you store it using multibytes >> (six bytes for all UTF-8 possibilites) , which is useful if you want >> to have one storage container for each letter as opposed to one for >> each byte(1) > > That's a ridiculous proposition. Why would you waste so much space?
Space is only one tradeoff. There are many others to consider. I have created data structures with much higher overhead than that because they happen to make the problem easier and significantly faster for the operations that I am performing on the data. For many operations, it is just much faster and simpler to use a single character based container opposed to having to process an entire byte stream to determine individual letters from the bytes or to having adaptive size containers to store the data. > UTF-8 exists *precisely* so that you can save space with most scripts. UTF-8 has many reasons for existing. One of the biggest is that it is compatible for tools that were designed to process ASCII and other 8bit encodings. > If you are ready to use 4+ bytes per character, just use UTF-32 which > has much nicer properties. I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might not want to have to worry about converting the encodings back and forth before and after processing them. That said, and more importantly, many variable length byte streams may not have alternate representations as unicode does. -- http://mail.python.org/mailman/listinfo/python-list