On 2011-01-19, Antoine Pitrou <solip...@pitrou.net> wrote: > On Wed, 19 Jan 2011 18:02:22 +0000 (UTC) > Tim Harig <user...@ilthio.net> wrote: >> Converting to a fixed byte >> representation (UTF-32/UCS-4) or separating all of the bytes for each >> UTF-8 into 6 byte containers both make it possible to simply index the >> letters by a constant size. You will note that Python does the >> former. > > Indeed, Python chose the wise option. Actually, I'd be curious of any > real-world software which successfully chose your proposed approach.
The point is basically the same. I created an example because it was simpler to follow for demonstration purposes then an actual UTF-8 conversion to any official multibyte format. You obviously have no other purpose then to be contrary, so we ended up following tangents. As soon as you start to convert to a multibyte format the endian issues occur. For UTF-8 on big endian hardware, this is anti-climactic because all of the bits are already stored in proper order. Little endian systems will probably convert to a native native endian format. If you choose to ignore that, that is your perogative. Have a nice day. -- http://mail.python.org/mailman/listinfo/python-list