Considering you post contained no information or evidence for your negations, I shouldn't even bother responding. I will bite once. Hopefully next time your arguments will contain some pith.
On 2011-01-19, Antoine Pitrou <solip...@pitrou.net> wrote: > On Wed, 19 Jan 2011 11:34:53 +0000 (UTC) > Tim Harig <user...@ilthio.net> wrote: >> That is why the FAQ I linked to >> says yes to the fact that you can consider UTF-8 to always be in big-endian >> order. > > It certainly doesn't. Read better. - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If - yes, then can I still assume the remaining UTF-8 bytes are in big-endian ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - order? ^^^^^^ - - A: Yes, UTF-8 can contain a BOM. However, it makes no difference as ^^^ - to the endianness of the byte stream. UTF-8 always has the same byte ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - order. An initial BOM is only used as a signature -- an indication that ^^^^^^ - an otherwise unmarked text file is in UTF-8. Note that some recipients of - UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently - in 8-bit environments, the use of a BOM will interfere with any protocol - or file format that expects specific ASCII characters at the beginning, - such as the use of "#!" of at the beginning of Unix shell scripts. The question that was not addressed was whether you can consider UTF-8 to be little endian. I pointed out why you cannot always make that assumption in my previous post. UTF-8 has no apparent endianess if you only store it as a byte stream. It does however have a byte order. If you store it using multibytes (six bytes for all UTF-8 possibilites) , which is useful if you want to have one storage container for each letter as opposed to one for each byte(1), the bytes will still have the same order but you have interrupted its sole existance as a byte stream and have returned it to the underlying multibyte oriented representation. If you attempt any numeric or binary operations on what is now a multibyte sequence, the processor will interpret the data using its own endian rules. If your processor is big-endian, then you don't have any problems. The processor will interpret the data in the order that it is stored. If your processor is little endian, then it will effectively change the order of the bytes for its own evaluation. So, you can always assume a big-endian and things will work out correctly while you cannot always make the same assumption as little endian without potential issues. The same holds true for any byte stream data. That is why I say that byte streams are essentially big endian. It is all a matter of how you look at it. I prefer to look at all data as endian even if it doesn't create endian issues because it forces me to consider any endian issues that might arise. If none do, I haven't really lost anything. If you simply assume that any byte sequence cannot have endian issues you ignore the possibility that such issues might not arise. When an issue like the above does, you end up with a potential bug. (1) For unicode it is probably better to convert to characters to UTF-32/UCS-4 for internal processing; but, creating a container large enough to hold any length of UTF-8 character will work. -- http://mail.python.org/mailman/listinfo/python-list