On 2011-01-19, Tim Roberts <t...@probo.com> wrote: > Tim Harig <user...@ilthio.net> wrote: >>On 2011-01-17, carlo <syseng...@gmail.com> wrote: >> >>> 2- If that were true, can you point me to some documentation about the >>> math that, as Mark says, demonstrates this? >> >>It is true because UTF-8 is essentially an 8 bit encoding that resorts >>to the next bit once it exhausts the addressible space of the current >>byte it moves to the next one. Since the bytes are accessed and assessed >>sequentially, they must be in big-endian order. > > You were doing excellently up to that last phrase. Endianness only applies > when you treat a series of bytes as a larger entity. That doesn't apply to > UTF-8. None of the bytes is more "significant" than any other, so by > definition it is neither big-endian or little-endian.
It depends how you process it and it doesn't generally make much difference in Python. Accessing UTF-8 data from C can be much trickier if you use a multibyte type to store the data. In that case, if happen to be on a little-endian architecture, it may be necessary to remember that the data is not in the order that your processor expects it to be for numeric operations and comparisons. That is why the FAQ I linked to says yes to the fact that you can consider UTF-8 to always be in big-endian order. Essentially all byte based data is big-endian. -- http://mail.python.org/mailman/listinfo/python-list