On Jan 19, 11:33 pm, Terry Reedy <tjre...@udel.edu> wrote: > On 1/19/2011 1:02 PM, Tim Harig wrote: > > > Right, but I only have to do that once. After that, I can directly address > > any piece of the stream that I choose. If I leave the information as a > > simple UTF-8 stream, I would have to walk the stream again, I would have to > > walk through the the first byte of all the characters from the beginning to > > make sure that I was only counting multibyte characters once until I found > > the character that I actually wanted. Converting to a fixed byte > > representation (UTF-32/UCS-4) or separating all of the bytes for each > > UTF-8 into 6 byte containers both make it possible to simply index the > > letters by a constant size. You will note that Python does the former. > > The idea of using a custom fixed-width padded version of a UTF-8 steams > waw initially shocking to me, but I can imagine that there are > specialized applications, which slice-and-dice uninterpreted segments, > for which that is appropriate. However, it is not germane to the folly > of prefixing standard UTF-8 steams with a 3-byte magic number, > mislabelled a 'byte-order-mark, thus making them non-standard. >
Unicode Book, 5.2.0, Chapter 2, Section 14, Page 51 - Paragraphe *Unicode Signature*. -- http://mail.python.org/mailman/listinfo/python-list