On 1/19/2011 1:02 PM, Tim Harig wrote:
Right, but I only have to do that once. After that, I can directly address any piece of the stream that I choose. If I leave the information as a simple UTF-8 stream, I would have to walk the stream again, I would have to walk through the the first byte of all the characters from the beginning to make sure that I was only counting multibyte characters once until I found the character that I actually wanted. Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former.
The idea of using a custom fixed-width padded version of a UTF-8 steams waw initially shocking to me, but I can imagine that there are specialized applications, which slice-and-dice uninterpreted segments, for which that is appropriate. However, it is not germane to the folly of prefixing standard UTF-8 steams with a 3-byte magic number, mislabelled a 'byte-order-mark, thus making them non-standard.
-- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list