On 1/19/2011 1:02 PM, Tim Harig wrote:

Right, but I only have to do that once.  After that, I can directly address
any piece of the stream that I choose.  If I leave the information as a
simple UTF-8 stream, I would have to walk the stream again, I would have to
walk through the the first byte of all the characters from the beginning to
make sure that I was only counting multibyte characters once until I found
the character that I actually wanted.  Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size.  You will note that Python does the former.

The idea of using a custom fixed-width padded version of a UTF-8 steams waw initially shocking to me, but I can imagine that there are specialized applications, which slice-and-dice uninterpreted segments, for which that is appropriate. However, it is not germane to the folly of prefixing standard UTF-8 steams with a 3-byte magic number, mislabelled a 'byte-order-mark, thus making them non-standard.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to