On Sat, Dec 1, 2012 at 2:30 AM, Steven D'Aprano <st...@pearwood.info> wrote:
>
>> The length and order of the optional byte order mark (BOM)
>> distinguishes UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.
>
> That's not quite right. The UTF-16BE and UTF-16LE character sets do
> not take BOMs, because the encoding already specifies the byte order:

Right, that was as clear as mud. What I meant is that the BOM is added
to distinguish UTF-16 from UTF-32 and little vs big endian in a
generic text stream. It's the nature of the stream itself to which I
was referring, not to specific names assigned in the Unicode standard.
For example, adding a BOM to a string encoded as UTF-16LE for a
Windows registry REG_SZ value would be redundant and wrong.

Encoding U+FEFF (zero width no-break space) also determines the
transform format in addition to byte order. So I do think of it more
like a signature than just a byte order mark.

Digressions about the UTF BOM aside, the more salient point I wanted
to make is that the transform formats are multibyte encodings (except
ASCII in UTF-8), which means the expression str(len(hello)) is using
the wrong length; it needs to use the length of the encoded string.
Also, UTF-16 and UTF-32 typically have very many null bytes. Together,
these two observations explain the error: "unicode_internal' codec
can't decode byte 0x00 in position 12: truncated input".
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to