On Sat, Dec 1, 2012 at 2:30 AM, Steven D'Aprano <st...@pearwood.info> wrote: > >> The length and order of the optional byte order mark (BOM) >> distinguishes UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. > > That's not quite right. The UTF-16BE and UTF-16LE character sets do > not take BOMs, because the encoding already specifies the byte order:
Right, that was as clear as mud. What I meant is that the BOM is added to distinguish UTF-16 from UTF-32 and little vs big endian in a generic text stream. It's the nature of the stream itself to which I was referring, not to specific names assigned in the Unicode standard. For example, adding a BOM to a string encoded as UTF-16LE for a Windows registry REG_SZ value would be redundant and wrong. Encoding U+FEFF (zero width no-break space) also determines the transform format in addition to byte order. So I do think of it more like a signature than just a byte order mark. Digressions about the UTF BOM aside, the more salient point I wanted to make is that the transform formats are multibyte encodings (except ASCII in UTF-8), which means the expression str(len(hello)) is using the wrong length; it needs to use the length of the encoded string. Also, UTF-16 and UTF-32 typically have very many null bytes. Together, these two observations explain the error: "unicode_internal' codec can't decode byte 0x00 in position 12: truncated input". _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor