Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

eryk sun Mon, 07 Aug 2017 21:34:51 -0700

On Tue, Aug 8, 2017 at 3:20 AM, Cameron Simpson <c...@cskk.id.au> wrote:
>
> As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
> is because each encoding has a leading byte order marker to indicate the big
> endianness or little endianness. For big endian data that is \xff\xfe; for
> little endian data it would be \xfe\xff.


To avoid encoding a byte order mark (BOM), use an "le" or "be" suffix, e.g.

    >>> 'Hello!'.encode('utf-16le')
    b'H\x00e\x00l\x00l\x00o\x00!\x00'

Sometimes a data format includes the byte order, which makes using a
BOM redundant. For example, strings in the Windows registry use
UTF-16LE, without a BOM.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Reply via email to