decode

Daniel Blanchard Tue, 06 Oct 2015 09:39:50 -0700

New submission from Daniel Blanchard:

As I recently discovered when someone filed a PR on chardet (see 
https://github.com/chardet/chardet/issues/70), BOMs are handled are not handled 
correctly by the endian-specific encodings UTF-16LE, UTF-16BE, UTF-32LE, and 
UTF-32BE, but are by the UTF-16 and UTF-32 encodings.


For example:

>>> 'foo'.encode('utf-16le')
b'f\x00o\x00o\x00'
>>> 'foo'.encode('utf-16')
b'\xff\xfef\x00o\x00o\x00'

You can see that when using UTF-16 (instead of UTF-16LE), you get the BOM 
correctly prepended to the bytes.

If you were on a little endian system and purposefully wanted to create a 
UTF-16BE file, the only way to do it is:

>>> codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')
b'\xfe\xff\x00f\x00o\x00o'

This doesn't make a lot of sense to me.  Why is the BOM not prepended 
automatically when encoding with UTF-16BE?

Furthermore, if you were given a UTF-16BE file on a little endian system, you 
might think that this would be the correct way to decode it:

>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16be')
'\ufefffoo'

but as you can see that leaves the BOM on there.  Strangely, decoding with 
UTF-16 works fine however:

>>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16')
'foo'

It seems to me that the endian-specific versions of UTF-16 and UTF-32 should be 
adding/removing the appropriate BOMs, and this is a long-standing bug.

----------
components: Unicode
messages: 252406
nosy: Daniel.Blanchard, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove 
BOM on encode/decode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25325>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue25325] UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode

Reply via email to