New submission from Daniel Blanchard: As I recently discovered when someone filed a PR on chardet (see https://github.com/chardet/chardet/issues/70), BOMs are handled are not handled correctly by the endian-specific encodings UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE, but are by the UTF-16 and UTF-32 encodings.
For example: >>> 'foo'.encode('utf-16le') b'f\x00o\x00o\x00' >>> 'foo'.encode('utf-16') b'\xff\xfef\x00o\x00o\x00' You can see that when using UTF-16 (instead of UTF-16LE), you get the BOM correctly prepended to the bytes. If you were on a little endian system and purposefully wanted to create a UTF-16BE file, the only way to do it is: >>> codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be') b'\xfe\xff\x00f\x00o\x00o' This doesn't make a lot of sense to me. Why is the BOM not prepended automatically when encoding with UTF-16BE? Furthermore, if you were given a UTF-16BE file on a little endian system, you might think that this would be the correct way to decode it: >>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16be') '\ufefffoo' but as you can see that leaves the BOM on there. Strangely, decoding with UTF-16 works fine however: >>> (codecs.BOM_UTF16_BE + 'foo'.encode('utf-16be')).decode('utf-16') 'foo' It seems to me that the endian-specific versions of UTF-16 and UTF-32 should be adding/removing the appropriate BOMs, and this is a long-standing bug. ---------- components: Unicode messages: 252406 nosy: Daniel.Blanchard, ezio.melotti, haypo priority: normal severity: normal status: open title: UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings don't add/remove BOM on encode/decode versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue25325> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com