On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote: > On 10Aug2016 1431, Chris Angelico wrote: > >>* make the default open() encoding check for a BOM or else use utf-8 > > > >-0.5. Is there any precedent for this kind of data-based detection > >being the default?
There is precedent: the Python interpreter will accept a BOM instead of an encoding cookie when importing .py files. [Chris] > >An explicit "utf-sig" could do a full detection, > >but even then it's not perfect - how do you distinguish UTF-32LE from > >UTF-16LE that starts with U+0000? BOMs are a heuristic, nothing more. If you're reading arbitrary files could start with anything, then of course they can guess wrong. But then if I dumped a bunch of arbitrary Unicode codepoints in your lap and asked you to guess the language, you would likely get it wrong too :-) [Chris] > >Do you say "UTF-32 is rare so we'll > >assume UTF-16", or do you say "files starting U+0000 are rare, so > >we'll assume UTF-32"? The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.) Compare those first four bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs (big-endian and little-endian). Then check for UTF-8, and if you're really keen, UTF-7 and UTF-1. def bom2enc(bom, default=None): """Return encoding name from a four-byte BOM.""" if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): return 'utf_32' elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')): return 'utf_16' elif bom.startswith(b'\xEF\xBB\xBF'): return 'utf_8_sig' elif bom.startswith(b'\x2B\x2F\x76'): if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39': return 'utf_7' elif bom.startswith(b'\xF7\x64\x4C'): return 'utf_1' elif default is None: raise ValueError('no recognisable BOM signature') else: return default [Steve Dower] > The BOM exists solely for data-based detection, and the UTF-8 BOM is > different from the UTF-16 and UTF-32 ones. So we either find an exact > BOM (which IIRC decodes as a no-op spacing character, though I have a > feeling some version of Unicode redefined it exclusively for being the > marker) or we use utf-8. The Byte Order Mark is always U+FEFF encoded into whatever bytes your encoding uses. You should never use U+FEFF except as a BOM, but of course arbitrary Unicode strings might include it in the middle of the string Just Because. In that case, it may be interpreted as a legacy "ZERO WIDTH NON-BREAKING SPACE" character. But new content should never do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF inside the body of your file or string as an unsupported character. http://www.unicode.org/faq/utf_bom.html#BOM [Steve] > But the main reason for detecting the BOM is that currently opening > files with 'utf-8' does not skip the BOM if it exists. I'd be quite > happy with changing the default encoding to: > > * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) > * utf-8 when writing (so the BOM is *not* written) Sounds reasonable to me. Rather than hard-coding that behaviour, can we have a new encoding that does that? "utf-8-readsig" perhaps. [Steve] > This provides the best compatibility when reading/writing files without > making any guesses. We could reasonably extend this to read utf-16 and > utf-32 if they have a BOM, but that's an extension and not necessary for > the main change. The use of a BOM is always a guess :-) Maybe I just happen to have a Latin1 file that starts with "", or a Mac Roman file that starts with "Ôªø". Either case will be wrongly detected as UTF-8. That's the risk you take when using a heuristic. And if you don't want to use that heuristic, then you must specify the actual encoding in use. -- Steven D'Aprano _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/