On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano <st...@pearwood.info> wrote:
> On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
>
>> On 10Aug2016 1431, Chris Angelico wrote:
>> >>* make the default open() encoding check for a BOM or else use utf-8
>> >
>> >-0.5. Is there any precedent for this kind of data-based detection
>> >being the default?
>
> There is precedent: the Python interpreter will accept a BOM instead of
> an encoding cookie when importing .py files.

Okay, that's good enough for me.

> [Chris]
>> >An explicit "utf-sig" could do a full detection,
>> >but even then it's not perfect - how do you distinguish UTF-32LE from
>> >UTF-16LE that starts with U+0000?
>
> BOMs are a heuristic, nothing more. If you're reading arbitrary files
> could start with anything, then of course they can guess wrong. But then
> if I dumped a bunch of arbitrary Unicode codepoints in your lap and
> asked you to guess the language, you would likely get it wrong too :-)

I have my own mental heuristics, but I can't recognize one Cyrillic
language from another. And some Slavic languages can be written with
either Latin or Cyrillic letters, just to further confuse matters. Of
course, "arbitrary Unicode codepoints" might not all come from one
language, and might not be any language at all.

(Do you wanna build a U+2603?)

> [Chris]
>> >Do you say "UTF-32 is rare so we'll
>> >assume UTF-16", or do you say "files starting U+0000 are rare, so
>> >we'll assume UTF-32"?
>
> The way I have done auto-detection based on BOMs is you start by reading
> four bytes from the file in binary mode. (If there are fewer than four
> bytes, it cannot be a text file with a BOM.)

Interesting. Are you assuming that a text file cannot be empty?
Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF
0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with
less than one character in them?

> Compare those first four
> bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second*
> (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs
> (big-endian and little-endian). Then check for UTF-8, and if you're
> really keen, UTF-7 and UTF-1.

For a default file-open encoding detection, I would minimize the
number of options. The UTF-7 BOM could be the beginning of a file
containing Base 64 data encoded in ASCII, which is a very real
possibility.

>     elif bom.startswith(b'\x2B\x2F\x76'):
>         if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
>             return 'utf_7'

So I wouldn't include UTF-7 in the detection. Nor UTF-1. Both are
rare. Even UTF-32 doesn't necessarily have to be included. When was
the last time you saw a UTF-32LE-BOM file?

> [Steve]
>> But the main reason for detecting the BOM is that currently opening
>> files with 'utf-8' does not skip the BOM if it exists. I'd be quite
>> happy with changing the default encoding to:
>>
>> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
>> * utf-8 when writing (so the BOM is *not* written)
>
> Sounds reasonable to me.
>
> Rather than hard-coding that behaviour, can we have a new encoding that
> does that? "utf-8-readsig" perhaps.

+1. Makes the documentation easier by having the default value for
encoding not depend on the value for mode.

ChrisA
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to