On Thu, Aug 11, 2016 at 9:40 AM, Steve Dower <steve.do...@python.org> wrote: > On 10Aug2016 1431, Chris Angelico wrote: >> I'd rather a single consistent default encoding. > > I'm proposing to make that single consistent default encoding utf-8. It > sounds like we're in agreement?
Yes, we are. I was disagreeing with Random's suggestion that mbcs would also serve. Defaulting to UTF-8 everywhere is (a) consistent on all systems, regardless of settings; and (b) consistent with bytes.decode() and str.encode(), both of which default to UTF-8. >> -0.5. Is there any precedent for this kind of data-based detection >> being the default? An explicit "utf-sig" could do a full detection, >> but even then it's not perfect - how do you distinguish UTF-32LE from >> UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll >> assume UTF-16", or do you say "files starting U+0000 are rare, so >> we'll assume UTF-32"? > > > The BOM exists solely for data-based detection, and the UTF-8 BOM is > different from the UTF-16 and UTF-32 ones. So we either find an exact BOM > (which IIRC decodes as a no-op spacing character, though I have a feeling > some version of Unicode redefined it exclusively for being the marker) or we > use utf-8. > > But the main reason for detecting the BOM is that currently opening files > with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with > changing the default encoding to: > > * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) > * utf-8 when writing (so the BOM is *not* written) > > This provides the best compatibility when reading/writing files without > making any guesses. We could reasonably extend this to read utf-16 and > utf-32 if they have a BOM, but that's an extension and not necessary for the > main change. AIUI the utf-8-sig encoding is happy to decode something that doesn't have a signature, right? If so, then yes, I would definitely support that mild mismatch in defaults. Chew up that UTF-8 aBOMination and just use UTF-8 as is. I've almost never seen files stored in UTF-32 (even UTF-16 isn't all that common compared to UTF-8), so I wouldn't stress too much about that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth doing, but it could easily be retrofitted (that byte sequence won't decode as UTF-8). >>> * force the console encoding to UTF-8 on initialize and revert on >>> finalize >> >> >> -0 for Python itself; +1 for Python's interactive interpreter. >> Programs that mess with console settings get annoying when they crash >> out and don't revert properly. Unless there is *no way* that you could >> externally kill the process without also bringing the terminal down, >> there's the distinct possibility of messing everything up. > > > The main problem here is that if the console is not forced to UTF-8 then it > won't render any of the characters correctly. Ehh, that's annoying. Is there a way to guarantee, at the process level, that the console will be returned to "normal state" when Python exits? If not, there's the risk that people run a Python program and then the *next* program gets into trouble. But if that happens only on abnormal termination ("I killed Python from Task Manager, and it left stuff messed up so I had to close the console"), it's probably an acceptable risk. And the benefit sounds well worthwhile. Revising my recommendation to +0.9. ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/