Michael Foord writes: > This is why I'm keen that by *default* Python should honour the UTF8 > signature when reading files;
Unfortunately, your caveat about "a lot of the time it will *seem* to work" applies to this as well. The only way that "honoring signatures" really works is if Python simply uses the UTF-8 codec on input and output by default, regardless of locale. Or perhaps if by default Python should error out unless a signature is found. Autodetection (ie, doing something different depending on the presence or absence of the signature) does not really work, because for it to work correctly, it needs to imply automatic resetting of the output codec as well. So what is your naive programmer supposed to expect when writing a cat program? Should the first encoding detected or defaulted determine the output codec? The last one? UTF-8 uber alles? Such autodetection *can* be done fairly accurately. After 20 years of experimenting, Emacs has it pretty much right. But ... Emacs almost never runs without a human watching it. And the code that handles this is a mess of special cases and heuristics. Not to mention throwing more than a few exceptions in practice. And in practice any decisions that need to be made about disambiguating the output codec are left up to the user. > particularly given that programmers who don't/can't/won't > understand encodings are likely to read files without specifying an > encoding and a lot of the time it will *seem* to work. But that's a different problem. If you want to fix that you should require an explicit codec parameter on all text I/O. They'll still just memorize the magic incantation and grumble about the extra characters they have to type, but they'll have been warned. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com