Michael Foord writes:

 > This is why I'm keen that by *default* Python should honour the UTF8  
 > signature when reading files;

Unfortunately, your caveat about "a lot of the time it will *seem* to
work" applies to this as well.  The only way that "honoring
signatures" really works is if Python simply uses the UTF-8 codec on
input and output by default, regardless of locale.  Or perhaps if by
default Python should error out unless a signature is found.

Autodetection (ie, doing something different depending on the presence
or absence of the signature) does not really work, because for it to
work correctly, it needs to imply automatic resetting of the output
codec as well.  So what is your naive programmer supposed to expect
when writing a cat program?  Should the first encoding detected or
defaulted determine the output codec?  The last one?  UTF-8 uber
alles?

Such autodetection *can* be done fairly accurately.  After 20 years of
experimenting, Emacs has it pretty much right.  But ... Emacs almost
never runs without a human watching it.  And the code that handles
this is a mess of special cases and heuristics.  Not to mention
throwing more than a few exceptions in practice.  And in practice any
decisions that need to be made about disambiguating the output codec
are left up to the user.

 > particularly given that programmers who don't/can't/won't
 > understand encodings are likely to read files without specifying an
 > encoding and a lot of the time it will *seem* to work.

But that's a different problem.  If you want to fix that you should
require an explicit codec parameter on all text I/O.  They'll still
just memorize the magic incantation and grumble about the extra
characters they have to type, but they'll have been warned.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to