On 24/01/2010 18:41, "Martin v. Löwis" wrote:
However it is likely to be often wrong, and where the user's locale
specifies an encoding like CP1252 then it will result in silent
corruption rather than an immediate exception.
Why do you say that? Why do you think it will likely be often wrong?
Most likely, encoding text files with cp1252 will be exactly right,
and what the end user wanted.


If the file has a UTF-8 signature then decoding the file with CP1252 will almost always be wrong. I'm *not* suggesting switching to UTF8 by default, which we can't do as 3.1 stable is now out with the current behavior.

This is why I'm keen that by *default* Python should honour the UTF8
signature when reading files; particularly given that programmers who
don't/can't/won't understand encodings are likely to read files without
specifying an encoding and a lot of the time it will *seem* to work.
That's probably a reasonable idea - but may also make things worse:
on writing, you'd still use cp1252, so you may end up outputting the
file in a different encoding. That would be particularly unfortunate
if you were merely performing some simple text replacement.

Decoding a UTF-8 file with CP1252 will always succeed, but if it contains non-ascii characters then 'simple text replacement' will either not work or can corrupt the data. Reading as UTF-8 and then outputting as CP1252 (without data loss) is preferable in my opinion. If 'guessing' an encoding using the user's locale is acceptable then using another *very strong* indicator (i.e. the presence of the UTF8 signature) should also be acceptable.

In addition there are many programs where the reading of data is separate from the writing of data (configuration files, xml etc) - so that the encoding of any files written is logically distinct. In my experience only a minority of programs have destructively rewritten their input files. If the programmer is never specifying an encoding but has an input file with a UTF8 signature, writing output in the locale specified encoding is the *right* thing to do. It may be different from the input encoding but it will be successfully read back in next time around.

So whatever the API - there's always tradeoffs.

Sure. I think the presence of a UTF-8 signature strongly enough indicates the encoding of the file to make it a better choice than using the locale preference. Only of course where an explicit encoding was not specified.


Regards,
Martin


--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to