Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

Michael Foord Sun, 24 Jan 2010 10:56:56 -0800

On 24/01/2010 18:41, "Martin v. Löwis" wrote:

However it is likely to be often wrong, and where the user's locale
specifies an encoding like CP1252 then it will result in silent
corruption rather than an immediate exception.

Why do you say that? Why do you think it will likely be often wrong?
Most likely, encoding text files with cp1252 will be exactly right,
and what the end user wanted.

If the file has a UTF-8 signature then decoding the file with CP1252will almost always be wrong. I'm *not* suggesting switching to UTF8 bydefault, which we can't do as 3.1 stable is now out with the currentbehavior.

This is why I'm keen that by *default* Python should honour the UTF8
signature when reading files; particularly given that programmers who
don't/can't/won't understand encodings are likely to read files without
specifying an encoding and a lot of the time it will *seem* to work.

That's probably a reasonable idea - but may also make things worse:
on writing, you'd still use cp1252, so you may end up outputting the
file in a different encoding. That would be particularly unfortunate
if you were merely performing some simple text replacement.

Decoding a UTF-8 file with CP1252 will always succeed, but if itcontains non-ascii characters then 'simple text replacement' will eithernot work or can corrupt the data. Reading as UTF-8 and then outputtingas CP1252 (without data loss) is preferable in my opinion. If 'guessing'an encoding using the user's locale is acceptable then using another*very strong* indicator (i.e. the presence of the UTF8 signature) shouldalso be acceptable.

In addition there are many programs where the reading of data isseparate from the writing of data (configuration files, xml etc) - sothat the encoding of any files written is logically distinct. In myexperience only a minority of programs have destructively rewrittentheir input files. If the programmer is never specifying an encoding buthas an input file with a UTF8 signature, writing output in the localespecified encoding is the *right* thing to do. It may be different fromthe input encoding but it will be successfully read back in next timearound.

So whatever the API - there's always tradeoffs.

Sure. I think the presence of a UTF-8 signature strongly enoughindicates the encoding of the file to make it a better choice than usingthe locale preference. Only of course where an explicit encoding was notspecified.

Regards,
Martin



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Proposed downstream change to site.py in Fedora (sys.defaultencoding)

Reply via email to