On 24/01/2010 18:41, "Martin v. Löwis" wrote:
However it is likely to be often wrong, and where the user's locale
specifies an encoding like CP1252 then it will result in silent
corruption rather than an immediate exception.
Why do you say that? Why do you think it will likely be often wrong?
Most likely, encoding text files with cp1252 will be exactly right,
and what the end user wanted.
If the file has a UTF-8 signature then decoding the file with CP1252
will almost always be wrong. I'm *not* suggesting switching to UTF8 by
default, which we can't do as 3.1 stable is now out with the current
behavior.
This is why I'm keen that by *default* Python should honour the UTF8
signature when reading files; particularly given that programmers who
don't/can't/won't understand encodings are likely to read files without
specifying an encoding and a lot of the time it will *seem* to work.
That's probably a reasonable idea - but may also make things worse:
on writing, you'd still use cp1252, so you may end up outputting the
file in a different encoding. That would be particularly unfortunate
if you were merely performing some simple text replacement.
Decoding a UTF-8 file with CP1252 will always succeed, but if it
contains non-ascii characters then 'simple text replacement' will either
not work or can corrupt the data. Reading as UTF-8 and then outputting
as CP1252 (without data loss) is preferable in my opinion. If 'guessing'
an encoding using the user's locale is acceptable then using another
*very strong* indicator (i.e. the presence of the UTF8 signature) should
also be acceptable.
In addition there are many programs where the reading of data is
separate from the writing of data (configuration files, xml etc) - so
that the encoding of any files written is logically distinct. In my
experience only a minority of programs have destructively rewritten
their input files. If the programmer is never specifying an encoding but
has an input file with a UTF8 signature, writing output in the locale
specified encoding is the *right* thing to do. It may be different from
the input encoding but it will be successfully read back in next time
around.
So whatever the API - there's always tradeoffs.
Sure. I think the presence of a UTF-8 signature strongly enough
indicates the encoding of the file to make it a better choice than using
the locale preference. Only of course where an explicit encoding was not
specified.
Regards,
Martin
--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog
READ CAREFULLY. By accepting and reading this email you agree, on behalf of
your employer, to release me from all obligations and waivers arising from any
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap,
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your
employer, its partners, licensors, agents and assigns, in perpetuity, without
prejudice to my ongoing rights and privileges. You further represent that you
have the authority to release me from any BOGUS AGREEMENTS on behalf of your
employer.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com