Nick Coghlan <ncogh...@gmail.com> added the comment:

Pondering it further (and reading subsequent comments here and in the thread), 
I agree an open_ascii() builtin would be a step backwards, not forwards.

So, morphing this issue into a documentation one to work out:
- the bare minimum we think Python 3 users should be learning about Unicode
- deciding where to document that (with a reference to the Unicode HOWTO for 
anyone that wants to know more)

Some ideas specifically in the context of text files (for readers already 
familiar with the basic concept of text encodings):

1. The world is moving towards standardising on UTF-8 as the binary encoding 
used to store text files. However, we're a long way from living in that world 
right now. Other encodings (many, but far from all, ASCII compatible) will be 
encountered quite often, either as the default encoding on a particular 
platform, or as the encoding of a particular text file. Dealing with these 
correctly requires additional work.

2. To maximise the chance of correct local interoperability, Python 3's default 
choice of encoding is actually taken from the underlying platform rather than 
being forced to UTF-8. While it is becoming more and more common for platforms 
to set their preferred encoding to UTF-8, this is not yet universal (notably, 
Windows still does not use UTF-8 as the default encoding for text files in 
order to preserve compatibility with various Unicode-unaware legacy 
applications).

To handle this correctly in cross-platform applications and libraries, it is 
often necessary to explicitly pass "encoding='utf-8'" when opening a UTF-8 
encoded text file.

The default encoding on a given platform can be checked by running "import 
locale; locale.getpreferredencoding()" at the interactive prompt.

3. Currently, it is still fairly common to encounter text files that are known 
to be stored in an ASCII-compatible text encoding without knowing precisely 
*which* encoding is used. The Python 2 text model allowed such files to be 
processed naively simply by assuming they were in an ASCII-compatible encoding 
and passing any non-ASCII characters faithfully through to the result. This 
permissive behaviour can be requested explicitly in Python 3 by passing 
"encoding='ascii'" and "errors='surrogateescape'" when opening a text file.

This approach parallels the behaviour of Python 2 and works correctly so long 
as it is fed data solely in ASCII compatible encodings (such as UTF-8 and 
latin-1). Behaviour when fed data that uses other encodings is unpredictable - 
common symptoms include Unicode encoding and decoding errors at unexpected 
points in a program, as well as silent corruption of the output text.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13997>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to