On Sun, Dec 5, 2010 at 5:25 PM, Victor Stinner <victor.stin...@haypocalc.com> wrote: > On Saturday 04 December 2010 09:31:04 you wrote: >> Alexander Belopolsky writes: >> > In fact, once the language moratorium is over, I will argue that >> > str.encode() and byte.decode() should deprecate encoding argument and >> > just do UTF-8 encoding/decoding. Hopefully by that time most people >> > will forget that other encodings exist. (I can dream, right?) >> >> It's just a dream. There's a pile of archival material, often on R/O >> media, out there that won't be transcoded any more quickly than the >> inscriptions on Tutankhamun's tomb. > > Not only, many libraries expect use bytes arguments encoded to a specific > encoding (eg. locale encoding). Said differenlty, only few libraries written > in > C accept wchar* strings. >
My proposal has nothing to do with C-API. It only concerns Python API of the builtin str type. > The Linux kernel (or many, or all, UNIX/BSD kernels) only manipulate byte > strings. The libc only accept wide characters for a few operations. I don't > know how to open a file with an unicode path with the Linux libc: you have to > encode it... > Yes, but hopefully the encoding used by the filesystem will be UTF-8. For Python users, however, encoding details will hopefully be hidden by the open() call. Yes, I am aware of the many problems with divining the filesystem encoding, but instructing application developers to supply their own fsencoding in open(filepath.encode(fsencoding)) calls is not very helpful. > Alexander: you should first patch all UNIX/BSD kernels to use unicode > everywhere, then patch all libc implementations, and then all libraries > (written in C). After that, you can have a break. > As Martin explained later in this thread with respect to the transform() method, removing codec argument from str.encode() method does not imply removing the codecs themselves. If I need a method to encode strings to say koi8_r encoding, I can easily access it directly: >>> from encodings import koi8_r >>> to_koi8_r = koi8_r.Codec().encode >>> to_koi8_r('код') (b'\xcb\xcf\xc4', 3) More likely, however, I will only need en/decoding to read/write legacy files and rather than encoding the strings explicitly before writing into a file, I will just open that file with the correct encoding. Having all encodings accessible in a str method only promotes a programming style where bytes objects can contain differently encoded strings in different parts of the program. Instead, well-written programs should decode bytes on input, do all processing with str type and decode on output. When strings need to be passed to char* C APIs, they should be encoded in UTF-8. Many C APIs originally designed for ASCII actually produce meaningful results when given UTF-8 bytes. (Supporting such usage was one of the design goals of UTF-8.) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com