Hi Petr, 2017-01-11 12:22 GMT+01:00 Petr Viktorin <encu...@gmail.com>: > > For example, this may mean that a built-in Python string sort will give you >> a different ordering than invoking the external "sort" command. >> I have been bitten by this kind of issues, leading to spurious "diffs" if >> you try to use sorting to put strings into a canonical order. >> > > AFAIK, this would not be a problem under PEP 538, which effectively treats > the "C" locale as "C.UTF-8". Strings of Unicode codepoints and the > corresponding UTF-8-encoded bytes sort the same way. >
...and this is also something new I learned. > > Is that wrong, or do you have a better example of trouble with using > "C.UTF-8" instead of "C"? After long deliberation, it seems I cannot find any source of trouble. +1 So my feeling is that people are ultimately not being helped by >> Python trying to be "nice", since they will be bitten by locale issues >> anyway. IMHO ultimately better to educate them to configure the locale. >> (I realise that people may reasonably disagree with this assessment ;-) ) >> >> I would then recommend to set to en_US.UTF-8, which is slower and >> less elegant but at least more widely supported. >> > > What about the spurious diffs you'd get when switching from "C" to > "en_US.UTF-8"? > That taught me to explicitly invoke "sort" using LANG=en_US.UTF-8 sort > > I believe the main problem is that the "C" locale really means two very > different things: > > a) Text is encoded as 7-bit ASCII; higher codepoints are an error > b) No encoding was specified > > In both cases, treating "C" as "C.UTF-8" is not bad: > a) For 7-bit "text", there's no real difference between these locales > b) UTF-8 is a much better default than ASCII > > A "C" locale also means that a program should not *output* non-ASCII characters, unless when explicitly being fed in (like in the case of "cat" or "sort" or the "ls" equivalent from PEP-540). So for example, a program might want to print fancy Unicode box characters to show progress, and check sys.stderr.encoding to see if it can do so. However, under a "C" locale it should not do so since for example the terminal is unlikely to display the fancy box characters properly. Note that the current PEP 540 proposal would be that sys.stderr is in UTF-8 /backslashreplace encoding under the "C" locale. I think this may be a minor concern ultimately, but it would be nice if we had some API to at least reliable answer the question "can I safely output non-ASCII myself?" Stephan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/