While I'm not strongly convinced that open() error handler must be changed for surrogateescape, first I would like to make sure that it's really a very bad idea because changing it :-)
2017-12-07 7:49 GMT+01:00 INADA Naoki <songofaca...@gmail.com>: > I just came up with crazy idea; changing default error handler of open() > to "surrogateescape" only when open mode is "w" or "a". The idea is tempting but I'm not sure that it's a good idea. Moreover, what about "r+" and "w+" modes? I dislike getting a different behaviour for inputs and outputs. The motivation for surrogateescape is to "pass through" undecodable bytes: you need to handle them on the input side and on the output side. That's why I decided to not only change sys.stdin error handler to surrogateescape for the POSIX locale, but also sys.stdout: https://bugs.python.org/issue19977 > When reading, "surrogateescape" error handler is dangerous because > it can produce arbitrary broken unicode string by mistake. I'm fine with that. I wouldn't say that it's the purpose of the PEP, but sadly it's an expected, known and documented side effect. You get the same behaviour with Unix command line tools and most Python 2 applications (processing data as bytes). Nothing new under the sun. The PEP 540 allows users to write applications behaving like Unix tools/Python 2 with the power of the Python 3 language and stdlib. Again, use the Strict UTF8 mode if you prioritize *correctness* over *usability*. Honestly, I'm not even sure that the Strict UTF-8 mode is *usable* in practice, since we are all surrounded by old documents encoded to various "legacy" encodings (where legay means: "not UTF-8", like Latin1 or ShiftJIS). The first non-ASCII character which is not encoded to UTF-8 is going to "crash" the application (big traceback with an unicode error). Maybe the problem is the feature name: "UTF-8 mode". Users may think to "strict" when they read "UTF-8", since UTF-8 is known to be a strict encoding. For example, UTF-8 is much stricter than latin1 which is unable to tell if a document was encoded latin1 or whatever else. UTF-8 is able to tell if a document was actually encoded to UTF-8 or not, thanks to the design of the encoding itself. > And it doesn't allow following code: > > with open("image.jpg", "r") as f: # Binary data, not UTF-8 > return f.read() Using a JPEG image, the example is obviously wrong. But using surrogateescape on open() is written to read *text files* which are mostly correctly encoded to UTF-8, except a few bytes. I'm not sure how to explain the issue. The Mercurial wiki page has a good example of this issue that they call the "Makefile problem": https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22 While it's not exactly the discussed issue, it gives you an issue of the kind of issues that you have when you use open(filename, encoding="utf-8", errors="strict") versus open(filename, encoding="utf-8", errors="surrogateescape") > I'm not sure about this is good idea. And I don't know when is good for > changing write error handler; only when PEP 538 or PEP 540 is used? > Or always when os.fsencoding() is UTF-8? > > Any thoughts? The PEP 538 doesn't affect the error handler. The PEP 540 only changes the error handler for the POSIX locale, it's a deliberate choice. The PEP 538 is only enabled for the POSIX locale, and the PEP 540 will also be enabled by default by this locale. I dislike the idea of chaning the error handler if the filesystem encoding is UTF-8. The UTF-8 mode must be enabled explicitly on purpose. The reduce any risk of regression, and prepare users who enable it for any potential issue. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com