On Sat, 23 Aug 2014 21:08:29 +1000, Steven D'Aprano <st...@pearwood.info> wrote: > When I started this email, I originally began to say that the actual > problem was with byte file names that cannot be decoded into Unicode > using the system encoding (typically UTF-8 on Linux systems. But I've > actually had difficulty demonstrating that it actually is a problem. I > started with a byte sequence which is invalid UTF-8, namely: > > b'ZZ\xdb\xdf\xfa\xff' > > created a file with that name, and then tried listing it with > os.listdir. Even in Python 3.1 it worked fine. I was able to list the > directory and open the file, so I'm not entirely sure where the problem > lies exactly. Can somebody demonstrate the failure mode?
The "failure" happens only when you try to cross from the domain of posix binary filenames into the domain of text streams (that is, a stream with a consistent encoding). If you stick with os interfaces that handle filenames, Python3 handles posix bytes filenames just fine (though there may be a few corner-case rough edges yet to be fixed, and the standard streams was one of them). The difficultly comes if you try to use a filename that contains undecodable bytes in a non-os-interface text-context (such as writing it to a text file that you have declared to be a utf-8 encoding): there you will get an error...not completely unlike the old "your code works until your user uses unicode" problem we had in python2, but in this case only happening in a very narrow set of circumstances involving trying to translate between one domain (posix binary filenames) and another domain (io streams with a consistent declared encoding). This is not a common operation, but appears to be the one Oleg is concerned about. The old unicode-blowup errors would happen almost any time someone with a non-ascii language tried to use a program written by an ascii-only programmer (which was most of us). The same problem existed in python2 if your goal was to produce a stream with a consistent encoding, but now python3 treats that as an error. If you really want a stream with an inconsistent encoding, open it as binary and use the surrogate escape error handler to recover the bytes in the filenames. That is, *be explicit* about your intentions. So yes, we've shifted a burden from those who want non-ascii text to work consistently to those who wanted inconsistently encoded text to "just work" (or rather *appear* to "just work"). The number of people who benefit from the improved text model *greatly* outweighs the number of people inconvenienced by the new strictness when the domain line (posix binary filenames to consistently encoded text stream) are crossed. And the result is more *valid* programs, and fewer unexpected errors overall, with no inconvenience unless that domain line is crossed, and even then the inconvenience is limited to the open call that creates the binary stream. --David _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com