On Thu, Jan 20, 2011 at 01:26:01AM +0100, Victor Stinner wrote: > Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : > > Additionally, many unix filesystem don't specify a filesystem encoding for > > filenames; they deal in legal and illegal bytes which could lead to > > troubles. This problem of which encoding to use is a problem that can be > > seen on UNIX systems even now. > > If the system is not correctly configured, it is not a bug in Python, > but a bug in the system config. Python relies on the locale to choose > the filesystem encoding (sys.getfilesystemencoding()). Python uses this > encoding to decode and encode all filenames. > Saying that multiple encodings on a single system is a misconfiguration every time it comes up does not make it true. There's been multiple examples of how you can end up with multiple encodings of filenames on a single system listed in past threads: multiple users with different encodings for their locales, mounting remote filesystems, downloading a file.... To the existing list I'd add getting a package from pypi -- neither tar nor zip files contain encoding information about the filenames. Therefore if I create an sdist of a python module using non-ascii filenames using a locale of latin1 and then upload to pypi, people downloading that on a utf-8 using locale will end up not being able to use the module.
> > * Specify an encoding per platform and stick to that. > > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all > programs will use it. > The proposal is that you ignore that when talking about loading and creating (I mentioned distutils because my thought was that distutils could grow the ability to translate from the system locale to a chosen neutral encoding when running setup.py any of the dist commands but that doesn't address the issue when testing a module that you've just written so perhaps that's not necessary.) python modules. Python modules would have a set of defined filesystem encodings per system. This prevents getting a mixture of encodings of modules and having things work in one location but fail when used somewhere else. Instead, you get an upfront failure until you correct the encoding. > Anyway, I don't see why it is a problem to have different encodings on > different systems. Each system can use its own encoding. The bug that > I'm trying to solve is a Python bug, not an OS bug. > There is no OS bug here. There is perhaps an OS design flaw but it's not a flaw that will be going away soon (in part, because the present OS designers do not see it as an OS flaw... to them it's a bug in code that attempts to build a simpler interface on top of it.) > > * Change import semantics to allow specifying the encoding of the module on > > the filesystem (seems really icky). > > This is a very bad idea. I introduced PYTHONFSENCODING environment > variable in Python 3.2, but then quickly removed it, because it > introduced a lot of inconsistencies. > Thanks for getting rid of that, PYTHONFSENCODING is a bad idea because it doesn't solve the underlying issues. However, when I say specifying the encoding of the module on the filesystem, I don't mean something global like PYTHONFSENCODING -- I mean something at the python code level:: import café encoded_as('latin1') After thinking about this one, though, I don't think it will work either. This takes care of importing modules where the fs encoding of the module is known but it doesn't where the fs encoding may be translated between platforms. I believe that this could arise when untarring a module on windows using winzip or similar that gives you the option of translating from utf-8 bytes into bytes that have meaning as characters on that platform, for instance. Do you have a solution to the problem? I haven't looked at your patch so perhaps you have an ingenous method of translating from the unicode representation of the module in the import statement to the bytes in arbitrary encodings on the filesystem that I haven't thought of. If you don't, however, then really - ASCII-only seems like the sanest of the three solutions I can think of. -Toshio
pgpxKdCbo8dSk.pgp
Description: PGP signature
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com