Thanks for writing this PEP 383, MvL. I recently ran into this problem in Python 2.x in the Tahoe project [1]. The Tahoe project should be considered a good use case showing what some people need. For example, the assumption that a file will later be written back into the same local filesystem (and thus luckily use the same encoding) from which it originally came doesn't hold for us, because Tahoe is used for file-sharing as well as for backup-and-restore.

One of my first conclusions in pursuing this issue is that we can never use the Python 2.x unicode APIs on Linux, just as we can never use the Python 2.x str APIs on Windows [2]. (You mentioned this ugliness in your PEP.) My next conclusion was that the Linux way of doing encoding of filenames really sucks compared to, for example, the Mac OS X way. I'm heartened to see what David Wheeler is trying to persuade the maintainers of Linux filesystems to improve some of this: [3].

My final conclusion was that we needed to have two kinds of workaround for the Linux suckage: first, if decoding using the suggested filesystem encoding fails, then we fall back to mojibake [4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not sure if it matters and I haven't yet understood if utf-8b offers another alternative for this case). Second, if decoding succeeds using the suggested filesystem encoding on Linux, then write down the encoding that we used and include that with the filename. This expands the size of our filenames significantly, but it is the only way to allow some future programmer to undo the damage of a falsely- successful decoding. Here's our whole plan: [5].

Regards,

Zooko

[1] http://allmydata.org
[2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html # see the footnote of this message
[3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[4] http://en.wikipedia.org/wiki/Mojibake
[5] http://allmydata.org/trac/tahoe/ticket/534#comment:47
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to