Re: PEP 383: Non-decodable Bytes in System Character Interfaces

Zooko O'Whielacronx Sat, 25 Apr 2009 08:33:42 -0700

Thanks for writing this PEP 383, MvL. I recently ran into thisproblem in Python 2.x in the Tahoe project [1]. The Tahoe projectshould be considered a good use case showing what some people need.For example, the assumption that a file will later be written backinto the same local filesystem (and thus luckily use the sameencoding) from which it originally came doesn't hold for us, becauseTahoe is used for file-sharing as well as for backup-and-restore.

One of my first conclusions in pursuing this issue is that we cannever use the Python 2.x unicode APIs on Linux, just as we can neveruse the Python 2.x str APIs on Windows [2]. (You mentioned thisugliness in your PEP.) My next conclusion was that the Linux way ofdoing encoding of filenames really sucks compared to, for example,the Mac OS X way. I'm heartened to see what David Wheeler is tryingto persuade the maintainers of Linux filesystems to improve some ofthis: [3].

My final conclusion was that we needed to have two kinds ofworkaround for the Linux suckage: first, if decoding using thesuggested filesystem encoding fails, then we fall back to mojibake[4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm notsure if it matters and I haven't yet understood if utf-8b offersanother alternative for this case). Second, if decoding succeedsusing the suggested filesystem encoding on Linux, then write down theencoding that we used and include that with the filename. Thisexpands the size of our filenames significantly, but it is the onlyway to allow some future programmer to undo the damage of a falsely-successful decoding. Here's our whole plan: [5].


Regards,

Zooko

[1] http://allmydata.org

[2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html #see the footnote of this message

[3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[4] http://en.wikipedia.org/wiki/Mojibake
[5] http://allmydata.org/trac/tahoe/ticket/534#comment:47
--
http://mail.python.org/mailman/listinfo/python-list

Re: PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to