On Sat, 20 Oct 2012 13:43:16 -0700, Julien Phalip wrote: > I've noticed that the encoding of non-ascii filenames can be inconsistent > between platforms when using the built-in open() function to create files. > > For example, on a Ubuntu 10.04.4 LTS box, the character u'ş' (u'\u015f') > gets encoded as u'ş' (u's\u0327'). Note how the two characters look > exactly the same but are encoded differently. The original character uses > only one code (u'\u015f'), but the resulting character that is saved on > the file system will be made of a combination of two codes: the letter 's' > followed by a diacritical cedilla (u's\u0327'). (You can learn more about > diacritics in [1]). On the Mac, however, the original encoding is always > preserved. > > This issue was also discussed in a blog post by Ned Batchelder [2].
You are conflating two distinct issues here: representation (how a given "character" is represented as a Unicode string) and encoding (how a given Unicode string is represented as a byte string). E.g. you state: > For example, on a Ubuntu 10.04.4 LTS box, the character u'ş' (u'\u015f') > gets encoded as u'ş' (u's\u0327'). which is incorrect. The latter isn't an "encoding" of the former. They are alternate Unicode representations of the same character. The former uses a pre-composed character (LATIN SMALL LETTER S WITH CEDILLA) while the latter uses a letter 's' with a combining accent (COMBINING CEDILLA). Unlike the Mac, neither Unix nor Windows will automatically normalise Unicode strings. A Unix filename is a sequence of bytes, nothing more and nothing less. This is part of the reason why Unix filenames are case sensitive: case applies to characters, and the kernel doesn't know which characters, if any, those bytes are meant to represent. Python will convert a Unicode string to a sequence of bytes using the filesystem encoding. If the encoding is UTF-8, then u'\u015f' will be encoded as b'\xc5\x9f', while u's\u0327' will be encoded as b's\xcc\xa7'. If you want to convert a Unicode string to a given normalisation, you can use unicodedata.normalize(), e.g.: > unicodedata.normalize('NFC', 's\u0327') '\u015f' > unicodedata.normalize('NFD', '\u015f') 's\u0327' However: if you want to access an existing file, you must use the filename as it appears on disc. On Unix and Windows, it's perfectly possible to have two files named e.g. '\u015f.txt' and 's\u0327.txt' in the same directory. Which one gets opened depends upon the exact sequence of Unicode codepoints passed to open(). The situation is different on the Mac, where system libraries automatically impose a specific representation on filenames, and will normalise Unicode strings to that representation. -- http://mail.python.org/mailman/listinfo/python-list