On Fri, 12 Dec 2008 23:32:27 +1300, Michal Ludvig wrote: > is there any way to determine what's the charset of filenames returned > by os.walk()?
No. Especially under *nix file systems file names are just a string of bytes, not characters. It is possible to have file names in different encondings in the same directory. > The trouble is, if I pass <type 'str'> argument to os.walk() I get the > filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows. Nobody knows. :-) > What's the right and safe way to walk the filesystem and get some > meaningful filenames? The safe way is to use `str`. > Related question - if the directory is given name on a command line > what's the right way to preprocess the argument before passing it down > to os.walk()? Pass it as is. > For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system): * directory is > called 'smile☺' > * sys.argv[1] will be 'smile\xe2\x98\xba' (type str) * after > .decode("utf-8") I get u'smile\u263a' (type unicode) > > But how should I decode() it when running on a system where $LANG > doesn't end with "UTF-8"? Apparently some locales have non-ascii default > charsets. For instance zh_TW is BIG5 charset by default, ru_RU is > ISO-8850-5, etc. How do I detect that to get the right charset for > decode()? You can't. Even if you know the preferred encoding of the system, e.g. via $LANG, there is no guarantee that all file names are encoded this way. > I tend to have everything internally in Unicode but it's often unclear > how to convert some inputs to Unicode in the first place. What are the > best practices for dealing with these chraset issues in Python? I'm usually using UTF-8 as default but offer the user ways, e.g. command line switches, to change that. If I have to display file names in a GUI I use a decoded version of the byte string file name, but keep the byte string for operations on the file. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list