Michal Ludvig wrote: > Hi all, > > is there any way to determine what's the charset of filenames returned > by os.walk()? > > The trouble is, if I pass <type 'str'> argument to os.walk() I get the > filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows. > > OTOH If I pass <type 'unicode'> to os.walk() all the filenames I get in > the loop are already unicode()d. > > However with some locales settings os.walk() dies with for example: > Traceback (most recent call last): > File "tst.py", line 10, in <module> > for root, dirs, files in filelist: > File "/usr/lib/python2.5/os.py", line 303, in walk > for x in walk(path, topdown, onerror): > File "/usr/lib/python2.5/os.py", line 293, in walk > if isdir(join(top, name)): > File "/usr/lib/python2.5/posixpath.py", line 65, in join > path += '/' + b > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: > ordinal not in range(128) > > I can't even skip over these files with 'os.walk(..., onerror=handler)' > the handler() is never called. > > That happens for instance when the file names have some non-ascii > characters and locales are set to ascii, but reportedly in some other > cases as well. > > What's the right and safe way to walk the filesystem and get some > meaningful filenames? > > > Related question - if the directory is given name on a command line > what's the right way to preprocess the argument before passing it down > to os.walk()? > > For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system): > * directory is called 'smile☺' > * sys.argv[1] will be 'smile\xe2\x98\xba' (type str) > * after .decode("utf-8") I get u'smile\u263a' (type unicode) > > But how should I decode() it when running on a system where $LANG > doesn't end with "UTF-8"? Apparently some locales have non-ascii default > charsets. For instance zh_TW is BIG5 charset by default, ru_RU is > ISO-8850-5, etc. How do I detect that to get the right charset for decode()? > > I tend to have everything internally in Unicode but it's often unclear > how to convert some inputs to Unicode in the first place. What are the > best practices for dealing with these chraset issues in Python? > There's currently a huge thread on python-dev dealing with (or rather discussing) this very tortuous issue. Look for "Python-3.0, unicode, and os.environ" in the archives. (The same issue, by the way, also applies to environment variables).
In a nutshell, this is likely to cause pain until all file systems are standardized on a particular encoding of Unicode. Probably only about another fifteen years to go ... regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list