Kenneth Pronovici wrote:
1) Why LC_ALL has any effect on the os.listdir() result?

The operating system (POSIX) does not have the inherent notion that file names are character strings. Instead, in POSIX, file names are primarily byte strings. There are some bytes which are interpreted as characters (e.g. '\x2e', which is '.', or '\x2f', which is '/'), but apart from that, most OS layers think these are just bytes.

Now, most *people* think that file names are character strings.
To interpret a file name as a character string, you need to know
what the encoding is to interpret the file names (which are byte
strings) as character strings.

There is, unfortunately, no operating system API to carry
the notion of a file system encoding. By convention, the locale
settings should be used to establish this encoding, in particular
the LC_CTYPE facet of the locale. This is defined in the
environment variables LC_CTYPE, LC_ALL, and LANG (searched
in this order).

2) Why only 3 of the 4 files come back as unicode strings?

If LANG is not set, the "C" locale is assumed, which uses ASCII as its file system encoding. In this locale, '\xe2\x99\xaa\xe2\x99\xac' is not a valid file name (atleast it cannot be interpreted as characters, and hence not be converted to Unicode).

Now, your Python script has requested that all file names
*should* be returned as character (ie. Unicode) strings, but
Python cannot comply, since there is no way to find out what
this byte string means, in terms of characters.

So we have three options:
1. skip this string, only return the ones that can be
   converted to Unicode. Give the user the impression
   the file does not exist.
2. return the string as a byte string
3. refuse to listdir altogether, raising an exception
   (i.e. return nothing)

Python has chosen alternative 2, allowing the application
to implement 1 or 3 on top of that if it wants to (or
come up with other strategies, such as user feedback).

3) The proper "general" way to deal with this situation?

You can chose option 1 or 3; you could tell the user about it, and then ignore the file, you could try to guess the encoding (UTF-8 would be a reasonable guess).

My goal is to build generalized code that consistently works with all
kinds of filenames.

Then it is best to drop the notion that file names are character strings (because some file names aren't). You do so by converting your path variable into a byte string. To do that, you could try

path = path.encode(sys.getfilesystemencoding())

This should work in most cases; Python will try to
determine the file system encoding from the environment,
and try to encode the file. Notice, however:

- on some systems, getfilesystemencoding may return None,
  if the encoding could not be determined. Fall back
  to sys.getdefaultencoding in this case.
- depending on where you got path from, this may
  raise a UnicodeError, if the user has entered a
  path name which cannot be encoding in the file system
  encoding (the user may well believe that she has
  such a file on disk).

So your code would read

try:
  path = path.encode(sys.getfilesystemencoding() or
                     sys.getdefaultencoding())
except UnicodeError:
  print >>sys.stderr, "Invalid path name", repr(path)
  sys.exit(1)

Ultimately, all I'm trying to do is copy some files
around.  I'd really prefer to find a programmatic way to make this work
that was independent of the user's configured locale, if possible.

As long as you manage to get a byte string from the path entered, all should be fine.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to