On Jul 18, 2007, at 9:55 AM, Wilfredo Sánchez Vega wrote:

On Jul 18, 2007, at 2:11 AM, Joe Orton wrote:

- it is convention on all modern Unixes I'm aware of that filename
charset/encoding follows LC_CTYPE; not just Linux. It may derive from
Solaris, I think that's where the locale APIs originate.

I guess I don't know how that works in practice. When you have an encoded string, you need to know it's encoding. On a file system, there is no meta data (typically) to indicate the encoding of the file name string.

So I set my locale settings to correspond to encoding A and write a file. Yours is encoding B. On Linux, one expects the file name to display differently for the other user?

On Solaris, it is only documented within "man -s 5 environ"

         LC_CTYPE

             This category  specifies  character  classification,
             character  conversion, and widths of multibyte char-
             acters. When LC_CTYPE is set to a valid  value,  the
             calling utility can display and handle text and file
             names containing valid characters for  that  locale;
             Extended  Unix Code (EUC) characters where any indi-
             vidual character can be 1, 2, or 3 bytes  wide;  and
             EUC  characters  of  1,  2,  or 3 column widths. The
             default "C" locale corresponds to  the  7-bit  ASCII
             character  set;  only characters from ISO 8859-1 are
             valid.  The  information   corresponding   to   this
             category  is  stored  in  a  database created by the
             localedef() command.  This environment  variable  is
             used  by  ctype(3C),  mblen(3C),  and many commands,
             such as cat(1), ed(1), ls(1), and vi(1).

POSIX does not recognize the use of LC_CTYPE for filenames because
the locale is supposed to be set on a per-process basis.

It doesn't work in practice.  The hack was added on Solaris in order
to give the appearance of internationalization without changing the
existing filesystems.  A better implementation would define it once
per mount point, with iso-8859-1 as the pre-existing default, and
allow that to be overridden by directory (where the names are stored).
I think that is why this use of locale was never standardized.

A system less concerned with backwards compatibility is better off
with a requirement of utf-8, though OS X should have made the filename
encoding a mount option. I assume that the ISO9660-Joliet (CD-ROM) driver does
some form of filename translation automatically from UCS-2.
In any case, even with the convention, it is left to the application
to determine how it will treat encoded filenames.  The OS X decision
to treat them all as utf-8 is at least consistent.  OTOH, this
is just a display convention -- OS X apps should have been designed
to treat the filename internally as an opaque nul-terminated array,
rather than barfing on non-utf8 encodings.

One thing I miss in OS X is an automated way for file archivers
(like unzip) to recognize and convert non-utf-8 filenames
when they are unarchived.  I frequently have to do that by hand
after unzipping something from China or Switzerland. Subversion
breaks on OS X whenever someone commits a filename with an e-grave,
which is a problem when your main product name is Communiqué.
I wonder if this change in APR would fix that error?

....Roy

Reply via email to