On Jul 18, 2007, at 9:55 AM, Wilfredo Sánchez Vega wrote:
On Jul 18, 2007, at 2:11 AM, Joe Orton wrote:
- it is convention on all modern Unixes I'm aware of that filename
charset/encoding follows LC_CTYPE; not just Linux. It may derive
from
Solaris, I think that's where the locale APIs originate.
I guess I don't know how that works in practice. When you have
an encoded string, you need to know it's encoding. On a file
system, there is no meta data (typically) to indicate the encoding
of the file name string.
So I set my locale settings to correspond to encoding A and write
a file. Yours is encoding B. On Linux, one expects the file name
to display differently for the other user?
On Solaris, it is only documented within "man -s 5 environ"
LC_CTYPE
This category specifies character classification,
character conversion, and widths of multibyte char-
acters. When LC_CTYPE is set to a valid value, the
calling utility can display and handle text and file
names containing valid characters for that locale;
Extended Unix Code (EUC) characters where any indi-
vidual character can be 1, 2, or 3 bytes wide; and
EUC characters of 1, 2, or 3 column widths. The
default "C" locale corresponds to the 7-bit ASCII
character set; only characters from ISO 8859-1 are
valid. The information corresponding to this
category is stored in a database created by the
localedef() command. This environment variable is
used by ctype(3C), mblen(3C), and many commands,
such as cat(1), ed(1), ls(1), and vi(1).
POSIX does not recognize the use of LC_CTYPE for filenames because
the locale is supposed to be set on a per-process basis.
It doesn't work in practice. The hack was added on Solaris in order
to give the appearance of internationalization without changing the
existing filesystems. A better implementation would define it once
per mount point, with iso-8859-1 as the pre-existing default, and
allow that to be overridden by directory (where the names are stored).
I think that is why this use of locale was never standardized.
A system less concerned with backwards compatibility is better off
with a requirement of utf-8, though OS X should have made the filename
encoding a mount option. I assume that the ISO9660-Joliet (CD-ROM)
driver does
some form of filename translation automatically from UCS-2.
In any case, even with the convention, it is left to the application
to determine how it will treat encoded filenames. The OS X decision
to treat them all as utf-8 is at least consistent. OTOH, this
is just a display convention -- OS X apps should have been designed
to treat the filename internally as an opaque nul-terminated array,
rather than barfing on non-utf8 encodings.
One thing I miss in OS X is an automated way for file archivers
(like unzip) to recognize and convert non-utf-8 filenames
when they are unarchived. I frequently have to do that by hand
after unzipping something from China or Switzerland. Subversion
breaks on OS X whenever someone commits a filename with an e-grave,
which is a problem when your main product name is Communiqué.
I wonder if this change in APR would fix that error?
....Roy