On 7/17/07, Wilfredo Sánchez Vega <[EMAIL PROTECTED]> wrote:
On Jul 17, 2007, at 5:25 AM, Joe Orton wrote:

> On Tue, Jul 17, 2007 at 02:14:25PM +0200, Erik Huelsmann wrote:
>> Reading [1], I conclude that applications should pass UTF-8 to BSD
>> functions such as stat() at all times. This suggests to me that
>> apr_filepath_encoding() should return APR_FILEPATH_ENCODING_UTF8.
>>
>> Yet, looking at the sources, on any Unixy system, it returns
>> APR_FILEPATH_ENCODING_LOCALE.
>>
>> Is this an oversight, or am I missing something else?

   My understanding is that in Darwin/Mac OS, all file names, when
accessed above the VFS layer, are, by convention, decomposed UTF-8.
This is confirmed by the Tech Note:

     http://developer.apple.com/qa/qa2001/qa1173.html

   At the top:

        In Mac OS X's VFS API file names are, by definition,
        canonically decomposed Unicode, encoded using UTF-8.

Which suggests that apr_filepath_encoding should return
APR_FILEPATH_ENCODING_UTF8, if I'm not mistaken.

   Under "Returning Names", it is clear that the file system
implementation is expected to convert the on-disk file name encoding
(if known) to decomposed UTF-8:

        When returning names to higher layers (for example,
        from your VOP_READDIR entry point), you should always
        return decomposed names. If your underlying volume
        format uses precomposed names, you should convert
        any precomposed characters to their decomposed
        equivalents before returning them to the system.

   Note that the above is considerably easier for a file system like
HFS+, where we know the on-disk encodings.  It's trickier for any file
system which doesn't specify the file name encoding, which
unfortunately is most.  It's particularly tricky when the volume
format is shared across different operating systems, since other
systems do not, AFAICT, have well-established conventions for file
name encoding (*).

   Note also that the convention is not enforced per se (**).  As a
result, you aren't guaranteed, even on Mac OS, that file names are
valid UTF-8 (***).  That poses interesting problems.  For example,
CFString (and therefore basically all Mac apps) have been known to
barf (and crash) when given a file name which isn't UTF-8, since it is
typically told that it is UTF-8.

But generally applications won't work with these non-UTF8 paths if
they are well behaving MacOSX apps themselves, right? That reduces
chances of being fed garbage. But, other OSes can't guarantee
UTF-8ness either, because LANG (and LC_CTYPE) can be user-settings,
which can differ for different users, but path names are the same for
all users. So on Linux you can't be too sure either.

Thanks for the extensive explanation.

bye,

Erik.

Reply via email to