On Jul 17, 2007, at 5:25 AM, Joe Orton wrote:

On Tue, Jul 17, 2007 at 02:14:25PM +0200, Erik Huelsmann wrote:
Reading [1], I conclude that applications should pass UTF-8 to BSD
functions such as stat() at all times. This suggests to me that
apr_filepath_encoding() should return APR_FILEPATH_ENCODING_UTF8.

Yet, looking at the sources, on any Unixy system, it returns
APR_FILEPATH_ENCODING_LOCALE.

Is this an oversight, or am I missing something else?

My understanding is that in Darwin/Mac OS, all file names, when accessed above the VFS layer, are, by convention, decomposed UTF-8. This is confirmed by the Tech Note:

    http://developer.apple.com/qa/qa2001/qa1173.html

  At the top:

        In Mac OS X's VFS API file names are, by definition,
        canonically decomposed Unicode, encoded using UTF-8.

Under "Returning Names", it is clear that the file system implementation is expected to convert the on-disk file name encoding (if known) to decomposed UTF-8:

        When returning names to higher layers (for example,
        from your VOP_READDIR entry point), you should always
        return decomposed names. If your underlying volume
        format uses precomposed names, you should convert
        any precomposed characters to their decomposed
        equivalents before returning them to the system.

Note that the above is considerably easier for a file system like HFS+, where we know the on-disk encodings. It's trickier for any file system which doesn't specify the file name encoding, which unfortunately is most. It's particularly tricky when the volume format is shared across different operating systems, since other systems do not, AFAICT, have well-established conventions for file name encoding (*).

Note also that the convention is not enforced per se (**). As a result, you aren't guaranteed, even on Mac OS, that file names are valid UTF-8 (***). That poses interesting problems. For example, CFString (and therefore basically all Mac apps) have been known to barf (and crash) when given a file name which isn't UTF-8, since it is typically told that it is UTF-8.

        pathname = <some illegal UTF-8 string>
        [NSFileHandle handleWithPath:
          [NSString stringWithUTF8String: pathname]]; // boom!

This is rare in practice, since Mac apps don't produce "illegal" file names for the same reason that they can't read them.

This is deliberate; on Unix the character set used for filenames is
dictated by the locale settings (e.g. LC_CTYPE), by convention.

Do you have a reference on this? I'm unaware of this convention. Perhaps by "Unix", you mean "Linux" (***)? I was around when we decided the above nonsense for Darwin, and I remember trying to some such a reference, so I'd love to see it.

There is certainly no Unix standard which dictates that all filenames
must be UTF-8-encoded Unicode, so APR cannot enforce that.

  No, but as I mention above, such a standard does exist in Darwin.

        -wsv



(*) I'm getting a vibe from Joe that Linux does, but I'm going to bet that more software on Linux is unaware of the convention there and Mac apps are on Mac OS (especially since most use our Toolkits, which are).

(**) It is, sort of, on HFS+. But not really; if your file name string is not UTF-8 but is a legal byte sequence in UTF-8, it'll be stored as given. (Unless it looks like precomposed UTF-8 and gets decomposed for you, which may look like corruption if that's not what you expect, which is likely if you weren't thinking UTF-8.

(***) The Single Unix Specification, which is probably the only "authority" left regarding Unix standards, has little useful to say on file name encodings. Here is everything I could find on the subject:

        For a filename to be portable across implementations
        conforming to IEEE Std 1003.1-2001, it shall consist
        only of the portable filename character set as defined
        in Portable Filename Character Set.

        The hyphen character shall not be used as the first
        character of a portable filename. Uppercase and
        lowercase letters shall retain their unique identities
        between conforming implementations. In the case of a
        portable pathname, the slash character may also be used.

So here we know that case-insensitive file systems are non- conforming. Oops.

  The Portable Filename Character Set is impressively weak:

        A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
        a b c d e f g h i j k l m n o p q r s t u v w x y z
        0 1 2 3 4 5 6 7 8 9 . _ -

This omits space and most punctuation, which makes sense if poorly written shell scripts (an unfortunate majority) are in the portability target.

  File names are defined thusly:

        A name consisting of 1 to {NAME_MAX} bytes used to name
        a file. The characters composing the name may be selected
        from the set of all character values excluding the slash
        character and the null byte. The filenames dot and dot-dot
        have special meaning. A filename is sometimes referred to
        as a "pathname component".

  Clearly this allows for byte sequences that are not legally UTF-8.

  And we note that PATH is fairly ill-conceived:

        Filenames should be constructed from the portable filename
        character set because the use of other characters can be
        confusing or ambiguous in certain contexts. (For example,
        the use of a colon ( ':' ) in a pathname could cause
        ambiguity if that pathname were included in a PATH
        definition.)

  This is all I could find on file names in the specification.

—
Wilfredo Sánchez - [EMAIL PROTECTED]

Reply via email to