On Jul 17, 2007, at 5:25 AM, Joe Orton wrote:
On Tue, Jul 17, 2007 at 02:14:25PM +0200, Erik Huelsmann wrote:
Reading [1], I conclude that applications should pass UTF-8 to BSD
functions such as stat() at all times. This suggests to me that
apr_filepath_encoding() should return APR_FILEPATH_ENCODING_UTF8.
Yet, looking at the sources, on any Unixy system, it returns
APR_FILEPATH_ENCODING_LOCALE.
Is this an oversight, or am I missing something else?
My understanding is that in Darwin/Mac OS, all file names, when
accessed above the VFS layer, are, by convention, decomposed UTF-8.
This is confirmed by the Tech Note:
http://developer.apple.com/qa/qa2001/qa1173.html
At the top:
In Mac OS X's VFS API file names are, by definition,
canonically decomposed Unicode, encoded using UTF-8.
Under "Returning Names", it is clear that the file system
implementation is expected to convert the on-disk file name encoding
(if known) to decomposed UTF-8:
When returning names to higher layers (for example,
from your VOP_READDIR entry point), you should always
return decomposed names. If your underlying volume
format uses precomposed names, you should convert
any precomposed characters to their decomposed
equivalents before returning them to the system.
Note that the above is considerably easier for a file system like
HFS+, where we know the on-disk encodings. It's trickier for any file
system which doesn't specify the file name encoding, which
unfortunately is most. It's particularly tricky when the volume
format is shared across different operating systems, since other
systems do not, AFAICT, have well-established conventions for file
name encoding (*).
Note also that the convention is not enforced per se (**). As a
result, you aren't guaranteed, even on Mac OS, that file names are
valid UTF-8 (***). That poses interesting problems. For example,
CFString (and therefore basically all Mac apps) have been known to
barf (and crash) when given a file name which isn't UTF-8, since it is
typically told that it is UTF-8.
pathname = <some illegal UTF-8 string>
[NSFileHandle handleWithPath:
[NSString stringWithUTF8String: pathname]]; // boom!
This is rare in practice, since Mac apps don't produce "illegal"
file names for the same reason that they can't read them.
This is deliberate; on Unix the character set used for filenames is
dictated by the locale settings (e.g. LC_CTYPE), by convention.
Do you have a reference on this? I'm unaware of this convention.
Perhaps by "Unix", you mean "Linux" (***)? I was around when we
decided the above nonsense for Darwin, and I remember trying to some
such a reference, so I'd love to see it.
There is certainly no Unix standard which dictates that all filenames
must be UTF-8-encoded Unicode, so APR cannot enforce that.
No, but as I mention above, such a standard does exist in Darwin.
-wsv
(*) I'm getting a vibe from Joe that Linux does, but I'm going to bet
that more software on Linux is unaware of the convention there and Mac
apps are on Mac OS (especially since most use our Toolkits, which are).
(**) It is, sort of, on HFS+. But not really; if your file name
string is not UTF-8 but is a legal byte sequence in UTF-8, it'll be
stored as given. (Unless it looks like precomposed UTF-8 and gets
decomposed for you, which may look like corruption if that's not what
you expect, which is likely if you weren't thinking UTF-8.
(***) The Single Unix Specification, which is probably the only
"authority" left regarding Unix standards, has little useful to say on
file name encodings. Here is everything I could find on the subject:
For a filename to be portable across implementations
conforming to IEEE Std 1003.1-2001, it shall consist
only of the portable filename character set as defined
in Portable Filename Character Set.
The hyphen character shall not be used as the first
character of a portable filename. Uppercase and
lowercase letters shall retain their unique identities
between conforming implementations. In the case of a
portable pathname, the slash character may also be used.
So here we know that case-insensitive file systems are non-
conforming. Oops.
The Portable Filename Character Set is impressively weak:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
This omits space and most punctuation, which makes sense if poorly
written shell scripts (an unfortunate majority) are in the portability
target.
File names are defined thusly:
A name consisting of 1 to {NAME_MAX} bytes used to name
a file. The characters composing the name may be selected
from the set of all character values excluding the slash
character and the null byte. The filenames dot and dot-dot
have special meaning. A filename is sometimes referred to
as a "pathname component".
Clearly this allows for byte sequences that are not legally UTF-8.
And we note that PATH is fairly ill-conceived:
Filenames should be constructed from the portable filename
character set because the use of other characters can be
confusing or ambiguous in certain contexts. (For example,
the use of a colon ( ':' ) in a pathname could cause
ambiguity if that pathname were included in a PATH
definition.)
This is all I could find on file names in the specification.
—
Wilfredo Sánchez - [EMAIL PROTECTED]