On Aug 6, 2007, at 5:39 PM, Wilfredo Sánchez Vega wrote:
On Aug 6, 2007, at 5:11 PM, Roy T. Fielding wrote:
Actually, it also crashes on valid utf-8 in normal form, because OS X
doesn't follow the standard on normalization. See "man -s 5 utf8":
If more than a single representation of a value exists (for
example,
0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation
is always
used. Longer ones are detected as an error as they pose a
potential
security risk, and destroy the 1:1 character:octet sequence
mapping.
but OS X requires the longer composition characters over shorter
ones.
My guess is that choice was driven by the way the UI allows such
characters to be composed (like "alt-u u" for uumlaut).
Above the VFS layer, we always use decomposed UTF-8.
Er, yeah, did I say that backwards? The man page says that equivalent
characters will use the shortest representation, which would mean
always using the composed form of UTF-8. Right? So the man page
for utf8 (from BSD) should be updated to explain the OS X quirks.
I learned something new today -- use the -v option with ls to display
non-ASCII filenames.
What I do currently is define
setenv MM_CHARSET "utf-8"
setenv LANG "en_US.utf-8"
in my shell init file.
On Mac OS (at least), that isn't relevant with respect to
filenames, which is what the patch that Erik proposed fixes.
Yeah, but it is relevant on Solaris, which is why subversion attempts
to use it. *shrug* I'll commit the patch if I ever get a chance to
compile it.
It is, however, relevant to how a CLI application encodes data
sent to the terminal. That is, the above means that Terminal.app
expects to see UTF-8 English text. (I think; again, I don't really
know much about BSD locale settings.)
Terminal.app has its own Preferences that defines the encoding used.
I don't think that is be overridden by the environment variables.
....Roy