On Aug 6, 2007, at 5:39 PM, Wilfredo Sánchez Vega wrote:
On Aug 6, 2007, at 5:11 PM, Roy T. Fielding wrote:
Actually, it also crashes on valid utf-8 in normal form, because OS X
doesn't follow the standard on normalization.  See "man -s 5 utf8":

If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always used. Longer ones are detected as an error as they pose a potential security risk, and destroy the 1:1 character:octet sequence mapping.

but OS X requires the longer composition characters over shorter ones.
My guess is that choice was driven by the way the UI allows such
characters to be composed (like "alt-u u" for uumlaut).

  Above the VFS layer, we always use decomposed UTF-8.

Er, yeah, did I say that backwards?  The man page says that equivalent
characters will use the shortest representation, which would mean
always using the composed form of UTF-8.  Right?  So the man page
for utf8 (from BSD) should be updated to explain the OS X quirks.

I learned something new today -- use the -v option with ls to display
non-ASCII filenames.

What I do currently is define

  setenv  MM_CHARSET "utf-8"
  setenv  LANG       "en_US.utf-8"

in my shell init file.

On Mac OS (at least), that isn't relevant with respect to filenames, which is what the patch that Erik proposed fixes.

Yeah, but it is relevant on Solaris, which is why subversion attempts
to use it. *shrug*  I'll commit the patch if I ever get a chance to
compile it.

It is, however, relevant to how a CLI application encodes data sent to the terminal. That is, the above means that Terminal.app expects to see UTF-8 English text. (I think; again, I don't really know much about BSD locale settings.)

Terminal.app has its own Preferences that defines the encoding used.
I don't think that is be overridden by the environment variables.

....Roy

Reply via email to