[reordering the conversation flow slightly] [Peter Samuelson] > > That's the implementation I would like to see, to be honest. Start > > with the observation that we can treat Mac OS X NFD paths as a > > client character encoding. Now observe that it is lossy. But > > ... almost all non-Unicode client charsets are equally lossy, for > > exactly the same reason!
[Branko Cibej] > I don't see what you mean by "lossy" though. NFD and NFC can > represent exactly the same set of characters, it's just that the > representations of some of them are different. By "lossy" I just mean that if you convert to UTF-8 NFD, you can't reliably convert _back_ to the original bytes. I'm assuming here that we continue to do _no_ n11n on the server side - pathnames from libsvn_(ra|repos|fs) are just UTF-8 with unspecified n11n. Thus, if the "client encoding" is UTF-8 NFD, you can't reliably convert that to the "server encoding". And this is also true of most legacy (non-Unicode) encodings: they know nothing about Unicode's n11n forms, so they are "lossy" in the same way: you can't reliably take a pathname in, e.g., ISO-8859-1, and convert to the encoding found in the repository, because you don't know the n11n form used by the original committer. This is why I suggested the mapping table in wc.db. Actually, the fact that the mapping table works around the inherent lossiness of character encoding conversion suggests that it _could_, in the future, also account for lossiness for other reasons. If we wished, we could have libsvn_wc mangle checked-out filenames on platforms with arbitrary limitations - escaping "<" and ":" characters on Windows, e.g. - using this same mechanism. Even if the conversion is lossy, the mapping table in wc.db knows the original filename. Of course you couldn't _create_ filenames with platform limitations on the same platform, but being able to check out the file at all is an improvement over today. Probably 'svn status' would show some indication that a name has been mangled in a way users would actually care about (i.e., not just NFC/NFD). > > The implementation on OS X might be a bit hairy, if there isn't an > > easy way to retrieve the real pathname of the file you just > > created. Anywhere else, we just store the pathname we just > > calcuated. > Afaik the OSX API normalizes everything to NFD automagically. So at > least on that platform there's no chance of having more than one form > for the same filename at the same time. Likewise on Windows, which > normalizes to NFC. Right. The question is, if libsvn_wc tells OS X to store a given path, with unknown n11n, is there an easy way to retrieve the pathname that was _actually_ stored on disk? That's what I mean by "might be a bit hairy". It sounds like the thing to do on OS X is for libsvn_wc to pre-normalize to NFD before writing the file, and just assume the OS will (re-)normalize to the same byte array. -- Peter Samuelson | org-tld!p12n!peter | http://p12n.org/