> -----Original Message----- > From: Branko Čibej [mailto:br...@xbc.nu] > Sent: maandag 30 januari 2012 16:11 > To: dev@subversion.apache.org > Subject: Re: Let's discuss about unicode compositions for filenames! > > On 31.01.2012 00:14, Peter Samuelson wrote: > > [Stefan Sperling] > >> It is indeed harder because we are passing paths verbatim to sqlite. > >> I doubt having more than one form of a given path in wc.db is fun... > > That's the implementation I would like to see, to be honest. Start > > with the observation that we can treat Mac OS X NFD paths as a client > > character encoding. Now observe that it is lossy. But ... almost all > > non-Unicode client charsets are equally lossy, for exactly the same > > reason! > > > > This suggests maintaining a mapping table in wc.db between server paths > > (UTF-8, unspecified NF) and wc paths (local charset, which is > > occasionally UTF-8 with NFD). > > > > This mapping table would be maintained any time we write to the wc. > > It would be consulted any time we search for files in the wc. > > > > It's not really extra work - we have to do those UTF-8 <-> local > > charset conversions all the time anyway. This would in fact cache > > those conversions. > > > > The implementation on OS X might be a bit hairy, if there isn't an easy > > way to retrieve the real pathname of the file you just created. > > Anywhere else, we just store the pathname we just calcuated. > > > > Afaik the OSX API normalizes everything to NFD automagically. So at > least on that platform there's no chance of having more than one form > for the same filename at the same time. Likewise on Windows, which > normalizes to NFC. > > I don't see what you mean by "lossy" though. NFD and NFC can represent > exactly the same set of characters, it's just that the representations > of some of them are different. Thus, this does not preclude normalizing > the paths in wc.db, and that's even easily automated. If such a > conversion finds a name collision ... the user is in serious trouble > already. :) > > It's more likely to find such a collision on Unix than either Mac OS or > Windows (both of which normalize on the FS API level). But this case is > probably so rare that I wouldn't worry about it.
Last time we discussed this in depth (a few years ago), Windows didn't perform the normalization you describe here. Was this added later? (Any documentation pointers?) I think the keyboard/editor support performs some normalization so users are unlikely to create the sequences not-normalized, but our old documents say that it just stores whatever it gets passed. (Probably for the same reason as Subversion does it: compatibility with the time where we didn't know about these problems) Bert > > -- Brane