RE: Let's discuss about unicode compositions for filenames!

Bert Huijben Mon, 30 Jan 2012 17:48:29 -0800


> -----Original Message-----
> From: Branko Čibej [mailto:[email protected]]
> Sent: maandag 30 januari 2012 16:11
> To: [email protected]
> Subject: Re: Let's discuss about unicode compositions for filenames!
> 
> On 31.01.2012 00:14, Peter Samuelson wrote:
> > [Stefan Sperling]
> >> It is indeed harder because we are passing paths verbatim to sqlite.
> >> I doubt having more than one form of a given path in wc.db is fun...
> > That's the implementation I would like to see, to be honest.  Start
> > with the observation that we can treat Mac OS X NFD paths as a client
> > character encoding.  Now observe that it is lossy.  But ... almost all
> > non-Unicode client charsets are equally lossy, for exactly the same
> > reason!
> >
> > This suggests maintaining a mapping table in wc.db between server paths
> > (UTF-8, unspecified NF) and wc paths (local charset, which is
> > occasionally UTF-8 with NFD).
> >
> > This mapping table would be maintained any time we write to the wc.
> > It would be consulted any time we search for files in the wc.
> >
> > It's not really extra work - we have to do those UTF-8 <-> local
> > charset conversions all the time anyway.  This would in fact cache
> > those conversions.
> >
> > The implementation on OS X might be a bit hairy, if there isn't an easy
> > way to retrieve the real pathname of the file you just created.
> > Anywhere else, we just store the pathname we just calcuated.
> >
> 
> Afaik the OSX API normalizes everything to NFD automagically. So at
> least on that platform there's no chance of having more than one form
> for the same filename at the same time. Likewise on Windows, which
> normalizes to NFC.
> 
> I don't see what you mean by "lossy" though. NFD and NFC can represent
> exactly the same set of characters, it's just that the representations
> of some of them are different. Thus, this does not preclude normalizing
> the paths in wc.db, and that's even easily automated. If such a
> conversion finds a name collision ... the user is in serious trouble
> already. :)
> 
> It's more likely to find such a collision on Unix than either Mac OS or
> Windows (both of which normalize on the FS API level). But this case is
> probably so rare that I wouldn't worry about it.


Last time we discussed this in depth (a few years ago), Windows didn't perform 
the normalization you describe here.
Was this added later? (Any documentation pointers?)

I think the keyboard/editor support performs some normalization so users are 
unlikely to create the sequences not-normalized, but our old documents say that 
it just stores whatever it gets passed.
(Probably for the same reason as Subversion does it: compatibility with the 
time where we didn't know about these problems)

        Bert
> 
> -- Brane

RE: Let's discuss about unicode compositions for filenames!

Reply via email to