Re: [RFC] Non-normalizing Unicode Composition Awareness

Julian Foad Tue, 14 Feb 2012 03:55:56 -0800

Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
proposal.  That's just what we need.  Just a few initial comments below...

Thomas Åkesson wrote:

> Context
> ===
> 
> [...] A unicode string (e.g. a file name) can be represented
> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
> characters where some are composed and others decomposed (rare).

What's "rare"?  We have to assume that input is in mixed composition in any 
system that doesn't explicitly normalize it, which (I think) includes most 
operating systems.  While it may be rare for any single string to contain 
characters in both compositions, it is very common to be processing a string 
that *might* have characters in both compositions -- in other words, that is 
not guaranteed to be normalized.  I think it would be clearer to drop the 
"(rare)" and just say "... normalized forms (NFC/NFD) or mixed (not 
normalized).".

> A minority of file systems (currently Mac OS X HFS+ only) will
> normalize the paths. In the case of HFS+, the path will be
> normalized into NFD and it will even be given back that way when
> listing the filesystem. 

Drop the word "even"?  The statement is not surprising.

[...]

> Similarities to case-sensitivity
> ===
> 
>  - If two Unicode strings differ only by letter case/composition,

Drop "/composition" -- it's the subject of the following sentence.

> on some 
computer systems they refer to the same file, while on
> other systems 
they refer to different files.  The same applies
> if two Unicode strings 
differ only by composition. 

> [...]

> Client Changes
> ===
> 
> [...] An abstraction between the repository path and the file
> system path can be achieved by ensuring that there is a column
> in wc.db that contains the file system path in exactly the same
> form that the file system gives back. APIs in wc needs to be
> extended to ensure that all interaction with the file system is
> performed with the file system path.

[...]

This part seems to be the heart of the whole proposal.  You describe the data 
that we need, but the behaviour will also need to be described in detail.  
Presumably much of the behaviour is boring and obvious (when we check out a new 
path and create it on disk, we store the disk path), but I'm sure there will be 
some less obvious parts (do we need to find out what the disk path of an 
'excluded' node would be, even though we're not actually creating it on disk, 
for example).

> Use Cases
> ===
> 
> This change will only affect use cases which rely on creating
> paths that look like duplicates but use different unicode
> composition. It is highly unlikely anyone is relying on this..

Uh... it sounds like you are saying there are no interesting use cases for this 
proposal!  No, on the contrary, this proposal also affects checking out and 
using a WC on Mac HFS+ where the repository paths were created on another 
system and are not in NFD, and it allows that case to work.  That's the more 
interesting use case, is it not?  It's definitely worth writing out the 
interesting case in full, including steps like checkout (or update) that brings 
in a non-NFD path, create a new file on the Mac, and commit.

- Julian

Re: [RFC] Non-normalizing Unicode Composition Awareness

Reply via email to