Re: [RFC] Non-normalizing Unicode Composition Awareness

Thomas Åkesson Sun, 25 Mar 2012 19:14:37 -0700

Hi,
Sorry about the delay, had a release to sort out...

I have moved the proposal into the wiki:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness


The comments from Julian and Markus have been implemented and I have added more 
information to the "Client Changes" section as well as more structure and 
TODO-notes. 

I would really appreciate if someone with more insight into WC-NG could provide 
input on some of the TODO items (or things that have been completely 
overlooked).

Thanks,
Thomas Å.


On 21 feb 2012, at 09:55, Daniel Shahaf wrote:

> I've granted you write access to the wiki.
> 
> Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100:
>> Thanks Julian and Markus for providing feedback. 
>> 
>> I am not commenting below because all the feedback is very good and I will 
>> try to address it as best I can in the next iteration. Describing the 
>> behaviour changes to the WC is the most challenging since I lack that kind 
>> of detailed knowledge. I will instead try to draft the structure of that 
>> section to make it easier for someone with that level of detail to assist.
>> 
>> Regarding use cases, what can I say... it was towards the end of a long 
>> stretch.
>> 
>> I think it would help with the upcoming iterations if I could move this 
>> "document" into the wiki. If you find that this first draft shows promise, 
>> please consider granting edit access in the wiki. My user name is "Thomas 
>> Åkesson", which exercises the Unicode awareness of MoinMoin...
>> 
>> /Thomas Å.
>> 
>> 
>> On 14 feb 2012, at 11:25, Julian Foad wrote:
>> 
>>> Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
>>> proposal.  That's just what we need.  Just a few initial comments below...
>>> 
>>> Thomas Åkesson wrote:
>>> 
>>>> Context
>>>> ===
>>>> 
>>>> [...] A unicode string (e.g. a file name) can be represented
>>>> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
>>>> characters where some are composed and others decomposed (rare).
>>> 
>>> 
>>> What's "rare"?  We have to assume that input is in mixed composition in any 
>>> system that doesn't explicitly normalize it, which (I think) includes most 
>>> operating systems.  While it may be rare for any single string to contain 
>>> characters in both compositions, it is very common to be processing a 
>>> string that *might* have characters in both compositions -- in other words, 
>>> that is not guaranteed to be normalized.  I think it would be clearer to 
>>> drop the "(rare)" and just say "... normalized forms (NFC/NFD) or mixed 
>>> (not normalized).".
>>> 
>>> 
>>>> A minority of file systems (currently Mac OS X HFS+ only) will
>>>> normalize the paths. In the case of HFS+, the path will be
>>>> normalized into NFD and it will even be given back that way when
>>>> listing the filesystem. 
>>> 
>>> 
>>> Drop the word "even"?  The statement is not surprising.
>>> 
>>> 
>>> [...]
>>> 
>>>> Similarities to case-sensitivity
>>>> ===
>>>> 
>>>>  - If two Unicode strings differ only by letter case/composition,
>>> 
>>> Drop "/composition" -- it's the subject of the following sentence.
>>> 
>>>> on some 
>>> computer systems they refer to the same file, while on
>>>> other systems 
>>> they refer to different files.  The same applies
>>>> if two Unicode strings 
>>> differ only by composition. 
>>> 
>>> 
>>>> [...]
>>> 
>>>> Client Changes
>>>> ===
>>>> 
>>>> [...] An abstraction between the repository path and the file
>>>> system path can be achieved by ensuring that there is a column
>>>> in wc.db that contains the file system path in exactly the same
>>>> form that the file system gives back. APIs in wc needs to be
>>>> extended to ensure that all interaction with the file system is
>>>> performed with the file system path.
>>> 
>>> [...]
>>> 
>>> This part seems to be the heart of the whole proposal.  You describe the 
>>> data that we need, but the behaviour will also need to be described in 
>>> detail.  Presumably much of the behaviour is boring and obvious (when we 
>>> check out a new path and create it on disk, we store the disk path), but 
>>> I'm sure there will be some less obvious parts (do we need to find out what 
>>> the disk path of an 'excluded' node would be, even though we're not 
>>> actually creating it on disk, for example).
>>> 
>>> 
>>>> Use Cases
>>>> ===
>>>> 
>>>> This change will only affect use cases which rely on creating
>>>> paths that look like duplicates but use different unicode
>>>> composition. It is highly unlikely anyone is relying on this..
>>> 
>>> 
>>> Uh... it sounds like you are saying there are no interesting use cases for 
>>> this proposal!  No, on the contrary, this proposal also affects checking 
>>> out and using a WC on Mac HFS+ where the repository paths were created on 
>>> another system and are not in NFD, and it allows that case to work.  That's 
>>> the more interesting use case, is it not?  It's definitely worth writing 
>>> out the interesting case in full, including steps like checkout (or update) 
>>> that brings in a non-NFD path, create a new file on the Mac, and commit.
>>> 
>>> - Julian
>>> 
>>

Re: [RFC] Non-normalizing Unicode Composition Awareness

Reply via email to