Hi, Sorry about the delay, had a release to sort out... I have moved the proposal into the wiki: http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness
The comments from Julian and Markus have been implemented and I have added more information to the "Client Changes" section as well as more structure and TODO-notes. I would really appreciate if someone with more insight into WC-NG could provide input on some of the TODO items (or things that have been completely overlooked). Thanks, Thomas Å. On 21 feb 2012, at 09:55, Daniel Shahaf wrote: > I've granted you write access to the wiki. > > Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100: >> Thanks Julian and Markus for providing feedback. >> >> I am not commenting below because all the feedback is very good and I will >> try to address it as best I can in the next iteration. Describing the >> behaviour changes to the WC is the most challenging since I lack that kind >> of detailed knowledge. I will instead try to draft the structure of that >> section to make it easier for someone with that level of detail to assist. >> >> Regarding use cases, what can I say... it was towards the end of a long >> stretch. >> >> I think it would help with the upcoming iterations if I could move this >> "document" into the wiki. If you find that this first draft shows promise, >> please consider granting edit access in the wiki. My user name is "Thomas >> Åkesson", which exercises the Unicode awareness of MoinMoin... >> >> /Thomas Å. >> >> >> On 14 feb 2012, at 11:25, Julian Foad wrote: >> >>> Hi Thomas. It's fantastic that you're taking the trouble to write up this >>> proposal. That's just what we need. Just a few initial comments below... >>> >>> Thomas Åkesson wrote: >>> >>>> Context >>>> === >>>> >>>> [...] A unicode string (e.g. a file name) can be represented >>>> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such >>>> characters where some are composed and others decomposed (rare). >>> >>> >>> What's "rare"? We have to assume that input is in mixed composition in any >>> system that doesn't explicitly normalize it, which (I think) includes most >>> operating systems. While it may be rare for any single string to contain >>> characters in both compositions, it is very common to be processing a >>> string that *might* have characters in both compositions -- in other words, >>> that is not guaranteed to be normalized. I think it would be clearer to >>> drop the "(rare)" and just say "... normalized forms (NFC/NFD) or mixed >>> (not normalized).". >>> >>> >>>> A minority of file systems (currently Mac OS X HFS+ only) will >>>> normalize the paths. In the case of HFS+, the path will be >>>> normalized into NFD and it will even be given back that way when >>>> listing the filesystem. >>> >>> >>> Drop the word "even"? The statement is not surprising. >>> >>> >>> [...] >>> >>>> Similarities to case-sensitivity >>>> === >>>> >>>> - If two Unicode strings differ only by letter case/composition, >>> >>> Drop "/composition" -- it's the subject of the following sentence. >>> >>>> on some >>> computer systems they refer to the same file, while on >>>> other systems >>> they refer to different files. The same applies >>>> if two Unicode strings >>> differ only by composition. >>> >>> >>>> [...] >>> >>>> Client Changes >>>> === >>>> >>>> [...] An abstraction between the repository path and the file >>>> system path can be achieved by ensuring that there is a column >>>> in wc.db that contains the file system path in exactly the same >>>> form that the file system gives back. APIs in wc needs to be >>>> extended to ensure that all interaction with the file system is >>>> performed with the file system path. >>> >>> [...] >>> >>> This part seems to be the heart of the whole proposal. You describe the >>> data that we need, but the behaviour will also need to be described in >>> detail. Presumably much of the behaviour is boring and obvious (when we >>> check out a new path and create it on disk, we store the disk path), but >>> I'm sure there will be some less obvious parts (do we need to find out what >>> the disk path of an 'excluded' node would be, even though we're not >>> actually creating it on disk, for example). >>> >>> >>>> Use Cases >>>> === >>>> >>>> This change will only affect use cases which rely on creating >>>> paths that look like duplicates but use different unicode >>>> composition. It is highly unlikely anyone is relying on this.. >>> >>> >>> Uh... it sounds like you are saying there are no interesting use cases for >>> this proposal! No, on the contrary, this proposal also affects checking >>> out and using a WC on Mac HFS+ where the repository paths were created on >>> another system and are not in NFD, and it allows that case to work. That's >>> the more interesting use case, is it not? It's definitely worth writing >>> out the interesting case in full, including steps like checkout (or update) >>> that brings in a non-NFD path, create a new file on the Mac, and commit. >>> >>> - Julian >>> >>