Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-11-09 Thread Thomas Åkesson
On 9 nov 2012, at 14:28, "C. Michael Pilato"  wrote:

> On 11/09/2012 07:49 AM, Branko Čibej wrote:
>> On 09.11.2012 12:28, Thomas Åkesson wrote:
>> I'm currently doing the grunt work of implementing the collation (done)
>> and the LIKE and GLOB operators that we'll need (in progress). The next,
>> and biggest, step will be to review the client and WC libraries to make
>> sure that paths sent to the server always come from the wc.db, not from
>> disk.
> 
> I'm not closely following this problem or solution, but how does the above
> play out for "svn import", "svn mkdir IRI", "svn delete IRI", etc?  (If this
> is documented somewhere, a reference by way of response would suffice.)

http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness

The draft proposes that the server does not discriminate any composition, apart 
from ensuring that creation of new name collisions is not allowed. 

Ensuring that paths come from wc.db applies to existing object. We can discuss 
whether Mac client should normalize to NFC, but that would be an option in my 
opinion. 

/Thomas Å.

Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-11-09 Thread Branko Čibej
On 09.11.2012 14:28, C. Michael Pilato wrote:
> On 11/09/2012 07:49 AM, Branko Čibej wrote:
>> On 09.11.2012 12:28, Thomas Åkesson wrote:
>> I'm currently doing the grunt work of implementing the collation (done)
>> and the LIKE and GLOB operators that we'll need (in progress). The next,
>> and biggest, step will be to review the client and WC libraries to make
>> sure that paths sent to the server always come from the wc.db, not from
>> disk.
> I'm not closely following this problem or solution, but how does the above
> play out for "svn import", "svn mkdir IRI", "svn delete IRI", etc?  (If this
> is documented somewhere, a reference by way of response would suffice.)

Since these are server-side operations with no working copy involvement,
and I'm doing this strictly client-side for now, nothing would change.
This is a problem that we'll eventually have to solve on the server as
well. I don't believe it would be correct for the client to verify that
such operations do not create normalization conflicts on the server.

As a matter of interest, a server-side solution is one of the features
we identified for FSv2; although there's no reason to wait for that. In
FSv2, I envision all names being stored twice, once in their original
form, and once NFC-normalized, for indexing. The reason for that is that
I expect server CPU cycles to be more expensive than server storage, and
it therefore makes sense to avoid using a relatively expensive
normalizing collation in the server metadata index.

This /may/ turn out to be an issue for client working copy performance,
too; but for now I've elected to assume that collation won't have a
noticeable effect. If it does, we'll look at other solutions.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com



Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-11-09 Thread C. Michael Pilato
On 11/09/2012 07:49 AM, Branko Čibej wrote:
> On 09.11.2012 12:28, Thomas Åkesson wrote:
> I'm currently doing the grunt work of implementing the collation (done)
> and the LIKE and GLOB operators that we'll need (in progress). The next,
> and biggest, step will be to review the client and WC libraries to make
> sure that paths sent to the server always come from the wc.db, not from
> disk.

I'm not closely following this problem or solution, but how does the above
play out for "svn import", "svn mkdir IRI", "svn delete IRI", etc?  (If this
is documented somewhere, a reference by way of response would suffice.)


-- 
C. Michael Pilato 
CollabNet   <>   www.collab.net   <>   Enterprise Cloud Development



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-11-09 Thread Branko Čibej
On 09.11.2012 12:28, Thomas Åkesson wrote:
> Today, I noticed that Branko started some implementation in a branch. Looks 
> like a collation based on utf8proc is in the making? I think that would make 
> a lot of sense because the ICU extension poses some challenges in the build 
> process and we might not need all that functionality that it provides.

Hi Thomas,

Yes, I started a branch that's intended to fix the normalization
problem. I selected utf8proc because we really don't need ICU (I can't
see a serious need for language-specific case folding, for example, nor
for Unicode regular expressions). Furthermore, utf8proc can be easily
embedded into Subversion so it doesn't present another dependency that
users would have to worry about.

I'm currently doing the grunt work of implementing the collation (done)
and the LIKE and GLOB operators that we'll need (in progress). The next,
and biggest, step will be to review the client and WC libraries to make
sure that paths sent to the server always come from the wc.db, not from
disk.

One open question is what to do about (historical) collisions in
existing repositories, but I don't think that issue is important enough
to resolve now.

It'll take a while, but I hope to be able to finish the work in time for
1.8. If not ... well then, it'll be in 1.9.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com



Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-11-09 Thread Thomas Åkesson
Revisiting this thread after a few months. Last spring, I did some work in the 
Wiki designing a proposal for resolving the Mac Unicode issues in a 
Non-normalizing manner. I ran out of time, but the thought process has been 
ongoing.

A couple of weeks ago at Subversion Live in London, I had the opportunity to 
discuss with a number of people. Although there were some different opinions on 
the matter, I think we concluded that we are actually relatively well aligned 
on the core idea. 

The proposal I drafted this spring (in the Wiki) proposed that a couple of 
columns were added to the WC in order to store normalized paths. Since a couple 
of months the concept of using a Sqlite collation has seemed more appealing. 
Last week, I did a test with the Sqlite ICU extension (available in sqlite 
source repository) which turned out to be quite encouraging. With such a 
collation, it is possible to perform equals in SQL statements that match  paths 
in a Unicode composition aware manner and therefore return rows regardless what 
composition the paths have.

This would be very useful, for instance, when given a filesystem path 
attempting to locate the corresponding node in wc.db. That is basically half 
the issue with Mac working copies.

Today, I noticed that Branko started some implementation in a branch. Looks 
like a collation based on utf8proc is in the making? I think that would make a 
lot of sense because the ICU extension poses some challenges in the build 
process and we might not need all that functionality that it provides.

I started a wiki page about unicode collation. I will append more info:
http://wiki.apache.org/subversion/UnicodeCollation

Also note the tiny test repo attached to:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness

Cheers,
Thomas Å.
 

Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-04-23 Thread Philip Martin
Thomas Åkesson  writes:

> If you, or someone else with WC insight, could provide some details on
> when/how conversions in the opposite direction is performed (e.g. svn
> stat and most commands taking path arguments), that would be
> incredibly useful to me. I would like to explore the option to somehow
> work around the "irreversible problem".

The Subversion libraries generally deal with UTF-8 paths exclusively.
When the use runs

   $ svn stat foo

the path 'foo' is typed in some local encoding. The program converts the
path from the local encoding to UTF-8 before passing it to the client or
working copy libraries.  These library then treats the path as UTF-8 in
almost most cases.  When making a system call such as stat() they pass
the UTF-8 path to the Subversion low-level IO functions in libsvn_subr.
These low-level functions convert the UTF-8 to the local encoding before
making the system calls.

-- 
uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com


Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-04-23 Thread Thomas Åkesson
Hi Philip,

Thanks for your comments in the wiki article. They raised some important points 
and potentially an idea that might simplify the solution.

>   All three paths are in UTF-8 but NFC/NFD is not currently specified. 
> local_relpath/parent_relpath get converted from UTF-8 to whatever locale 
> encoding is in use whenever they are used to access the filesystem.


This is not unlike what we need to do for HFS+. We could consider UTF8-MAC to 
be a distinct encoding. There is the major caveat that this conversion is 
irreversible (since the normalization is not specified in the repo/wc.db).

If you, or someone else with WC insight, could provide some details on when/how 
conversions in the opposite direction is performed (e.g. svn stat and most 
commands taking path arguments), that would be incredibly useful to me. I would 
like to explore the option to somehow work around the "irreversible problem".

It would also be useful if someone could point me to where in the WC code the 
conversion from UTF-8 to locale encoding is performed.

Thanks!

/Thomas Å.



On 17 apr 2012, at 05:24, Thomas Åkesson wrote:

> Hi,
> A bit of a status update on the wiki article:
> http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness
> 
> Received some comments from Daniel, which I have tried to address. Thanks. 
> 
> I have written a bash script which demonstrates the concept of "Alternative 
> 1" with regards to how the local_relpath column is handled by 
> checkout/update. 
> 
> From the wiki:
> ---
> This alternative can be simulated using the attached script 
> localrelpath2nfd.sh. This provides a Working Copy equivalent to what a 
> checkout should produce if this alternative was implemented in Subversion 
> itself:
> 
> svn co ...
> svn stat #Shows any problematic items as missing and unversioned
> localrelpath2nfd.sh
> svn stat #Should be clean apart from misperception that some items are 
> switched
> ---
> 
> This script can be used to investigate how other subcommands are affected and 
> determine what needs to be done. It is possible to make commits but updates 
> to normalisation-dependent nodes will fail since this script is not inside 
> the update code. 
> 
> I intend to use this script to take the design to the next level of detail. 
> First, I would like some feedback from people with in-depth knowledge of the 
> WC and preferably get some idea on what the community thinks about the 
> approach. 
> 
> /Thomas Å. 
> 
> 
> On 26 mar 2012, at 04:14, Thomas Åkesson  wrote:
> 
>> Hi,
>> Sorry about the delay, had a release to sort out...
>> 
>> I have moved the proposal into the wiki:
>> http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness
>> 
>> The comments from Julian and Markus have been implemented and I have added 
>> more information to the "Client Changes" section as well as more structure 
>> and TODO-notes. 
>> 
>> I would really appreciate if someone with more insight into WC-NG could 
>> provide input on some of the TODO items (or things that have been completely 
>> overlooked).
>> 
>> Thanks,
>> Thomas Å.
>> 
>> 
>> On 21 feb 2012, at 09:55, Daniel Shahaf wrote:
>> 
>>> I've granted you write access to the wiki.
>>> 
>>> Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100:
 Thanks Julian and Markus for providing feedback. 
 
 I am not commenting below because all the feedback is very good and I will 
 try to address it as best I can in the next iteration. Describing the 
 behaviour changes to the WC is the most challenging since I lack that kind 
 of detailed knowledge. I will instead try to draft the structure of that 
 section to make it easier for someone with that level of detail to assist.
 
 Regarding use cases, what can I say... it was towards the end of a long 
 stretch.
 
 I think it would help with the upcoming iterations if I could move this 
 "document" into the wiki. If you find that this first draft shows promise, 
 please consider granting edit access in the wiki. My user name is "Thomas 
 Åkesson", which exercises the Unicode awareness of MoinMoin...
 
 /Thomas Å.
 
 
 On 14 feb 2012, at 11:25, Julian Foad wrote:
 
> Hi Thomas.  It's fantastic that you're taking the trouble to write up 
> this proposal.  That's just what we need.  Just a few initial comments 
> below...
> 
> Thomas Åkesson wrote:
> 
>> Context
>> ===
>> 
>> [...] A unicode string (e.g. a file name) can be represented
>> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
>> characters where some are composed and others decomposed (rare).
> 
> 
> What's "rare"?  We have to assume that input is in mixed composition in 
> any system that doesn't explicitly normalize it, which (I think) includes 
> most operating systems.  While it may be rare for any single string to 
> contain characters in both c

Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-04-17 Thread Stefan Sperling
On Tue, Apr 17, 2012 at 05:24:53AM +0200, Thomas Åkesson wrote:
> I intend to use this script to take the design to the next level of detail. 
> First, I would like some feedback from people with in-depth knowledge of the 
> WC and preferably get some idea on what the community thinks about the 
> approach. 

I am interested but I don't have time to digest the details right now.
If you don't get any responses within a couple of days please ping me
and I'll take a look. Thanks!


Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-04-16 Thread Thomas Åkesson
Hi,
A bit of a status update on the wiki article:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness

Received some comments from Daniel, which I have tried to address. Thanks. 

I have written a bash script which demonstrates the concept of "Alternative 1" 
with regards to how the local_relpath column is handled by checkout/update. 

From the wiki:
---
This alternative can be simulated using the attached script 
localrelpath2nfd.sh. This provides a Working Copy equivalent to what a checkout 
should produce if this alternative was implemented in Subversion itself:

svn co ...
svn stat #Shows any problematic items as missing and unversioned
localrelpath2nfd.sh
svn stat #Should be clean apart from misperception that some items are switched
---

This script can be used to investigate how other subcommands are affected and 
determine what needs to be done. It is possible to make commits but updates to 
normalisation-dependent nodes will fail since this script is not inside the 
update code. 

I intend to use this script to take the design to the next level of detail. 
First, I would like some feedback from people with in-depth knowledge of the WC 
and preferably get some idea on what the community thinks about the approach. 

/Thomas Å. 


On 26 mar 2012, at 04:14, Thomas Åkesson  wrote:

> Hi,
> Sorry about the delay, had a release to sort out...
> 
> I have moved the proposal into the wiki:
> http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness
> 
> The comments from Julian and Markus have been implemented and I have added 
> more information to the "Client Changes" section as well as more structure 
> and TODO-notes. 
> 
> I would really appreciate if someone with more insight into WC-NG could 
> provide input on some of the TODO items (or things that have been completely 
> overlooked).
> 
> Thanks,
> Thomas Å.
> 
> 
> On 21 feb 2012, at 09:55, Daniel Shahaf wrote:
> 
>> I've granted you write access to the wiki.
>> 
>> Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100:
>>> Thanks Julian and Markus for providing feedback. 
>>> 
>>> I am not commenting below because all the feedback is very good and I will 
>>> try to address it as best I can in the next iteration. Describing the 
>>> behaviour changes to the WC is the most challenging since I lack that kind 
>>> of detailed knowledge. I will instead try to draft the structure of that 
>>> section to make it easier for someone with that level of detail to assist.
>>> 
>>> Regarding use cases, what can I say... it was towards the end of a long 
>>> stretch.
>>> 
>>> I think it would help with the upcoming iterations if I could move this 
>>> "document" into the wiki. If you find that this first draft shows promise, 
>>> please consider granting edit access in the wiki. My user name is "Thomas 
>>> Åkesson", which exercises the Unicode awareness of MoinMoin...
>>> 
>>> /Thomas Å.
>>> 
>>> 
>>> On 14 feb 2012, at 11:25, Julian Foad wrote:
>>> 
 Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
 proposal.  That's just what we need.  Just a few initial comments below...
 
 Thomas Åkesson wrote:
 
> Context
> ===
> 
> [...] A unicode string (e.g. a file name) can be represented
> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
> characters where some are composed and others decomposed (rare).
 
 
 What's "rare"?  We have to assume that input is in mixed composition in 
 any system that doesn't explicitly normalize it, which (I think) includes 
 most operating systems.  While it may be rare for any single string to 
 contain characters in both compositions, it is very common to be 
 processing a string that *might* have characters in both compositions -- 
 in other words, that is not guaranteed to be normalized.  I think it would 
 be clearer to drop the "(rare)" and just say "... normalized forms 
 (NFC/NFD) or mixed (not normalized).".
 
 
> A minority of file systems (currently Mac OS X HFS+ only) will
> normalize the paths. In the case of HFS+, the path will be
> normalized into NFD and it will even be given back that way when
> listing the filesystem. 
 
 
 Drop the word "even"?  The statement is not surprising.
 
 
 [...]
 
> Similarities to case-sensitivity
> ===
> 
> - If two Unicode strings differ only by letter case/composition,
 
 Drop "/composition" -- it's the subject of the following sentence.
 
> on some 
 computer systems they refer to the same file, while on
> other systems 
 they refer to different files.  The same applies
> if two Unicode strings 
 differ only by composition. 
 
 
> [...]
 
> Client Changes
> ===
> 
> [...] An abstraction between the repository path and the file
> system path can be achieved by ensuring that there

Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-03-25 Thread Thomas Åkesson
Hi,
Sorry about the delay, had a release to sort out...

I have moved the proposal into the wiki:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness

The comments from Julian and Markus have been implemented and I have added more 
information to the "Client Changes" section as well as more structure and 
TODO-notes. 

I would really appreciate if someone with more insight into WC-NG could provide 
input on some of the TODO items (or things that have been completely 
overlooked).

Thanks,
Thomas Å.


On 21 feb 2012, at 09:55, Daniel Shahaf wrote:

> I've granted you write access to the wiki.
> 
> Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100:
>> Thanks Julian and Markus for providing feedback. 
>> 
>> I am not commenting below because all the feedback is very good and I will 
>> try to address it as best I can in the next iteration. Describing the 
>> behaviour changes to the WC is the most challenging since I lack that kind 
>> of detailed knowledge. I will instead try to draft the structure of that 
>> section to make it easier for someone with that level of detail to assist.
>> 
>> Regarding use cases, what can I say... it was towards the end of a long 
>> stretch.
>> 
>> I think it would help with the upcoming iterations if I could move this 
>> "document" into the wiki. If you find that this first draft shows promise, 
>> please consider granting edit access in the wiki. My user name is "Thomas 
>> Åkesson", which exercises the Unicode awareness of MoinMoin...
>> 
>> /Thomas Å.
>> 
>> 
>> On 14 feb 2012, at 11:25, Julian Foad wrote:
>> 
>>> Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
>>> proposal.  That's just what we need.  Just a few initial comments below...
>>> 
>>> Thomas Åkesson wrote:
>>> 
 Context
 ===
 
 [...] A unicode string (e.g. a file name) can be represented
 in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
 characters where some are composed and others decomposed (rare).
>>> 
>>> 
>>> What's "rare"?  We have to assume that input is in mixed composition in any 
>>> system that doesn't explicitly normalize it, which (I think) includes most 
>>> operating systems.  While it may be rare for any single string to contain 
>>> characters in both compositions, it is very common to be processing a 
>>> string that *might* have characters in both compositions -- in other words, 
>>> that is not guaranteed to be normalized.  I think it would be clearer to 
>>> drop the "(rare)" and just say "... normalized forms (NFC/NFD) or mixed 
>>> (not normalized).".
>>> 
>>> 
 A minority of file systems (currently Mac OS X HFS+ only) will
 normalize the paths. In the case of HFS+, the path will be
 normalized into NFD and it will even be given back that way when
 listing the filesystem. 
>>> 
>>> 
>>> Drop the word "even"?  The statement is not surprising.
>>> 
>>> 
>>> [...]
>>> 
 Similarities to case-sensitivity
 ===
 
  - If two Unicode strings differ only by letter case/composition,
>>> 
>>> Drop "/composition" -- it's the subject of the following sentence.
>>> 
 on some 
>>> computer systems they refer to the same file, while on
 other systems 
>>> they refer to different files.  The same applies
 if two Unicode strings 
>>> differ only by composition. 
>>> 
>>> 
 [...]
>>> 
 Client Changes
 ===
 
 [...] An abstraction between the repository path and the file
 system path can be achieved by ensuring that there is a column
 in wc.db that contains the file system path in exactly the same
 form that the file system gives back. APIs in wc needs to be
 extended to ensure that all interaction with the file system is
 performed with the file system path.
>>> 
>>> [...]
>>> 
>>> This part seems to be the heart of the whole proposal.  You describe the 
>>> data that we need, but the behaviour will also need to be described in 
>>> detail.  Presumably much of the behaviour is boring and obvious (when we 
>>> check out a new path and create it on disk, we store the disk path), but 
>>> I'm sure there will be some less obvious parts (do we need to find out what 
>>> the disk path of an 'excluded' node would be, even though we're not 
>>> actually creating it on disk, for example).
>>> 
>>> 
 Use Cases
 ===
 
 This change will only affect use cases which rely on creating
 paths that look like duplicates but use different unicode
 composition. It is highly unlikely anyone is relying on this..
>>> 
>>> 
>>> Uh... it sounds like you are saying there are no interesting use cases for 
>>> this proposal!  No, on the contrary, this proposal also affects checking 
>>> out and using a WC on Mac HFS+ where the repository paths were created on 
>>> another system and are not in NFD, and it allows that case to work.  That's 
>>> the more interesting use case, is it not?  It's definitely worth writing 
>>> out the int

Re: {SPAM 03.5} Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-02-21 Thread Daniel Shahaf
I've granted you write access to the wiki.

Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100:
> Thanks Julian and Markus for providing feedback. 
> 
> I am not commenting below because all the feedback is very good and I will 
> try to address it as best I can in the next iteration. Describing the 
> behaviour changes to the WC is the most challenging since I lack that kind of 
> detailed knowledge. I will instead try to draft the structure of that section 
> to make it easier for someone with that level of detail to assist.
> 
> Regarding use cases, what can I say... it was towards the end of a long 
> stretch.
> 
> I think it would help with the upcoming iterations if I could move this 
> "document" into the wiki. If you find that this first draft shows promise, 
> please consider granting edit access in the wiki. My user name is "Thomas 
> Åkesson", which exercises the Unicode awareness of MoinMoin...
> 
> /Thomas Å.
> 
> 
> On 14 feb 2012, at 11:25, Julian Foad wrote:
> 
> > Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
> > proposal.  That's just what we need.  Just a few initial comments below...
> > 
> > Thomas Åkesson wrote:
> > 
> >> Context
> >> ===
> >> 
> >> [...] A unicode string (e.g. a file name) can be represented
> >> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
> >> characters where some are composed and others decomposed (rare).
> > 
> > 
> > What's "rare"?  We have to assume that input is in mixed composition in any 
> > system that doesn't explicitly normalize it, which (I think) includes most 
> > operating systems.  While it may be rare for any single string to contain 
> > characters in both compositions, it is very common to be processing a 
> > string that *might* have characters in both compositions -- in other words, 
> > that is not guaranteed to be normalized.  I think it would be clearer to 
> > drop the "(rare)" and just say "... normalized forms (NFC/NFD) or mixed 
> > (not normalized).".
> > 
> > 
> >> A minority of file systems (currently Mac OS X HFS+ only) will
> >> normalize the paths. In the case of HFS+, the path will be
> >> normalized into NFD and it will even be given back that way when
> >> listing the filesystem. 
> > 
> > 
> > Drop the word "even"?  The statement is not surprising.
> > 
> > 
> > [...]
> > 
> >> Similarities to case-sensitivity
> >> ===
> >> 
> >>   - If two Unicode strings differ only by letter case/composition,
> > 
> > Drop "/composition" -- it's the subject of the following sentence.
> > 
> >> on some 
> > computer systems they refer to the same file, while on
> >> other systems 
> > they refer to different files.  The same applies
> >> if two Unicode strings 
> > differ only by composition. 
> > 
> > 
> >> [...]
> > 
> >> Client Changes
> >> ===
> >> 
> >> [...] An abstraction between the repository path and the file
> >> system path can be achieved by ensuring that there is a column
> >> in wc.db that contains the file system path in exactly the same
> >> form that the file system gives back. APIs in wc needs to be
> >> extended to ensure that all interaction with the file system is
> >> performed with the file system path.
> > 
> > [...]
> > 
> > This part seems to be the heart of the whole proposal.  You describe the 
> > data that we need, but the behaviour will also need to be described in 
> > detail.  Presumably much of the behaviour is boring and obvious (when we 
> > check out a new path and create it on disk, we store the disk path), but 
> > I'm sure there will be some less obvious parts (do we need to find out what 
> > the disk path of an 'excluded' node would be, even though we're not 
> > actually creating it on disk, for example).
> > 
> > 
> >> Use Cases
> >> ===
> >> 
> >> This change will only affect use cases which rely on creating
> >> paths that look like duplicates but use different unicode
> >> composition. It is highly unlikely anyone is relying on this..
> > 
> > 
> > Uh... it sounds like you are saying there are no interesting use cases for 
> > this proposal!  No, on the contrary, this proposal also affects checking 
> > out and using a WC on Mac HFS+ where the repository paths were created on 
> > another system and are not in NFD, and it allows that case to work.  That's 
> > the more interesting use case, is it not?  It's definitely worth writing 
> > out the interesting case in full, including steps like checkout (or update) 
> > that brings in a non-NFD path, create a new file on the Mac, and commit.
> > 
> > - Julian
> > 
> 


Re: {SPAM 03.5} Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-02-14 Thread Thomas Åkesson
Thanks Julian and Markus for providing feedback. 

I am not commenting below because all the feedback is very good and I will try 
to address it as best I can in the next iteration. Describing the behaviour 
changes to the WC is the most challenging since I lack that kind of detailed 
knowledge. I will instead try to draft the structure of that section to make it 
easier for someone with that level of detail to assist.

Regarding use cases, what can I say... it was towards the end of a long stretch.

I think it would help with the upcoming iterations if I could move this 
"document" into the wiki. If you find that this first draft shows promise, 
please consider granting edit access in the wiki. My user name is "Thomas 
Åkesson", which exercises the Unicode awareness of MoinMoin...

/Thomas Å.


On 14 feb 2012, at 11:25, Julian Foad wrote:

> Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
> proposal.  That's just what we need.  Just a few initial comments below...
> 
> Thomas Åkesson wrote:
> 
>> Context
>> ===
>> 
>> [...] A unicode string (e.g. a file name) can be represented
>> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
>> characters where some are composed and others decomposed (rare).
> 
> 
> What's "rare"?  We have to assume that input is in mixed composition in any 
> system that doesn't explicitly normalize it, which (I think) includes most 
> operating systems.  While it may be rare for any single string to contain 
> characters in both compositions, it is very common to be processing a string 
> that *might* have characters in both compositions -- in other words, that is 
> not guaranteed to be normalized.  I think it would be clearer to drop the 
> "(rare)" and just say "... normalized forms (NFC/NFD) or mixed (not 
> normalized).".
> 
> 
>> A minority of file systems (currently Mac OS X HFS+ only) will
>> normalize the paths. In the case of HFS+, the path will be
>> normalized into NFD and it will even be given back that way when
>> listing the filesystem. 
> 
> 
> Drop the word "even"?  The statement is not surprising.
> 
> 
> [...]
> 
>> Similarities to case-sensitivity
>> ===
>> 
>>   - If two Unicode strings differ only by letter case/composition,
> 
> Drop "/composition" -- it's the subject of the following sentence.
> 
>> on some 
> computer systems they refer to the same file, while on
>> other systems 
> they refer to different files.  The same applies
>> if two Unicode strings 
> differ only by composition. 
> 
> 
>> [...]
> 
>> Client Changes
>> ===
>> 
>> [...] An abstraction between the repository path and the file
>> system path can be achieved by ensuring that there is a column
>> in wc.db that contains the file system path in exactly the same
>> form that the file system gives back. APIs in wc needs to be
>> extended to ensure that all interaction with the file system is
>> performed with the file system path.
> 
> [...]
> 
> This part seems to be the heart of the whole proposal.  You describe the data 
> that we need, but the behaviour will also need to be described in detail.  
> Presumably much of the behaviour is boring and obvious (when we check out a 
> new path and create it on disk, we store the disk path), but I'm sure there 
> will be some less obvious parts (do we need to find out what the disk path of 
> an 'excluded' node would be, even though we're not actually creating it on 
> disk, for example).
> 
> 
>> Use Cases
>> ===
>> 
>> This change will only affect use cases which rely on creating
>> paths that look like duplicates but use different unicode
>> composition. It is highly unlikely anyone is relying on this..
> 
> 
> Uh... it sounds like you are saying there are no interesting use cases for 
> this proposal!  No, on the contrary, this proposal also affects checking out 
> and using a WC on Mac HFS+ where the repository paths were created on another 
> system and are not in NFD, and it allows that case to work.  That's the more 
> interesting use case, is it not?  It's definitely worth writing out the 
> interesting case in full, including steps like checkout (or update) that 
> brings in a non-NFD path, create a new file on the Mac, and commit.
> 
> - Julian
> 



Re: [RFC] Non-normalizing Unicode Composition Awareness

2012-02-14 Thread Julian Foad
Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
proposal.  That's just what we need.  Just a few initial comments below...

Thomas Åkesson wrote:

> Context
> ===
> 
> [...] A unicode string (e.g. a file name) can be represented
> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
> characters where some are composed and others decomposed (rare).


What's "rare"?  We have to assume that input is in mixed composition in any 
system that doesn't explicitly normalize it, which (I think) includes most 
operating systems.  While it may be rare for any single string to contain 
characters in both compositions, it is very common to be processing a string 
that *might* have characters in both compositions -- in other words, that is 
not guaranteed to be normalized.  I think it would be clearer to drop the 
"(rare)" and just say "... normalized forms (NFC/NFD) or mixed (not 
normalized).".


> A minority of file systems (currently Mac OS X HFS+ only) will
> normalize the paths. In the case of HFS+, the path will be
> normalized into NFD and it will even be given back that way when
> listing the filesystem. 


Drop the word "even"?  The statement is not surprising.


[...]

> Similarities to case-sensitivity
> ===
> 
>  - If two Unicode strings differ only by letter case/composition,

Drop "/composition" -- it's the subject of the following sentence.

> on some 
computer systems they refer to the same file, while on
> other systems 
they refer to different files.  The same applies
> if two Unicode strings 
differ only by composition. 


> [...]

> Client Changes
> ===
> 
> [...] An abstraction between the repository path and the file
> system path can be achieved by ensuring that there is a column
> in wc.db that contains the file system path in exactly the same
> form that the file system gives back. APIs in wc needs to be
> extended to ensure that all interaction with the file system is
> performed with the file system path.

[...]

This part seems to be the heart of the whole proposal.  You describe the data 
that we need, but the behaviour will also need to be described in detail.  
Presumably much of the behaviour is boring and obvious (when we check out a new 
path and create it on disk, we store the disk path), but I'm sure there will be 
some less obvious parts (do we need to find out what the disk path of an 
'excluded' node would be, even though we're not actually creating it on disk, 
for example).


> Use Cases
> ===
> 
> This change will only affect use cases which rely on creating
> paths that look like duplicates but use different unicode
> composition. It is highly unlikely anyone is relying on this..


Uh... it sounds like you are saying there are no interesting use cases for this 
proposal!  No, on the contrary, this proposal also affects checking out and 
using a WC on Mac HFS+ where the repository paths were created on another 
system and are not in NFD, and it allows that case to work.  That's the more 
interesting use case, is it not?  It's definitely worth writing out the 
interesting case in full, including steps like checkout (or update) that brings 
in a non-NFD path, create a new file on the Mac, and commit.

- Julian



AW: [RFC] Non-normalizing Unicode Composition Awareness (was: Let's discuss about unicode compositions for filenames!)

2012-02-14 Thread Markus Schaber
Hi, Thomas,

Just a little bit of nitpicking: For some characters, there exist more than 2 
different ways: Sometimes, there are different codepoints for the same 
character, or characters can be partially composed or they can be decomposed, 
but with non-canonical ordering. Those cases are rare in practice, however.



Best regards

Markus Schaber
--
___
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax 
+49-831-54031-50

Email: m.scha...@3s-software.com<mailto:m.scha...@3s-software.com> | Web: 
http://www.3s-software.com <http://www.3s-software.com/>
CoDeSys internet forum: 
http://forum.3s-software.com<http://forum-en.3s-software.com/>
Download CoDeSys sample projects: 
http://www.3s-software.com/index.shtml?sample_projects

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade 
register: Kempten HRB 6186 | Tax ID No.: DE 167014915

Von: Thomas Åkesson [mailto:tho...@akesson.cc]
Gesendet: Dienstag, 14. Februar 2012 01:35
An: Subversion Development
Cc: Hiroaki Nakamura; Stefan Sperling
Betreff: [RFC] Non-normalizing Unicode Composition Awareness (was: Let's 
discuss about unicode compositions for filenames!)

Title: Non-normalizing Unicode Composition Awareness
Version: 0.1 (2012-02-14)


Context
===

Within Unicode, some characters can in the unicode standard be represented in 2 
different ways (composed/decomposed), while rendered equally on screen or in 
print. A unicode string (e.g. a file name) can be represented in 2 normalized 
forms (NFC/NFD) or mixed, i.e. multiple such characters where some are composed 
and others decomposed (rare).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename 
in any form, store and give back in the form it was input. These file systems 
will typically even accept multiple files where the path looks identical on 
screen but the unicode string is different due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the 
paths. In the case of HFS+, the path will be normalized into NFD and it will 
even be given back that way when listing the filesystem.

Most significant differences from the majority of filesystems:
 - A file that is stored in NFC or mixed, will not be returned with an 
identical name. Generally considered a negative effect of the HFS+ unicode 
implementation.
 - Multiple files whose name is rendered equally cannot be stored in the same 
directory. Often considered an advantage.


The topic has been described here:
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

 - This RFC is not as complete in all areas, and depend on this note for 
additional context and issue description.
 - This RFC proposes a solution very similar to the note's solution 4, "Client 
and server-side path comparison routines". However, here it is proposed as a 
long term solution.
 - This RFC is essentially identical to what Erik H. proposes in this thread:
http://svn.haxx.se/dev/archive-2010-09/0319.shtml



Issue Description
===

 - Subversion and most file systems currently allow creation of multiple paths, 
which in normalized form are identical. Hereafter referred to as 
"normalized-name collisions". This could cause significant upgrade issues for 
repositories containing such collisions, depending on which solution is 
implemented. See section "Legacy Data".

 - Users have difficulty understanding and managing "normalized-name 
collisions". It is difficult to know which file is which and one of the paths 
is typically not possible to type on a keyboard.

 - Mac OS X clients can not interoperate with non-OSX clients when paths 
contain composed characters (added by a non-OSX client). The working copies are 
broken directly after checkout/update on OSX. Tracked by: 
http://subversion.tigris.org/issues/show_bug.cgi?id=2464



Differences to case-sensitivity
===

 - NFC/NFD look the same when rendered on screen.
 - Different case can be controlled with the keyboard, while unicode 
composition is more difficult.
 - Most modern case-insensitive file systems are case-preserving, i.e. they do 
not normalize to a preferred form and always return the same form that was 
stored. Normalizing file systems do not preserve the paths.



Similarities to case-sensitivity
===

 - If two Unicode strings differ only by letter case/composition, on some 
computer systems they refer to the same file, while on other systems they refer 
to different files.  The same applies if two Unicode strings differ only by 
composition. The rules are set by each file system.

 - Subversion interoperates with different systems.  When two file names that 
differ only by letter case are transferred from a
case-sensitive system to a case-insensitive system, they will collide and 
Subvers

[RFC] Non-normalizing Unicode Composition Awareness (was: Let's discuss about unicode compositions for filenames!)

2012-02-13 Thread Thomas Åkesson
Title: Non-normalizing Unicode Composition Awareness
Version: 0.1 (2012-02-14)


Context
===

Within Unicode, some characters can in the unicode standard be represented in 2 
different ways (composed/decomposed), while rendered equally on screen or in 
print. A unicode string (e.g. a file name) can be represented in 2 normalized 
forms (NFC/NFD) or mixed, i.e. multiple such characters where some are composed 
and others decomposed (rare).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename 
in any form, store and give back in the form it was input. These file systems 
will typically even accept multiple files where the path looks identical on 
screen but the unicode string is different due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the 
paths. In the case of HFS+, the path will be normalized into NFD and it will 
even be given back that way when listing the filesystem. 

Most significant differences from the majority of filesystems:
 - A file that is stored in NFC or mixed, will not be returned with an 
identical name. Generally considered a negative effect of the HFS+ unicode 
implementation.
 - Multiple files whose name is rendered equally cannot be stored in the same 
directory. Often considered an advantage.   


The topic has been described here:
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

 - This RFC is not as complete in all areas, and depend on this note for 
additional context and issue description.
 - This RFC proposes a solution very similar to the note's solution 4, "Client 
and server-side path comparison routines". However, here it is proposed as a 
long term solution.
 - This RFC is essentially identical to what Erik H. proposes in this thread:
http://svn.haxx.se/dev/archive-2010-09/0319.shtml



Issue Description
===

 - Subversion and most file systems currently allow creation of multiple paths, 
which in normalized form are identical. Hereafter referred to as 
"normalized-name collisions". This could cause significant upgrade issues for 
repositories containing such collisions, depending on which solution is 
implemented. See section "Legacy Data".

 - Users have difficulty understanding and managing "normalized-name 
collisions". It is difficult to know which file is which and one of the paths 
is typically not possible to type on a keyboard.

 - Mac OS X clients can not interoperate with non-OSX clients when paths 
contain composed characters (added by a non-OSX client). The working copies are 
broken directly after checkout/update on OSX. Tracked by: 
http://subversion.tigris.org/issues/show_bug.cgi?id=2464



Differences to case-sensitivity
===

 - NFC/NFD look the same when rendered on screen.
 - Different case can be controlled with the keyboard, while unicode 
composition is more difficult.
 - Most modern case-insensitive file systems are case-preserving, i.e. they do 
not normalize to a preferred form and always return the same form that was 
stored. Normalizing file systems do not preserve the paths.



Similarities to case-sensitivity
===

 - If two Unicode strings differ only by letter case/composition, on some 
computer systems they refer to the same file, while on other systems they refer 
to different files.  The same applies if two Unicode strings differ only by 
composition. The rules are set by each file system.

 - Subversion interoperates with different systems.  When two file names that 
differ only by letter case are transferred from a 
case-sensitive system to a case-insensitive system, they will collide and 
Subversion should handle this in some friendly way. The same applies if two 
file names differ only by composition.



To Normalize or Not to Normalize
===

Whether or not to normalize within a Subversion repository (server-side) has 
been debated. The note (unicode-composition-for-filenames) considers 
normalization to NFC to be the long term (2.x) solution. Referring to this 
feature as "repository normalization".

There are implementation advantages with normalized paths which can simplify 
comparisons and storage. 

There are also reasons not to normalize:

 - A file system is generally expected to give back exactly what was stored, or 
refuse up-front. HFS+ has been criticized for not living up to this 
expectation, which is also the reason the Svn WC has issues on HFS+. Subversion 
can be considered a sort of file system, and could therefore be expected to 
live up to this expectation.

 - Compatibility is a high priority for Subversion. Introducing 
normalization/translation/etc is not unlikely to introduce compatibility 
issues, now or later. There is a principle that Subversion should not be a 
limiting factor or impose undue limitations on allowed characters, file names 
etc. 

 - Introducing normalization tends to complicate the upgrade process, 
especially for repositories that contain "normalized-name collisions". This is