[RFC] Non-normalizing Unicode Composition Awareness (was: Let's discuss about unicode compositions for filenames!)

2012-02-13 Thread Thomas Åkesson
Title: Non-normalizing Unicode Composition Awareness
Version: 0.1 (2012-02-14)


Context
===

Within Unicode, some characters can in the unicode standard be represented in 2 
different ways (composed/decomposed), while rendered equally on screen or in 
print. A unicode string (e.g. a file name) can be represented in 2 normalized 
forms (NFC/NFD) or mixed, i.e. multiple such characters where some are composed 
and others decomposed (rare).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename 
in any form, store and give back in the form it was input. These file systems 
will typically even accept multiple files where the path looks identical on 
screen but the unicode string is different due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the 
paths. In the case of HFS+, the path will be normalized into NFD and it will 
even be given back that way when listing the filesystem. 

Most significant differences from the majority of filesystems:
 - A file that is stored in NFC or mixed, will not be returned with an 
identical name. Generally considered a negative effect of the HFS+ unicode 
implementation.
 - Multiple files whose name is rendered equally cannot be stored in the same 
directory. Often considered an advantage.   


The topic has been described here:
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

 - This RFC is not as complete in all areas, and depend on this note for 
additional context and issue description.
 - This RFC proposes a solution very similar to the note's solution 4, "Client 
and server-side path comparison routines". However, here it is proposed as a 
long term solution.
 - This RFC is essentially identical to what Erik H. proposes in this thread:
http://svn.haxx.se/dev/archive-2010-09/0319.shtml



Issue Description
===

 - Subversion and most file systems currently allow creation of multiple paths, 
which in normalized form are identical. Hereafter referred to as 
"normalized-name collisions". This could cause significant upgrade issues for 
repositories containing such collisions, depending on which solution is 
implemented. See section "Legacy Data".

 - Users have difficulty understanding and managing "normalized-name 
collisions". It is difficult to know which file is which and one of the paths 
is typically not possible to type on a keyboard.

 - Mac OS X clients can not interoperate with non-OSX clients when paths 
contain composed characters (added by a non-OSX client). The working copies are 
broken directly after checkout/update on OSX. Tracked by: 
http://subversion.tigris.org/issues/show_bug.cgi?id=2464



Differences to case-sensitivity
===

 - NFC/NFD look the same when rendered on screen.
 - Different case can be controlled with the keyboard, while unicode 
composition is more difficult.
 - Most modern case-insensitive file systems are case-preserving, i.e. they do 
not normalize to a preferred form and always return the same form that was 
stored. Normalizing file systems do not preserve the paths.



Similarities to case-sensitivity
===

 - If two Unicode strings differ only by letter case/composition, on some 
computer systems they refer to the same file, while on other systems they refer 
to different files.  The same applies if two Unicode strings differ only by 
composition. The rules are set by each file system.

 - Subversion interoperates with different systems.  When two file names that 
differ only by letter case are transferred from a 
case-sensitive system to a case-insensitive system, they will collide and 
Subversion should handle this in some friendly way. The same applies if two 
file names differ only by composition.



To Normalize or Not to Normalize
===

Whether or not to normalize within a Subversion repository (server-side) has 
been debated. The note (unicode-composition-for-filenames) considers 
normalization to NFC to be the long term (2.x) solution. Referring to this 
feature as "repository normalization".

There are implementation advantages with normalized paths which can simplify 
comparisons and storage. 

There are also reasons not to normalize:

 - A file system is generally expected to give back exactly what was stored, or 
refuse up-front. HFS+ has been criticized for not living up to this 
expectation, which is also the reason the Svn WC has issues on HFS+. Subversion 
can be considered a sort of file system, and could therefore be expected to 
live up to this expectation.

 - Compatibility is a high priority for Subversion. Introducing 
normalization/translation/etc is not unlikely to introduce compatibility 
issues, now or later. There is a principle that Subversion should not be a 
limiting factor or impose undue limitations on allowed characters, file names 
etc. 

 - Introducing normalization tends to complicate the upgrade process, 
especially for repositories that contain "normalized-name collisions". This is

AW: [RFC] Non-normalizing Unicode Composition Awareness (was: Let's discuss about unicode compositions for filenames!)

2012-02-14 Thread Markus Schaber
Hi, Thomas,

Just a little bit of nitpicking: For some characters, there exist more than 2 
different ways: Sometimes, there are different codepoints for the same 
character, or characters can be partially composed or they can be decomposed, 
but with non-canonical ordering. Those cases are rare in practice, however.



Best regards

Markus Schaber
--
___
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax 
+49-831-54031-50

Email: m.scha...@3s-software.com<mailto:m.scha...@3s-software.com> | Web: 
http://www.3s-software.com <http://www.3s-software.com/>
CoDeSys internet forum: 
http://forum.3s-software.com<http://forum-en.3s-software.com/>
Download CoDeSys sample projects: 
http://www.3s-software.com/index.shtml?sample_projects

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade 
register: Kempten HRB 6186 | Tax ID No.: DE 167014915

Von: Thomas Åkesson [mailto:tho...@akesson.cc]
Gesendet: Dienstag, 14. Februar 2012 01:35
An: Subversion Development
Cc: Hiroaki Nakamura; Stefan Sperling
Betreff: [RFC] Non-normalizing Unicode Composition Awareness (was: Let's 
discuss about unicode compositions for filenames!)

Title: Non-normalizing Unicode Composition Awareness
Version: 0.1 (2012-02-14)


Context
===

Within Unicode, some characters can in the unicode standard be represented in 2 
different ways (composed/decomposed), while rendered equally on screen or in 
print. A unicode string (e.g. a file name) can be represented in 2 normalized 
forms (NFC/NFD) or mixed, i.e. multiple such characters where some are composed 
and others decomposed (rare).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename 
in any form, store and give back in the form it was input. These file systems 
will typically even accept multiple files where the path looks identical on 
screen but the unicode string is different due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the 
paths. In the case of HFS+, the path will be normalized into NFD and it will 
even be given back that way when listing the filesystem.

Most significant differences from the majority of filesystems:
 - A file that is stored in NFC or mixed, will not be returned with an 
identical name. Generally considered a negative effect of the HFS+ unicode 
implementation.
 - Multiple files whose name is rendered equally cannot be stored in the same 
directory. Often considered an advantage.


The topic has been described here:
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

 - This RFC is not as complete in all areas, and depend on this note for 
additional context and issue description.
 - This RFC proposes a solution very similar to the note's solution 4, "Client 
and server-side path comparison routines". However, here it is proposed as a 
long term solution.
 - This RFC is essentially identical to what Erik H. proposes in this thread:
http://svn.haxx.se/dev/archive-2010-09/0319.shtml



Issue Description
===

 - Subversion and most file systems currently allow creation of multiple paths, 
which in normalized form are identical. Hereafter referred to as 
"normalized-name collisions". This could cause significant upgrade issues for 
repositories containing such collisions, depending on which solution is 
implemented. See section "Legacy Data".

 - Users have difficulty understanding and managing "normalized-name 
collisions". It is difficult to know which file is which and one of the paths 
is typically not possible to type on a keyboard.

 - Mac OS X clients can not interoperate with non-OSX clients when paths 
contain composed characters (added by a non-OSX client). The working copies are 
broken directly after checkout/update on OSX. Tracked by: 
http://subversion.tigris.org/issues/show_bug.cgi?id=2464



Differences to case-sensitivity
===

 - NFC/NFD look the same when rendered on screen.
 - Different case can be controlled with the keyboard, while unicode 
composition is more difficult.
 - Most modern case-insensitive file systems are case-preserving, i.e. they do 
not normalize to a preferred form and always return the same form that was 
stored. Normalizing file systems do not preserve the paths.



Similarities to case-sensitivity
===

 - If two Unicode strings differ only by letter case/composition, on some 
computer systems they refer to the same file, while on other systems they refer 
to different files.  The same applies if two Unicode strings differ only by 
composition. The rules are set by each file system.

 - Subversion interoperates with different systems.  When two file names that 
differ only by letter case are transferred from a
case-sensitive system to a case-insensitive system, they will collide and 
Subvers