Title: Non-normalizing Unicode Composition Awareness
Version: 0.1 (2012-02-14)


Context
===

Within Unicode, some characters can in the unicode standard be represented in 2 
different ways (composed/decomposed), while rendered equally on screen or in 
print. A unicode string (e.g. a file name) can be represented in 2 normalized 
forms (NFC/NFD) or mixed, i.e. multiple such characters where some are composed 
and others decomposed (rare).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename 
in any form, store and give back in the form it was input. These file systems 
will typically even accept multiple files where the path looks identical on 
screen but the unicode string is different due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the 
paths. In the case of HFS+, the path will be normalized into NFD and it will 
even be given back that way when listing the filesystem. 

Most significant differences from the majority of filesystems:
 - A file that is stored in NFC or mixed, will not be returned with an 
identical name. Generally considered a negative effect of the HFS+ unicode 
implementation.
 - Multiple files whose name is rendered equally cannot be stored in the same 
directory. Often considered an advantage.   


The topic has been described here:
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

 - This RFC is not as complete in all areas, and depend on this note for 
additional context and issue description.
 - This RFC proposes a solution very similar to the note's solution 4, "Client 
and server-side path comparison routines". However, here it is proposed as a 
long term solution.
 - This RFC is essentially identical to what Erik H. proposes in this thread:
http://svn.haxx.se/dev/archive-2010-09/0319.shtml



Issue Description
===

 - Subversion and most file systems currently allow creation of multiple paths, 
which in normalized form are identical. Hereafter referred to as 
"normalized-name collisions". This could cause significant upgrade issues for 
repositories containing such collisions, depending on which solution is 
implemented. See section "Legacy Data".

 - Users have difficulty understanding and managing "normalized-name 
collisions". It is difficult to know which file is which and one of the paths 
is typically not possible to type on a keyboard.

 - Mac OS X clients can not interoperate with non-OSX clients when paths 
contain composed characters (added by a non-OSX client). The working copies are 
broken directly after checkout/update on OSX. Tracked by: 
http://subversion.tigris.org/issues/show_bug.cgi?id=2464



Differences to case-sensitivity
===

 - NFC/NFD look the same when rendered on screen.
 - Different case can be controlled with the keyboard, while unicode 
composition is more difficult.
 - Most modern case-insensitive file systems are case-preserving, i.e. they do 
not normalize to a preferred form and always return the same form that was 
stored. Normalizing file systems do not preserve the paths.



Similarities to case-sensitivity
===

 - If two Unicode strings differ only by letter case/composition, on some 
computer systems they refer to the same file, while on other systems they refer 
to different files.  The same applies if two Unicode strings differ only by 
composition. The rules are set by each file system.

 - Subversion interoperates with different systems.  When two file names that 
differ only by letter case are transferred from a 
case-sensitive system to a case-insensitive system, they will collide and 
Subversion should handle this in some friendly way. The same applies if two 
file names differ only by composition.



To Normalize or Not to Normalize
===

Whether or not to normalize within a Subversion repository (server-side) has 
been debated. The note (unicode-composition-for-filenames) considers 
normalization to NFC to be the long term (2.x) solution. Referring to this 
feature as "repository normalization".

There are implementation advantages with normalized paths which can simplify 
comparisons and storage. 

There are also reasons not to normalize:

 - A file system is generally expected to give back exactly what was stored, or 
refuse up-front. HFS+ has been criticized for not living up to this 
expectation, which is also the reason the Svn WC has issues on HFS+. Subversion 
can be considered a sort of file system, and could therefore be expected to 
live up to this expectation.

 - Compatibility is a high priority for Subversion. Introducing 
normalization/translation/etc is not unlikely to introduce compatibility 
issues, now or later. There is a principle that Subversion should not be a 
limiting factor or impose undue limitations on allowed characters, file names 
etc. 

 - Introducing normalization tends to complicate the upgrade process, 
especially for repositories that contain "normalized-name collisions". This is 
one of the reasons this very issue has not been addressed.

However, there is very little reason to allow the creation of new 
"normalized-name collisions". There are no known use-cases for creating 
multiple files in the same directory that would have identical normalized 
paths. Subversion should preferably refuse such add operations as early as 
possible, at the latest during commit. Referring to this feature as "uniqueness 
normalization".



Solution Overview
===

There are 2 components of this solution, one server side and one client side. 
These can be addressed individually, which is an important requirement for 
Subversion 1.x interoperability between client and server versions.

This solution does not normalize paths in the repository. Paths are only 
normalized for the purpose of comparisons.


Server Changes
===

The Subversion server should no longer accept 'add':ing paths that cause 
"normalized-name collisions". The comparison with existing paths (and other 
paths in the same txn) should be performed in normalized form. However, the 
paths created in the repository will keep the form input by the client.

There could be a performance impact. [Need more data] However, the 'add' 
operation is not one of the most frequent ones, in a typical installation.

It is not possible to rely on client behavior. A Subversion server can be 
accessed via mod_dav_svn, and elder Subversion clients.

The desired server behavior can be accomplished with Subversion 1.7 or earlier 
using a pre-commit hook, but it is desirable to have "uniqueness normalization" 
as the future default behavior.



Client Changes
===

The Working Copy needs an abstraction between the repository path provided by 
the server and the actual file system path. This is required for normalizing 
file systems (HFS+) regardless if the Subversion server performs normalization 
to NFC (repository normalization) or just enforces "uniqueness normalization".

It might be more feasible to implement such an abstraction now in wc-ng than it 
was in svn -1.6. 

[This section needs input from someone more familiar with wc-ng]

Columns of interest in wc.db:

 - The repository path as stored on server: repos_path (e.g. 
"project/dir/file.txt")
 - The local path from WC root to node: local_relpath (e.g. "dir/file.txt")
 - The local path from WC root to node parent: parent_relpath (e.g. "dir")


An abstraction between the repository path and the file system path can be 
achieved by ensuring that there is a column in wc.db that contains the file 
system path in exactly the same form that the file system gives back. APIs in 
wc needs to be extended to ensure that all interaction with the file system is 
performed with the file system path.


Alternative 1:

Redefine the existing column local_relpath to contain the path as stored in the 
file system. Code that currently relies on local_relpath being a substring of 
repos_path needs to be adjusted. E.g. a node might be considered switched when 
this condition is not met.


Alternative 2:

A new column, local_relpath_fs, is added that contains the path as stored in 
the file system. This column will be used on all systems to interact with the 
file system. Currently, the content of columns local_relpath and  
local_relpath_fs will be identical on all file systems except HFS+.


Normalized uniqueness:

Path uniqueness should be checked in normalized form during add operations, in 
order to prevent "normalized-name collisions" as early as possible. It might be 
acceptable to identify this later during commit, since it is a quite rare 
condition.

When an existing "normalized-name collision" arrives to a Working Copy on HFS+ 
via checkout or update, there will be a uniqueness issue in the column 
local_relpath/local_relpath_fs and a situation somewhat similar to an 
obstruction. This should be communicated in some friendly way, similar to 
conflicts on case-insensititve file systems.



Use Cases
===

This change will only affect use cases which rely on creating paths that look 
like duplicates but use different unicode composition. It is highly unlikely 
anyone is relying on this..



Legacy Data
===

 - This change will cause no problems when upgrading existing repositories even 
if they contain "normalized-name collisions".

 - If "normalized-name collisions" exist in HEAD, a check out on Mac OS X will 
still fail after an upgrade but potentially with a better error message. This 
is an issue that is very similar to case-collisions on case-insensitive file 
systems. The detection code is similar and the same friendly error message can 
potentially be used.

 - These "normalized-name collisions" can be resolved in HEAD via "svn mv 
SRC_URL DST_URL". Historical revisions will still be difficult to check out 
from Mac OS X.

 - Working Copies will be upgraded in the same way as any other wc-ng upgrade 
with SQL schema changes. Working Copies on Mac OS X that are broken before 
upgrade might require a fresh check out.




/Thomas Å.

Reply via email to