subject:"RE\: Let's discuss about unicode compositions for filenames\!"

2012/2/3 Peter Samuelson pe...@p12n.org:

 [Hiroaki Nakamura]
 In option (2), we do n12n on all clients on all platforms, and we
 include web_dav_svn in clients. So we convert all input paths to
 the server encoding, which is NFC.

 Indeed.  But the very concept of a server encoding means we are
 involving the server side.  Which invokes a lot of difficult questions
 like what about existing 1.x clients, what about existing checkouts
 and what about existing repositories.

Svn 1.7 forces me to upgrade existing 1.6 working copies.
So we can let users to upgrade working copies.

Existing repositories, I think it would be better to convert them too using
svndump/svnload. And we change svnload to convert filenames to NFC.
However in reality we cannot force users to convert every existing repository.
So we need to change servers too. When servers read filenames
from repositories, they first convert to NFC and then process commands.

We also need to changes servers in order to deal with existing 1.x clients.
We convert filenames to NFC when web_dav_svn and svnserve
receive filenames from clients, they must first convert filenames to NFC.


 By proposing a client-only solution, I hope to avoid _all_ those
 questions.  (Except what about existing checkouts - there would be a
 wc upgrade of some sort.)  No recoding of existing repository paths is
 necessary.  In my proposal, the only recoding that is done is on the
 client side, on a platform that does not support the original pathname
 (e.g., OS X HFS+ with a NFC path).

 All problems in computer science can be solved by another level of
 indirection.

 Mostly true.  I can't tell if you quoted that as a point of support for
 my proposal, or as a point against it.

 Yes, with the mapping table, you can mangle filenames. However I
 think it is too complex for novice users. Users must care about the
 original filenames and the mangled filenames all the time.

 Well, there is no need to use this same proposal to also work around
 other filesystem limitations like avoiding : on Windows.  It is just
 something that becomes _possible_.

 Also you must adapt all clients to use the mapping table. That is
 whole lot of work! Maybe you would create another version control
 system.

 By all clients I guess you mean all Subversion client libraries.
 Yes, that is the proposal.  It would touch libsvn_wc and probably
 libsvn_client and libsvn_subr.

Yes, like I said above, clients actually includes components that
run on servers like web_dav_svn, and it should read as any components
that access to repositories and working copies.

We also need to change svnserve. So we'd better say all servers and clients.


 So even if Windows NTFS can have the same abstract filenames in both
 NFC and NFD simultaneously, we should avoid that, and we should only
 allow NFC filenames.

 This could be done, if we wanted to go to the trouble.  Or we could
 just say use a pre-commit hook, like we tell people who want to
 prevent REAMDE and Reamde in a single dir.  It is not the same level of
 interoperability problem as the one this thread is about.

If you think in analogy to ASCII uppercase and lowercase examples,
you miss the point. Please reread the Unicode Standard Annex #15
UAX #15: Unicode Normalization Forms
http://unicode.org/reports/tr15/

  Canonical equivalence is a fundamental equivalency between
  characters or sequences of characters that represent the same
  abstract character, and when correctly displayed should always
  have the same visual appearance and behavior. Figure 1 illustrates
  this equivalence.

So, filenames in NFC and NFD are the equivalent, the same.
README and readme are different.
NFC/NFD and uppercase/lowercase are two different stories.

Should we allow the same filenames in one directory?
Of course not! If we allow that we go into really trouble and
confusion.

And OS X HSF+ does not allow that. So to support interoperability
to OS X, we should not allow it in subversion too.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com

Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Branko Čibej

On 02.02.2012 20:59, Hiroaki Nakamura wrote:
 So we need to change servers too. When servers read filenames from
 repositories, they first convert to NFC and then process commands.

That won't work. You have to do the initial lookup in a
normalization-agnostic way, and neither BDB nor FSFS makes that possible
wihout scanning whole directories.

 We also need to changes servers in order to deal with existing 1.x
 clients. We convert filenames to NFC when web_dav_svn and svnserve
 receive filenames from clients, they must first convert filenames to NFC. 

Actually, libsvn_repos; this has to work with ra_local as well. And it
would have to maintain a table for converting results back to how the
client knows them. This is the hard part to get right; imagine:

$ svn up
U čombe

How will the server know if the client represents the č in the same
encoding that the now-normalizing server sends? Will the client scan the
directory and normalize the names to find the local file that needs
updating?

-- Brane

Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Daniel Shahaf

Branko Čibej wrote on Thu, Feb 02, 2012 at 21:03:47 +0100:
 On 02.02.2012 20:22, Peter Samuelson wrote:
  [Hiroaki Nakamura]
  In option (2), we do n12n on all clients on all platforms, and we
  include web_dav_svn in clients. So we convert all input paths to
  the server encoding, which is NFC.
  Indeed.  But the very concept of a server encoding means we are
  involving the server side.  Which invokes a lot of difficult questions
  like what about existing 1.x clients, what about existing checkouts
  and what about existing repositories.
 
  By proposing a client-only solution, I hope to avoid _all_ those
  questions.
 
 Can't see how that works, unless you either make the client-side
 solution optional, create a mapping table, or make name lookup on the
 server agnostic to character representation. I can't envision how any of
 those solutions would work all the time.
 
 It would be nice if we could normalize paths in the repository without
 having to perform a dump/reload cycle, but I don't know how that would
 work in FSFS

It won't.  Changing the encoding increase the length (in bytes) of the
string (in the dirents hash, for example), and thus change the offsets
of the node-revs that are later in the file --- to which subsequent
revisions, and the id's of those node-revs, refer.

 (BDB would be fairly easy, modulo collisions, but I don't
 think those are very likely).
 
 -- Brane

Re: Let's discuss about unicode compositions for filenames!

2012/2/3 Branko Čibej br...@xbc.nu:
 On 02.02.2012 20:59, Hiroaki Nakamura wrote:
 So we need to change servers too. When servers read filenames from
 repositories, they first convert to NFC and then process commands.

 That won't work. You have to do the initial lookup in a
 normalization-agnostic way, and neither BDB nor FSFS makes that possible
 wihout scanning whole directories.

OK, then do scan whole directories. If you do not want that,
we force users to convert existing repositories. I think we must
choose one of the two. Tough choices, but I cannot think of a
better way at least right now.


 We also need to changes servers in order to deal with existing 1.x
 clients. We convert filenames to NFC when web_dav_svn and svnserve
 receive filenames from clients, they must first convert filenames to NFC.

 Actually, libsvn_repos; this has to work with ra_local as well. And it
 would have to maintain a table for converting results back to how the
 client knows them. This is the hard part to get right; imagine:

    $ svn up
    U čombe

 How will the server know if the client represents the č in the same
 encoding that the now-normalizing server sends? Will the client scan the
 directory and normalize the names to find the local file that needs
 updating?

Yes, without upgrading working copies, we must do that.

If there is a better way, I would like to know.
Please give us better solution if you have an idea  all.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com

Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Peter Samuelson


[Hiroaki Nakamura]
 Existing repositories, I think it would be better to convert them too using
 svndump/svnload. And we change svnload to convert filenames to NFC.
 However in reality we cannot force users to convert every existing repository.

Also note that if you convert a repository (via dump/load or whatever),
all working copies based on the repository are invalidated and need to
be re-checked-out.  Avoiding _that_ problem would be really hairy, I
think, very similar to the sort of work that would be needed to support
obliterate without losing working copies.

 We also need to changes servers in order to deal with existing 1.x
 clients.  We convert filenames to NFC when web_dav_svn and svnserve
 receive filenames from clients, they must first convert filenames to
 NFC.

You keep saying what we must do on the server side.  I propose
something that is purely on the client side.  It will solve the OS X /
non-OS X interoperability problem.  It will not solve every problem
ever faced by a Subversion user.  That's a job for 2.0.

 Yes, like I said above, clients actually includes components that
 run on servers like web_dav_svn, and it should read as any components
 that access to repositories and working copies.

No.  By clients I mean components that run on the client side.  If my
proposal had required changes to mod_dav_svn, I would not have said
strictly client-side.  I do not propose any change to mod_dav_svn,
svnserve, svnadmin, libsvn_repos, libsvn_fs, the repository data, or
anything else on the server side.

 If you think in analogy to ASCII uppercase and lowercase examples,
 you miss the point. Please reread the Unicode Standard Annex #15
 UAX #15: Unicode Normalization Forms
 http://unicode.org/reports/tr15/

Thanks, I've read it.  The analogy stands.  We could prevent NFC/NFD
collisions as an additional service to users, something we have not
done for the past 10 years.  This would be along the lines of
preventing users from shooting themselves in the foot.

The actual _software_ problem that is solved by preventing collisions
is the same as the software problem solved by preventing upper/lower
case collisions: certain clients are unable to check out a folder that
has such collisions.  (Windows clients, in the case of upper/lower
collisions; OS X clients, in the case of NFC/NFD collisions.)

I think we are talking past each other.  You are trying to solve two
distinct but related problems: 1. OS X client-side confusion when faced
with a non-NFD repository path; 2. NFC/NFD collisions.  I am only
trying to solve problem 1.  I'm ignoring problem 2 for two reasons:

(a) Problem 2 requires server-side work and complex compatibility /
upgrade scenarios (dump/load, re-check-out all wcs, etc).

(b) Problem 2 can be worked around, for new repositories (or
repositories with no existing collisions), with a pre-commit hook.

...neither of which are true for my proposal to solve problem 1.

So long as you continue to insist that, to solve problem 1, we must
also solve problem 2, I'm pretty sure we will never come to any
agreement.

Peter

Re: Let's discuss about unicode compositions for filenames!

2012/2/3 Daniel Shahaf danie...@elego.de:
 Branko Čibej wrote on Thu, Feb 02, 2012 at 21:03:47 +0100:
 On 02.02.2012 20:22, Peter Samuelson wrote:
  [Hiroaki Nakamura]
  In option (2), we do n12n on all clients on all platforms, and we
  include web_dav_svn in clients. So we convert all input paths to
  the server encoding, which is NFC.
  Indeed.  But the very concept of a server encoding means we are
  involving the server side.  Which invokes a lot of difficult questions
  like what about existing 1.x clients, what about existing checkouts
  and what about existing repositories.
 
  By proposing a client-only solution, I hope to avoid _all_ those
  questions.

 Can't see how that works, unless you either make the client-side
 solution optional, create a mapping table, or make name lookup on the
 server agnostic to character representation. I can't envision how any of
 those solutions would work all the time.

 It would be nice if we could normalize paths in the repository without
 having to perform a dump/reload cycle, but I don't know how that would
 work in FSFS

 It won't.  Changing the encoding increase the length (in bytes) of the
 string (in the dirents hash, for example), and thus change the offsets
 of the node-revs that are later in the file --- to which subsequent
 revisions, and the id's of those node-revs, refer.

Changes from NFD to NFC does not increase the length.
The length will be same or smaller, not larger.

Here I quote from
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
   The proposed internal 'normal form' should be NFC, if only if
   it were because it's the most compact form of the two:  when
   allocating memory to store a conversion result, it won't be
   necessary (ever) to allocate more than the size of the input buffer.


-- 
)Hiroaki Nakamura) hnaka...@gmail.com

Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Peter Samuelson


 On 02.02.2012 20:22, Peter Samuelson wrote:
  By proposing a client-only solution, I hope to avoid _all_ those
  questions.

[Branko Cibej]
 Can't see how that works, unless you either make the client-side
 solution optional, create a mapping table, or make name lookup on the
 server agnostic to character representation.

Yes, I did propose a mapping table in wc.db.

Old clients on OS X would continue to be confused; the solution is to
upgrade.

New clients on OS X (and elsewhere) would maintain a mapping table
between 'repository path' and 'local filesystem representation'.  We
already have these two concepts, given we support non-UTF-8 client
encodings.

 It would be nice if we could normalize paths in the repository
 without having to perform a dump/reload cycle, but I don't know how
 that would work in FSFS

Indeed, it's a problem similar to obliterate, and carries the same risk
of invalidating every wc.  This is why I don't think it's a reasonable
path to take in the short term (1.x).

Peter

Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Daniel Shahaf

Hiroaki Nakamura wrote on Fri, Feb 03, 2012 at 05:33:02 +0900:
 2012/2/3 Daniel Shahaf danie...@elego.de:
  Branko Čibej wrote on Thu, Feb 02, 2012 at 21:03:47 +0100:
  On 02.02.2012 20:22, Peter Samuelson wrote:
   [Hiroaki Nakamura]
   In option (2), we do n12n on all clients on all platforms, and we
   include web_dav_svn in clients. So we convert all input paths to
   the server encoding, which is NFC.
   Indeed.  But the very concept of a server encoding means we are
   involving the server side.  Which invokes a lot of difficult questions
   like what about existing 1.x clients, what about existing checkouts
   and what about existing repositories.
  
   By proposing a client-only solution, I hope to avoid _all_ those
   questions.
 
  Can't see how that works, unless you either make the client-side
  solution optional, create a mapping table, or make name lookup on the
  server agnostic to character representation. I can't envision how any of
  those solutions would work all the time.
 
  It would be nice if we could normalize paths in the repository without
  having to perform a dump/reload cycle, but I don't know how that would
  work in FSFS
 
  It won't.  Changing the encoding increase the length (in bytes) of the
  string (in the dirents hash, for example), and thus change the offsets
  of the node-revs that are later in the file --- to which subsequent
  revisions, and the id's of those node-revs, refer.
 
 Changes from NFD to NFC does not increase the length.
 The length will be same or smaller, not larger.
 

If the conversion is guaranteed to be monotone non-increasing (in
length) then I believe could be made to work in place.

As to keeping concurrent readers and preexisting working copies sane ---
for now I'm LAAEFTR'ing that.

 Here I quote from
 http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
The proposed internal 'normal form' should be NFC, if only if
it were because it's the most compact form of the two:  when
allocating memory to store a conversion result, it won't be
necessary (ever) to allocate more than the size of the input buffer.
 
 
 -- 
 )Hiroaki Nakamura) hnaka...@gmail.com

Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Branko Čibej

On 02.02.2012 21:28, Hiroaki Nakamura wrote:
 2012/2/3 Branko Čibej br...@xbc.nu:
 On 02.02.2012 20:59, Hiroaki Nakamura wrote:
 So we need to change servers too. When servers read filenames from
 repositories, they first convert to NFC and then process commands.
 That won't work. You have to do the initial lookup in a
 normalization-agnostic way, and neither BDB nor FSFS makes that possible
 wihout scanning whole directories.
 OK, then do scan whole directories.

But we can't make old clients do that. So ... by normalizing paths that
come from the server, we're effectively killing off all old clients that
would otherwise work with said servers.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

2012/2/3 Peter Samuelson pe...@p12n.org:

[Hiroaki Nakamura]
Existing repositories, I think it would be better to convert them too using
svndump/svnload. And we change svnload to convert filenames to NFC.
However in reality we cannot force users to convert every existing
repository.

Also note that if you convert a repository (via dump/load or whatever),
all working copies based on the repository are invalidated and need to
be re-checked-out. Avoiding _that_ problem would be really hairy, I
think, very similar to the sort of work that would be needed to support
obliterate without losing working copies.

We also need to changes servers in order to deal with existing 1.x
clients. We convert filenames to NFC when web_dav_svn and svnserve
receive filenames from clients, they must first convert filenames to
NFC.

You keep saying what we must do on the server side. I propose
something that is purely on the client side. It will solve the OS X /
non-OS X interoperability problem. It will not solve every problem
ever faced by a Subversion user. That's a job for 2.0.

OK. When I started this thread, I suppose we'd like to focus to
long term solution 2.x. That's because the short term solution options (4)
written in
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
seems too diificult and complex for me.

But if a modification to my proposal will fit in short term 1.x,
I will modify it delightedly.

Yes, like I said above, clients actually includes components that
run on servers like web_dav_svn, and it should read as any components
that access to repositories and working copies.

No. By clients I mean components that run on the client side. If my
proposal had required changes to mod_dav_svn, I would not have said
strictly client-side. I do not propose any change to mod_dav_svn,
svnserve, svnadmin, libsvn_repos, libsvn_fs, the repository data, or
anything else on the server side.

If you think in analogy to ASCII uppercase and lowercase examples,
you miss the point. Please reread the Unicode Standard Annex #15
UAX #15: Unicode Normalization Forms
http://unicode.org/reports/tr15/

Thanks, I've read it. The analogy stands. We could prevent NFC/NFD
collisions as an additional service to users, something we have not
done for the past 10 years. This would be along the lines of
preventing users from shooting themselves in the foot.

The actual _software_ problem that is solved by preventing collisions
is the same as the software problem solved by preventing upper/lower
case collisions: certain clients are unable to check out a folder that
has such collisions. (Windows clients, in the case of upper/lower
collisions; OS X clients, in the case of NFC/NFD collisions.)

Yes, I agree with that.

I think we are talking past each other. You are trying to solve two
distinct but related problems: 1. OS X client-side confusion when faced
with a non-NFD repository path; 2. NFC/NFD collisions. I am only
trying to solve problem 1. I'm ignoring problem 2 for two reasons:

(a) Problem 2 requires server-side work and complex compatibility /
upgrade scenarios (dump/load, re-check-out all wcs, etc).

(b) Problem 2 can be worked around, for new repositories (or
repositories with no existing collisions), with a pre-commit hook.

...neither of which are true for my proposal to solve problem 1.

So long as you continue to insist that, to solve problem 1, we must
also solve problem 2, I'm pretty sure we will never come to any
agreement.

OK. So how about changing my proposal like:
(1) No sever modification. Just modify svn_path_cstring_to_utf8 only.
(2) Let users install a pre-commit hook which rejects any non-NFC filenames.

In this way, we only need one function. Modification is just like
the original OS X unicode path patch:
utf8precompose_macosx_2.patch
http://subversion.tigris.org/nonav/issues/showattachment.cgi/813/utf8precompose_macosx_2.patch
in
http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Only difference the original patch to my patch will be mine use
utf8proc so that we can use it on all platforms, Mac OS X, Windows
and Linux.

--
)Hiroaki Nakamura) hnaka...@gmail.com

Re: Let's discuss about unicode compositions for filenames!

2012-01-31 Thread Peter Samuelson


[reordering the conversation flow slightly]

  [Peter Samuelson]
  That's the implementation I would like to see, to be honest.  Start
  with the observation that we can treat Mac OS X NFD paths as a
  client character encoding.  Now observe that it is lossy.  But
  ... almost all non-Unicode client charsets are equally lossy, for
  exactly the same reason!

[Branko Cibej]
 I don't see what you mean by lossy though. NFD and NFC can
 represent exactly the same set of characters, it's just that the
 representations of some of them are different.

By lossy I just mean that if you convert to UTF-8 NFD, you can't
reliably convert _back_ to the original bytes.  I'm assuming here that
we continue to do _no_ n11n on the server side - pathnames from
libsvn_(ra|repos|fs) are just UTF-8 with unspecified n11n.  Thus, if
the client encoding is UTF-8 NFD, you can't reliably convert that to
the server encoding.

And this is also true of most legacy (non-Unicode) encodings: they know
nothing about Unicode's n11n forms, so they are lossy in the same
way: you can't reliably take a pathname in, e.g., ISO-8859-1, and
convert to the encoding found in the repository, because you don't know
the n11n form used by the original committer.

This is why I suggested the mapping table in wc.db.

Actually, the fact that the mapping table works around the inherent
lossiness of character encoding conversion suggests that it _could_, in
the future, also account for lossiness for other reasons.  If we
wished, we could have libsvn_wc mangle checked-out filenames on
platforms with arbitrary limitations - escaping  and : characters
on Windows, e.g. - using this same mechanism.  Even if the conversion
is lossy, the mapping table in wc.db knows the original filename.  Of
course you couldn't _create_ filenames with platform limitations on the
same platform, but being able to check out the file at all is an
improvement over today.  Probably 'svn status' would show some
indication that a name has been mangled in a way users would actually
care about (i.e., not just NFC/NFD).

  The implementation on OS X might be a bit hairy, if there isn't an
  easy way to retrieve the real pathname of the file you just
  created.  Anywhere else, we just store the pathname we just
  calcuated.

 Afaik the OSX API normalizes everything to NFD automagically. So at
 least on that platform there's no chance of having more than one form
 for the same filename at the same time. Likewise on Windows, which
 normalizes to NFC.

Right.  The question is, if libsvn_wc tells OS X to store a given path,
with unknown n11n, is there an easy way to retrieve the pathname that
was _actually_ stored on disk?  That's what I mean by might be a bit
hairy.  It sounds like the thing to do on OS X is for libsvn_wc to
pre-normalize to NFD before writing the file, and just assume the OS
will (re-)normalize to the same byte array.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/

Re: Let's discuss about unicode compositions for filenames!

On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
 Hi folks!
 
 I read the note about unicode compositions for filenames
 http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
 and would like to drive the discussion.

Hi,

I am very happy to hear that you want to work towards getting this
problem fixed. Thank you for your help!

I've just re-read the unicode-composition-for-filenames notes.
I think they are a bit outdated. For instance, they still talk about
the 1.6 working copy format. They also don't clearly explain the problems
with backwards compatibility we're facing here.

We won't be able to apply your patch as it is. The problem is that
it can break operation for some existing repositories and working
copies.

Generally, I think that writing code that implements a solution for
this problem is not hard, no matter what the solution is.
The real challenge lies in finding a solution that is backwards
compatible with existing repositories and working copies.

I will explain what I mean by giving examples below.
But first, let's recap the basic problem, if only so others can more
easily follow this discussion.

As you know, in Unicode, some characters can be represented in two distinct
ways: pre-composed form (NFC) and de-composed form (NFD).
For instance, the letter ä (a umlaut) can be represented by Unicode
code point 0x00E4 ( ä ), which is the pre-composed form, or by code
point 0x0061 ( a ) followed by code point 0x0308 ( ̈  ), which is the
de-composed form.

This is a basic property of Unicode. It simply contains both ways of
representing these characters in its character tables.
I.e. any byte-string representation of Unicode, be it UTF-8, UTF-16,
must also be able to represent both ways of encoding such characters.
So when filenames are given in Unicode, a filename may contain any
combination of NFC and NFD characters.

Because Subversion never normalises filenames to one form or the other,
the space of all possible filenames in a Subversion repository or working
copy contains a large amount of redundancy. There are many filenames which
look the same to the user but differ in terms of the Unicode code points
used to represent them.

For instance, imagine a filename containing 3 a umlaut characters
and otherwise only characters from the ASCII set.
There are 8 (2^3) different ways of representing this filename in Unicode,
and hence 8 different UTF-8 byte strings which can be used in the repository
or working copy to represent what is, from the user's point of view,
the same filename.

The problem we have on Mac OS X is that when we write any of these
8 different byte strings to the filesystem to name the file, and later
read the filename back from the filesystem (e.g. by opening the parent
directory and asking for a list of files it contains), we will always
receive the name with all a umlaut characters expanded to de-composed
form.

Now, in the working copy meta data (.svn/wc.db) we can use any of 8 forms
of the filename. If we don't use NFC for all characters in the filename,
the filename read from disk may fail to match any name stored in meta data.

Let's simplify the discussion a bit by assuming only two possible ways
of encoding a filename: One with all characters normalised to NFC, and
one with all characters normalised to NFD. We don't really need to
consider the mixed forms for the purpose of this discussion (though it
helps to keep in mind that they exist).

So let's talk about what would happen if we applied your patch.

Let's say I have a working copy which contains filenames normalised
to NFD, as is the case on Mac OS X. The server gets upgraded to a new
release of Subversion which contains your patch. This means the server
will now send all paths as NFC. Let's say there are changes made to a
file which has 3 a umlaut characters in its name. When I run 'svn update'
my client will try to find the NFC form of the name in its meta-data,
and fail to locate it because the file was stored as NFD.

So this means your patch will break compatibility with the working copy.
Therefore, we would need to provide an upgrade path for those working
copies. E.g. 'svn upgrade' could be modified to normalise all filenames
stored in the DB to NFC. Problem solved.

But now comes the next problem. Given a filename in NFC which we read from
meta data, how can we locate the corresponding on-disk file if its form
is not NFC? We could of course rename the on-disk file. Except this
won't work on Mac OS X unless we decide to use NFD encoding. So we could
decide to also use NFD everywhere -- but this would break as soon as
some other operating system decides to normalise to NFC, so it's not a
good solution. We could also open the parent directory, read all the
filenames within it, normalise them all, and then search the resulting
list. This works, expect if a name exists twice, once in NFC form and once
in NFD form. We'd somehow have to solve the name collision in

Re: Let's discuss about unicode compositions for filenames!

On 30.01.2012 13:30, Stefan Sperling wrote:
 On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote:
 Hi folks!

 I read the note about unicode compositions for filenames
 http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
 and would like to drive the discussion.
 Hi,

 I am very happy to hear that you want to work towards getting this
 problem fixed. Thank you for your help!

 I've just re-read the unicode-composition-for-filenames notes.
 I think they are a bit outdated. For instance, they still talk about
 the 1.6 working copy format. They also don't clearly explain the problems
 with backwards compatibility we're facing here.

[...]

We have to track two distinct normalizations, the internal (wc.db,
repos) form, most likely NFC, and the working copy, on-disk form. This
last will depend on the host system; most likely NFD on Mac OS and NFC
everywhere else. The on-disk normalization needs to happen before
conversion to the system encoding, of course.

libsvn_repos should do its own normalization to NFC because we can't
trust old clients to do it right.
Doing a dump/reload cycle should then be sufficient to upgrade the
repository, and probably the only viable one, too.

For working copies, we may want to teach svn upgrade to do the on-disk
and wc.db normalization dance. Clearly, client-side normalization
requires a WC format bump, but it need not be automatic.

We should probably give serious thought to using the restricted
normalisation forms (NFKC and NFKD) and tell people who want proper
Unicode Roman numerals in their file names to think again. :)

-- Brane

Re: Let's discuss about unicode compositions for filenames!

2012-01-30 Thread Peter Samuelson


  [Stefan Sperling]
  We could also open the parent directory, read all the filenames
  within it, normalise them all, and then search the resulting
  list. This works, expect if a name exists twice, once in NFC form
  and once in NFD form. We'd somehow have to solve the name collision
  in the filesystem.

[Markus Schaber]
 This sounds astonishingly similar to the lower/upper case problem of
 UN*X vs. Mac/Win.

There are similarities, but there are some important differences:

- We have to support Mac OS X, which stores all files in NFD.  In the
  upper/lowercase analogy, think of OS X as MS-DOS, which does not
  preserve mixed case at all but always represents files in uppercase.
  Subversion doesn't support MS-DOS and I hope we never need to.  MS
  Windows, OTOH, at least preserves the upper/lowercase distinction
  presented to it when you create a file.  Big difference.

  (I'm not saying OS X is like MS-DOS in other respects.  Just for the
  purpose of the NFC/NFD vs. upper/lower analogy.)

- Also, the Subversion platform has chosen to support files like README
  and Readme that conflict on Windows.  Our reasoning is if you have
  users on Windows, don't do that.  Most solutions to the NFC/NFD
  problem will affect all platforms, not just one, and we probably
  can't just say well, don't do that - we'll need to actually prevent
  it - and somehow deal with existing clients, WCs, and repositories).

Because of those differences, my gut feeling is that we can't treat the
two issues in the same way.

Peter

Re: Let's discuss about unicode compositions for filenames!

2012-01-30 Thread Julian Foad

Let me just note some of the main similarities and differences between this 
issue of Unicode compositions and the issue of case-sensitivity in file names.
Differences:

  * NFC and NFD look the same when 
displayed, and most users haven't heard of them and don't expect that a 
computer might treat two 
identical-looking filenames as different.  With letter case, most users are 
aware that some systems treat upper and lower case letters as the same while 
other systems treat them as different, and they learn to behave according to 
the system's rules.


  *The main 
case-insensitive file systems are case-preserving with no normal form, 
whereas the main system that treats NFC and NFD as equivalent(MacOS) chooses 
one form as the normal form and always normalizes the given file name to that 
form.


Similarities:
  * If two Unicode strings differ only by letter case, on some computer systems 
they refer to the same file, while on other systems they refer to different 
files.  The rules are created by the 
designers of the systems, sometimes explicitly and sometimes 
implicitly.  Different parts of a system can have different rules.  The 
same applies if two Unicode strings differ only by composition. 

  * Subversion  interoperates with different systems.  When two file names that 
differ only by letter case are transferred from a 
case-sensitive system to a case-insensitive system, they will collide 
and Subversion shouldhandle thisin some friendly way.  The same applies if two 
file namesdiffer only by composition.

The differences are important, but the similarities are enough that we should 
be looking for some commonality in the implementation.

- Julian

Re: Let's discuss about unicode compositions for filenames!

2012-01-30 Thread Neels J Hofmeyr

On 01/30/2012 02:00 PM, Markus Schaber wrote:
 Maybe the best solution to this issue is a client-only solution, in a similar 
 way the case sensitivity problem is tackled.

Spinning the client-only thought a bit: Imagine a repos with a un*x user
adding a file called föö. Now an OSX user checks it out and gets the path
normalized to fo:o:.

1. wc.db on OSX's HFS+ file systems has to be aware that the file föö is
stored locally as fo:o:.

2. Whenever the OSX user types in fo:o:, the client must remember that the
repos expects the path for this node to be sent as föö, or the repos will
reply that the node does not exist. It could be solved with a translation
table between the repos and the client, but it remains quite a messy
endeavor, because:

3. New files may be added remotely at any given moment. For example, a path
'föö/bar' is checked out to OSX's fs and becomes 'fo:o:/bar'. Then someone
else adds 'fo:o:/bar' to the repos as well -- we now have two distinct 'bar'
files in the repos that share the same normalized path. Now OSX potentially
mistakes its checked-out 'föö/bar' for the later added 'fo:o:/bar', as that
matches the local path without any de-normalisations... The OSX client
basically has no chance to show conflicting files to its user
simultaneously. Data is hidden.


Thus, OSX admins will want the repository to be able to disallow having
multiple representations of the same normalized path -- not that easy to
achieve, in fact: before accepting a path name from the client, the repos
needs to either cycle through all possible unicode representations or needs
to normalize and compare all existing paths. Normalizing a client's path
before storing in the repos is a no-go, as the client won't be able find its
nodes later. Probably the best option is to define a given normalization per
repos and then refuse commits that add non-normalized paths, like a
pre-commit hook.

On the other hand, an all-un*x shop must be allowed to operate the way they
always did. Their OSs only see byte sequences and don't mess around with
normalization. Say they want to have a folder of differently normalized
representations of the same file for testing *their own* code for unicode
robustness. They should be able to do that. (They obviously can't use OSX's
HFS+ for that, though.)

So, on top of client-only fixes, it would be good to have ways to enforce
certain repository behavior, based on self-imposed policy -- I mean, we
won't have The Subversion Normalization, each admin decides alone.

On 01/30/2012 01:30 PM, Stefan Sperling wrote:
 I am not convinced that it is impossible to fix.

Nicely put :)

~Neels

[[[
fred@mac $ svn co http://svn/repos
A foo
A bar
*** Warning:
You are checking out to an HFS+ file system. Your WC may not accurately
represent this revision. Consider using a different file system!
Continue? (Y/n) Y
A föö
*** File name collision detected. Skipping 'föo:'
*** File name collision detected. Skipping 'fo:ö'
*** File name collision detected. Skipping 'fo:o:'
A baz
fred@mac $
]]]
:P



signature.asc
Description: OpenPGP digital signature

Re: Let's discuss about unicode compositions for filenames!

On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
 2012/1/30 Stefan Sperling s...@elego.de:
  My friend is not willing to upgrade to a new client version yet, which
  is fine because all 1.x releases of Subversion clients are supposed
  to be compatible with all 1.y releases of Subversion servers. He should
  not have to upgrade his client just because the server was upgraded.
 
  In his working copy, the file name is also in NFD form. When he
  talks to the server, the server provides the name in NFC. Because he
  is using the old client the client has no way of knowing how to map
  the NFC name to its local NFD file. So we've broken backwards
  compatibility for my friend.
 
 I think we cannot avoid this. So this patch is for 2.x, which may
 break backward compatibility.

If we are ever going to break compatibility, this issue will
certainly be addressed by normalising all paths as you suggest.
It was an unfortunate oversight that no NFD/NFC normalisation
was implemented in the first place :(

However, we really do not want to break compatibility at this time.
A solution that does not require us to break compatibility would
be much better. Nobody knows yet when the time for 2.x will come.

As far as I know, HFS+ is the only filesystem that has this problem.
It is possible to use other filesystems on Mac OS X as a workaround.
For example UFS, ext2, or NTFS (via FUSE).

I think Subversion's backwards compatibility is very important and
should not be jeopardised because of the behaviour of one filesystem.
 
 If we have two files of the same filenames, one in NFC, the other in NFD,
 it is really a headache for us to normalize all paths to NFC. The only thing
 we can do is just keep one file of the two and throw the other file.
 
 In reality, I think this is rare case. If we find this collision when 
 upgrading
 repositories, we should stop and provide the way for users to choose which
 one to save.

I agree that this is probably a rare case in practice. However, we must
be prepared to handle it. Users who run into this problem can lose the
ability to use newer versions of Subversion to read their data.
This cannot be allowed to happen because we want to stay compatible.

  As you can see, there is a lot of complexity involved in fixing this
  issue. I hope you aren't discouraged by this. Someone will need to
  explore the details of these problems to fix this issue. I am not convinced
  that it is impossible to fix. We'll need to be very careful about backwards
  compatibility when making decisions. But there might be ways to achieve a
  satisfying solution nonetheless.
 
 Like other people say, we should prohibit the NFC/NFD same filename collision,
 not in the subversion system, but in operational rules, just don't do that.

So far, don't do that has been the answer to this entire problem.
We've been telling people if they want to use non-ASCII characters
with both Windows/Linux and Mac OS X clients they should not be using HFS+.

And mixing various unicode forms works fine today if the filesystem
used by the client supports this. The use case Neels contrived, where
developers want to test their code with unicode filenames in various
NFD/NFC forms, and check those test files into Subversion, should still
be supported.

 Then, the rest problem seems rather simple. Convert *all* input paths to NFC
 first, then do the work. All input means paths passed to servers from clients,
 paths obtained by servers from repositories, paths obtained by clients from
 working copies. Is that correct?

Yes, that is correct. Also, paths obtained by clients from the local
filesystem, and paths sent by servers to clients.

Re: Let's discuss about unicode compositions for filenames!

2012-01-30 Thread Johan Corveleyn

On Mon, Jan 30, 2012 at 8:10 PM, Stefan Sperling s...@elego.de wrote:
 On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
 2012/1/30 Stefan Sperling s...@elego.de:

[ ... ]

 And mixing various unicode forms works fine today if the filesystem
 used by the client supports this. The use case Neels contrived, where
 developers want to test their code with unicode filenames in various
 NFD/NFC forms, and check those test files into Subversion, should still
 be supported.

Indeed.

Though this means that unconditional NFC (or whatever) normalization
in the working copy database is not an option, since it precludes
representing multiple forms at the same time in the wc. Except maybe
dependent on the (filesystem of the) client platform.

Of course, if a repository needs to support also checkouts to OSX/HFS+
clients, it should be configured to disallow multiple (conflicting)
forms to enter the repository. This can be done with a pre-commit
hook, similar to case-insensitive.py [1], which does the same for
case-clashing files.

(BTW, case-insensitive.py works by comparing incoming adds with the
list of directory entries of the corresponding directory within the
txn (comparing their normalized forms))

-- 
Johan

[1] 
http://svn.apache.org/repos/asf/subversion/trunk/contrib/hook-scripts/case-insensitive.py

Re: Let's discuss about unicode compositions for filenames!

On 30.01.2012 21:00, Johan Corveleyn wrote:
 On Mon, Jan 30, 2012 at 8:10 PM, Stefan Sperling s...@elego.de wrote:
 On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
 2012/1/30 Stefan Sperling s...@elego.de:
 [ ... ]

 And mixing various unicode forms works fine today if the filesystem
 used by the client supports this. The use case Neels contrived, where
 developers want to test their code with unicode filenames in various
 NFD/NFC forms, and check those test files into Subversion, should still
 be supported.
 Indeed.

 Though this means that unconditional NFC (or whatever) normalization
 in the working copy database is not an option, since it precludes
 representing multiple forms at the same time in the wc. Except maybe
 dependent on the (filesystem of the) client platform.

Are you seriously proposing that we /support/ such broken, hackish
nonsense? How do you expect users to tell the difference between file
names that look identical on the character level, but are not on the
code point level?

Supporting such hacks would only be a source of bug reports. I don't see
this as a desirable feature.

And as for doing the server-side checks in pre-commit hooks ... i guess
you could write a whole libsvn_repos implementation merely as a set of
pre-commit hooks, but who would want to? Hooks aren't intended for
implementing core functionality..

-- Brane

Re: Let's discuss about unicode compositions for filenames!

2012-01-30 Thread Johan Corveleyn

On Mon, Jan 30, 2012 at 9:09 PM, Branko Čibej br...@xbc.nu wrote:
 On 30.01.2012 21:00, Johan Corveleyn wrote:
 On Mon, Jan 30, 2012 at 8:10 PM, Stefan Sperling s...@elego.de wrote:
 On Tue, Jan 31, 2012 at 01:42:21AM +0900, Hiroaki Nakamura wrote:
 2012/1/30 Stefan Sperling s...@elego.de:
 [ ... ]

 And mixing various unicode forms works fine today if the filesystem
 used by the client supports this. The use case Neels contrived, where
 developers want to test their code with unicode filenames in various
 NFD/NFC forms, and check those test files into Subversion, should still
 be supported.
 Indeed.

 Though this means that unconditional NFC (or whatever) normalization
 in the working copy database is not an option, since it precludes
 representing multiple forms at the same time in the wc. Except maybe
 dependent on the (filesystem of the) client platform.

 Are you seriously proposing that we /support/ such broken, hackish
 nonsense? How do you expect users to tell the difference between file
 names that look identical on the character level, but are not on the
 code point level?

Huh? I'm not proposing anything. Hiroaki suggested (with his patch and
followup discussion) to do normalization to NFC in wc.db (or something
like that, so that all paths that enter wc.db are in NFC form). All
I'm saying is that this conflicts with the use case
Neels contrived, to represent multiple forms in the working copy.
Except if you allow some clients to do it, and others not (either by a
client-side option, or by platform-specific behavior).

 Supporting such hacks would only be a source of bug reports. I don't see
 this as a desirable feature.

No problem, I don't either. I'm not really participating in this
discussion (got enough discussions going on already :-)). Just wanted
to point out the conflict ...

 And as for doing the server-side checks in pre-commit hooks ... i guess
 you could write a whole libsvn_repos implementation merely as a set of
 pre-commit hooks, but who would want to? Hooks aren't intended for
 implementing core functionality..

Ok, then I also propose that case-insensitive.py should be folded into
core functionality (server-side option). That would be vastly better
of course, more performant etc ...

So I totally agree.

-- 
Johan

Re: Let's discuss about unicode compositions for filenames!

On Mon, Jan 30, 2012 at 09:09:22PM +0100, Branko Čibej wrote:
 Are you seriously proposing that we /support/ such broken, hackish
 nonsense? How do you expect users to tell the difference between file
 names that look identical on the character level, but are not on the
 code point level?

 Supporting such hacks would only be a source of bug reports. I don't see
 this as a desirable feature.

The question is why you would want to break it now that it works.
Because of HFS+? Isn't what HFS+ does just as broken if you think
about it? Why normalise paths in the filesystem if nobody else does it?

I'd prefer a universe where svn normalises anything to NFC from the
1.0 release onwards. Alas, we're in the wrong one.
Compare http://www.qwantz.com/index.php?comic=34 and following.
. o O (Where's my goatee?)

Re: Let's discuss about unicode compositions for filenames!

On 30.01.2012 21:29, Stefan Sperling wrote:
 On Mon, Jan 30, 2012 at 09:09:22PM +0100, Branko Čibej wrote:
 Are you seriously proposing that we /support/ such broken, hackish
 nonsense? How do you expect users to tell the difference between file
 names that look identical on the character level, but are not on the
 code point level?

 Supporting such hacks would only be a source of bug reports. I don't see
 this as a desirable feature.
 The question is why you would want to break it now that it works.
 Because of HFS+? Isn't what HFS+ does just as broken if you think
 about it? Why normalise paths in the filesystem if nobody else does it?

You're aware that MacPorts subversion already has a hack to normalize
the other way, at least over the wire. :)

Sure, if you want to turn on such normalization, you pretty much have to
dump and reload the repository as well as upgrading all working copies
(again). Either that, or use form-independent comparison on the server,
which isn't such a bad idea anyway. Doing that in wc.db is probably harder.

-- Brane

Re: Let's discuss about unicode compositions for filenames!

On Mon, Jan 30, 2012 at 09:34:03PM +0100, Branko Čibej wrote:
 Sure, if you want to turn on such normalization, you pretty much have to
 dump and reload the repository as well as upgrading all working copies
 (again). Either that, or use form-independent comparison on the server,
 which isn't such a bad idea anyway. Doing that in wc.db is probably harder.

It is indeed harder because we are passing paths verbatim to sqlite.
I doubt having more than one form of a given path in wc.db is fun...

Re: Let's discuss about unicode compositions for filenames!

2012-01-30 Thread Peter Samuelson


[Stefan Sperling]
 It is indeed harder because we are passing paths verbatim to sqlite.
 I doubt having more than one form of a given path in wc.db is fun...

That's the implementation I would like to see, to be honest.  Start
with the observation that we can treat Mac OS X NFD paths as a client
character encoding.  Now observe that it is lossy.  But ... almost all
non-Unicode client charsets are equally lossy, for exactly the same
reason!

This suggests maintaining a mapping table in wc.db between server paths
(UTF-8, unspecified NF) and wc paths (local charset, which is
occasionally UTF-8 with NFD).

This mapping table would be maintained any time we write to the wc.
It would be consulted any time we search for files in the wc.

It's not really extra work - we have to do those UTF-8 - local
charset conversions all the time anyway.  This would in fact cache
those conversions.

The implementation on OS X might be a bit hairy, if there isn't an easy
way to retrieve the real pathname of the file you just created.
Anywhere else, we just store the pathname we just calcuated.

Peter

Re: Let's discuss about unicode compositions for filenames!

On 31.01.2012 00:14, Peter Samuelson wrote:
 [Stefan Sperling]
 It is indeed harder because we are passing paths verbatim to sqlite.
 I doubt having more than one form of a given path in wc.db is fun...
 That's the implementation I would like to see, to be honest.  Start
 with the observation that we can treat Mac OS X NFD paths as a client
 character encoding.  Now observe that it is lossy.  But ... almost all
 non-Unicode client charsets are equally lossy, for exactly the same
 reason!

 This suggests maintaining a mapping table in wc.db between server paths
 (UTF-8, unspecified NF) and wc paths (local charset, which is
 occasionally UTF-8 with NFD).

 This mapping table would be maintained any time we write to the wc.
 It would be consulted any time we search for files in the wc.

 It's not really extra work - we have to do those UTF-8 - local
 charset conversions all the time anyway.  This would in fact cache
 those conversions.

 The implementation on OS X might be a bit hairy, if there isn't an easy
 way to retrieve the real pathname of the file you just created.
 Anywhere else, we just store the pathname we just calcuated.


Afaik the OSX API normalizes everything to NFD automagically. So at
least on that platform there's no chance of having more than one form
for the same filename at the same time. Likewise on Windows, which
normalizes to NFC.

I don't see what you mean by lossy though. NFD and NFC can represent
exactly the same set of characters, it's just that the representations
of some of them are different. Thus, this does not preclude normalizing
the paths in wc.db, and that's even easily automated. If such a
conversion finds a name collision ... the user is in serious trouble
already. :)

It's more likely to find such a collision on Unix than either Mac OS or
Windows (both of which normalize on the FS API level). But this case is
probably so rare that I wouldn't worry about it.

-- Brane

RE: Let's discuss about unicode compositions for filenames!

2012-01-30 Thread Bert Huijben

 -Original Message-
 From: Branko Čibej [mailto:br...@xbc.nu]
 Sent: maandag 30 januari 2012 16:11
 To: dev@subversion.apache.org
 Subject: Re: Let's discuss about unicode compositions for filenames!

 On 31.01.2012 00:14, Peter Samuelson wrote:
  [Stefan Sperling]
  It is indeed harder because we are passing paths verbatim to sqlite.
  I doubt having more than one form of a given path in wc.db is fun...
  That's the implementation I would like to see, to be honest.  Start
  with the observation that we can treat Mac OS X NFD paths as a client
  character encoding.  Now observe that it is lossy.  But ... almost all
  non-Unicode client charsets are equally lossy, for exactly the same
  reason!

  This suggests maintaining a mapping table in wc.db between server paths
  (UTF-8, unspecified NF) and wc paths (local charset, which is
  occasionally UTF-8 with NFD).

  This mapping table would be maintained any time we write to the wc.
  It would be consulted any time we search for files in the wc.

  It's not really extra work - we have to do those UTF-8 - local
  charset conversions all the time anyway.  This would in fact cache
  those conversions.

  The implementation on OS X might be a bit hairy, if there isn't an easy
  way to retrieve the real pathname of the file you just created.
  Anywhere else, we just store the pathname we just calcuated.

 Afaik the OSX API normalizes everything to NFD automagically. So at
 least on that platform there's no chance of having more than one form
 for the same filename at the same time. Likewise on Windows, which
 normalizes to NFC.

 I don't see what you mean by lossy though. NFD and NFC can represent
 exactly the same set of characters, it's just that the representations
 of some of them are different. Thus, this does not preclude normalizing
 the paths in wc.db, and that's even easily automated. If such a
 conversion finds a name collision ... the user is in serious trouble
 already. :)

 It's more likely to find such a collision on Unix than either Mac OS or
 Windows (both of which normalize on the FS API level). But this case is
 probably so rare that I wouldn't worry about it.

Last time we discussed this in depth (a few years ago), Windows didn't perform 
the normalization you describe here.
Was this added later? (Any documentation pointers?)

I think the keyboard/editor support performs some normalization so users are 
unlikely to create the sequences not-normalized, but our old documents say that 
it just stores whatever it gets passed.
(Probably for the same reason as Subversion does it: compatibility with the 
time where we didn't know about these problems)

Bert

 -- Brane

Re: Let's discuss about unicode compositions for filenames!