[no subject]

2015-02-04 Thread Hiroaki Nakamura
ゆうちょのATMは9時からだったので、それまで待ってからお金おろして帰りますね

--


Re: Let's discuss about unicode compositions for filenames!

2012-02-16 Thread Hiroaki Nakamura
2012/2/17 Vincent Lefevre vincent-...@vinc17.net:
 On 2012-01-30 21:29:41 +0100, Stefan Sperling wrote:
 On Mon, Jan 30, 2012 at 09:09:22PM +0100, Branko Čibej wrote:
  Are you seriously proposing that we /support/ such broken, hackish
  nonsense? How do you expect users to tell the difference between file
  names that look identical on the character level, but are not on the
  code point level?
 
  Supporting such hacks would only be a source of bug reports. I don't see
  this as a desirable feature.

 The question is why you would want to break it now that it works.
 Because of HFS+? [...]

 I think you mean because of Mac OS X. Indeed, unless this has changed,
 with the Mac OS X Terminal, when a user types an accented character,
 it is in NFD at the command line level. So, even if the user uses a
 conventional file system that can store both NFC and NFD, the filename
 will be in NFD, which will annoy Linux users.

Actually, whether filename is in NFC or NFD depends on the way of
inputting filenames.
If you type all characters, it is in NFC.
If you use shell filename completion by hitting tab key, it is in NFD.
I tried with Japanese filenames and confirmed this.

So, it is HFS+ which returns the filenames in NFD.
-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-11 Thread Hiroaki Nakamura
2012/2/9 Markus Schaber m.scha...@3s-software.com:
 Hi,

 Von: Stefan Sperling [mailto:s...@elego.de]

 On Thu, Feb 09, 2012 at 12:20:14AM +0900, Hiroaki Nakamura wrote:
  [Upgrade options / backwards compatibility for proposed unicode 
  normalization fix]

 - Need to re-checkout existing working copies of the repository?
   = Yes, but only if config is changed from the default.

 Maybe this could even be avoided if newer clients (or an utility script) can 
 upgrade the working copy to the normalized format.

Yes, if the working copy does not have filename collisions. However,
for compatibility,
we cannot let newer clients upgrade working copies automatically
because existing
working copies may have filename collisions.


 Best regards

 Markus Schaber
 --
 ___
 We software Automation.

 3S-Smart Software Solutions GmbH
 Markus Schaber | Developer
 Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax 
 +49-831-54031-50

 Email: m.scha...@3s-software.com | Web: http://www.3s-software.com
 CoDeSys internet forum: http://forum.3s-software.com
 Download CoDeSys sample projects: 
 http://www.3s-software.com/index.shtml?sample_projects

 Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade 
 register: Kempten HRB 6186 | Tax ID No.: DE 167014915



-- 
中村 弘輝 )Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-11 Thread Hiroaki Nakamura
Hi,

2012/2/9 Thomas Åkesson tho...@akesson.cc:
 Hi,
 I have been interested in this issue for a couple of years and I remember it 
 was discussed briefly at Subconf in Germany a couple of years ago.

 Branching the thread here because I'd like to propose a different approach 
 than Hiroaki. This proposition is not very different from the note 
 unicode-composition-for-filenames or what Peter S, Neels and others 
 suggested, perhaps just combining 2 changes slightly differently.

 This is based on my limited understanding of WC-NG, please correct me if I 
 make incorrect assumptions.

 - Server will still accept both NFC and NFD, however, it will no longer 
 accept collisions. Enforced by normalising to NFD before uniqueness checks 
 during add operations (yes, might be more expensive). There will be no 
 unified normalisation, but the subversion server will work like most 
 filesystems; return what was given to it.

For compatibility, we cannot ignore existing repositories and working
copies which have filename
collisions. So we cannot enforce subversion servers and clients to
normalize filenames.
We must let users to choose whether filenames are normalized or not
per repository.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-11 Thread Hiroaki Nakamura
2012/2/11 Branko Čibej br...@apache.org:
 On 11.02.2012 13:05, Hiroaki Nakamura wrote:
 2012/2/9 Markus Schaber m.scha...@3s-software.com:
 Von: Stefan Sperling [mailto:s...@elego.de]
 On Thu, Feb 09, 2012 at 12:20:14AM +0900, Hiroaki Nakamura wrote:
 [Upgrade options / backwards compatibility for proposed unicode 
 normalization fix]
 - Need to re-checkout existing working copies of the repository?
   = Yes, but only if config is changed from the default.
 Maybe this could even be avoided if newer clients (or an utility script) 
 can upgrade the working copy to the normalized format.
 Yes, if the working copy does not have filename collisions. However,
 for compatibility,
 we cannot let newer clients upgrade working copies automatically
 because existing
 working copies may have filename collisions.

 That's not entirely true, since we can detect the collisions in advance,
 and a partially upgraded working copy would still work

 From a practical point of view, it's very, very unlikely that there
 would be any such collisions in a valid working copy. People would tend
 to notice. :)

Yes, I agree wholeheartedly!
At work, I notice there are a few repositories which have NFC filenames
and NFD filenames. However there is no repository which have collisions
as far as I know.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-08 Thread Hiroaki Nakamura
Hi, thanks for your review.

2012/2/9 Stefan Sperling s...@elego.de:
 Open questions:

Here I try to answer these. Of course, I welcome everyone to answer.


  - How can the client retrieve the configuration from the server?
   This is related to server-dictated configuration, see
   http://wiki.apache.org/subversion/ServerDictatedConfiguration
   and http://subversion.tigris.org/issues/show_bug.cgi?id=1974
   This issue would need to be solved first.

I read those two pages and I think it can be done with server-dictated
configuration.


  - What happens if NFC/NFD is enabled in repository config, but the
   repository contains non-normalised paths (i.e. did not go through
   a dump/load cycle to normalise all paths)?

I think we will provide the check command for finding out:
- whether a repository contains the same filenames of different unicode
  normalized/unnormalized forms.
- all filenames in a repository are NFC.
- all filenames in a repository are NFD.

I think of an idea that we can change this config during loading cycle only,
that is, we can specify this config as an option to load command.
When load command finishes, the option value is saved in config.

However, administrators can cheat to change config file without loading,
as the config file is a plain text file. So we cannot enforce this config must
be set only by load command.

Therefore I think It should be administrators' responsibility to ensure this
config match a repository.


  - How do we handle name collisions if both NFC and NFD forms exist
   in a repository that sets the configuration to NCF or NFD?

   Is an upgrade not supported in this case?

No, I think we don't support to change this config to NFC/NFD in this case.
Only unicode-normalization 'none' is allowed.


   Or will duplicate paths need to be discarded from history?
    How can the user filter the paths, and how can the user decide
    which path is kept?

I think we don't support these. Maybe repository admin users
can remove one of duplicated filenames from history in repository
and try to load again, I wonder?


    Or will duplicate paths be renamed throughout history?
    How can the user rename the paths?

I think users can only normalize filenames during load command.
Users cannot rename filenames arbitrarily.


 Anything else? I cannot think of more questions but there might
 be more things to consider here.



-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-07 Thread Hiroaki Nakamura
2012/2/7 Branko Čibej br...@apache.org:
 On 06.02.2012 22:26, Hiroaki Nakamura wrote:
 The Unicode Standard says canonical equivalent sequences should be
 interpreted the same way.
 * 1.1 Canonical and Compatibility Equivalence
   http://unicode.org/reports/tr15/#Canonical_Equivalence
 * 2.12 Equivalent Sequences and Normalization
   http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf

 So we should not have the same name multiple times in repositories
 and working copies. Therefore subversion servers and clients does
 not need to handle them.

 *sigh*

 I don't give a gnat's whisker what the Unicode Standard says. I'm only
 interested in real-world situations. Or are you implying that, e.g., the
 Unix VFS layer will magically detect file name equality of different
 (de)normalized forms? Because it won't.

 -- Brane


I'm interested in real-world situations, too. It is the reality that
we need to avoid the same filenames in different forms because
they confuse users so much.

I don't think we expect file systems detect filename equality of
different forms. Mac OS X HFS+ can have only NFD filenames
and we must cope with it. And as you say, standard file systems
in Linux and Windows does not magically detect file name equality
of different forms. Also It's the reality we cannot force users to format
their harddisks and change file systems.

So communication layer must take care of this problem to provide
interoperability among Windows, Linux and Mac.
Subversion to the rescue!

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-06 Thread Hiroaki Nakamura
Hi, all.

It seems there is no further discussion.

I think the conclusion for the short term solution is:
We convert unnormalized paths to NFC normalized paths on clients only,
that is, svn_path_cstring_to_utf8.

It is the same approach as utf8precompose_macosx_2.patch in
http://subversion.tigris.org/issues/show_bug.cgi?id=2464

It is proven to work as it is included in MacPorts unicode_path variant
and Homebrew --unicode-path option.

The difference is this time we use utf8proc instead of Mac OS X APIs,
and we do conversions on not only Mac but all platforms.

Do you agree? If so, I will update my patch and post it to
http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Best regards,

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-06 Thread Hiroaki Nakamura
2012/2/6 Stefan Sperling s...@elego.de:
 On Mon, Feb 06, 2012 at 02:28:40PM +0100, Branko Čibej wrote:
 On 06.02.2012 14:10, Hiroaki Nakamura wrote:
  Hi, all.
 
  It seems there is no further discussion.
 
  I think the conclusion for the short term solution is:
  We convert unnormalized paths to NFC normalized paths on clients only,
  that is, svn_path_cstring_to_utf8.
 
  It is the same approach as utf8precompose_macosx_2.patch in
  http://subversion.tigris.org/issues/show_bug.cgi?id=2464
 
  It is proven to work as it is included in MacPorts unicode_path variant
  and Homebrew --unicode-path option.

 You'll note that MacPorts also warns you that using this option may
 cause interoperability issues with other clients that aren't using it,
 right? So this is hardly a universal solution that will not affect
 existing users and repositories.

 Exactly. This is what I meant when I said that we cannot apply the
 submitted patch as it is, at the very beginning of this thread.
 The submitted patch simply copies the MacPorts solution and has
 the same compatibility problems.

 I think the discussion made clear that there are two ways
 to move forward:

  1) Implement a client-side mapping table which maps server-provided
    paths to local filesystem paths. It translates between one or more
    server-side and local representations of the same path. This could
    be done only on Mac OS X (or, preferrably, only on HFS+ filesystems)
    because only Mac OS X has problems.
    The idea here is to not change existing paths in repositories at all,
    no matter which way they are encoded, and to teach Mac OS X clients
    to cope with the problem locally. This way, other existing clients
    won't notice a difference. The only thing that won't work is to create
    a working copy on Mac OS X which contains the same name multiple times,
    in NFD and in some other normalised or non-normalised form.
    This approach was suggested by Peter.

The Unicode Standard says canonical equivalent sequences should be
interpreted the same way.
* 1.1 Canonical and Compatibility Equivalence
  http://unicode.org/reports/tr15/#Canonical_Equivalence
* 2.12 Equivalent Sequences and Normalization
  http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf

So we should not have the same name multiple times in repositories
and working copies. Therefore subversion servers and clients does
not need to handle them. Rather I think we should fix subversion to
reject the same name in a different form.

To handle existing repositories and working copies, maybe we should
create a tool which checks repositories and working copies have the
same name multiple times.

If they have, users must rename files manually. In reality, I think this
is extremely rare.

    We'd need either a working patch or a more detailed implementation
    design document to move forward here.

OK. Peter, or somebody else, please give us either one of them.


  2) Do something else that effects repositories, too, and provide
    a clean upgrade path for everyone (servers and clients).
    AFAIK nobody has made a suggestion as to what could be done here.

What do you mean by a clean upgrade?
Is it clean if we do dump and load for repositories and re-checkout for
working copies?

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-04 Thread Hiroaki Nakamura
2012/2/3 Julian Foad julianf...@btopenworld.com:
 You may well be correct that NFC is never longer than NFD, but that's not the 
 question.  The question is whether NFC may be longer than the current paths 
 (which are not normalized to normalization form C or to form D).  And the 
 answer is yes it may be longer.  See 
 http://unicode.org/faq/normalization.html#11.

Oh, I didn't know that. Thanks for letting me know.
I also read all other items in http://unicode.org/faq/normalization.html#11
and all of http://www.unicode.org/reports/tr15/ and learned more about
normalization.

Maybe we should revise the note.
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames



 Here I quote from
 http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
    The proposed internal 'normal form' should be NFC, if only if
    it were because it's the most compact form of the two:  when
    allocating memory to store a conversion result, it won't be
    necessary (ever) to allocate more than the size of the input buffer.

 That statement seems to be talking about converting between NFC and NFD, not 
 from un-normalized to normalized.

Yes, indeed.

So, we need to normalize input paths before processing.
We choose NFC as normalization form.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Hiroaki Nakamura
2012/2/3 Peter Samuelson pe...@p12n.org:

 [Hiroaki Nakamura]
 In option (2), we do n12n on all clients on all platforms, and we
 include web_dav_svn in clients. So we convert all input paths to
 the server encoding, which is NFC.

 Indeed.  But the very concept of a server encoding means we are
 involving the server side.  Which invokes a lot of difficult questions
 like what about existing 1.x clients, what about existing checkouts
 and what about existing repositories.

Svn 1.7 forces me to upgrade existing 1.6 working copies.
So we can let users to upgrade working copies.

Existing repositories, I think it would be better to convert them too using
svndump/svnload. And we change svnload to convert filenames to NFC.
However in reality we cannot force users to convert every existing repository.
So we need to change servers too. When servers read filenames
from repositories, they first convert to NFC and then process commands.

We also need to changes servers in order to deal with existing 1.x clients.
We convert filenames to NFC when web_dav_svn and svnserve
receive filenames from clients, they must first convert filenames to NFC.


 By proposing a client-only solution, I hope to avoid _all_ those
 questions.  (Except what about existing checkouts - there would be a
 wc upgrade of some sort.)  No recoding of existing repository paths is
 necessary.  In my proposal, the only recoding that is done is on the
 client side, on a platform that does not support the original pathname
 (e.g., OS X HFS+ with a NFC path).

 All problems in computer science can be solved by another level of
 indirection.

 Mostly true.  I can't tell if you quoted that as a point of support for
 my proposal, or as a point against it.

 Yes, with the mapping table, you can mangle filenames. However I
 think it is too complex for novice users. Users must care about the
 original filenames and the mangled filenames all the time.

 Well, there is no need to use this same proposal to also work around
 other filesystem limitations like avoiding : on Windows.  It is just
 something that becomes _possible_.

 Also you must adapt all clients to use the mapping table. That is
 whole lot of work! Maybe you would create another version control
 system.

 By all clients I guess you mean all Subversion client libraries.
 Yes, that is the proposal.  It would touch libsvn_wc and probably
 libsvn_client and libsvn_subr.

Yes, like I said above, clients actually includes components that
run on servers like web_dav_svn, and it should read as any components
that access to repositories and working copies.

We also need to change svnserve. So we'd better say all servers and clients.


 So even if Windows NTFS can have the same abstract filenames in both
 NFC and NFD simultaneously, we should avoid that, and we should only
 allow NFC filenames.

 This could be done, if we wanted to go to the trouble.  Or we could
 just say use a pre-commit hook, like we tell people who want to
 prevent REAMDE and Reamde in a single dir.  It is not the same level of
 interoperability problem as the one this thread is about.

If you think in analogy to ASCII uppercase and lowercase examples,
you miss the point. Please reread the Unicode Standard Annex #15
UAX #15: Unicode Normalization Forms
http://unicode.org/reports/tr15/

  Canonical equivalence is a fundamental equivalency between
  characters or sequences of characters that represent the same
  abstract character, and when correctly displayed should always
  have the same visual appearance and behavior. Figure 1 illustrates
  this equivalence.

So, filenames in NFC and NFD are the equivalent, the same.
README and readme are different.
NFC/NFD and uppercase/lowercase are two different stories.

Should we allow the same filenames in one directory?
Of course not! If we allow that we go into really trouble and
confusion.

And OS X HSF+ does not allow that. So to support interoperability
to OS X, we should not allow it in subversion too.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Hiroaki Nakamura
2012/2/3 Branko Čibej br...@xbc.nu:
 On 02.02.2012 20:59, Hiroaki Nakamura wrote:
 So we need to change servers too. When servers read filenames from
 repositories, they first convert to NFC and then process commands.

 That won't work. You have to do the initial lookup in a
 normalization-agnostic way, and neither BDB nor FSFS makes that possible
 wihout scanning whole directories.

OK, then do scan whole directories. If you do not want that,
we force users to convert existing repositories. I think we must
choose one of the two. Tough choices, but I cannot think of a
better way at least right now.


 We also need to changes servers in order to deal with existing 1.x
 clients. We convert filenames to NFC when web_dav_svn and svnserve
 receive filenames from clients, they must first convert filenames to NFC.

 Actually, libsvn_repos; this has to work with ra_local as well. And it
 would have to maintain a table for converting results back to how the
 client knows them. This is the hard part to get right; imagine:

    $ svn up
    U čombe

 How will the server know if the client represents the č in the same
 encoding that the now-normalizing server sends? Will the client scan the
 directory and normalize the names to find the local file that needs
 updating?

Yes, without upgrading working copies, we must do that.

If there is a better way, I would like to know.
Please give us better solution if you have an idea  all.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Hiroaki Nakamura
2012/2/3 Daniel Shahaf danie...@elego.de:
 Branko Čibej wrote on Thu, Feb 02, 2012 at 21:03:47 +0100:
 On 02.02.2012 20:22, Peter Samuelson wrote:
  [Hiroaki Nakamura]
  In option (2), we do n12n on all clients on all platforms, and we
  include web_dav_svn in clients. So we convert all input paths to
  the server encoding, which is NFC.
  Indeed.  But the very concept of a server encoding means we are
  involving the server side.  Which invokes a lot of difficult questions
  like what about existing 1.x clients, what about existing checkouts
  and what about existing repositories.
 
  By proposing a client-only solution, I hope to avoid _all_ those
  questions.

 Can't see how that works, unless you either make the client-side
 solution optional, create a mapping table, or make name lookup on the
 server agnostic to character representation. I can't envision how any of
 those solutions would work all the time.

 It would be nice if we could normalize paths in the repository without
 having to perform a dump/reload cycle, but I don't know how that would
 work in FSFS

 It won't.  Changing the encoding increase the length (in bytes) of the
 string (in the dirents hash, for example), and thus change the offsets
 of the node-revs that are later in the file --- to which subsequent
 revisions, and the id's of those node-revs, refer.

Changes from NFD to NFC does not increase the length.
The length will be same or smaller, not larger.

Here I quote from
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
   The proposed internal 'normal form' should be NFC, if only if
   it were because it's the most compact form of the two:  when
   allocating memory to store a conversion result, it won't be
   necessary (ever) to allocate more than the size of the input buffer.


-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Re: Let's discuss about unicode compositions for filenames!

2012-02-02 Thread Hiroaki Nakamura
2012/2/3 Peter Samuelson pe...@p12n.org:

 [Hiroaki Nakamura]
 Existing repositories, I think it would be better to convert them too using
 svndump/svnload. And we change svnload to convert filenames to NFC.
 However in reality we cannot force users to convert every existing 
 repository.

 Also note that if you convert a repository (via dump/load or whatever),
 all working copies based on the repository are invalidated and need to
 be re-checked-out.  Avoiding _that_ problem would be really hairy, I
 think, very similar to the sort of work that would be needed to support
 obliterate without losing working copies.

 We also need to changes servers in order to deal with existing 1.x
 clients.  We convert filenames to NFC when web_dav_svn and svnserve
 receive filenames from clients, they must first convert filenames to
 NFC.

 You keep saying what we must do on the server side.  I propose
 something that is purely on the client side.  It will solve the OS X /
 non-OS X interoperability problem.  It will not solve every problem
 ever faced by a Subversion user.  That's a job for 2.0.

OK. When I started this thread, I suppose we'd like to focus to
long term solution 2.x. That's because the short term solution options (4)
written in
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
seems too diificult and complex for me.

But if a modification to my proposal will fit in short term 1.x,
I will modify it delightedly.


 Yes, like I said above, clients actually includes components that
 run on servers like web_dav_svn, and it should read as any components
 that access to repositories and working copies.

 No.  By clients I mean components that run on the client side.  If my
 proposal had required changes to mod_dav_svn, I would not have said
 strictly client-side.  I do not propose any change to mod_dav_svn,
 svnserve, svnadmin, libsvn_repos, libsvn_fs, the repository data, or
 anything else on the server side.

 If you think in analogy to ASCII uppercase and lowercase examples,
 you miss the point. Please reread the Unicode Standard Annex #15
 UAX #15: Unicode Normalization Forms
 http://unicode.org/reports/tr15/

 Thanks, I've read it.  The analogy stands.  We could prevent NFC/NFD
 collisions as an additional service to users, something we have not
 done for the past 10 years.  This would be along the lines of
 preventing users from shooting themselves in the foot.

 The actual _software_ problem that is solved by preventing collisions
 is the same as the software problem solved by preventing upper/lower
 case collisions: certain clients are unable to check out a folder that
 has such collisions.  (Windows clients, in the case of upper/lower
 collisions; OS X clients, in the case of NFC/NFD collisions.)

Yes, I agree with that.


 I think we are talking past each other.  You are trying to solve two
 distinct but related problems: 1. OS X client-side confusion when faced
 with a non-NFD repository path; 2. NFC/NFD collisions.  I am only
 trying to solve problem 1.  I'm ignoring problem 2 for two reasons:

    (a) Problem 2 requires server-side work and complex compatibility /
    upgrade scenarios (dump/load, re-check-out all wcs, etc).

    (b) Problem 2 can be worked around, for new repositories (or
    repositories with no existing collisions), with a pre-commit hook.

 ...neither of which are true for my proposal to solve problem 1.

 So long as you continue to insist that, to solve problem 1, we must
 also solve problem 2, I'm pretty sure we will never come to any
 agreement.

OK. So how about changing my proposal like:
(1) No sever modification. Just modify svn_path_cstring_to_utf8 only.
(2) Let users install a pre-commit hook which rejects any non-NFC filenames.

In this way, we only need one function. Modification is just like
the original OS X unicode path patch:
utf8precompose_macosx_2.patch
http://subversion.tigris.org/nonav/issues/showattachment.cgi/813/utf8precompose_macosx_2.patch
in
http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Only difference the original patch to my patch will be mine use
utf8proc so that we can use it on all platforms, Mac OS X, Windows
and Linux.

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


Let's discuss about unicode compositions for filenames!

2012-01-29 Thread Hiroaki Nakamura
Hi folks!

I read the note about unicode compositions for filenames
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames
and would like to drive the discussion.

First, for me, the short term solution (4) seems too difficult to implement.
It is very complex and error-prone, so here I focus to the long term
solution (2).

It is simple. We convert all input paths into the 'normal' normal form (NFC),
using utf8proc.
http://www.public-software-group.org/utf8proc

I made a quick-and-dirty proof-of-concept patch for the further discussion.

If you run apache + mod_dav_svn with this patch,
NFD filenames in commits by svn client without this patch will be
converted to NFC.

This patch has following limitations right now but we can fix them.
- It does not handle all input paths, only two:
  one for mod_dav_svn open_stream, one for svn_path_cstring_to_utf8.
- The error handling is not yet implemented.
- The configure script should be modified for linking against the
utf8proc library.
  Currently it needs EXTRA_LDFLAGS=-lutf8proc when running make.


To test this patch, please do the steps below.

(1) build and install utf8proc
The example below is for Scientific Linux 6.1 x86_64.
Currently I install utf8proc to system library locations (/usr/include
and /usr/lib64),
not places like /usr/local/include and /usr/local/lib64, just because I don't
want to bother about modifying the configure script right now.

wget 
http://www.public-software-group.org/pub/projects/utf8proc/v1.1.5/utf8proc-v1.1.5.tar.gz
tar xf utf8proc-v1.1.5.tar.gz
cd utf8proc-v1.1.5
make c-library
sudo install -m 644 libutf8proc.so /usr/lib64/libutf8proc.so.1.1.5
sudo ln -s libutf8proc.so.1.1.5 /usr/lib64/libutf8proc.so.1
sudo ln -s libutf8proc.so.1 /usr/lib64/libutf8proc.so
sudo install -m 644 utf8proc.h /usr/include

(2) build Subversion 1.7.2 with this patch.
cd subversion-1.7.2
patch -p1  ../subversion-1.7.2-NFC.diff
./configure
EXTRA_LDFLAGS=-lutf8proc make
sudo make install

One thing I'd like to discuss is how we link to utf8proc.
There are two options.
(1) Install utf8proc as a shared library and modify the configure script to
 have --with-utf8proc option.
(2) Copy the utf8proc source files in the subversion source directories and
 use static link (like sqlite-amalgamation).

The option (1) needs the utf8proc package to be created for each OS distribution
and modify the dependency of the subversion package. I think this is
the ideal way,
but that is a lot of work. I think the option (2) is easier. Put
utf8proc source files in
the subversion source tarballs.

Am I on the right track?
Let's discuss and fix this problem and we will be happy ever after!

-- 
)Hiroaki Nakamura) hnaka...@gmail.com


 subversion-1.7.2-NFC.diff
diff -ruN subversion-1.7.2.orig/subversion/include/svn_utf.h
subversion-1.7.2/subversion/include/svn_utf.h
--- subversion-1.7.2.orig/subversion/include/svn_utf.h  2009-11-17
04:07:17.0 +0900
+++ subversion-1.7.2/subversion/include/svn_utf.h   2012-01-29
11:54:20.150665621 +0900
@@ -220,6 +220,14 @@
  const svn_string_t *src,
  apr_pool_t *pool);

+/** Set @a *dest to a NFC canonicalized C string from string @a src;
+ * allocate @a *dest in @a pool.
+ */
+svn_error_t *
+svn_utf_cstring_NFC(const char **dest,
+const char *src,
+apr_pool_t *pool);
+
 #ifdef __cplusplus
 }
 #endif /* __cplusplus */
diff -ruN subversion-1.7.2.orig/subversion/libsvn_subr/path.c
subversion-1.7.2/subversion/libsvn_subr/path.c
--- subversion-1.7.2.orig/subversion/libsvn_subr/path.c 2011-01-18
06:45:39.0 +0900
+++ subversion-1.7.2/subversion/libsvn_subr/path.c  2012-01-29
18:01:06.900398904 +0900
@@ -1119,15 +1119,17 @@
  const char *path_apr,
  apr_pool_t *pool)
 {
+  char *path_nfc;
+  SVN_ERR(svn_utf_cstring_NFC(path_nfc, path_apr, pool));
   svn_boolean_t path_is_utf8;
   SVN_ERR(get_path_encoding(path_is_utf8, pool));
   if (path_is_utf8)
 {
-  *path_utf8 = apr_pstrdup(pool, path_apr);
+  *path_utf8 = apr_pstrdup(pool, path_nfc);
   return SVN_NO_ERROR;
 }
   else
-return svn_utf_cstring_to_utf8(path_utf8, path_apr, pool);
+return svn_utf_cstring_to_utf8(path_utf8, path_nfc, pool);
 }


diff -ruN subversion-1.7.2.orig/subversion/libsvn_subr/utf.c
subversion-1.7.2/subversion/libsvn_subr/utf.c
--- subversion-1.7.2.orig/subversion/libsvn_subr/utf.c  2011-08-24
00:04:38.0 +0900
+++ subversion-1.7.2/subversion/libsvn_subr/utf.c   2012-01-29
17:55:33.643895922 +0900
@@ -42,6 +42,7 @@
 #include private/svn_utf_private.h
 #include private/svn_dep_compat.h
 #include private/svn_string_private.h
+#include utf8proc.h

 

@@ -1029,3 +1030,58 @@

   return err;
 }
+
+static ssize_t svn_utf_map(
+  const uint8_t *str, ssize_t len, uint8_t **dstptr, int options,
+  apr_pool_t *pool
+) {
+  int32_t