On Sun, Jan 29, 2012 at 07:38:44PM +0900, Hiroaki Nakamura wrote: > Hi folks! > > I read the note about unicode compositions for filenames > http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames > and would like to drive the discussion.
Hi, I am very happy to hear that you want to work towards getting this problem fixed. Thank you for your help! I've just re-read the unicode-composition-for-filenames notes. I think they are a bit outdated. For instance, they still talk about the 1.6 working copy format. They also don't clearly explain the problems with backwards compatibility we're facing here. We won't be able to apply your patch as it is. The problem is that it can break operation for some existing repositories and working copies. Generally, I think that writing code that implements a solution for this problem is not hard, no matter what the solution is. The real challenge lies in finding a solution that is backwards compatible with existing repositories and working copies. I will explain what I mean by giving examples below. But first, let's recap the basic problem, if only so others can more easily follow this discussion. As you know, in Unicode, some characters can be represented in two distinct ways: pre-composed form (NFC) and de-composed form (NFD). For instance, the letter ä (a umlaut) can be represented by Unicode code point 0x00E4 ( ä ), which is the pre-composed form, or by code point 0x0061 ( a ) followed by code point 0x0308 ( ̈ ), which is the de-composed form. This is a basic property of Unicode. It simply contains both ways of representing these characters in its character tables. I.e. any byte-string representation of Unicode, be it UTF-8, UTF-16, must also be able to represent both ways of encoding such characters. So when filenames are given in Unicode, a filename may contain any combination of NFC and NFD characters. Because Subversion never normalises filenames to one form or the other, the space of all possible filenames in a Subversion repository or working copy contains a large amount of redundancy. There are many filenames which look the same to the user but differ in terms of the Unicode code points used to represent them. For instance, imagine a filename containing 3 "a umlaut" characters and otherwise only characters from the ASCII set. There are 8 (2^3) different ways of representing this filename in Unicode, and hence 8 different UTF-8 byte strings which can be used in the repository or working copy to represent what is, from the user's point of view, the same filename. The problem we have on Mac OS X is that when we write any of these 8 different byte strings to the filesystem to name the file, and later read the filename back from the filesystem (e.g. by opening the parent directory and asking for a list of files it contains), we will always receive the name with all "a umlaut" characters expanded to de-composed form. Now, in the working copy meta data (.svn/wc.db) we can use any of 8 forms of the filename. If we don't use NFC for all characters in the filename, the filename read from disk may fail to match any name stored in meta data. Let's simplify the discussion a bit by assuming only two possible ways of encoding a filename: One with all characters normalised to NFC, and one with all characters normalised to NFD. We don't really need to consider the mixed forms for the purpose of this discussion (though it helps to keep in mind that they exist). So let's talk about what would happen if we applied your patch. Let's say I have a working copy which contains filenames normalised to NFD, as is the case on Mac OS X. The server gets upgraded to a new release of Subversion which contains your patch. This means the server will now send all paths as NFC. Let's say there are changes made to a file which has 3 "a umlaut" characters in its name. When I run 'svn update' my client will try to find the NFC form of the name in its meta-data, and fail to locate it because the file was stored as NFD. So this means your patch will break compatibility with the working copy. Therefore, we would need to provide an upgrade path for those working copies. E.g. 'svn upgrade' could be modified to normalise all filenames stored in the DB to NFC. Problem solved. But now comes the next problem. Given a filename in NFC which we read from meta data, how can we locate the corresponding on-disk file if its form is not NFC? We could of course rename the on-disk file. Except this won't work on Mac OS X unless we decide to use NFD encoding. So we could decide to also use NFD everywhere -- but this would break as soon as some other operating system decides to normalise to NFC, so it's not a good solution. We could also open the parent directory, read all the filenames within it, normalise them all, and then search the resulting list. This works, expect if a name exists twice, once in NFC form and once in NFD form. We'd somehow have to solve the name collision in the filesystem. But well, let's assume we had a way of storing NFC in meta-data and not caring about the on-disk form. Now things get even more complicated. My friend is not willing to upgrade to a new client version yet, which is fine because all 1.x releases of Subversion clients are supposed to be compatible with all 1.y releases of Subversion servers. He should not have to upgrade his client just because the server was upgraded. In his working copy, the file name is also in NFD form. When he talks to the server, the server provides the name in NFC. Because he is using the old client the client has no way of knowing how to map the NFC name to its local NFD file. So we've broken backwards compatibility for my friend. But it gets worse. Recall the filesystem name collision problem mentioned above. This problem can also happen in the repository filesystem! For instance, assume that in the repository there already exist two filenames, one NFD, the other NFC, but they both are actually the same name. This currently works fine, expect on Mac OS X. What should be done now when the server is upgraded to normalise all paths to NFC? How can we still access content of the file which has the name in NFD form? Should one of the files be renamed in the HEAD revision? Or all historic revisions? Or removed from history? How do we help users carrying out such upgrades, without breaking existing working copies used by older clients which do not know anything about the NFC/NFD problem? These are the questions which we'll need to answer to solve this issue. I honestly do not have good answers. I hope that you will find ways of solving these problems. There may even be more problems hidden here which I haven't though of yet. It will be quite hard to thoroughly make sure that no unforeseen problems will arise when this issue gets fixed one way or another. A good solution needs to be carefully planned, implemented, and thoroughly tested. I think the following caveats would be acceptable if they help with fixing the issue: - An upgrade path which optionally requires people to check all working copies out again, when either the server or the client is upgraded. Note again, this must be _optional_. Only people affected by the issue should have to make this choice, e.g. by changing configuration parameters from the default settings. By default, existing working copies should keep working after upgrading the client or server. Because imagine what would happen if an upgrade of the server broke many working copies checked out from a hosting service such as sourceforge.net -- not good. - An upgrade path which requires everyone to run 'svn upgrade' on their working copies in order to use the new client version, but not the new server version. - An upgrade path which requires people to dump/load their existing repositories in order to get rid of the problem. Existing repositories which are left alone should keep working as they do today, with problems on Mac OS X clients but no problems on other clients (anything else would cause too much breakage and confusion). E.g. this step could normalise all paths in all revisions. But keep in mind the problem of name collisions which can happen when the same name exists as both NFC and NFD. Something needs to happen in this case to resolve the problem, ideally giving users a choice about how to proceed. As you can see, there is a lot of complexity involved in fixing this issue. I hope you aren't discouraged by this. Someone will need to explore the details of these problems to fix this issue. I am not convinced that it is impossible to fix. We'll need to be very careful about backwards compatibility when making decisions. But there might be ways to achieve a satisfying solution nonetheless.