Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On 2014-05-05 23.46, Jeff King wrote: [] 2. Do all index filename comparisons under Mac OS X using a UTF-8 aware comparison function regardless if core.precomposeunicode is set. This would probably have bad performance, and somewhat defeats the point of converting the filenames at the readdir level in the first place. Right, I'm concerned about performance here, but I wonder if we can reuse the name-hash solutions from ignorecase. Some short question, (sorry being short) What do you think about this: diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c index 95fe849..c807f38 100644 --- a/compat/precompose_utf8.c +++ b/compat/precompose_utf8.c @@ -57,6 +57,22 @@ void probe_utf8_pathname_composition(char *path, int len) } +char *inverscompose_str_len(const char *in, size_t insz, int *outsz) +{ + char *prec_str = NULL; + + if (has_non_ascii(in, insz, NULL)) { + int old_errno = errno; + prec_str = reencode_string_len(in, (int)insz, + precomposed_unicode ? path_encoding : repo_encoding, + precomposed_unicode ? repo_encoding : path_encoding, + outsz); + errno = old_errno; + } + return prec_str; +} + + void precompose_argv(int argc, const char **argv) { int i = 0; diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h index 3b73585..b9ac099 100644 --- a/compat/precompose_utf8.h +++ b/compat/precompose_utf8.h @@ -26,6 +26,7 @@ typedef struct { struct dirent_prec_psx *dirent_nfc; } PREC_DIR; +char *inverscompose_str_len(const char *in, size_t insz, int *outsz); void precompose_argv(int argc, const char **argv); void probe_utf8_pathname_composition(char *, int); diff --git a/git-compat-util.h b/git-compat-util.h index f6d3a46..5ae5f57 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -177,7 +177,7 @@ typedef unsigned long uintptr_t; #ifdef PRECOMPOSE_UNICODE #include compat/precompose_utf8.h #else -#define precompose_str(in,i_nfd2nfc) +#define inverscompose_str_len(s,i,o) NULL #define precompose_argv(c,v) #define probe_utf8_pathname_composition(a,b) #endif diff --git a/name-hash.c b/name-hash.c index 97444d0..3c4a4ee 100644 --- a/name-hash.c +++ b/name-hash.c @@ -210,7 +210,7 @@ struct cache_entry *index_dir_exists(struct index_state *istate, const char *nam return NULL; } -struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int icase) +static struct cache_entry *index_file_exists_int(struct index_state *istate, const char *name, int namelen, int icase) { struct cache_entry *ce; struct hashmap_entry key; @@ -227,6 +227,25 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na return NULL; } + +struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int icase) +{ + struct cache_entry *ce; + ce = index_file_exists_int(istate, name, namelen, icase); + if (ce) + return ce; + else { + int prec_len; + char *prec_name = inverscompose_str_len(name, namelen, prec_len); + if (prec_name) { + ce = index_file_exists_int(istate, prec_name, prec_len, icase); + free(prec_name); + } + } + return ce; +} + + void free_name_hash(struct index_state *istate) { if (!istate-name_hash_initialized) diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh index e4ba601..4414658 100755 --- a/t/t3910-mac-os-precompose.sh +++ b/t/t3910-mac-os-precompose.sh @@ -140,6 +140,22 @@ test_expect_success Add long precomposed filename ' git add * git commit -m Long filename ' + +test_expect_success 'handle existing decomposed filenames' ' + git checkout master + git reset --hard + git checkout -b exist_decomposed + echo content verbatim.$Adiarnfd + git -c core.precomposeunicode=false add verbatim.$Adiarnfd + git commit -m existing decomposed file + echo \verbatim.A\\314\\210\ expect + git ls-files actual + test_cmp expect actual + expect + git status | grep verbatim || : actual + test_cmp expect actual +' + # Test if the global core.precomposeunicode stops autosensing # Must be the last test case test_expect_success respect git config --global core.precomposeunicode ' diff --git a/utf8.c b/utf8.c index 77c28d4..fb8a99a 100644 --- a/utf8.c +++ b/utf8.c @@ -519,7 +519,7 @@ char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, int *outs char *out, *outpos; iconv_ibp cp; - outsz = insz; + outsz = 3*insz; /* for decompose */ outalloc = outsz + 1; /* for terminating NUL */
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Mon, May 5, 2014 at 11:46 PM, Jeff King p...@peff.net wrote: On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote: 1. Tell everyone that NFD in the git repo is wrong, and they should make a new commit to normalize all their in-repo files to be precomposed. This is probably not the right thing to do, because it still doesn't fix checkouts of old history. And it spreads the problem to people on byte-preserving filesystems (like ext4), because now they have to start precomposing their filenames as they are adde to git. (typo: ^added) I'm not sure if I follow. People running ext4 (or Linux in general, or Windows, or Unix) do not suffer from file system feature of Mac OS, which accepts precomposed/decomposed Unicode but returns decompomsed. What I mean by spreads the problem is that git on Linux does not need to care about utf8 at all. It treats filenames as a byte sequence. But if we were to start enforcing filenames should be precomposed utf8, then people adding files on Linux would want to enforce that, too. People on Linux could ignore the issue as they do now, but they would then create problems for OS X users if they add decomposed filenames. IOW, if the OS X code assumes all repo filenames are precomposed, then other systems become a possible vector for violating that assumption. FWIW, Git for Windows also doesn't deal with that filenames are just a byte-sequence-notion. We have patches in place that require filenames to *either* be valid UTF-8 or Windows-1252. Windows itself treats filenames as Unicode characters, not arbitrary byte sequences. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote: 1. Tell everyone that NFD in the git repo is wrong, and they should make a new commit to normalize all their in-repo files to be precomposed. This is probably not the right thing to do, because it still doesn't fix checkouts of old history. And it spreads the problem to people on byte-preserving filesystems (like ext4), because now they have to start precomposing their filenames as they are adde to git. (typo: ^added) I'm not sure if I follow. People running ext4 (or Linux in general, or Windows, or Unix) do not suffer from file system feature of Mac OS, which accepts precomposed/decomposed Unicode but returns decompomsed. What I mean by spreads the problem is that git on Linux does not need to care about utf8 at all. It treats filenames as a byte sequence. But if we were to start enforcing filenames should be precomposed utf8, then people adding files on Linux would want to enforce that, too. People on Linux could ignore the issue as they do now, but they would then create problems for OS X users if they add decomposed filenames. IOW, if the OS X code assumes all repo filenames are precomposed, then other systems become a possible vector for violating that assumption. 3. Convert index filenames to their precomposed form when we read the index from disk. This would be efficient, but we would have to be careful not to write the precomposed forms back out to disk. Question: How could we be careful? Mac OS writes always decomposed Unicode to disk. (And all other OS tend to use precomposed forms, mainly because the keyboard driver generates it.) Sorry, I should have been more clear here. I meant do not write index entries using the precomposed forms out to the on-disk index. Because that would mean that git silently converts your filenames, and it would look like you have changes to commit whenever you read in a tree with a decomposed name. Looking over the patch you sent earlier, I suspect that is part of its problem (it stores the converted name in the index entry's name field). This is my understanding: Some possible fixes are: 1. Accept that NFD in a Git repo which is shared between Mac OS and Linux or Windows is problematic. Whenever core.precomposeunicode = true, do the following: Let Git under Mac OS change all file names in the index into the precomposed form when a new commit is done. This is probably not a wrong thing to do. When the index file is read into memory, precompose the file names and compare them with the precomposed form coming from precompose_utf8_readdir(). This avoids decomposed file names to be reported as untracked by git status. This is the case I was specifically thinking of above (and I think what your patch is doing). 2. Do all index filename comparisons under Mac OS X using a UTF-8 aware comparison function regardless if core.precomposeunicode is set. This would probably have bad performance, and somewhat defeats the point of converting the filenames at the readdir level in the first place. Right, I'm concerned about performance here, but I wonder if we can reuse the name-hash solutions from ignorecase. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
(Sorry for the late reply, I'm handicapped by some local IT problems) Peff, do you know if the fix below helps ? On 2014-04-28 18.16, Jeff King wrote: If you have existing decomposed filenames in your git repository (e.g., that were created with older versions of git that did not precompose unicode), a modern git with core.precomposeunicode set does not handle them well. Yes. The problem is that we normalize the paths coming from the disk into their precomposed form, and then compare them against the literal bytes in the index. This makes things better if you have the precomposed form in the index. It makes things worse if you actually have the decomposed form in the index. As a result, paths with decomposed filenames may have their precomposed variants listed as untracked files (even though the precomposed variants do not exist on-disk at all). Yes This patch just adds a test to demonstrate the breakage. Some possible fixes are: 1. Tell everyone that NFD in the git repo is wrong, and they should make a new commit to normalize all their in-repo files to be precomposed. This is probably not the right thing to do, because it still doesn't fix checkouts of old history. And it spreads the problem to people on byte-preserving filesystems (like ext4), because now they have to start precomposing their filenames as they are adde to git. (typo: ^added) I'm not sure if I follow. People running ext4 (or Linux in general, or Windows, or Unix) do not suffer from file system feature of Mac OS, which accepts precomposed/decomposed Unicode but returns decompomsed. 2. Do all index filename comparisons using a UTF-8 aware comparison function when core.precomposeunicode is set. This would probably have bad performance, and somewhat defeats the point of converting the filenames at the readdir level in the first place. 3. Convert index filenames to their precomposed form when we read the index from disk. This would be efficient, but we would have to be careful not to write the precomposed forms back out to disk. Question: How could we be careful? Mac OS writes always decomposed Unicode to disk. (And all other OS tend to use precomposed forms, mainly because the keyboard driver generates it.) 4. Introduce some infrastructure to efficiently match up the precomposed/decomposed forms. We already do something similar for case-insensitive files using name-hash.c. We might be able to adapt that strategy here. -- This is my understanding: Some possible fixes are: 1. Accept that NFD in a Git repo which is shared between Mac OS and Linux or Windows is problematic. Whenever core.precomposeunicode = true, do the following: Let Git under Mac OS change all file names in the index into the precomposed form when a new commit is done. This is probably not a wrong thing to do. When the index file is read into memory, precompose the file names and compare them with the precomposed form coming from precompose_utf8_readdir(). This avoids decomposed file names to be reported as untracked by git status. 2. Do all index filename comparisons under Mac OS X using a UTF-8 aware comparison function regardless if core.precomposeunicode is set. This would probably have bad performance, and somewhat defeats the point of converting the filenames at the readdir level in the first place. (The TC is good) Signed-off-by: Jeff King p...@peff.net --- t/t3910-mac-os-precompose.sh | 10 ++ 1 file changed, 10 insertions(+) diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh index e4ba601..23aa61e 100755 --- a/t/t3910-mac-os-precompose.sh +++ b/t/t3910-mac-os-precompose.sh @@ -140,6 +140,16 @@ test_expect_success Add long precomposed filename ' git add * git commit -m Long filename ' + +test_expect_failure 'handle existing decomposed filenames' ' + echo content verbatim.$Adiarnfd + git -c core.precomposeunicode=false add verbatim.$Adiarnfd + git commit -m existing decomposed file + expect + git ls-files --exclude-standard -o verbatim* untracked + test_cmp expect untracked +' + # Test if the global core.precomposeunicode stops autosensing # Must be the last test case test_expect_success respect git config --global core.precomposeunicode ' On top of the we need to following: diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c index 95fe849..5ed3d17 100644 --- a/compat/precompose_utf8.c +++ b/compat/precompose_utf8.c @@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len) } +char *precompose_str_len(const char *in, size_t insz, int *outsz) +{ +char *prec_str = NULL; +if (precomposed_unicode != 1) +return NULL; +
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On 2014-04-30 16.57, Torsten Bögershausen wrote: There is something wrong with the patch, (test suite hangs or TC fail), so I need to com back later. Sorry for the noise. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On 29.04.14 20:02, Jeff King wrote: On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: This patch just adds a test to demonstrate the breakage. Some possible fixes are: 1. Tell everyone that NFD in the git repo is wrong, and they should make a new commit to normalize all their in-repo files to be precomposed. This is probably not the right thing to do, because it still doesn't fix checkouts of old history. And it spreads the problem to people on byte-preserving filesystems (like ext4), because now they have to start precomposing their filenames as they are adde to git. Hmm, have we taught the compare precomposed for codepaths that compare two trees and a tree and the index, too? Otherwise, we would have the same issue with commits in the old history. Ugh, yeah, I didn't think about that codepath. I think we would not want to precompose in that case. IOW, git works byte-wise internally, but it is only at the filesystem layer that we do such munging. The index straddles the line between the filesystem and git's internal representations. [snip] Please allow me to answer on this post- I made a suggestion here: https://github.com/tboegi/git/commit/85305ce306cb88a07dad6350d6ba8c5f2f817af6 The new test in t3910 passes, but the test suite hangs somewhere, there is a whitespace in precompose_utf8.c, so I don't know what to say. But in case someone wants to make a code review: commit 85305ce306cb88a07dad6350d6ba8c5f2f817af6 Author: Torsten Bögershausen tbo...@web.de Date: Wed Apr 30 10:30:04 2014 +0200 core.precomposeunicode with decomposed filenames Commit 750b2e4785e shows a failure of core.precomposeunicode when decomposed filenames are in the index. When decomposed file names are in the index and readdir() converts them into the decomposed form, Git status will report the precomposed file name as untracked. Solution: Precompose file names when reading the index file from disc into memory. diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c index 95fe849..40ebc2e 100644 --- a/compat/precompose_utf8.c +++ b/compat/precompose_utf8.c @@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len) } +char *precompose_str_len(const char *in, size_t insz, int *outsz) +{ + char *prec_str = NULL; + if (precomposed_unicode != 1) + return NULL; + + if (has_non_ascii(in, insz, NULL)) + prec_str = reencode_string_len(in, insz, repo_encoding, path_encoding, outsz); + + return prec_str; +} + + void precompose_argv(int argc, const char **argv) { int i = 0; diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h index 3b73585..28f6595 100644 --- a/compat/precompose_utf8.h +++ b/compat/precompose_utf8.h @@ -26,6 +26,7 @@ typedef struct { struct dirent_prec_psx *dirent_nfc; } PREC_DIR; +char *precompose_str_len(const char *in, size_t insz, int *outsz); void precompose_argv(int argc, const char **argv); void probe_utf8_pathname_composition(char *, int); diff --git a/git-compat-util.h b/git-compat-util.h index d493a8c..de117d1 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -180,7 +180,7 @@ typedef unsigned long uintptr_t; #ifdef PRECOMPOSE_UNICODE #include compat/precompose_utf8.h #else -#define precompose_str(in,i_nfd2nfc) +#define precompose_str_len(s,i,o) NULL #define precompose_argv(c,v) #define probe_utf8_pathname_composition(a,b) #endif diff --git a/read-cache.c b/read-cache.c index 4b4effd..0887835 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1330,7 +1330,7 @@ static inline uint32_t ntoh_l_force_align(void *p) #define ntoh_l(var) ntoh_l_force_align((var)) #endif -static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk, +static struct cache_entry *cache_entry_from_ondisk_int(struct ondisk_cache_entry *ondisk, unsigned int flags, const char *name, size_t len) @@ -1355,6 +1355,22 @@ static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *on return ce; } +static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk, +unsigned int flags, +const char *name, +size_t len) +{ + int prec_len; + char *prec_name = precompose_str_len(name, len, prec_len); + if (prec_name) { + struct cache_entry *ce; + ce = cache_entry_from_ondisk_int(ondisk, flags, prec_name, prec_len); + free(prec_name); + return ce; + } + return
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On 04/29/2014 05:23 AM, Jeff King wrote: On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote: OK, thanks for the description. In theory we can make Git composition ignoring by changing index_file_exists() in name-hash.c. (Both names must be precomposed first and compared then) Yeah, we could perhaps get away without storing the extra precomposed form if we just stored the entries under their precomposed hash, and then taught same_name to use a slower precompose-aware comparison. But I also see that we do binary searches in index_name_pos (called by index_name_is_other). I don't know if we'd have to deal with this problem there, too. Just loud thinking: We precompose whenever we read file names from disc, that's done in readdir() We precompose whenever we get an argv into Git, that's done in precompose_argv() We precompose every time we read file names from the index file on disc(?) into memory. That we don't do today, and my suggestion to hack name-hash.c isn't a good one. Probably we need to go into read-cache.c, or more places. I'm not an expert here knowing all Git internal details. Basically all places where strings containing file names are put into memory are effected, and I wouldn't be too concerned about CPU cycles. I don't know how much people are using Git before 1.7.12 (the first version supporting precomposed unicode). Could we simply ask them to upgrade ? I'm not sure what you mean here. Upgrading won't help, because the values are baked into existing history created with the old versions forever. Any time I git checkout v1.0 on the broken project, a modern git will complain about the ghost untracked file. It depends if all file names in a certain repo are stored decomposed, (in this case everybody can set core.precomposeunicode false) or if there is a mixture having precomposed and decomposed in different comits/directories... In this case we can normalize the master branch. For older commit the users need to wait for an updated Git version, until that they need to live with the ghosts as good as they can. The next problem is that people need to agree if the repo should store names in pre- or decomposed form. (My voice is for precomposed) Unfortunatly the core.precomposeunicode is repo-local, so everybody needs to agree globally and configure locally. Yeah, that was sort of my point 1 from the patch. I'm a bit worried that it creates problems for people on other systems, though. Linux people do not need to care about precomposed/decomposed at all, and any attempt we make to automatically handle it is going to run afoul of non-utf8 encodings. Not to mention that it does not solve the git checkout v1.0 problem above. Not sure what is meant by non-utf8 encodings. Mac OS X can only handle Unicode filenames, and a single ISO-8859-1 will be returned as %XY from readdir(). So if you want to share a repo with Mac OS X (and/or Windows) Unicode should be used. Are you saying that there is a Linux station feeding in file names in e.g. 8859-1, EUC ? My experience/knowledge is that you can not do that (in a useful way). Side note: I which we had this config variable travelling with the repo, like .gitattributes does for text dealing with CRLF-LF. Yeah, I guess if we wanted to enforce it everywhere, you would have to use .gitattributes since we cannot safely turn on such a feature by default. I don't know how many reports you have, reading all this it feels as if the effected users could normalize their repos and run git config core.precomposeunicode true, followed by git config --global core.precomposeunicode true. Does that sound like a possible way forward ? I have just a handful of reports. Maybe 3-4? I didn't dig them all up today, as it would be a non-trivial effort. But I have no idea how good a sampling that is. The following could help, may be: git -c core.quotepath=false ls-files | iconv -f UTF-8-MAC -t UTF-8 expected git -c core.quotepath=false ls-files actual diff expected actual -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
Jeff King p...@peff.net writes: This patch just adds a test to demonstrate the breakage. Some possible fixes are: 1. Tell everyone that NFD in the git repo is wrong, and they should make a new commit to normalize all their in-repo files to be precomposed. This is probably not the right thing to do, because it still doesn't fix checkouts of old history. And it spreads the problem to people on byte-preserving filesystems (like ext4), because now they have to start precomposing their filenames as they are adde to git. Hmm, have we taught the compare precomposed for codepaths that compare two trees and a tree and the index, too? Otherwise, we would have the same issue with commits in the old history. Do we have a similar issue for older commit in a history under ignore-case as well? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: This patch just adds a test to demonstrate the breakage. Some possible fixes are: 1. Tell everyone that NFD in the git repo is wrong, and they should make a new commit to normalize all their in-repo files to be precomposed. This is probably not the right thing to do, because it still doesn't fix checkouts of old history. And it spreads the problem to people on byte-preserving filesystems (like ext4), because now they have to start precomposing their filenames as they are adde to git. Hmm, have we taught the compare precomposed for codepaths that compare two trees and a tree and the index, too? Otherwise, we would have the same issue with commits in the old history. Ugh, yeah, I didn't think about that codepath. I think we would not want to precompose in that case. IOW, git works byte-wise internally, but it is only at the filesystem layer that we do such munging. The index straddles the line between the filesystem and git's internal representations. I think my keep the normalized names alongside index entries approach might still work there. But it means that we compare against the real byte-wise names on the tree side, and against the normalized names on the path side. But that means having two comparison/lookup functions for the index, and always using the right one. And algorithms that rely on traversing two sorted lists cannot work in both directions. Do we have a similar issue for older commit in a history under ignore-case as well? I don't think so, because we handle ignorecase completely differently. There we use the name-hash with a case-insensitive hash and a case-insensitive comparison function. And we use strcasecmp liberally throughout the code. I don't think we have a str_utf8_cmp that ignores normalizations (or maybe strcoll will do this?). But in theory we could use it everywhere we use strcasecmp for ignore_case. And then we would not need to have our readdir wrapper, maybe? I admit I haven't thought that much about _either_ approach. But aside from some bugs in the hash system, I do not recall seeing any design problems in the ignorecase code. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
Jeff King p...@peff.net writes: I don't think we have a str_utf8_cmp that ignores normalizations (or maybe strcoll will do this?). But in theory we could use it everywhere we use strcasecmp for ignore_case. And then we would not need to have our readdir wrapper, maybe? I admit I haven't thought that much about _either_ approach. But aside from some bugs in the hash system, I do not recall seeing any design problems in the ignorecase code. Our diffs and merges depend on walking two (or more) sorted lists, and that sort order is baked in the tree objects when they are created. Using normalized comparison only when comparing the earliest elements picked from these sorted lists would not give you the correct comparison or merge results, would it? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Tue, Apr 29, 2014 at 11:49:30AM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: I don't think we have a str_utf8_cmp that ignores normalizations (or maybe strcoll will do this?). But in theory we could use it everywhere we use strcasecmp for ignore_case. And then we would not need to have our readdir wrapper, maybe? I admit I haven't thought that much about _either_ approach. But aside from some bugs in the hash system, I do not recall seeing any design problems in the ignorecase code. Our diffs and merges depend on walking two (or more) sorted lists, and that sort order is baked in the tree objects when they are created. Using normalized comparison only when comparing the earliest elements picked from these sorted lists would not give you the correct comparison or merge results, would it? Right, but we do not do normalized comparisons on that side. Not for precompose, and not for ignorecase. The entry in the index _is_ case-sensitive[1], and we compare it as such to the tree side. It is only when comparing the filesystem side to the index that we need to care about such normalizations. And there we have the name-hash code to handle ignorecase, but nothing to handle precompose. -Peff [1] This works because you typically get the case-sensitive entry via `git read-tree`, and then further update it from the filesystem. If you were to add a new entry makefile, and somebody else added Makefile, they would conflict. When you do git add makefile and Makefile _does_ exist, I am not sure, though, if it is git who handles making sure we find the correct entry, or if it is simply the fact that case insensitive filesystems are typically case-preserving (so you generally ask for Makefile anyway). If it is the latter, then that is a problem for precompose. HFS's NFD normalization is non-preserving. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
Jeff King p...@peff.net writes: This patch just adds a test to demonstrate the breakage. Some possible fixes are: ... 2. Do all index filename comparisons using a UTF-8 aware comparison function when core.precomposeunicode is set. This would probably have bad performance, and somewhat defeats the point of converting the filenames at the readdir level in the first place. As we do this only when core.precomposeunicode is set, projects that use pathnames encoded not in UTF-8 (e.g. 8859-1, EUC, etc.) will not be affected by getting their path mangled, as long as we won't flip the default to true (which I am not suggesting to do, by the way). 3. Convert index filenames to their precomposed form when we read the index from disk. This would be efficient, but we would have to be careful not to write the precomposed forms back out to disk. I think this may be the right approach, especially if you are going to do this only when core.precomposeunicode is set. the reasoning behind we would have to be careful not to write part, is unclear to me, though. Don't decomposing filesystems perform the manglig from the precomposed form without even being asked to do so, just like a case insensitive filesystem will overwrite an existing makefile on a request to write to Makefile? 4. Introduce some infrastructure to efficiently match up the precomposed/decomposed forms. We already do something similar for case-insensitive files using name-hash.c. We might be able to adapt that strategy here. Signed-off-by: Jeff King p...@peff.net --- t/t3910-mac-os-precompose.sh | 10 ++ 1 file changed, 10 insertions(+) diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh index e4ba601..23aa61e 100755 --- a/t/t3910-mac-os-precompose.sh +++ b/t/t3910-mac-os-precompose.sh @@ -140,6 +140,16 @@ test_expect_success Add long precomposed filename ' git add * git commit -m Long filename ' + +test_expect_failure 'handle existing decomposed filenames' ' + echo content verbatim.$Adiarnfd + git -c core.precomposeunicode=false add verbatim.$Adiarnfd + git commit -m existing decomposed file + expect + git ls-files --exclude-standard -o verbatim* untracked + test_cmp expect untracked +' + # Test if the global core.precomposeunicode stops autosensing # Must be the last test case test_expect_success respect git config --global core.precomposeunicode ' -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote: 3. Convert index filenames to their precomposed form when we read the index from disk. This would be efficient, but we would have to be careful not to write the precomposed forms back out to disk. I think this may be the right approach, especially if you are going to do this only when core.precomposeunicode is set. the reasoning behind we would have to be careful not to write part, is unclear to me, though. Don't decomposing filesystems perform the manglig from the precomposed form without even being asked to do so, just like a case insensitive filesystem will overwrite an existing makefile on a request to write to Makefile? Sorry, I meant do not write the precomposed forms back out to the on-disk index. And by extension, do not update cache-tree and write them out to git trees. IOW, it is not enough to just set cache_entry-name to the normalized form. You'd need to store both. Since such entries are in the minority, and because cache_entry is already a variable-length struct, I think you could get away with sticking it after the name field, and then comparing like: const char *ce_normalized_name(struct cache_entry *ce, size_t *len) { const char *ret; /* Normal, fast path */ if (!(ce-ce_flags CE_NORMALIZED_NAME)) { len = ce_namelen(ce); return ce-name; } /* Slow path for normalized names */ ret = ce-name + ce-namelen + 1; *len = strlen(name); return ret; } The strlen is probably OK since such paths are presumably in the minority (even for UTF-8 paths, we can avoid storing the extra copy if they do not need any normalization). Or we could get fancy and encode the length in front, but I am not sure it is worth the complexity. Anyway, the tricky part is then making sure that all cache_entry name comparisons use ce_normalized_name instead of ce-name. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On 28.04.14 21:35, Jeff King wrote: On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote: 3. Convert index filenames to their precomposed form when we read the index from disk. This would be efficient, but we would have to be careful not to write the precomposed forms back out to disk. I think this may be the right approach, especially if you are going to do this only when core.precomposeunicode is set. the reasoning behind we would have to be careful not to write part, is unclear to me, though. Don't decomposing filesystems perform the manglig from the precomposed form without even being asked to do so, just like a case insensitive filesystem will overwrite an existing makefile on a request to write to Makefile? Sorry, I meant do not write the precomposed forms back out to the on-disk index. And by extension, do not update cache-tree and write them out to git trees. IOW, it is not enough to just set cache_entry-name to the normalized form. You'd need to store both. Since such entries are in the minority, and because cache_entry is already a variable-length struct, I think you could get away with sticking it after the name field, and then comparing like: const char *ce_normalized_name(struct cache_entry *ce, size_t *len) { const char *ret; /* Normal, fast path */ if (!(ce-ce_flags CE_NORMALIZED_NAME)) { len = ce_namelen(ce); return ce-name; } /* Slow path for normalized names */ ret = ce-name + ce-namelen + 1; *len = strlen(name); return ret; } The strlen is probably OK since such paths are presumably in the minority (even for UTF-8 paths, we can avoid storing the extra copy if they do not need any normalization). Or we could get fancy and encode the length in front, but I am not sure it is worth the complexity. Anyway, the tricky part is then making sure that all cache_entry name comparisons use ce_normalized_name instead of ce-name. -Peff To my knowledge repos with decomposed unicode should be rare in practice. I only can speak for european (or latin based) or cyrillic languages myself: - It is difficult (but not impossible) to enter decomposed unicode on the keyboard. - Some programs under Mac OS X do not handle decomposed code points well, an ä may be displayed as ¨a for example. - Pushing and pulling to Windows or Linux is possible, but the same problems here: the keyboard is not prepared to enter the decomposed form, and the display may be wrong. The only possible use case for decomposed unicode I am aware of is when you use git-bzr, because bzr does not do the precomposition (and neither hg to my knowledge). So for me the test case could sense, even if I think that nobody (TM) uses an old Git version under Mac OS X which is not able to handle precomposed unicode. Unless I have missed something. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote: To my knowledge repos with decomposed unicode should be rare in practice. I only can speak for european (or latin based) or cyrillic languages myself: I've run across several cases in the past few months, but only just figured out what was going on. Most were tickets to GitHub support, but we actually have such a case in our github/github repository. In most cases, I think they were created on older versions of git on OS X, either before core.precomposeunicode existed, or before it was turned on by default. The decomposed form got baked into the tree (whatever the user originally typed, git probably found out about it via git add .). I think reports are just coming in now because we didn't start turning on core.precomposeunicode by default until v1.8.5, shipped in November. And then, a person working on the repository would not notice anything, since we only set the flag during clone. So it took time for people to upgrade _and_ to make fresh clones. So for me the test case could sense, even if I think that nobody (TM) uses an old Git version under Mac OS X which is not able to handle precomposed unicode. Even when they do not, the decomposed values are baked into history from those old versions. So it is a matter of history created with older versions not interacting well with newer versions. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On 2014-04-28 22.03, Jeff King wrote: On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote: To my knowledge repos with decomposed unicode should be rare in practice. I only can speak for european (or latin based) or cyrillic languages myself: I've run across several cases in the past few months, but only just figured out what was going on. Most were tickets to GitHub support, but we actually have such a case in our github/github repository. In most cases, I think they were created on older versions of git on OS X, either before core.precomposeunicode existed, or before it was turned on by default. The decomposed form got baked into the tree (whatever the user originally typed, git probably found out about it via git add .). I think reports are just coming in now because we didn't start turning on core.precomposeunicode by default until v1.8.5, shipped in November. And then, a person working on the repository would not notice anything, since we only set the flag during clone. So it took time for people to upgrade _and_ to make fresh clones. OK, thanks for the description. In theory we can make Git composition ignoring by changing index_file_exists() in name-hash.c. (Both names must be precomposed first and compared then) I don't know how much people are using Git before 1.7.12 (the first version supporting precomposed unicode). Could we simply ask them to upgrade ? The next problem is that people need to agree if the repo should store names in pre- or decomposed form. (My voice is for precomposed) Unfortunatly the core.precomposeunicode is repo-local, so everybody needs to agree globally and configure locally. Side note: I which we had this config variable travelling with the repo, like .gitattributes does for text dealing with CRLF-LF. I don't know how many reports you have, reading all this it feels as if the effected users could normalize their repos and run git config core.precomposeunicode true, followed by git config --global core.precomposeunicode true. Does that sound like a possible way forward ? So for me the test case could sense, even if I think that nobody (TM) uses an old Git version under Mac OS X which is not able to handle precomposed unicode. Even when they do not, the decomposed values are baked into history from those old versions. So it is a matter of history created with older versions not interacting well with newer versions. I'm not sure if I understood all the details here, but I would be happy to help with suggestions/tests/reviews. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Mon, Apr 28, 2014 at 03:35:02PM -0400, Jeff King wrote: Since such entries are in the minority, and because cache_entry is already a variable-length struct, I think you could get away with sticking it after the name field, and then comparing like: const char *ce_normalized_name(struct cache_entry *ce, size_t *len) { const char *ret; /* Normal, fast path */ if (!(ce-ce_flags CE_NORMALIZED_NAME)) { len = ce_namelen(ce); return ce-name; } /* Slow path for normalized names */ ret = ce-name + ce-namelen + 1; *len = strlen(name); return ret; } That's the reading half. We would also need to create the normalized names for each cache_entry. I took a look at that this afternoon. It turns out we make cache_entry structs in quite a few places. So I thought I'd start with converting them all to a function like: struct cache_entry *cache_entry_alloc(const char *name, size_t len); And then once converted, we could teach it to normalize the name as appropriate. That interface does improve many of the callers, but there are a few tricky ones. For example, in checkout.c:update_some (and one or two other spots), we actually have the path broken into two parts, and we combine them while writing into the cache_entry. We could obviously combine them into a single buffer beforehand, but that means extra copying in reasonably hot code paths. It would be slightly ugly but perhaps reasonable to have: cache_entry_alloc_two(const char *one, size_t one_len, const char *two, size_t two_len); But then I got to unpack-trees. It has the whole path broken down across a linked list. I'm not sure what would be least terrible interface here. Again, we could format to a buffer and copy, but I'm hesitant to do it on this code path. I'm not sure if I'm just being paranoid about the extra memcpys (it's not _that_ tight a loop, since after all, we are generally zlib inflating to get the tree data, and the filenames are not all _that_ long). I dunno. I just hate the idea of tradeoffs for this OS-X-only fix permeating the rest of the code on other platforms. But maybe Knuth should be hitting me with his premature optimization clue-stick. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote: OK, thanks for the description. In theory we can make Git composition ignoring by changing index_file_exists() in name-hash.c. (Both names must be precomposed first and compared then) Yeah, we could perhaps get away without storing the extra precomposed form if we just stored the entries under their precomposed hash, and then taught same_name to use a slower precompose-aware comparison. But I also see that we do binary searches in index_name_pos (called by index_name_is_other). I don't know if we'd have to deal with this problem there, too. I don't know how much people are using Git before 1.7.12 (the first version supporting precomposed unicode). Could we simply ask them to upgrade ? I'm not sure what you mean here. Upgrading won't help, because the values are baked into existing history created with the old versions forever. Any time I git checkout v1.0 on the broken project, a modern git will complain about the ghost untracked file. The next problem is that people need to agree if the repo should store names in pre- or decomposed form. (My voice is for precomposed) Unfortunatly the core.precomposeunicode is repo-local, so everybody needs to agree globally and configure locally. Yeah, that was sort of my point 1 from the patch. I'm a bit worried that it creates problems for people on other systems, though. Linux people do not need to care about precomposed/decomposed at all, and any attempt we make to automatically handle it is going to run afoul of non-utf8 encodings. Not to mention that it does not solve the git checkout v1.0 problem above. Side note: I which we had this config variable travelling with the repo, like .gitattributes does for text dealing with CRLF-LF. Yeah, I guess if we wanted to enforce it everywhere, you would have to use .gitattributes since we cannot safely turn on such a feature by default. I don't know how many reports you have, reading all this it feels as if the effected users could normalize their repos and run git config core.precomposeunicode true, followed by git config --global core.precomposeunicode true. Does that sound like a possible way forward ? I have just a handful of reports. Maybe 3-4? I didn't dig them all up today, as it would be a non-trivial effort. But I have no idea how good a sampling that is. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html