Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-05-07 Thread Torsten Bögershausen
On 2014-05-05 23.46, Jeff King wrote:
[]
   2. Do all index filename comparisons under Mac OS X using a UTF-8 aware
  comparison function regardless if core.precomposeunicode is set.
  This would probably have bad performance, and somewhat
  defeats the point of converting the filenames at the
  readdir level in the first place.
 
 Right, I'm concerned about performance here, but I wonder if we can
 reuse the name-hash solutions from ignorecase.
Some short question, (sorry being short)
What do you think about this:

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 95fe849..c807f38 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -57,6 +57,22 @@ void probe_utf8_pathname_composition(char *path, int len)
 }
 
 
+char *inverscompose_str_len(const char *in, size_t insz, int *outsz)
+{
+   char *prec_str = NULL;
+
+   if (has_non_ascii(in, insz, NULL)) {
+   int old_errno = errno;
+   prec_str = reencode_string_len(in, (int)insz,
+  precomposed_unicode ? 
path_encoding : repo_encoding,
+  precomposed_unicode ? 
repo_encoding : path_encoding,
+  outsz);
+   errno = old_errno;
+   }
+   return prec_str;
+}
+
+
 void precompose_argv(int argc, const char **argv)
 {
int i = 0;
diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h
index 3b73585..b9ac099 100644
--- a/compat/precompose_utf8.h
+++ b/compat/precompose_utf8.h
@@ -26,6 +26,7 @@ typedef struct {
struct dirent_prec_psx *dirent_nfc;
 } PREC_DIR;
 
+char *inverscompose_str_len(const char *in, size_t insz, int *outsz);
 void precompose_argv(int argc, const char **argv);
 void probe_utf8_pathname_composition(char *, int);
 
diff --git a/git-compat-util.h b/git-compat-util.h
index f6d3a46..5ae5f57 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -177,7 +177,7 @@ typedef unsigned long uintptr_t;
 #ifdef PRECOMPOSE_UNICODE
 #include compat/precompose_utf8.h
 #else
-#define precompose_str(in,i_nfd2nfc)
+#define inverscompose_str_len(s,i,o) NULL
 #define precompose_argv(c,v)
 #define probe_utf8_pathname_composition(a,b)
 #endif
diff --git a/name-hash.c b/name-hash.c
index 97444d0..3c4a4ee 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -210,7 +210,7 @@ struct cache_entry *index_dir_exists(struct index_state 
*istate, const char *nam
return NULL;
 }
 
-struct cache_entry *index_file_exists(struct index_state *istate, const char 
*name, int namelen, int icase)
+static struct cache_entry *index_file_exists_int(struct index_state *istate, 
const char *name, int namelen, int icase)
 {
struct cache_entry *ce;
struct hashmap_entry key;
@@ -227,6 +227,25 @@ struct cache_entry *index_file_exists(struct index_state 
*istate, const char *na
return NULL;
 }
 
+
+struct cache_entry *index_file_exists(struct index_state *istate, const char 
*name, int namelen, int icase)
+{
+   struct cache_entry *ce;
+   ce = index_file_exists_int(istate, name, namelen, icase);
+   if (ce)
+   return ce;
+   else {
+   int prec_len;
+   char *prec_name = inverscompose_str_len(name, namelen, 
prec_len);
+   if (prec_name) {
+   ce = index_file_exists_int(istate, prec_name, prec_len, 
icase);
+   free(prec_name);
+   }
+   }
+   return ce;
+}
+
+
 void free_name_hash(struct index_state *istate)
 {
if (!istate-name_hash_initialized)
diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
index e4ba601..4414658 100755
--- a/t/t3910-mac-os-precompose.sh
+++ b/t/t3910-mac-os-precompose.sh
@@ -140,6 +140,22 @@ test_expect_success Add long precomposed filename '
git add * 
git commit -m Long filename
 '
+
+test_expect_success 'handle existing decomposed filenames' '
+   git checkout master 
+   git reset --hard 
+   git checkout -b exist_decomposed 
+   echo content verbatim.$Adiarnfd 
+   git -c core.precomposeunicode=false add verbatim.$Adiarnfd 
+   git commit -m existing decomposed file 
+   echo \verbatim.A\\314\\210\ expect 
+   git ls-files actual 
+   test_cmp expect actual 
+   expect 
+   git status | grep verbatim || : actual 
+   test_cmp expect actual
+'
+
 # Test if the global core.precomposeunicode stops autosensing
 # Must be the last test case
 test_expect_success respect git config --global core.precomposeunicode '
diff --git a/utf8.c b/utf8.c
index 77c28d4..fb8a99a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -519,7 +519,7 @@ char *reencode_string_iconv(const char *in, size_t insz, 
iconv_t conv, int *outs
char *out, *outpos;
iconv_ibp cp;
 
-   outsz = insz;
+   outsz = 3*insz; /* for decompose */
outalloc = outsz + 1; /* for terminating NUL */

Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-05-06 Thread Erik Faye-Lund
On Mon, May 5, 2014 at 11:46 PM, Jeff King p...@peff.net wrote:
 On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote:

1. Tell everyone that NFD in the git repo is wrong, and
   they should make a new commit to normalize all their
   in-repo files to be precomposed.
   This is probably not the right thing to do, because it
   still doesn't fix checkouts of old history. And it
   spreads the problem to people on byte-preserving
   filesystems (like ext4), because now they have to start
   precomposing their filenames as they are adde to git.
  (typo:  
 ^added)
 I'm not sure if I follow. People running ext4 (or Linux in general,
 or Windows, or Unix) do not suffer from file system
 feature of Mac OS, which accepts precomposed/decomposed Unicode
 but returns decompomsed.

 What I mean by spreads the problem is that git on Linux does not need
 to care about utf8 at all. It treats filenames as a byte sequence. But
 if we were to start enforcing filenames should be precomposed utf8,
 then people adding files on Linux would want to enforce that, too.

 People on Linux could ignore the issue as they do now, but they would
 then create problems for OS X users if they add decomposed filenames.
 IOW, if the OS X code assumes all repo filenames are precomposed, then
 other systems become a possible vector for violating that assumption.

FWIW, Git for Windows also doesn't deal with that filenames are just
a byte-sequence-notion. We have patches in place that require
filenames to *either* be valid UTF-8 or Windows-1252. Windows itself
treats filenames as Unicode characters, not arbitrary byte sequences.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-05-05 Thread Jeff King
On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote:

1. Tell everyone that NFD in the git repo is wrong, and
   they should make a new commit to normalize all their
   in-repo files to be precomposed.
   This is probably not the right thing to do, because it
   still doesn't fix checkouts of old history. And it
   spreads the problem to people on byte-preserving
   filesystems (like ext4), because now they have to start
   precomposing their filenames as they are adde to git.
  (typo:  
 ^added)
 I'm not sure if I follow. People running ext4 (or Linux in general,
 or Windows, or Unix) do not suffer from file system
 feature of Mac OS, which accepts precomposed/decomposed Unicode
 but returns decompomsed.

What I mean by spreads the problem is that git on Linux does not need
to care about utf8 at all. It treats filenames as a byte sequence. But
if we were to start enforcing filenames should be precomposed utf8,
then people adding files on Linux would want to enforce that, too.

People on Linux could ignore the issue as they do now, but they would
then create problems for OS X users if they add decomposed filenames.
IOW, if the OS X code assumes all repo filenames are precomposed, then
other systems become a possible vector for violating that assumption.

3. Convert index filenames to their precomposed form when
   we read the index from disk. This would be efficient,
   but we would have to be careful not to write the
   precomposed forms back out to disk.
 Question:
 How could we be careful?
 Mac OS writes always decomposed Unicode to disk.
 (And all other OS tend to use precomposed forms, mainly because the keyboard
 driver generates it.)

Sorry, I should have been more clear here. I meant do not write index
entries using the precomposed forms out to the on-disk index. Because
that would mean that git silently converts your filenames, and it would
look like you have changes to commit whenever you read in a tree with a
decomposed name.

Looking over the patch you sent earlier, I suspect that is part of its
problem (it stores the converted name in the index entry's name field).

 This is my understanding:
 Some possible fixes are:
 
   1. Accept that NFD in a Git repo which is shared between Mac OS
  and Linux or Windows is problematic.
  Whenever core.precomposeunicode = true, do the following:
  Let Git under Mac OS change all file names in the index
  into the precomposed form when a new commit is done.
  This is probably not a wrong thing to do.
 
  When the index file is read into memory, precompose the file names and 
 compare
  them with the precomposed form coming from precompose_utf8_readdir().
  This avoids decomposed file names to be reported as untracked by git 
 status.

This is the case I was specifically thinking of above (and I think what
your patch is doing).

   2. Do all index filename comparisons under Mac OS X using a UTF-8 aware
  comparison function regardless if core.precomposeunicode is set.
  This would probably have bad performance, and somewhat
  defeats the point of converting the filenames at the
  readdir level in the first place.

Right, I'm concerned about performance here, but I wonder if we can
reuse the name-hash solutions from ignorecase.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-05-04 Thread Torsten Bögershausen
(Sorry for the late reply, I'm handicapped by some local IT problems)
Peff, do you know if the fix below helps ?

On 2014-04-28 18.16, Jeff King wrote:
 If you have existing decomposed filenames in your git
 repository (e.g., that were created with older versions of
 git that did not precompose unicode), a modern git with
 core.precomposeunicode set does not handle them well.
Yes.
 The problem is that we normalize the paths coming from the
 disk into their precomposed form, and then compare them
 against the literal bytes in the index. This makes things
 better if you have the precomposed form in the index. It
 makes things worse if you actually have the decomposed form
 in the index.

 As a result, paths with decomposed filenames may have their
 precomposed variants listed as untracked files (even though
 the precomposed variants do not exist on-disk at all).
Yes
 This patch just adds a test to demonstrate the breakage.
 Some possible fixes are:

   1. Tell everyone that NFD in the git repo is wrong, and
  they should make a new commit to normalize all their
  in-repo files to be precomposed.
  This is probably not the right thing to do, because it
  still doesn't fix checkouts of old history. And it
  spreads the problem to people on byte-preserving
  filesystems (like ext4), because now they have to start
  precomposing their filenames as they are adde to git.
 (typo:  
^added)
I'm not sure if I follow. People running ext4 (or Linux in general,
or Windows, or Unix) do not suffer from file system
feature of Mac OS, which accepts precomposed/decomposed Unicode
but returns decompomsed.

   2. Do all index filename comparisons using a UTF-8 aware
  comparison function when core.precomposeunicode is set.
  This would probably have bad performance, and somewhat
  defeats the point of converting the filenames at the
  readdir level in the first place.

   3. Convert index filenames to their precomposed form when
  we read the index from disk. This would be efficient,
  but we would have to be careful not to write the
  precomposed forms back out to disk.
Question:
How could we be careful?
Mac OS writes always decomposed Unicode to disk.
(And all other OS tend to use precomposed forms, mainly because the keyboard
driver generates it.)

   4. Introduce some infrastructure to efficiently match up
  the precomposed/decomposed forms. We already do
  something similar for case-insensitive files using
  name-hash.c. We might be able to adapt that strategy
  here.
--

This is my understanding:
Some possible fixes are:

  1. Accept that NFD in a Git repo which is shared between Mac OS
 and Linux or Windows is problematic.
 Whenever core.precomposeunicode = true, do the following:
 Let Git under Mac OS change all file names in the index
 into the precomposed form when a new commit is done.
 This is probably not a wrong thing to do.

 When the index file is read into memory, precompose the file names and 
compare
 them with the precomposed form coming from precompose_utf8_readdir().
 This avoids decomposed file names to be reported as untracked by git 
status.

  2. Do all index filename comparisons under Mac OS X using a UTF-8 aware
 comparison function regardless if core.precomposeunicode is set.
 This would probably have bad performance, and somewhat
 defeats the point of converting the filenames at the
 readdir level in the first place.

(The TC is good)


 Signed-off-by: Jeff King p...@peff.net
 ---
  t/t3910-mac-os-precompose.sh | 10 ++
  1 file changed, 10 insertions(+)

 diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
 index e4ba601..23aa61e 100755
 --- a/t/t3910-mac-os-precompose.sh
 +++ b/t/t3910-mac-os-precompose.sh
 @@ -140,6 +140,16 @@ test_expect_success Add long precomposed filename '
   git add * 
   git commit -m Long filename
  '
 +
 +test_expect_failure 'handle existing decomposed filenames' '
 + echo content verbatim.$Adiarnfd 
 + git -c core.precomposeunicode=false add verbatim.$Adiarnfd 
 + git commit -m existing decomposed file 
 + expect 
 + git ls-files --exclude-standard -o verbatim* untracked 
 + test_cmp expect untracked
 +'
 +
  # Test if the global core.precomposeunicode stops autosensing
  # Must be the last test case
  test_expect_success respect git config --global core.precomposeunicode '
On top of the we need to following:

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 95fe849..5ed3d17 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len)
 }
 
 
+char *precompose_str_len(const char *in, size_t insz, int *outsz)
+{
+char *prec_str = NULL;
+if (precomposed_unicode != 1)
+return NULL;
+

Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-05-04 Thread Torsten Bögershausen
On 2014-04-30 16.57, Torsten Bögershausen wrote:
There is something wrong with the patch, (test suite hangs or TC fail),
so I need to com back later. Sorry for the noise.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-30 Thread Torsten Bögershausen
On 29.04.14 20:02, Jeff King wrote:
 On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote:
 
 Jeff King p...@peff.net writes:

 This patch just adds a test to demonstrate the breakage.
 Some possible fixes are:

   1. Tell everyone that NFD in the git repo is wrong, and
  they should make a new commit to normalize all their
  in-repo files to be precomposed.

  This is probably not the right thing to do, because it
  still doesn't fix checkouts of old history. And it
  spreads the problem to people on byte-preserving
  filesystems (like ext4), because now they have to start
  precomposing their filenames as they are adde to git.

 Hmm, have we taught the compare precomposed for codepaths that
 compare two trees and a tree and the index, too?  Otherwise, we
 would have the same issue with commits in the old history.
 
 Ugh, yeah, I didn't think about that codepath. I think we would not want
 to precompose in that case. IOW, git works byte-wise internally, but it
 is only at the filesystem layer that we do such munging. The index
 straddles the line between the filesystem and git's internal
 representations.

[snip]
Please allow me to answer on this post-
I made a suggestion here:
https://github.com/tboegi/git/commit/85305ce306cb88a07dad6350d6ba8c5f2f817af6

The new test in t3910 passes, but the test suite hangs somewhere, there is a 
whitespace
in precompose_utf8.c, so I don't know what to say.
But in case someone wants to make a code review:


commit 85305ce306cb88a07dad6350d6ba8c5f2f817af6
Author: Torsten Bögershausen tbo...@web.de
Date:   Wed Apr 30 10:30:04 2014 +0200

core.precomposeunicode with decomposed filenames

Commit 750b2e4785e shows a failure of core.precomposeunicode
when decomposed filenames are in the index.

When decomposed file names are in the index and readdir()
converts them into the decomposed form, Git status will report
the precomposed file name as untracked.

Solution:
Precompose file names when reading the index file from disc into memory.

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 95fe849..40ebc2e 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len)
 }
 
 
+char *precompose_str_len(const char *in, size_t insz, int *outsz)
+{
+   char *prec_str = NULL;
+   if (precomposed_unicode != 1)
+   return NULL;
+
+   if (has_non_ascii(in, insz, NULL))
+   prec_str = reencode_string_len(in, insz, repo_encoding, 
path_encoding, outsz);
+
+   return prec_str;
+}
+
+
 void precompose_argv(int argc, const char **argv)
 {
int i = 0;
diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h
index 3b73585..28f6595 100644
--- a/compat/precompose_utf8.h
+++ b/compat/precompose_utf8.h
@@ -26,6 +26,7 @@ typedef struct {
struct dirent_prec_psx *dirent_nfc;
 } PREC_DIR;
 
+char *precompose_str_len(const char *in, size_t insz, int *outsz);
 void precompose_argv(int argc, const char **argv);
 void probe_utf8_pathname_composition(char *, int);
 
diff --git a/git-compat-util.h b/git-compat-util.h
index d493a8c..de117d1 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -180,7 +180,7 @@ typedef unsigned long uintptr_t;
 #ifdef PRECOMPOSE_UNICODE
 #include compat/precompose_utf8.h
 #else
-#define precompose_str(in,i_nfd2nfc)
+#define precompose_str_len(s,i,o) NULL
 #define precompose_argv(c,v)
 #define probe_utf8_pathname_composition(a,b)
 #endif
diff --git a/read-cache.c b/read-cache.c
index 4b4effd..0887835 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1330,7 +1330,7 @@ static inline uint32_t ntoh_l_force_align(void *p)
 #define ntoh_l(var) ntoh_l_force_align((var))
 #endif
 
-static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry 
*ondisk,
+static struct cache_entry *cache_entry_from_ondisk_int(struct 
ondisk_cache_entry *ondisk,
   unsigned int flags,
   const char *name,
   size_t len)
@@ -1355,6 +1355,22 @@ static struct cache_entry 
*cache_entry_from_ondisk(struct ondisk_cache_entry *on
return ce;
 }
 
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry 
*ondisk,
+unsigned int flags,
+const char *name,
+size_t len)
+{
+   int prec_len;
+   char *prec_name = precompose_str_len(name, len, prec_len);
+   if (prec_name) {
+   struct cache_entry *ce;
+   ce = cache_entry_from_ondisk_int(ondisk, flags, prec_name, 
prec_len);
+   free(prec_name);
+   return ce;
+   }
+   return 

Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-29 Thread Torsten Bögershausen

On 04/29/2014 05:23 AM, Jeff King wrote:

On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote:


OK, thanks for the description.
In theory we can make Git composition ignoring by changing
index_file_exists() in name-hash.c.
(Both names must be precomposed first and compared then)

Yeah, we could perhaps get away without storing the extra precomposed
form if we just stored the entries under their precomposed hash, and
then taught same_name to use a slower precompose-aware comparison. But I
also see that we do binary searches in index_name_pos (called by
index_name_is_other). I don't know if we'd have to deal with this
problem there, too.

Just loud thinking:
We precompose whenever we read file names from disc, that's done in 
readdir()
We precompose whenever we get an argv into Git, that's done in 
precompose_argv()
We precompose every time we read file names from the index file on 
disc(?) into memory.
That we don't do today, and my suggestion to hack name-hash.c isn't a 
good one.


Probably we need to go into read-cache.c, or more places. I'm not an 
expert here knowing

all Git internal details.
Basically all places where strings containing file names are put into 
memory are effected,

and I wouldn't be too concerned about CPU cycles.


I don't know how much people are using Git before 1.7.12 (the
first version supporting precomposed unicode).

Could we simply ask them to upgrade ?

I'm not sure what you mean here. Upgrading won't help, because the
values are baked into existing history created with the old versions
forever. Any time I git checkout v1.0 on the broken project, a modern
git will complain about the ghost untracked file.

It depends if all file names in a certain repo are stored decomposed,
(in this case everybody can set core.precomposeunicode false)
or if there is a mixture having precomposed and decomposed
in different comits/directories...
In this case we can normalize the master branch.
For older commit the users need to wait for an updated Git version,
until that they need to live with the ghosts as good as they can.




The next problem is that people need to agree if the repo should store
names in pre- or decomposed form.
(My voice is for precomposed)
Unfortunatly the core.precomposeunicode is repo-local, so everybody
needs to agree globally and configure locally.

Yeah, that was sort of my point 1 from the patch. I'm a bit worried
that it creates problems for people on other systems, though. Linux
people do not need to care about precomposed/decomposed at all, and any
attempt we make to automatically handle it is going to run afoul of
non-utf8 encodings. Not to mention that it does not solve the git
checkout v1.0 problem above.

Not sure what is meant by non-utf8 encodings.
Mac OS X can only handle Unicode filenames,
and a single ISO-8859-1 will be returned as %XY from readdir().
So if you want to share a repo with Mac OS X (and/or Windows)
Unicode should be used.
Are you saying that there is a Linux station feeding in file names in  
e.g. 8859-1, EUC ?

My experience/knowledge is that you can not do that (in a useful way).



Side note:
I which we had this config variable travelling with the repo, like 
.gitattributes does
for text dealing with CRLF-LF.

Yeah, I guess if we wanted to enforce it everywhere, you would have to
use .gitattributes since we cannot safely turn on such a feature by
default.


I don't know how many reports you have, reading all this it feels as if the 
effected users
could normalize their repos and run git config core.precomposeunicode true, 
followed
by git config --global core.precomposeunicode true.
Does that sound like a possible way forward ?

I have just a handful of reports. Maybe 3-4? I didn't dig them all up
today, as it would be a non-trivial effort. But I have no idea how good
a sampling that is.

The following could help, may be:
git -c core.quotepath=false ls-files | iconv -f UTF-8-MAC -t UTF-8 expected
git -c core.quotepath=false ls-files actual
diff expected actual


-Peff


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 This patch just adds a test to demonstrate the breakage.
 Some possible fixes are:

   1. Tell everyone that NFD in the git repo is wrong, and
  they should make a new commit to normalize all their
  in-repo files to be precomposed.

  This is probably not the right thing to do, because it
  still doesn't fix checkouts of old history. And it
  spreads the problem to people on byte-preserving
  filesystems (like ext4), because now they have to start
  precomposing their filenames as they are adde to git.

Hmm, have we taught the compare precomposed for codepaths that
compare two trees and a tree and the index, too?  Otherwise, we
would have the same issue with commits in the old history.

Do we have a similar issue for older commit in a history under
ignore-case as well?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-29 Thread Jeff King
On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  This patch just adds a test to demonstrate the breakage.
  Some possible fixes are:
 
1. Tell everyone that NFD in the git repo is wrong, and
   they should make a new commit to normalize all their
   in-repo files to be precomposed.
 
   This is probably not the right thing to do, because it
   still doesn't fix checkouts of old history. And it
   spreads the problem to people on byte-preserving
   filesystems (like ext4), because now they have to start
   precomposing their filenames as they are adde to git.
 
 Hmm, have we taught the compare precomposed for codepaths that
 compare two trees and a tree and the index, too?  Otherwise, we
 would have the same issue with commits in the old history.

Ugh, yeah, I didn't think about that codepath. I think we would not want
to precompose in that case. IOW, git works byte-wise internally, but it
is only at the filesystem layer that we do such munging. The index
straddles the line between the filesystem and git's internal
representations.

I think my keep the normalized names alongside index entries approach
might still work there. But it means that we compare against the real
byte-wise names on the tree side, and against the normalized names on
the path side. But that means having two comparison/lookup functions for
the index, and always using the right one. And algorithms that rely on
traversing two sorted lists cannot work in both directions.

 Do we have a similar issue for older commit in a history under
 ignore-case as well?

I don't think so, because we handle ignorecase completely differently.
There we use the name-hash with a case-insensitive hash and a
case-insensitive comparison function. And we use strcasecmp liberally
throughout the code.

I don't think we have a str_utf8_cmp that ignores normalizations (or
maybe strcoll will do this?). But in theory we could use it everywhere
we use strcasecmp for ignore_case. And then we would not need to have
our readdir wrapper, maybe? I admit I haven't thought that much about
_either_ approach. But aside from some bugs in the hash system, I do not
recall seeing any design problems in the ignorecase code.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I don't think we have a str_utf8_cmp that ignores normalizations (or
 maybe strcoll will do this?). But in theory we could use it everywhere
 we use strcasecmp for ignore_case. And then we would not need to have
 our readdir wrapper, maybe? I admit I haven't thought that much about
 _either_ approach. But aside from some bugs in the hash system, I do not
 recall seeing any design problems in the ignorecase code.

Our diffs and merges depend on walking two (or more) sorted lists,
and that sort order is baked in the tree objects when they are
created.  Using normalized comparison only when comparing the
earliest elements picked from these sorted lists would not give you
the correct comparison or merge results, would it?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-29 Thread Jeff King
On Tue, Apr 29, 2014 at 11:49:30AM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  I don't think we have a str_utf8_cmp that ignores normalizations (or
  maybe strcoll will do this?). But in theory we could use it everywhere
  we use strcasecmp for ignore_case. And then we would not need to have
  our readdir wrapper, maybe? I admit I haven't thought that much about
  _either_ approach. But aside from some bugs in the hash system, I do not
  recall seeing any design problems in the ignorecase code.
 
 Our diffs and merges depend on walking two (or more) sorted lists,
 and that sort order is baked in the tree objects when they are
 created.  Using normalized comparison only when comparing the
 earliest elements picked from these sorted lists would not give you
 the correct comparison or merge results, would it?

Right, but we do not do normalized comparisons on that side. Not for
precompose, and not for ignorecase. The entry in the index _is_
case-sensitive[1], and we compare it as such to the tree side.

It is only when comparing the filesystem side to the index that we need
to care about such normalizations. And there we have the name-hash code
to handle ignorecase, but nothing to handle precompose.

-Peff

[1] This works because you typically get the case-sensitive entry via
`git read-tree`, and then further update it from the filesystem. If
you were to add a new entry makefile, and somebody else added
Makefile, they would conflict.

When you do git add makefile and Makefile _does_ exist, I am not
sure, though, if it is git who handles making sure we find the
correct entry, or if it is simply the fact that case insensitive
filesystems are typically case-preserving (so you generally ask for
Makefile anyway). If it is the latter, then that is a problem for
precompose. HFS's NFD normalization is non-preserving.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-28 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 This patch just adds a test to demonstrate the breakage.
 Some possible fixes are:
 ...
   2. Do all index filename comparisons using a UTF-8 aware
  comparison function when core.precomposeunicode is set.
  This would probably have bad performance, and somewhat
  defeats the point of converting the filenames at the
  readdir level in the first place.

As we do this only when core.precomposeunicode is set, projects that
use pathnames encoded not in UTF-8 (e.g. 8859-1, EUC, etc.) will not
be affected by getting their path mangled, as long as we won't flip
the default to true (which I am not suggesting to do, by the way).

   3. Convert index filenames to their precomposed form when
  we read the index from disk. This would be efficient,
  but we would have to be careful not to write the
  precomposed forms back out to disk.

I think this may be the right approach, especially if you are going
to do this only when core.precomposeunicode is set.

the reasoning behind we would have to be careful not to write
part, is unclear to me, though.  Don't decomposing filesystems
perform the manglig from the precomposed form without even being
asked to do so, just like a case insensitive filesystem will
overwrite an existing makefile on a request to write to
Makefile?

   4. Introduce some infrastructure to efficiently match up
  the precomposed/decomposed forms. We already do
  something similar for case-insensitive files using
  name-hash.c. We might be able to adapt that strategy
  here.

 Signed-off-by: Jeff King p...@peff.net
 ---
  t/t3910-mac-os-precompose.sh | 10 ++
  1 file changed, 10 insertions(+)

 diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
 index e4ba601..23aa61e 100755
 --- a/t/t3910-mac-os-precompose.sh
 +++ b/t/t3910-mac-os-precompose.sh
 @@ -140,6 +140,16 @@ test_expect_success Add long precomposed filename '
   git add * 
   git commit -m Long filename
  '
 +
 +test_expect_failure 'handle existing decomposed filenames' '
 + echo content verbatim.$Adiarnfd 
 + git -c core.precomposeunicode=false add verbatim.$Adiarnfd 
 + git commit -m existing decomposed file 
 + expect 
 + git ls-files --exclude-standard -o verbatim* untracked 
 + test_cmp expect untracked
 +'
 +
  # Test if the global core.precomposeunicode stops autosensing
  # Must be the last test case
  test_expect_success respect git config --global core.precomposeunicode '
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-28 Thread Jeff King
On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote:

3. Convert index filenames to their precomposed form when
   we read the index from disk. This would be efficient,
   but we would have to be careful not to write the
   precomposed forms back out to disk.
 
 I think this may be the right approach, especially if you are going
 to do this only when core.precomposeunicode is set.
 
 the reasoning behind we would have to be careful not to write
 part, is unclear to me, though.  Don't decomposing filesystems
 perform the manglig from the precomposed form without even being
 asked to do so, just like a case insensitive filesystem will
 overwrite an existing makefile on a request to write to
 Makefile?

Sorry, I meant do not write the precomposed forms back out to the
on-disk index. And by extension, do not update cache-tree and write
them out to git trees.

IOW, it is not enough to just set cache_entry-name to the normalized
form. You'd need to store both.

Since such entries are in the minority, and because cache_entry is
already a variable-length struct, I think you could get away with
sticking it after the name field, and then comparing like:

  const char *ce_normalized_name(struct cache_entry *ce, size_t *len)
  {
const char *ret;

/* Normal, fast path */
if (!(ce-ce_flags  CE_NORMALIZED_NAME)) {
len = ce_namelen(ce);
return ce-name;
}

/* Slow path for normalized names */
ret = ce-name + ce-namelen + 1;
*len = strlen(name);
return ret;
  }

The strlen is probably OK since such paths are presumably in the
minority (even for UTF-8 paths, we can avoid storing the extra copy if
they do not need any normalization). Or we could get fancy and encode
the length in front, but I am not sure it is worth the complexity.

Anyway, the tricky part is then making sure that all cache_entry name
comparisons use ce_normalized_name instead of ce-name.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-28 Thread Torsten Bögershausen

On 28.04.14 21:35, Jeff King wrote:
 On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote:

   3. Convert index filenames to their precomposed form when
  we read the index from disk. This would be efficient,
  but we would have to be careful not to write the
  precomposed forms back out to disk.
 I think this may be the right approach, especially if you are going
 to do this only when core.precomposeunicode is set.

 the reasoning behind we would have to be careful not to write
 part, is unclear to me, though.  Don't decomposing filesystems
 perform the manglig from the precomposed form without even being
 asked to do so, just like a case insensitive filesystem will
 overwrite an existing makefile on a request to write to
 Makefile?
 Sorry, I meant do not write the precomposed forms back out to the
 on-disk index. And by extension, do not update cache-tree and write
 them out to git trees.

 IOW, it is not enough to just set cache_entry-name to the normalized
 form. You'd need to store both.

 Since such entries are in the minority, and because cache_entry is
 already a variable-length struct, I think you could get away with
 sticking it after the name field, and then comparing like:

   const char *ce_normalized_name(struct cache_entry *ce, size_t *len)
   {
   const char *ret;

   /* Normal, fast path */
   if (!(ce-ce_flags  CE_NORMALIZED_NAME)) {
   len = ce_namelen(ce);
   return ce-name;
   }

   /* Slow path for normalized names */
   ret = ce-name + ce-namelen + 1;
   *len = strlen(name);
   return ret;
   }

 The strlen is probably OK since such paths are presumably in the
 minority (even for UTF-8 paths, we can avoid storing the extra copy if
 they do not need any normalization). Or we could get fancy and encode
 the length in front, but I am not sure it is worth the complexity.

 Anyway, the tricky part is then making sure that all cache_entry name
 comparisons use ce_normalized_name instead of ce-name.

 -Peff
To my knowledge repos with decomposed unicode should be rare in practice.
I only can speak for european (or latin based) or cyrillic languages myself:

- It is difficult (but not impossible) to enter decomposed unicode on the 
keyboard.
- Some programs under Mac OS X do not handle decomposed code points well,
  an ä may be displayed as ¨a for example.
- Pushing and pulling to Windows or Linux is possible, but the same problems 
here:
  the keyboard is not prepared to enter the decomposed form, and the display 
may be wrong.

The only possible use case for decomposed unicode I am aware of is when you use 
git-bzr,
because bzr does not do the precomposition (and neither hg to my knowledge).

So for me the test case could sense, even if I think that nobody (TM) uses an 
old Git version
under Mac OS X which is not able to handle precomposed unicode.

Unless I have missed something.






--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-28 Thread Jeff King
On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote:

 To my knowledge repos with decomposed unicode should be rare in
 practice.  I only can speak for european (or latin based) or cyrillic
 languages myself:

I've run across several cases in the past few months, but only just
figured out what was going on. Most were tickets to GitHub support, but
we actually have such a case in our github/github repository. In most
cases, I think they were created on older versions of git on OS X,
either before core.precomposeunicode existed, or before it was turned on
by default. The decomposed form got baked into the tree (whatever the
user originally typed, git probably found out about it via git add .).

I think reports are just coming in now because we didn't start turning
on core.precomposeunicode by default until v1.8.5, shipped in November.
And then, a person working on the repository would not notice anything,
since we only set the flag during clone. So it took time for people to
upgrade _and_ to make fresh clones.

 So for me the test case could sense, even if I think that nobody (TM)
 uses an old Git version under Mac OS X which is not able to handle
 precomposed unicode.

Even when they do not, the decomposed values are baked into history from
those old versions. So it is a matter of history created with older
versions not interacting well with newer versions.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-28 Thread Torsten Bögershausen
On 2014-04-28 22.03, Jeff King wrote:
 On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote:

 To my knowledge repos with decomposed unicode should be rare in
 practice.  I only can speak for european (or latin based) or cyrillic
 languages myself:
 I've run across several cases in the past few months, but only just
 figured out what was going on. Most were tickets to GitHub support, but
 we actually have such a case in our github/github repository. In most
 cases, I think they were created on older versions of git on OS X,
 either before core.precomposeunicode existed, or before it was turned on
 by default. The decomposed form got baked into the tree (whatever the
 user originally typed, git probably found out about it via git add .).

 I think reports are just coming in now because we didn't start turning
 on core.precomposeunicode by default until v1.8.5, shipped in November.
 And then, a person working on the repository would not notice anything,
 since we only set the flag during clone. So it took time for people to
 upgrade _and_ to make fresh clones.
OK, thanks for the description.
In theory we can make Git composition ignoring by changing
index_file_exists() in name-hash.c.
(Both names must be precomposed first and compared then)

I don't know how much people are using Git before 1.7.12 (the
first version supporting precomposed unicode).

Could we simply ask them to upgrade ?
The next problem is that people need to agree if the repo should store
names in pre- or decomposed form.
(My voice is for precomposed)
Unfortunatly the core.precomposeunicode is repo-local, so everybody
needs to agree globally and configure locally.

Side note:
I which we had this config variable travelling with the repo, like 
.gitattributes does
for text dealing with CRLF-LF.

I don't know how many reports you have, reading all this it feels as if the 
effected users
could normalize their repos and run git config core.precomposeunicode true, 
followed
by git config --global core.precomposeunicode true.
Does that sound like a possible way forward ?
 So for me the test case could sense, even if I think that nobody (TM)
 uses an old Git version under Mac OS X which is not able to handle
 precomposed unicode.
 Even when they do not, the decomposed values are baked into history from
 those old versions. So it is a matter of history created with older
 versions not interacting well with newer versions.
I'm not sure if I understood all the details here, but I would be happy to help
with suggestions/tests/reviews.

 -Peff

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-28 Thread Jeff King
On Mon, Apr 28, 2014 at 03:35:02PM -0400, Jeff King wrote:

 Since such entries are in the minority, and because cache_entry is
 already a variable-length struct, I think you could get away with
 sticking it after the name field, and then comparing like:
 
   const char *ce_normalized_name(struct cache_entry *ce, size_t *len)
   {
   const char *ret;
 
   /* Normal, fast path */
   if (!(ce-ce_flags  CE_NORMALIZED_NAME)) {
   len = ce_namelen(ce);
   return ce-name;
   }
 
   /* Slow path for normalized names */
   ret = ce-name + ce-namelen + 1;
   *len = strlen(name);
   return ret;
   }

That's the reading half. We would also need to create the normalized
names for each cache_entry. I took a look at that this afternoon. It
turns out we make cache_entry structs in quite a few places.  So I
thought I'd start with converting them all to a function like:

  struct cache_entry *cache_entry_alloc(const char *name, size_t len);

And then once converted, we could teach it to normalize the name as
appropriate. That interface does improve many of the callers, but there
are a few tricky ones.

For example, in checkout.c:update_some (and one or two other spots), we
actually have the path broken into two parts, and we combine them while
writing into the cache_entry. We could obviously combine them into a
single buffer beforehand, but that means extra copying in reasonably hot
code paths. It would be slightly ugly but perhaps reasonable to have:

  cache_entry_alloc_two(const char *one, size_t one_len,
const char *two, size_t two_len);

But then I got to unpack-trees. It has the whole path broken down across
a linked list.  I'm not sure what would be least terrible interface
here. Again, we could format to a buffer and copy, but I'm hesitant to
do it on this code path.

I'm not sure if I'm just being paranoid about the extra memcpys (it's
not _that_ tight a loop, since after all, we are generally zlib
inflating to get the tree data, and the filenames are not all _that_
long).

I dunno. I just hate the idea of tradeoffs for this OS-X-only fix
permeating the rest of the code on other platforms. But maybe Knuth
should be hitting me with his premature optimization clue-stick.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

2014-04-28 Thread Jeff King
On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote:

 OK, thanks for the description.
 In theory we can make Git composition ignoring by changing
 index_file_exists() in name-hash.c.
 (Both names must be precomposed first and compared then)

Yeah, we could perhaps get away without storing the extra precomposed
form if we just stored the entries under their precomposed hash, and
then taught same_name to use a slower precompose-aware comparison. But I
also see that we do binary searches in index_name_pos (called by
index_name_is_other). I don't know if we'd have to deal with this
problem there, too.

 I don't know how much people are using Git before 1.7.12 (the
 first version supporting precomposed unicode).
 
 Could we simply ask them to upgrade ?

I'm not sure what you mean here. Upgrading won't help, because the
values are baked into existing history created with the old versions
forever. Any time I git checkout v1.0 on the broken project, a modern
git will complain about the ghost untracked file.

 The next problem is that people need to agree if the repo should store
 names in pre- or decomposed form.
 (My voice is for precomposed)
 Unfortunatly the core.precomposeunicode is repo-local, so everybody
 needs to agree globally and configure locally.

Yeah, that was sort of my point 1 from the patch. I'm a bit worried
that it creates problems for people on other systems, though. Linux
people do not need to care about precomposed/decomposed at all, and any
attempt we make to automatically handle it is going to run afoul of
non-utf8 encodings. Not to mention that it does not solve the git
checkout v1.0 problem above.

 Side note:
 I which we had this config variable travelling with the repo, like 
 .gitattributes does
 for text dealing with CRLF-LF.

Yeah, I guess if we wanted to enforce it everywhere, you would have to
use .gitattributes since we cannot safely turn on such a feature by
default.

 I don't know how many reports you have, reading all this it feels as if the 
 effected users
 could normalize their repos and run git config core.precomposeunicode 
 true, followed
 by git config --global core.precomposeunicode true.
 Does that sound like a possible way forward ?

I have just a handful of reports. Maybe 3-4? I didn't dig them all up
today, as it would be a non-trivial effort. But I have no idea how good
a sampling that is.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html