Re: Git clone and case sensitivity

2018-07-31 Thread Jeff Hostetler




On 7/29/2018 5:28 AM, Jeff King wrote:

On Sun, Jul 29, 2018 at 07:26:41AM +0200, Duy Nguyen wrote:


strcasecmp() will only catch a subset of the cases. We really need to
follow the same folding rules that the filesystem would.


True. But that's how we handle case insensitivity internally. If a
filesytem has more sophisticated folding rules then git will not work
well on that one anyway.


Hrm. Yeah, I guess that's the best we can do for the actual in-memory
checks. Everything else depends on doing an actual filesystem operation,
and our icase stuff kicks in way before then. I was mostly thinking of
HFS+ utf8 normalization weirdness, but I guess people are accustomed to
that by now.


For the case of clone, I actually wonder if we could detect during the
checkout step that a file already exists. Since we know that the
directory we started with was empty, then if it does, either:

   - there's some funny case-folding going on that means two paths in the
 repository map to the same name in the filesystem; or

   - somebody else is writing to the directory at the same time as us


This is exactly what my first patch does (minus the sparse checkout
part).


Right, sorry, I should have read that one more carefully.


But without knowing the exact folding rules, I don't think we can
locate this "somebody else" who wrote the first path. So if N paths
are treated the same by this filesystem, we could only report N-1 of
them.

If we want to report just one path when this happens though, then this
works quite well.


Hmm. Since most such systems are case-preserving, would it be possible
to report the name of the existing file? Doing it via opendir/readdir is
hacky, and anyway puts the burden on us to find the matching name. Doing
it via fstat() on the opened file doesn't work because at that the
filesystem has resolved the name to an inode.

So yeah, perhaps strcasecmp() is the best we can do (I do agree that
being able to mention all of the conflicting names is a benefit).

I guess we should be using fspathcmp(), though, in case it later learns
to be smarter.

-Peff



As has already been mentioned, this gets into weird territory really
fast, between case folding, final space/dot on windows, utf8 NFC/NFD
weirdness on the mac, utf8 invisible chars on the mac, long/short names
on windows, and etc.

And that's just for filenames.  Things really get weird if directory
names have these ambiguities.

Perhaps just print the problematic paths (where the collision is
detected) and let the user decide how to correct them.

Perhaps we could have a separate tool that could scan the index or
commit for potential conflicts and warn them in advance (granted, it
might not be perfect and may report a few false positives).

Forcing them into a sparse-checkout situation might be over their
skill level.

Jeff


Re: Git clone and case sensitivity

2018-07-29 Thread Jeff King
On Sun, Jul 29, 2018 at 07:26:41AM +0200, Duy Nguyen wrote:

> > strcasecmp() will only catch a subset of the cases. We really need to
> > follow the same folding rules that the filesystem would.
> 
> True. But that's how we handle case insensitivity internally. If a
> filesytem has more sophisticated folding rules then git will not work
> well on that one anyway.

Hrm. Yeah, I guess that's the best we can do for the actual in-memory
checks. Everything else depends on doing an actual filesystem operation,
and our icase stuff kicks in way before then. I was mostly thinking of
HFS+ utf8 normalization weirdness, but I guess people are accustomed to
that by now.

> > For the case of clone, I actually wonder if we could detect during the
> > checkout step that a file already exists. Since we know that the
> > directory we started with was empty, then if it does, either:
> >
> >   - there's some funny case-folding going on that means two paths in the
> > repository map to the same name in the filesystem; or
> >
> >   - somebody else is writing to the directory at the same time as us
> 
> This is exactly what my first patch does (minus the sparse checkout
> part).

Right, sorry, I should have read that one more carefully.

> But without knowing the exact folding rules, I don't think we can
> locate this "somebody else" who wrote the first path. So if N paths
> are treated the same by this filesystem, we could only report N-1 of
> them.
> 
> If we want to report just one path when this happens though, then this
> works quite well.

Hmm. Since most such systems are case-preserving, would it be possible
to report the name of the existing file? Doing it via opendir/readdir is
hacky, and anyway puts the burden on us to find the matching name. Doing
it via fstat() on the opened file doesn't work because at that the
filesystem has resolved the name to an inode.

So yeah, perhaps strcasecmp() is the best we can do (I do agree that
being able to mention all of the conflicting names is a benefit).

I guess we should be using fspathcmp(), though, in case it later learns
to be smarter.

-Peff


Re: Git clone and case sensitivity

2018-07-28 Thread Duy Nguyen
On Sat, Jul 28, 2018 at 11:57 AM Jeff King  wrote:
> > +static int has_duplicate_icase_entries(struct index_state *istate)
> > +{
> > + struct string_list list = STRING_LIST_INIT_NODUP;
> > + int i;
> > + int found = 0;
> > +
> > + for (i = 0; i < istate->cache_nr; i++)
> > + string_list_append(, istate->cache[i]->name);
> > +
> > + list.cmp = strcasecmp;
> > + string_list_sort();
> > +
> > + for (i = 1; i < list.nr; i++) {
> > + if (strcasecmp(list.items[i-1].string,
> > +list.items[i].string))
> > + continue;
> > + found = 1;
> > + break;
> > + }
> > + string_list_clear(, 0);
> > +
> > + return found;
> > +}
>
> strcasecmp() will only catch a subset of the cases. We really need to
> follow the same folding rules that the filesystem would.

True. But that's how we handle case insensitivity internally. If a
filesytem has more sophisticated folding rules then git will not work
well on that one anyway.

> For the case of clone, I actually wonder if we could detect during the
> checkout step that a file already exists. Since we know that the
> directory we started with was empty, then if it does, either:
>
>   - there's some funny case-folding going on that means two paths in the
> repository map to the same name in the filesystem; or
>
>   - somebody else is writing to the directory at the same time as us

This is exactly what my first patch does (minus the sparse checkout
part).  But without knowing the exact folding rules, I don't think we
can locate this "somebody else" who wrote the first path. So if N
paths are treated the same by this filesystem, we could only report
N-1 of them.

If we want to report just one path when this happens though, then this
works quite well.
-- 
Duy


Re: Git clone and case sensitivity

2018-07-28 Thread brian m. carlson
On Sat, Jul 28, 2018 at 05:56:59AM -0400, Jeff King wrote:
> strcasecmp() will only catch a subset of the cases. We really need to
> follow the same folding rules that the filesystem would.
> 
> For the case of clone, I actually wonder if we could detect during the
> checkout step that a file already exists. Since we know that the
> directory we started with was empty, then if it does, either:
> 
>   - there's some funny case-folding going on that means two paths in the
> repository map to the same name in the filesystem; or
> 
>   - somebody else is writing to the directory at the same time as us
> 
> Either of which I think would be worth warning about. I'm not sure if we
> already lstat() the paths we're writing anyway as part of the checkout,
> so we might even get the feature "for free".

This is possible to do.  From the bug I accidentally introduced in 2.16,
we know that on clone, there is a code path that is only traversed when
we hit this case and only on case-insensitive file systems.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Git clone and case sensitivity

2018-07-28 Thread Jeff King
On Sat, Jul 28, 2018 at 07:11:05AM +0200, Duy Nguyen wrote:

> > It might be enough to just issue a warning and give an advise() hint
> > that tells the user what's going on. Then they can decide what to do
> > (hide both paths, or just work in the index, or move to a different fs,
> > or complain to upstream).
> 
> Yeah that may be the best option. Something like this perhaps? Not
> sure how much detail the advice should be here.

Yeah, something along these lines.  I agree with Simon's comment
elsewhere that this should probably mention the names. I don't know if
we'd want to offer advice pointing them to using the sparse feature to
work around it.

> +static int has_duplicate_icase_entries(struct index_state *istate)
> +{
> + struct string_list list = STRING_LIST_INIT_NODUP;
> + int i;
> + int found = 0;
> +
> + for (i = 0; i < istate->cache_nr; i++)
> + string_list_append(, istate->cache[i]->name);
> +
> + list.cmp = strcasecmp;
> + string_list_sort();
> +
> + for (i = 1; i < list.nr; i++) {
> + if (strcasecmp(list.items[i-1].string,
> +list.items[i].string))
> + continue;
> + found = 1;
> + break;
> + }
> + string_list_clear(, 0);
> +
> + return found;
> +}

strcasecmp() will only catch a subset of the cases. We really need to
follow the same folding rules that the filesystem would.

For the case of clone, I actually wonder if we could detect during the
checkout step that a file already exists. Since we know that the
directory we started with was empty, then if it does, either:

  - there's some funny case-folding going on that means two paths in the
repository map to the same name in the filesystem; or

  - somebody else is writing to the directory at the same time as us

Either of which I think would be worth warning about. I'm not sure if we
already lstat() the paths we're writing anyway as part of the checkout,
so we might even get the feature "for free".

-Peff


Re: Git clone and case sensitivity

2018-07-28 Thread Simon Ruderich
On Sat, Jul 28, 2018 at 07:11:05AM +0200, Duy Nguyen wrote:
>  static int checkout(int submodule_progress)
>  {
>   struct object_id oid;
> @@ -761,6 +785,11 @@ static int checkout(int submodule_progress)
>   if (write_locked_index(_index, _file, COMMIT_LOCK))
>   die(_("unable to write new index file"));
>
> + if (ignore_case && has_duplicate_icase_entries(_index))
> + warning(_("This repository has paths that only differ in case\n"
> +   "and you have a case-insenitive filesystem which 
> will\n"
> +   "cause problems."));
> +
>   err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1),
>  oid_to_hex(), "1", NULL);

I think the advice message should list the problematic file
names. Even though this might be quite verbose it will help those
affected to quickly find the problematic files to either fix this
on their own or report to upstream (unless there's already an
easy way to find those files - if so it should be mentioned in
the message).

Regards
Simon
-- 
+ privacy is necessary
+ using gnupg http://gnupg.org
+ public key id: 0x92FEFDB7E44C32F9


Re: Git clone and case sensitivity

2018-07-27 Thread Duy Nguyen
On Sat, Jul 28, 2018 at 12:48:57AM -0400, Jeff King wrote:
> On Sat, Jul 28, 2018 at 06:45:43AM +0200, Duy Nguyen wrote:
> 
> > > I agree throwing a real exception would be bad. But how about detecting
> > > the problem and trying our best to keep the repo in somewhat usable
> > > state like this?
> > >
> > > This patch uses sparse checkout to hide all those paths that we fail
> > > to checkout, so you can still have a clean worktree to do things, as
> > > long as you don't touch those paths.
> > 
> > Side note. There may still be problems with this patch. Let's use
> > vim-colorschemes.git as an example, which has darkBlue.vim and
> > darkblue.vim.
> > 
> > Say we have checked out darkBlue.vim and hidden darkblue.vim. When you
> > update darkBlue.vim on worktree and then update the index, are we sure
> > we will update darkBlue.vim entry and not (hidden) darkblue.vim? I am
> > not sure. I don't think our lookup function is prepared to deal with
> > this. Maybe it's best to hide both of them.
> 
> It might be enough to just issue a warning and give an advise() hint
> that tells the user what's going on. Then they can decide what to do
> (hide both paths, or just work in the index, or move to a different fs,
> or complain to upstream).

Yeah that may be the best option. Something like this perhaps? Not
sure how much detail the advice should be here.

-- 8< --
diff --git a/builtin/clone.c b/builtin/clone.c
index 1d939af9d8..b47ad5877b 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -711,6 +711,30 @@ static void update_head(const struct ref *our, const 
struct ref *remote,
}
 }
 
+static int has_duplicate_icase_entries(struct index_state *istate)
+{
+   struct string_list list = STRING_LIST_INIT_NODUP;
+   int i;
+   int found = 0;
+
+   for (i = 0; i < istate->cache_nr; i++)
+   string_list_append(, istate->cache[i]->name);
+
+   list.cmp = strcasecmp;
+   string_list_sort();
+
+   for (i = 1; i < list.nr; i++) {
+   if (strcasecmp(list.items[i-1].string,
+  list.items[i].string))
+   continue;
+   found = 1;
+   break;
+   }
+   string_list_clear(, 0);
+
+   return found;
+}
+
 static int checkout(int submodule_progress)
 {
struct object_id oid;
@@ -761,6 +785,11 @@ static int checkout(int submodule_progress)
if (write_locked_index(_index, _file, COMMIT_LOCK))
die(_("unable to write new index file"));
 
+   if (ignore_case && has_duplicate_icase_entries(_index))
+   warning(_("This repository has paths that only differ in case\n"
+ "and you have a case-insenitive filesystem which 
will\n"
+ "cause problems."));
+
err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1),
   oid_to_hex(), "1", NULL);
 
-- 8< --


Re: Git clone and case sensitivity

2018-07-27 Thread Jeff King
On Sat, Jul 28, 2018 at 06:45:43AM +0200, Duy Nguyen wrote:

> > I agree throwing a real exception would be bad. But how about detecting
> > the problem and trying our best to keep the repo in somewhat usable
> > state like this?
> >
> > This patch uses sparse checkout to hide all those paths that we fail
> > to checkout, so you can still have a clean worktree to do things, as
> > long as you don't touch those paths.
> 
> Side note. There may still be problems with this patch. Let's use
> vim-colorschemes.git as an example, which has darkBlue.vim and
> darkblue.vim.
> 
> Say we have checked out darkBlue.vim and hidden darkblue.vim. When you
> update darkBlue.vim on worktree and then update the index, are we sure
> we will update darkBlue.vim entry and not (hidden) darkblue.vim? I am
> not sure. I don't think our lookup function is prepared to deal with
> this. Maybe it's best to hide both of them.

It might be enough to just issue a warning and give an advise() hint
that tells the user what's going on. Then they can decide what to do
(hide both paths, or just work in the index, or move to a different fs,
or complain to upstream).

-Peff


Re: Git clone and case sensitivity

2018-07-27 Thread Duy Nguyen
On Sat, Jul 28, 2018 at 6:36 AM Duy Nguyen  wrote:
>
> On Fri, Jul 27, 2018 at 08:59:09PM +, brian m. carlson wrote:
> > On Fri, Jul 27, 2018 at 11:59:33AM +0200, Paweł Paruzel wrote:
> > > Hi,
> > >
> > > Lately, I have been wondering why my test files in repo are modified
> > > after I clone it. It turned out to be two files: boolStyle_t_f and
> > > boolStyle_T_F.
> > > The system that pushed those files was case sensitive while my mac
> > > after High Sierra update had APFS which is by default
> > > case-insensitive. I highly suggest that git clone threw an exception
> > > when files are case sensitive and being cloned to a case insensitive
> > > system. This has caused problems with overriding files for test cases
> > > without any warning.
> >
> > If we did what you proposed, it would be impossible to clone such a
> > repository on a case-insensitive system.
>
> I agree throwing a real exception would be bad. But how about detecting
> the problem and trying our best to keep the repo in somewhat usable
> state like this?
>
> This patch uses sparse checkout to hide all those paths that we fail
> to checkout, so you can still have a clean worktree to do things, as
> long as you don't touch those paths.

Side note. There may still be problems with this patch. Let's use
vim-colorschemes.git as an example, which has darkBlue.vim and
darkblue.vim.

Say we have checked out darkBlue.vim and hidden darkblue.vim. When you
update darkBlue.vim on worktree and then update the index, are we sure
we will update darkBlue.vim entry and not (hidden) darkblue.vim? I am
not sure. I don't think our lookup function is prepared to deal with
this. Maybe it's best to hide both of them.
-- 
Duy


Re: Git clone and case sensitivity

2018-07-27 Thread Duy Nguyen
On Fri, Jul 27, 2018 at 08:59:09PM +, brian m. carlson wrote:
> On Fri, Jul 27, 2018 at 11:59:33AM +0200, Paweł Paruzel wrote:
> > Hi,
> > 
> > Lately, I have been wondering why my test files in repo are modified
> > after I clone it. It turned out to be two files: boolStyle_t_f and
> > boolStyle_T_F.
> > The system that pushed those files was case sensitive while my mac
> > after High Sierra update had APFS which is by default
> > case-insensitive. I highly suggest that git clone threw an exception
> > when files are case sensitive and being cloned to a case insensitive
> > system. This has caused problems with overriding files for test cases
> > without any warning.
> 
> If we did what you proposed, it would be impossible to clone such a
> repository on a case-insensitive system.

I agree throwing a real exception would be bad. But how about detecting
the problem and trying our best to keep the repo in somewhat usable
state like this?

This patch uses sparse checkout to hide all those paths that we fail
to checkout, so you can still have a clean worktree to do things, as
long as you don't touch those paths.

-- 8< --
diff --git a/builtin/clone.c b/builtin/clone.c
index 1d939af9d8..a6b5e2c948 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -711,6 +711,30 @@ static void update_head(const struct ref *our, const 
struct ref *remote,
}
 }
 
+static int enable_sparse_checkout_on_icase_fs(struct index_state *istate)
+{
+   int i;
+   int skip_count = 0;
+   FILE *fp = fopen(git_path("info/sparse"), "a+");
+
+   for (i = 0; i < istate->cache_nr; i++) {
+   struct cache_entry *ce = istate->cache[i];
+   if (!ce_skip_worktree(ce))
+   continue;
+   if (!skip_count) {
+   git_config_set_multivar_gently("core.sparseCheckout",
+  "true",
+  CONFIG_REGEX_NONE, 0);
+   fprintf(fp, "# List of paths hidden by 'git clone'\n");
+   }
+   fprintf(fp, "/%s\n", ce->name);
+   skip_count++;
+   }
+   fclose(fp);
+
+   return skip_count;
+}
+
 static int checkout(int submodule_progress)
 {
struct object_id oid;
@@ -751,6 +775,7 @@ static int checkout(int submodule_progress)
opts.verbose_update = (option_verbosity >= 0);
opts.src_index = _index;
opts.dst_index = _index;
+   opts.clone_checkout = 1;
 
tree = parse_tree_indirect();
parse_tree(tree);
@@ -761,6 +786,12 @@ static int checkout(int submodule_progress)
if (write_locked_index(_index, _file, COMMIT_LOCK))
die(_("unable to write new index file"));
 
+   if (enable_sparse_checkout_on_icase_fs(_index))
+   warning("Paths that differ only in case are detected "
+   "and will not work correctly on this case-insensitive "
+   "filesystem. Sparse checkout has been enabled to hide "
+   "these paths.");
+
err |= run_hook_le(NULL, "post-checkout", sha1_to_hex(null_sha1),
   oid_to_hex(), "1", NULL);
 
diff --git a/cache.h b/cache.h
index 8b447652a7..9ecf7ad952 100644
--- a/cache.h
+++ b/cache.h
@@ -1455,6 +1455,7 @@ struct checkout {
unsigned force:1,
 quiet:1,
 not_new:1,
+set_skipworktree_on_updated:1,
 refresh_cache:1;
 };
 #define CHECKOUT_INIT { NULL, "" }
diff --git a/entry.c b/entry.c
index b5d1d3cf23..ba21db63e7 100644
--- a/entry.c
+++ b/entry.c
@@ -447,6 +447,11 @@ int checkout_entry(struct cache_entry *ce,
 
if (!changed)
return 0;
+   if (state->set_skipworktree_on_updated) {
+   ce->ce_flags |= CE_SKIP_WORKTREE;
+   state->istate->cache_changed |= CE_ENTRY_CHANGED;
+   return 0;
+   }
if (!state->force) {
if (!state->quiet)
fprintf(stderr,
diff --git a/unpack-trees.c b/unpack-trees.c
index 66741130ae..a8a24e0b13 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -358,6 +358,7 @@ static int check_updates(struct unpack_trees_options *o)
state.quiet = 1;
state.refresh_cache = 1;
state.istate = index;
+   state.set_skipworktree_on_updated = o->clone_checkout;
 
progress = get_progress(o);
 
diff --git a/unpack-trees.h b/unpack-trees.h
index c2b434c606..8ebe2e2ec5 100644
--- a/unpack-trees.h
+++ b/unpack-trees.h
@@ -49,6 +49,7 @@ struct unpack_trees_options {
 aggressive,
 skip_unmerged,
 initial_checkout,
+clone_checkout,
 diff_index_cached,
 debug_unpack,
 

Re: Git clone and case sensitivity

2018-07-27 Thread brian m. carlson
On Fri, Jul 27, 2018 at 11:59:33AM +0200, Paweł Paruzel wrote:
> Hi,
> 
> Lately, I have been wondering why my test files in repo are modified
> after I clone it. It turned out to be two files: boolStyle_t_f and
> boolStyle_T_F.
> The system that pushed those files was case sensitive while my mac
> after High Sierra update had APFS which is by default
> case-insensitive. I highly suggest that git clone threw an exception
> when files are case sensitive and being cloned to a case insensitive
> system. This has caused problems with overriding files for test cases
> without any warning.

If we did what you proposed, it would be impossible to clone such a
repository on a case-insensitive system.  While this might be fine for a
closed system such as inside a company, this would make many open source
repositories unusable, even when the files differing in case are
nonfunctional (like README and readme).

This is actually one of a few ways people can make repositories that
will show as modified on Windows or macOS systems, due to limitations in
those OSes.  If you want to be sure that your repository is unmodified
after clone, you can ensure that the output of git status --porcelain is
empty, such as by checking for a zero exit from
"test $(git status --porcelain | wc -l) -eq 0".
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Git clone and case sensitivity

2018-07-27 Thread Paweł Paruzel
Hi,

Lately, I have been wondering why my test files in repo are modified after I 
clone it. It turned out to be two files: boolStyle_t_f and boolStyle_T_F.
The system that pushed those files was case sensitive while my mac after High 
Sierra update had APFS which is by default case-insensitive. I highly suggest 
that git clone threw an exception when files are case sensitive and being 
cloned to a case insensitive system. This has caused problems with overriding 
files for test cases without any warning.

Thanks in advance.
Regards,
Pawel Paruzel