Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

2018-02-05 Thread Stefan Beller
> /*
>  * For
>  *"a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
>  * the "e/foo.c" part is the same, we just want to know that
>  *"a/b/c/d" was renamed to "a/b/some/thing/else"
>  * so, for this example, this function returns "a/b/c/d" in
>  * *old_dir and "a/b/some/thing/else" in *new_dir.
>  *
>  * Also, if the basename of the file changed, we don't care.  We
>  * want to know which portion of the directory, if any, changed.
>  */
>
> Is that better?

yes, that helps me understanding pretty much.

>
>>> +*
>>> +* Also, if the basename of the file changed, we don't care.  We
>>> +* want to know which portion of the directory, if any, changed.
>>> +*/
>>> +   end_of_old = strrchr(old_path, '/');
>>> +   end_of_new = strrchr(new_path, '/');
>>> +
>>> +   if (end_of_old == NULL || end_of_new == NULL)
>>> +   return;
>>
>> return early as the files are in the top level, and apparently we do
>> not support top level renaming?
>>
>> What about a commit like 81b50f3ce4 (Move 'builtin-*' into
>> a 'builtin/' subdirectory, 2010-02-22) ?
>>
>> Well that specific commit left many files outside the new builtin dir,
>> but conceptually we could see a directory rename of
>>
>> /* => /src/*
>
> We had some internal usecases for want to support a "rename" of the
> toplevel directory into a subdirectory of itself.  However, attempting
> to support that opened much too big a can of worms for me.  We'd open
> up some big surprises somewhere.


ok, I was just trying to understand the code.
(As said before, I was not asking for having this implemented.)

> In particular, note that not supporting a "rename" of the toplevel
> directory is a special case of not supporting a "rename" of any
> directory to a subdirectory below itself, which in turn is a very
> special case of excluding partial directory renames.  I addressed this
> in the cover letter of my original submission, as follows:
>
> """
> Further, there's a basic question about when directory rename detection
> should be applied at all.  I have a simple rule:
>
>   3) If a given directory still exists on both sides of a merge, we do
>  not consider it to have been renamed.
>
> Rule 3 may sound obvious at first, but it will probably arise as a
> question for some users -- what if someone "mostly" moved a directory but
> still left some files around, or, equivalently (from the perspective of the
> three-way merge that merge-recursive performs), fully renamed a directory
> in one commmit and then recreated that directory in a later commit adding
> some new files and then tried to merge?  See the big comment in section 4
> of the new t6043 for further discussion of this rule.
> """
>
> Patch 04/31 is the one that adds that big comment with further discussion.
>
> Maybe there's a way to support toplevel renames, but I think it'd make
> this series longer and more complicated...and may cause more edge
> cases that confuse users.

Thanks for reminding!

>
>>> +   while (*--end_of_new == *--end_of_old &&
>>> +  end_of_old != old_path &&
>>> +  end_of_new != new_path)
>>> +   ; /* Do nothing; all in the while loop */
>>
>> We have to compare manually as we'd want to find
>> the first non-equal and there doesn't seem to be a good
>> library function for that.
>>
>> Assuming many repos are UTF8 (including in their paths),
>> how does this work with display characters longer than one char?
>> It should be fine as we cut at the slash?
>
> Oh, UTF-8.  Ugh.
> Can UTF-8 characters, other than '/', have a byte whose value matches
> (unsigned char)('/')?  If so, then I'll need to figure out how to do
> utf-8 character parsing.  Anyone have pointers?

Oh right we are always cutting at '/', which hex 2F, so we cannot split
a codepoint in half accidentally (finding the slash inside a codepoint).

And renaming a directory from one utf8 codepoint to another, which
have the same prefix is not an issue at this layer, too.
(Think of abc -> abd just all of abc/d in one code point, but there
but even for multi code points/ASCII we repeat the prefix when
printing the rename)

>> So the first loop is about counting the number of files in each directory
>> that are renamed and the later while loop is about mapping them?
>
> Close; would adding the following comment at the top of the function help?
>
> /*
>  * Typically, we think of a directory rename as all files from a
>  * certain directory being moved to a target directory.  However,
>  * what if someone first moved two files from the original
>  * directory in one commit, and then renamed the directory
>  * somewhere else in a later commit?  At merge time, we just know
>  * that files from the original directory went to two different
>  * places, and that the bulk of them ended up in the same place.
>  * We want each directory rename to follow the bulk

Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

2018-02-03 Thread Eric Sunshine
On Sat, Feb 3, 2018 at 11:42 PM, Eric Sunshine  wrote:
> [2]: https://en.wikipedia.org/wiki/Diaeresis_(diacritic)

Correction:
[2]: https://en.wikipedia.org/wiki/Precomposed_character


Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

2018-02-03 Thread Eric Sunshine
On Sat, Feb 3, 2018 at 9:04 PM, Elijah Newren  wrote:
> On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren  wrote:
>> On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller  wrote:
>>> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren  wrote:
 +   while (*--end_of_new == *--end_of_old &&
 +  end_of_old != old_path &&
 +  end_of_new != new_path)
 +   ; /* Do nothing; all in the while loop */
>>>
>>> Assuming many repos are UTF8 (including in their paths),
>>> how does this work with display characters longer than one char?
>>> It should be fine as we cut at the slash?
>>
>> Can UTF-8 characters, other than '/', have a byte whose value matches
>> (unsigned char)('/')?  If so, then I'll need to figure out how to do
>> utf-8 character parsing.  Anyone have pointers?
>
> Well, after digging around for a while, I found this claim on the
> Wikipedia page for UTF-8:
>
>   Since ASCII bytes do not occur when encoding non-ASCII code points
> into UTF-8, UTF-8 is safe to use within most programming and document
> languages that interpret certain ASCII characters in a special way,
> such as "/" in filenames, "\" in escape sequences, and "%" in printf.
>
> So, unless I'm reading something wrong here, I think that means this
> code is just fine as it is.

You're reading it correctly. Unicode values greater than \U007f
encoded with UTF-8 will never contain bytes which can be confused with
any 7-bit ASCII character.

It's possible that Stefan was thinking of "combining characters"[1]
which may be "precomposed" and "decomposed"[2], but which appear the
same when rendered. For instance, "รถ" might be a single Unicode
codepoint or two codepoints, such as "o" combined with a diaeresis.
It's a potential problem if you're comparing, byte by byte, two
filenames which look the same. However, Git takes pains[3] to avoid
this problem by ensuring (if possible) that filenames are precomposed
within Git even if they happen to be decomposed on the actual
filesystem. So, most likely, your code is okay as-is.

[1]: https://en.wikipedia.org/wiki/Combining_character
[2]: https://en.wikipedia.org/wiki/Diaeresis_(diacritic)
[3]: https://github.com/git/git/blob/master/compat/precompose_utf8.c


Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

2018-02-03 Thread Elijah Newren
On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren  wrote:
> On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller  wrote:
>> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren  wrote:

>>> +   while (*--end_of_new == *--end_of_old &&
>>> +  end_of_old != old_path &&
>>> +  end_of_new != new_path)
>>> +   ; /* Do nothing; all in the while loop */
>>
>> We have to compare manually as we'd want to find
>> the first non-equal and there doesn't seem to be a good
>> library function for that.
>>
>> Assuming many repos are UTF8 (including in their paths),
>> how does this work with display characters longer than one char?
>> It should be fine as we cut at the slash?
>
> Oh, UTF-8.  Ugh.
> Can UTF-8 characters, other than '/', have a byte whose value matches
> (unsigned char)('/')?  If so, then I'll need to figure out how to do
> utf-8 character parsing.  Anyone have pointers?

Well, after digging around for a while, I found this claim on the
Wikipedia page for UTF-8:

  Since ASCII bytes do not occur when encoding non-ASCII code points
into UTF-8, UTF-8 is safe to use within most programming and document
languages that interpret certain ASCII characters in a special way,
such as "/" in filenames, "\" in escape sequences, and "%" in printf.

So, unless I'm reading something wrong here, I think that means this
code is just fine as it is.


Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

2018-02-03 Thread Elijah Newren
On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller  wrote:
> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren  wrote:

>> +   /* For
>
> comment style.

Fixed it and looked through the file for any other violations and
fixed the ones introduced in this series.  (The ones I added years ago
in other places of the file I just left.)

>> +*"a/b/c/d/foo.c" -> "a/b/something-else/d/foo.c"
>> +* the "d/foo.c" part is the same, we just want to know that
>> +*"a/b/c" was renamed to "a/b/something-else"
>> +* so, for this example, this function returns "a/b/c" in
>> +* *old_dir and "a/b/something-else" in *new_dir.
>
> So we would not see multi-directory renames?
>
> "a/b/c/d/foo.c" -> "a/b/something-else/e/foo.c"
>
> could be detected as
>
> "a/b/{c/d/ => something-else/e}/foo.c"
>
> I assume this patch series is not bringing that to the table.
> (which is fine, I am just wondering)

I fully intended to support that, and believe the code does.  I
changed the comment as follows to try to make it clearer:

/*
 * For
 *"a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
 * the "e/foo.c" part is the same, we just want to know that
 *"a/b/c/d" was renamed to "a/b/some/thing/else"
 * so, for this example, this function returns "a/b/c/d" in
 * *old_dir and "a/b/some/thing/else" in *new_dir.
 *
 * Also, if the basename of the file changed, we don't care.  We
 * want to know which portion of the directory, if any, changed.
 */

Is that better?

>> +*
>> +* Also, if the basename of the file changed, we don't care.  We
>> +* want to know which portion of the directory, if any, changed.
>> +*/
>> +   end_of_old = strrchr(old_path, '/');
>> +   end_of_new = strrchr(new_path, '/');
>> +
>> +   if (end_of_old == NULL || end_of_new == NULL)
>> +   return;
>
> return early as the files are in the top level, and apparently we do
> not support top level renaming?
>
> What about a commit like 81b50f3ce4 (Move 'builtin-*' into
> a 'builtin/' subdirectory, 2010-02-22) ?
>
> Well that specific commit left many files outside the new builtin dir,
> but conceptually we could see a directory rename of
>
> /* => /src/*

We had some internal usecases for want to support a "rename" of the
toplevel directory into a subdirectory of itself.  However, attempting
to support that opened much too big a can of worms for me.  We'd open
up some big surprises somewhere.

In particular, note that not supporting a "rename" of the toplevel
directory is a special case of not supporting a "rename" of any
directory to a subdirectory below itself, which in turn is a very
special case of excluding partial directory renames.  I addressed this
in the cover letter of my original submission, as follows:

"""
Further, there's a basic question about when directory rename detection
should be applied at all.  I have a simple rule:

  3) If a given directory still exists on both sides of a merge, we do
 not consider it to have been renamed.

Rule 3 may sound obvious at first, but it will probably arise as a
question for some users -- what if someone "mostly" moved a directory but
still left some files around, or, equivalently (from the perspective of the
three-way merge that merge-recursive performs), fully renamed a directory
in one commmit and then recreated that directory in a later commit adding
some new files and then tried to merge?  See the big comment in section 4
of the new t6043 for further discussion of this rule.
"""

Patch 04/31 is the one that adds that big comment with further discussion.

Maybe there's a way to support toplevel renames, but I think it'd make
this series longer and more complicated...and may cause more edge
cases that confuse users.

>> +   while (*--end_of_new == *--end_of_old &&
>> +  end_of_old != old_path &&
>> +  end_of_new != new_path)
>> +   ; /* Do nothing; all in the while loop */
>
> We have to compare manually as we'd want to find
> the first non-equal and there doesn't seem to be a good
> library function for that.
>
> Assuming many repos are UTF8 (including in their paths),
> how does this work with display characters longer than one char?
> It should be fine as we cut at the slash?

Oh, UTF-8.  Ugh.
Can UTF-8 characters, other than '/', have a byte whose value matches
(unsigned char)('/')?  If so, then I'll need to figure out how to do
utf-8 character parsing.  Anyone have pointers?

>> +   /*
>> +* We've found the first non-matching character in the directory
>> +* paths.  That means the current directory we were comparing
>> +* represents the rename.  Move end_of_old and end_of_new back
>> +* to the full directory name.
>> +*/
>> +   if (*end_of_old == '/')
>> +   end_of_old++;
>> +   if (*end_of_old != '/')
>> +   end_of_new++;
>> +   end_of

Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

2018-02-02 Thread Stefan Beller
On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren  wrote:
> This populates a list of directory renames for us.  The list of
> directory renames is not yet used, but will be in subsequent commits.
>
> Signed-off-by: Elijah Newren 
> ---
>  merge-recursive.c | 155 
> --
>  1 file changed, 152 insertions(+), 3 deletions(-)
>
> diff --git a/merge-recursive.c b/merge-recursive.c
> index 40ed8e1f39..c75d3a5139 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -1393,6 +1393,138 @@ static struct diff_queue_struct *get_diffpairs(struct 
> merge_options *o,
> return ret;
>  }
>
> +static void get_renamed_dir_portion(const char *old_path, const char 
> *new_path,
> +   char **old_dir, char **new_dir)
> +{
> +   char *end_of_old, *end_of_new;
> +   int old_len, new_len;
> +
> +   *old_dir = NULL;
> +   *new_dir = NULL;
> +
> +   /* For

comment style.

/*
 * we prefer to keep the beginning
 * and ending line of a comment free.
 */
/* unless you write single line comments */

> +*"a/b/c/d/foo.c" -> "a/b/something-else/d/foo.c"
> +* the "d/foo.c" part is the same, we just want to know that
> +*"a/b/c" was renamed to "a/b/something-else"
> +* so, for this example, this function returns "a/b/c" in
> +* *old_dir and "a/b/something-else" in *new_dir.

So we would not see multi-directory renames?

"a/b/c/d/foo.c" -> "a/b/something-else/e/foo.c"

could be detected as

"a/b/{c/d/ => something-else/e}/foo.c"

I assume this patch series is not bringing that to the table.
(which is fine, I am just wondering)

> +*
> +* Also, if the basename of the file changed, we don't care.  We
> +* want to know which portion of the directory, if any, changed.
> +*/
> +   end_of_old = strrchr(old_path, '/');
> +   end_of_new = strrchr(new_path, '/');
> +
> +   if (end_of_old == NULL || end_of_new == NULL)
> +   return;

return early as the files are in the top level, and apparently we do
not support top level renaming?

What about a commit like 81b50f3ce4 (Move 'builtin-*' into
a 'builtin/' subdirectory, 2010-02-22) ?

Well that specific commit left many files outside the new builtin dir,
but conceptually we could see a directory rename of

/* => /src/*

> +   while (*--end_of_new == *--end_of_old &&
> +  end_of_old != old_path &&
> +  end_of_new != new_path)
> +   ; /* Do nothing; all in the while loop */

We have to compare manually as we'd want to find
the first non-equal and there doesn't seem to be a good
library function for that.

Assuming many repos are UTF8 (including in their paths),
how does this work with display characters longer than one char?
It should be fine as we cut at the slash?

> +   /*
> +* We've found the first non-matching character in the directory
> +* paths.  That means the current directory we were comparing
> +* represents the rename.  Move end_of_old and end_of_new back
> +* to the full directory name.
> +*/
> +   if (*end_of_old == '/')
> +   end_of_old++;
> +   if (*end_of_old != '/')
> +   end_of_new++;
> +   end_of_old = strchr(end_of_old, '/');
> +   end_of_new = strchr(end_of_new, '/');
> +
> +   /*
> +* It may have been the case that old_path and new_path were the same
> +* directory all along.  Don't claim a rename if they're the same.
> +*/
> +   old_len = end_of_old - old_path;
> +   new_len = end_of_new - new_path;
> +
> +   if (old_len != new_len || strncmp(old_path, new_path, old_len)) {

How often are we going to see this string-is-equal case?
Would it make sense to do that first in the function?

> +   *old_dir = xstrndup(old_path, old_len);
> +   *new_dir = xstrndup(new_path, new_len);
> +   }
> +}
> +
> +static struct hashmap *get_directory_renames(struct diff_queue_struct *pairs,
> +struct tree *tree)
> +{
> +   struct hashmap *dir_renames;
> +   struct hashmap_iter iter;
> +   struct dir_rename_entry *entry;
> +   int i;
> +
> +   dir_renames = malloc(sizeof(struct hashmap));

xmalloc

> +   dir_rename_init(dir_renames);
> +   for (i = 0; i < pairs->nr; ++i) {
> +   struct string_list_item *item;
> +   int *count;
> +   struct diff_filepair *pair = pairs->queue[i];
> +   char *old_dir, *new_dir;
> +
> +   /* File not part of directory rename if it wasn't renamed */
> +   if (pair->status != 'R')
> +   continue;
> +
> +   get_renamed_dir_portion(pair->one->path, pair->two->path,
> +   &old_dir,&new_dir);
> +   if (!old_dir)
> 

[PATCH v7 19/31] merge-recursive: add get_directory_renames()

2018-01-30 Thread Elijah Newren
This populates a list of directory renames for us.  The list of
directory renames is not yet used, but will be in subsequent commits.

Signed-off-by: Elijah Newren 
---
 merge-recursive.c | 155 --
 1 file changed, 152 insertions(+), 3 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 40ed8e1f39..c75d3a5139 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1393,6 +1393,138 @@ static struct diff_queue_struct *get_diffpairs(struct 
merge_options *o,
return ret;
 }
 
+static void get_renamed_dir_portion(const char *old_path, const char *new_path,
+   char **old_dir, char **new_dir)
+{
+   char *end_of_old, *end_of_new;
+   int old_len, new_len;
+
+   *old_dir = NULL;
+   *new_dir = NULL;
+
+   /* For
+*"a/b/c/d/foo.c" -> "a/b/something-else/d/foo.c"
+* the "d/foo.c" part is the same, we just want to know that
+*"a/b/c" was renamed to "a/b/something-else"
+* so, for this example, this function returns "a/b/c" in
+* *old_dir and "a/b/something-else" in *new_dir.
+*
+* Also, if the basename of the file changed, we don't care.  We
+* want to know which portion of the directory, if any, changed.
+*/
+   end_of_old = strrchr(old_path, '/');
+   end_of_new = strrchr(new_path, '/');
+
+   if (end_of_old == NULL || end_of_new == NULL)
+   return;
+   while (*--end_of_new == *--end_of_old &&
+  end_of_old != old_path &&
+  end_of_new != new_path)
+   ; /* Do nothing; all in the while loop */
+   /*
+* We've found the first non-matching character in the directory
+* paths.  That means the current directory we were comparing
+* represents the rename.  Move end_of_old and end_of_new back
+* to the full directory name.
+*/
+   if (*end_of_old == '/')
+   end_of_old++;
+   if (*end_of_old != '/')
+   end_of_new++;
+   end_of_old = strchr(end_of_old, '/');
+   end_of_new = strchr(end_of_new, '/');
+
+   /*
+* It may have been the case that old_path and new_path were the same
+* directory all along.  Don't claim a rename if they're the same.
+*/
+   old_len = end_of_old - old_path;
+   new_len = end_of_new - new_path;
+
+   if (old_len != new_len || strncmp(old_path, new_path, old_len)) {
+   *old_dir = xstrndup(old_path, old_len);
+   *new_dir = xstrndup(new_path, new_len);
+   }
+}
+
+static struct hashmap *get_directory_renames(struct diff_queue_struct *pairs,
+struct tree *tree)
+{
+   struct hashmap *dir_renames;
+   struct hashmap_iter iter;
+   struct dir_rename_entry *entry;
+   int i;
+
+   dir_renames = malloc(sizeof(struct hashmap));
+   dir_rename_init(dir_renames);
+   for (i = 0; i < pairs->nr; ++i) {
+   struct string_list_item *item;
+   int *count;
+   struct diff_filepair *pair = pairs->queue[i];
+   char *old_dir, *new_dir;
+
+   /* File not part of directory rename if it wasn't renamed */
+   if (pair->status != 'R')
+   continue;
+
+   get_renamed_dir_portion(pair->one->path, pair->two->path,
+   &old_dir,&new_dir);
+   if (!old_dir)
+   /* Directory didn't change at all; ignore this one. */
+   continue;
+
+   entry = dir_rename_find_entry(dir_renames, old_dir);
+   if (!entry) {
+   entry = xmalloc(sizeof(struct dir_rename_entry));
+   dir_rename_entry_init(entry, old_dir);
+   hashmap_put(dir_renames, entry);
+   } else {
+   free(old_dir);
+   }
+   item = string_list_lookup(&entry->possible_new_dirs, new_dir);
+   if (!item) {
+   item = string_list_insert(&entry->possible_new_dirs,
+ new_dir);
+   item->util = xcalloc(1, sizeof(int));
+   } else {
+   free(new_dir);
+   }
+   count = item->util;
+   *count += 1;
+   }
+
+   hashmap_iter_init(dir_renames, &iter);
+   while ((entry = hashmap_iter_next(&iter))) {
+   int max = 0;
+   int bad_max = 0;
+   char *best = NULL;
+
+   for (i = 0; i < entry->possible_new_dirs.nr; i++) {
+   int *count = entry->possible_new_dirs.items[i].util;
+
+   if (*count == max)
+   bad_max = max;
+   else if (*count > max) {
+