Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Jeff King
On Sun, May 19, 2013 at 02:17:35PM +0100, John Keeping wrote:

 When using git cherry or git log --cherry-pick we often have a small
 number of commits on one side and a large number on the other.  In
 revision.c::cherry_pick_list we store the patch IDs for the small side
 before comparing the large side to this.
 
 In this case, it is likely that only a small number of paths are touched
 by the commits on the smaller side of the range and by storing these we
 can ignore many commits on the other side before unpacking blobs and
 diffing them.

There are two observations that make your scheme work:

  1. For something like --cherry-pick, we do not even care about the
 actual patch-ids, only whether there are matches from the left and
 right sides.

  2. Comparing the sets of paths touched by two commits is often
 sufficient to realize they do not have the same patch-ids.

But I think those observations mean we can do even better than
calculating the real patch-id for the small side at all.

Imagine we have both loose and strict patch-ids, where the loose one
is much faster to compute, and has that property that a match _might_
mean we have the same strict patch-id, but a non-match means we
definitely do not have the same strict patch-id.

I think such a loose patch-id could just be a hash of the filenames that
were changed by the patch (e.g., the first 32-bits of the sha1 of the
concatenated filenames). Computing that should be about as expensive as
a tree-diff. Per observation 2 above, if two commits do not have the
same loose id, we know that they cannot possibly have the same strict
id.

Then we can forget about the smaller-side and bigger-side entirely, and
just do something like:

  1. Make a sorted list (or hash table) of loose ids for one side.

  2. For each commit on the other side, calculate its loose id and look
 that up in the sorted list. If no hits, we know that there is no
 match. For any hits, lazily calculate (and cache) the strict patch
 id for both sides and compare as usual.

In the best case, we compute no patch-ids at all. And even for the
average case, I'd expect our lazy calculation to only have to compute a
handful of ids.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Jeff King
On Wed, May 29, 2013 at 02:20:07AM -0400, Jeff King wrote:

 In the best case, we compute no patch-ids at all. And even for the
 average case, I'd expect our lazy calculation to only have to compute a
 handful of ids.

Here is a not-well-tested version of the idea. I tried to contain the
changes to patch-ids.c, though we may also be able to simplify the
--cherry code, too.

Here are my timings compared to stock git and to yours
(all are best-of-five, git log --cherry $revs):

  revs=origin/master...origin/jk/submodule-subdirectory-ok
stock|you   |me
---
  real  0m0.501s | 0m0.078s | 0m0.098s
  user  0m0.480s | 0m0.056s | 0m0.084s
  sys   0m0.016s | 0m0.020s | 0m0.012s

  revs=origin/next...origin/pu
stock|you   |me
---
  real  0m0.857s | 0m0.847s | 0m0.519s
  user  0m0.828s | 0m0.812s | 0m0.488s
  sys   0m0.024s | 0m0.028s | 0m0.024s

So it performs slightly less well on the small case, and a bit better on
the large case. I think part of the problem is that when we do have a
loose hit, we end up re-doing the tree diff to find the strict, which
involves re-inflating the trees. It's unavoidable for the lazy entries
unless we want to cache the whole diff_queued_diff, but for the non-lazy
entries we should be able to do the strict patch-id incrementally. We
just need to improve the function interfaces.

---
 diff.c  |  15 -
 diff.h  |   2 +-
 patch-ids.c | 117 +++
 patch-ids.h |  11 ++--
 4 files changed, 82 insertions(+), 63 deletions(-)

diff --git a/diff.c b/diff.c
index f0b3e7c..3b55788 100644
--- a/diff.c
+++ b/diff.c
@@ -4233,7 +4233,8 @@ static void patch_id_consume(void *priv, char *line, 
unsigned long len)
 }
 
 /* returns 0 upon success, and writes result into sha1 */
-static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
+static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
+int loose)
 {
struct diff_queue_struct *q = diff_queued_diff;
int i;
@@ -4266,6 +4267,12 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1)
if (DIFF_PAIR_UNMERGED(p))
continue;
 
+   if (loose) {
+   git_SHA1_Update(ctx, p-one-path, 
strlen(p-one-path));
+   git_SHA1_Update(ctx, p-two-path, 
strlen(p-two-path));
+   continue;
+   }
+
diff_fill_sha1_info(p-one);
diff_fill_sha1_info(p-two);
if (fill_mmfile(mf1, p-one)  0 ||
@@ -4323,11 +4330,13 @@ int diff_flush_patch_id(struct diff_options *options, 
unsigned char *sha1)
return 0;
 }
 
-int diff_flush_patch_id(struct diff_options *options, unsigned char *sha1)
+int diff_flush_patch_id(struct diff_options *options,
+   unsigned char *sha1,
+   int loose)
 {
struct diff_queue_struct *q = diff_queued_diff;
int i;
-   int result = diff_get_patch_id(options, sha1);
+   int result = diff_get_patch_id(options, sha1, loose);
 
for (i = 0; i  q-nr; i++)
diff_free_filepair(q-queue[i]);
diff --git a/diff.h b/diff.h
index 78b4091..b018aaf 100644
--- a/diff.h
+++ b/diff.h
@@ -320,7 +320,7 @@ extern int do_diff_cache(const unsigned char *, struct 
diff_options *);
 extern int run_diff_index(struct rev_info *revs, int cached);
 
 extern int do_diff_cache(const unsigned char *, struct diff_options *);
-extern int diff_flush_patch_id(struct diff_options *, unsigned char *);
+extern int diff_flush_patch_id(struct diff_options *, unsigned char *, int 
loose);
 
 extern int diff_result_code(struct diff_options *, int);
 
diff --git a/patch-ids.c b/patch-ids.c
index bc8a28f..3a83ee6 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -5,7 +5,7 @@ static int commit_patch_id(struct commit *commit, struct 
diff_options *options,
 #include patch-ids.h
 
 static int commit_patch_id(struct commit *commit, struct diff_options *options,
-   unsigned char *sha1)
+  unsigned char *sha1, int loose)
 {
if (commit-parents)
diff_tree_sha1(commit-parents-item-object.sha1,
@@ -13,27 +13,9 @@ static int commit_patch_id(struct commit *commit, struct 
diff_options *options,
else
diff_root_tree_sha1(commit-object.sha1, , options);
diffcore_std(options);
-   return diff_flush_patch_id(options, sha1);
+   return diff_flush_patch_id(options, sha1, loose);
 }
 
-static const unsigned char *patch_id_access(size_t index, void *table)
-{
-   struct patch_id **id_table = table;
-   return id_table[index]-patch_id;
-}
-
-static int patch_pos(struct patch_id **table, int nr, const unsigned char *id)
-{
-   return sha1_pos(id, table, nr, patch_id_access);
-}
-
-#define 

Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Jeff King
On Wed, May 29, 2013 at 03:22:25AM -0400, Jeff King wrote:

   revs=origin/master...origin/jk/submodule-subdirectory-ok
 stock|you   |me
 ---
   real  0m0.501s | 0m0.078s | 0m0.098s
   user  0m0.480s | 0m0.056s | 0m0.084s
   sys   0m0.016s | 0m0.020s | 0m0.012s
 
   revs=origin/next...origin/pu
 stock|you   |me
 ---
   real  0m0.857s | 0m0.847s | 0m0.519s
   user  0m0.828s | 0m0.812s | 0m0.488s
   sys   0m0.024s | 0m0.028s | 0m0.024s
 
 So it performs slightly less well on the small case, and a bit better on
 the large case. I think part of the problem is that when we do have a
 loose hit, we end up re-doing the tree diff to find the strict, which
 involves re-inflating the trees. It's unavoidable for the lazy entries
 unless we want to cache the whole diff_queued_diff, but for the non-lazy
 entries we should be able to do the strict patch-id incrementally. We
 just need to improve the function interfaces.

The (somewhat hacky) patch below, on top of my previous one, does that
optimization.  But the timings aren't improved by much. My best-of-five
for the second case went down to:

  real0m0.495s
  user0m0.488s
  sys 0m0.004s

However, the actual time to just do git log --raw $revs is:

  real0m0.333s
  user0m0.292s
  sys 0m0.032s

which provides a lower-bound for how well we can do, as it is just doing
a single tree diff for each commit. So I think we have reaped most of
the benefits of this approach already (and we will generally have to do
_some_ true patch-id calculations, so we can never meet that lower
bound).

---
 diff.c  | 11 +++---
 diff.h  |  3 ++-
 patch-ids.c | 39 ++-
 3 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/diff.c b/diff.c
index 3b55788..161c5bf 100644
--- a/diff.c
+++ b/diff.c
@@ -4233,8 +4233,8 @@ static void patch_id_consume(void *priv, char *line, 
unsigned long len)
 }
 
 /* returns 0 upon success, and writes result into sha1 */
-static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
-int loose)
+int diff_get_patch_id(struct diff_options *options, unsigned char *sha1,
+ int loose)
 {
struct diff_queue_struct *q = diff_queued_diff;
int i;
@@ -4330,21 +4330,16 @@ int diff_flush_patch_id(struct diff_options *options,
return 0;
 }
 
-int diff_flush_patch_id(struct diff_options *options,
-   unsigned char *sha1,
-   int loose)
+void diff_queue_clear(void)
 {
struct diff_queue_struct *q = diff_queued_diff;
int i;
-   int result = diff_get_patch_id(options, sha1, loose);
 
for (i = 0; i  q-nr; i++)
diff_free_filepair(q-queue[i]);
 
free(q-queue);
DIFF_QUEUE_CLEAR(q);
-
-   return result;
 }
 
 static int is_summary_empty(const struct diff_queue_struct *q)
diff --git a/diff.h b/diff.h
index b018aaf..7207d4b 100644
--- a/diff.h
+++ b/diff.h
@@ -320,7 +320,8 @@ extern int do_diff_cache(const unsigned char *, struct 
diff_options *);
 extern int run_diff_index(struct rev_info *revs, int cached);
 
 extern int do_diff_cache(const unsigned char *, struct diff_options *);
-extern int diff_flush_patch_id(struct diff_options *, unsigned char *, int 
loose);
+extern int diff_get_patch_id(struct diff_options *, unsigned char *, int 
loose);
+extern void diff_queue_clear(void);
 
 extern int diff_result_code(struct diff_options *, int);
 
diff --git a/patch-ids.c b/patch-ids.c
index 3a83ee6..83cda92 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -4,8 +4,7 @@
 #include sha1-lookup.h
 #include patch-ids.h
 
-static int commit_patch_id(struct commit *commit, struct diff_options *options,
-  unsigned char *sha1, int loose)
+static void start_patch_id(struct commit *commit, struct diff_options *options)
 {
if (commit-parents)
diff_tree_sha1(commit-parents-item-object.sha1,
@@ -13,7 +12,6 @@ static int commit_patch_id(struct commit *commit, struct 
diff_options *options,
else
diff_root_tree_sha1(commit-object.sha1, , options);
diffcore_std(options);
-   return diff_flush_patch_id(options, sha1, loose);
 }
 
 int init_patch_ids(struct patch_ids *ids)
@@ -50,12 +48,16 @@ static struct patch_id *add_commit(struct commit *commit,
   int no_add)
 {
struct patch_id *p;
-   unsigned char loose[20], strict[20];
+   unsigned char loose[20], strict[20] = {0};
unsigned long hash;
void **pos;
 
-   if (commit_patch_id(commit, ids-diffopts, loose, 1))
+   start_patch_id(commit, ids-diffopts);
+   if (diff_get_patch_id(ids-diffopts, loose, 1)) {
+   diff_queue_clear();
return NULL;
+   }
+
hash = hash_sha1(loose);
 
p = 

Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I think such a loose patch-id could just be a hash of the filenames that
 were changed by the patch (e.g., the first 32-bits of the sha1 of the
 concatenated filenames). Computing that should be about as expensive as
 a tree-diff. Per observation 2 above, if two commits do not have the
 same loose id, we know that they cannot possibly have the same strict
 id.

Because the strict one already hashes the filenames, if files that
are touched by a patch is different from that of another patch, we
judge them being different.

 Then we can forget about the smaller-side and bigger-side entirely, and
 just do something like:

   1. Make a sorted list (or hash table) of loose ids for one side.

   2. For each commit on the other side, calculate its loose id and look
  that up in the sorted list. If no hits, we know that there is no
  match. For any hits, lazily calculate (and cache) the strict patch
  id for both sides and compare as usual.

 In the best case, we compute no patch-ids at all. And even for the
 average case, I'd expect our lazy calculation to only have to compute a
 handful of ids.

Correct.

This has rather interesting ramifications on cherry-pick and rebase,
though.  Both command can handle changes that come from an old tree
before some paths were renamed, but strict patch-id would not spot
equivalent changes we already have in our history if our change
happened after a rename, i.e.

   Z
  /
 O---R---X---Y

where Z updates path F, R moves F to G and X changes G the same way
as Z changes F, and we are trying to cherry-pick Z on top of Y.  The
cherry-pick filter will see different patch-id for Z and X.

We will likely to notice that patch already applied (if using am-3
machinery) or already up-to-date (if using merge machinery) even
when we missed this equivalency and drop the duplicate from the
result, so it is not a big loss, but we might want to consider
removing the filename from patch-id computation, at least for the
ones we internally use and discard for revs-cherry_pick filtering.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Jeff King
On Wed, May 29, 2013 at 11:08:46AM -0700, Junio C Hamano wrote:

 This has rather interesting ramifications on cherry-pick and rebase,
 though.  Both command can handle changes that come from an old tree
 before some paths were renamed, but strict patch-id would not spot
 equivalent changes we already have in our history if our change
 happened after a rename, i.e.
 
Z
   /
  O---R---X---Y
 
 where Z updates path F, R moves F to G and X changes G the same way
 as Z changes F, and we are trying to cherry-pick Z on top of Y.  The
 cherry-pick filter will see different patch-id for Z and X.

True. That is a problem with the current patch-id system, no? Though my
proposal would make it harder to change it in the future (as does
John's).

With mine, I wonder if you could use a different loose definition that
does not take the filenames into account.  Using something like the
changes in file sizes would be unique, but would not properly map to
strict cases (a patch-id that adds 5 lines to the end of a 100-byte file
may match one that adds the same five lines to the end of a 200-byte
file).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Wed, May 29, 2013 at 11:08:46AM -0700, Junio C Hamano wrote:

 This has rather interesting ramifications on cherry-pick and rebase,
 though.  Both command can handle changes that come from an old tree
 before some paths were renamed, but strict patch-id would not spot
 equivalent changes we already have in our history if our change
 happened after a rename, i.e.
 
Z
   /
  O---R---X---Y
 
 where Z updates path F, R moves F to G and X changes G the same way
 as Z changes F, and we are trying to cherry-pick Z on top of Y.  The
 cherry-pick filter will see different patch-id for Z and X.

 True. That is a problem with the current patch-id system, no?

Correct.  That is why I suggested not to change the external
interface (i.e. what is shown from patch-id command) as people may
have kept them, but wondered if a possible improvement may be to
exclude the name from ids when we internally generate two sets of
Ids and intersect them, i.e. log --cherry.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-20 Thread John Keeping
On Sun, May 19, 2013 at 11:36:23PM -0700, Jonathan Nieder wrote:
 I don't know what it should mean to try to use --cherry without
 --no-merges or --first-parent, so I guess this is harmless.

Currently --no-merges doesn't actually get passed down this far.  We do
the patch ID calculations without taking that into account.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-19 Thread John Keeping
When using git cherry or git log --cherry-pick we often have a small
number of commits on one side and a large number on the other.  In
revision.c::cherry_pick_list we store the patch IDs for the small side
before comparing the large side to this.

In this case, it is likely that only a small number of paths are touched
by the commits on the smaller side of the range and by storing these we
can ignore many commits on the other side before unpacking blobs and
diffing them.

This improves performance in every example I have tried (times are best
of three, using git.git):

Before:
$ time git log --cherry master...jk/submodule-subdirectory-ok /dev/null

real0m0.373s
user0m0.341s
sys 0m0.031s

After:
$ time git log --cherry master...jk/submodule-subdirectory-ok /dev/null

real0m0.060s
user0m0.055s
sys 0m0.005s

Before:
$ time git log --cherry next...pu /dev/null

real0m0.661s
user0m0.617s
sys 0m0.044s

After:
$ time git log --cherry next...pu /dev/null

real0m0.509s
user0m0.478s
sys 0m0.030s

Signed-off-by: John Keeping j...@keeping.me.uk
---
 patch-ids.c | 142 
 patch-ids.h |   3 ++
 2 files changed, 145 insertions(+)

diff --git a/patch-ids.c b/patch-ids.c
index bc8a28f..912f40c 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -1,5 +1,6 @@
 #include cache.h
 #include diff.h
+#include diffcore.h
 #include commit.h
 #include sha1-lookup.h
 #include patch-ids.h
@@ -16,6 +17,137 @@ static int commit_patch_id(struct commit *commit, struct 
diff_options *options,
return diff_flush_patch_id(options, sha1);
 }
 
+struct collect_paths_info {
+   struct string_list *paths;
+   int searching;
+};
+
+static int collect_paths_recursive(int n, struct tree_desc *t,
+   const char *base, struct pathspec *pathspec,
+   struct collect_paths_info *data);
+
+static int same_entry(struct name_entry *a, struct name_entry *b)
+{
+   if (!a-sha1 || !b-sha1)
+   return a-sha1 == b-sha1;
+   return  !hashcmp(a-sha1, b-sha1) 
+   a-mode == b-mode;
+}
+
+static char *traverse_path(const struct traverse_info *info,
+   const struct name_entry *n)
+{
+   char *path = xmalloc(traverse_path_len(info, n) + 1);
+   return make_traverse_path(path, info, n);
+}
+
+static int add_touched_path(struct collect_paths_info *info, const char *path)
+{
+   if (info-searching) {
+   if (!string_list_has_string(info-paths, path))
+   return -1;
+   }
+   string_list_insert(info-paths, path);
+   return 0;
+}
+
+static inline const unsigned char *dir_sha1(struct name_entry *e)
+{
+   if (S_ISDIR(e-mode))
+   return e-sha1;
+   return NULL;
+}
+
+static int collect_touched_paths_cb(int n, unsigned long mask,
+   unsigned long dirmask, struct name_entry *entry,
+   struct traverse_info *info)
+{
+   struct collect_paths_info *collect_info = info-data;
+   if (n == 1) {
+   /* We're handling a root commit - add all the paths. */
+   if (entry[0].sha1  !S_ISDIR(entry[0].mode)) {
+   if (add_touched_path(collect_info, entry[0].path))
+   return -1;
+   } else if (S_ISDIR(entry[0].mode)) {
+   char *newbase = traverse_path(info, entry);
+   struct tree_desc t[1];
+   void *buf0 = fill_tree_descriptor(t, entry[0].sha1);
+   int error = collect_paths_recursive(1, t, newbase,
+   info-pathspec, collect_info);
+   free(buf0);
+   free(newbase);
+   if (error  0)
+   return error;
+   }
+   return mask;
+   }
+
+   if (same_entry(entry+0, entry+1))
+   return mask;
+
+   if (entry[0].mode  !S_ISDIR(entry[0].mode))
+   if (add_touched_path(collect_info, entry[0].path))
+   return -1;
+   if (entry[1].mode  !S_ISDIR(entry[1].mode))
+   if (add_touched_path(collect_info, entry[1].path))
+   return -1;
+
+   if ((entry[0].sha1  S_ISDIR(entry[0].mode)) ||
+   (entry[1].sha1  S_ISDIR(entry[1].mode))) {
+   char *newbase = traverse_path(info,
+   S_ISDIR(entry[0].mode) ? entry+0 : entry+1);
+   struct tree_desc t[2];
+   void *buf0 = fill_tree_descriptor(t+0, dir_sha1(entry+0));
+   void *buf1 = fill_tree_descriptor(t+1, dir_sha1(entry+1));
+   int error = collect_paths_recursive(2, t, newbase,
+   info-pathspec, collect_info);
+   free(buf0);
+   free(buf1);
+   free(newbase);
+   if (error  0)
+   return 

Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-19 Thread Jonathan Nieder
John Keeping wrote:

 In this case, it is likely that only a small number of paths are touched
 by the commits on the smaller side of the range and by storing these we
 can ignore many commits on the other side before unpacking blobs and
 diffing them.

I like this idea a lot.

 --- a/patch-ids.c
 +++ b/patch-ids.c
 @@ -1,5 +1,6 @@
  #include cache.h
  #include diff.h
 +#include diffcore.h
  #include commit.h
  #include sha1-lookup.h
  #include patch-ids.h
 @@ -16,6 +17,137 @@ static int commit_patch_id(struct commit *commit, struct 
 diff_options *options,
   return diff_flush_patch_id(options, sha1);
  }
  
 +struct collect_paths_info {
 + struct string_list *paths;
 + int searching;
 +};
 +
 +static int collect_paths_recursive(int n, struct tree_desc *t,
 + const char *base, struct pathspec *pathspec,
 + struct collect_paths_info *data);

Can you say a little about how this function works (e.g., in a
comment)?  What are the preconditions and postconditions?  How does
the state evolve (e.g, when is searching set)?

 +
 +static int same_entry(struct name_entry *a, struct name_entry *b)
 +{
 + if (!a-sha1 || !b-sha1)
 + return a-sha1 == b-sha1;
 + return  !hashcmp(a-sha1, b-sha1) 
 + a-mode == b-mode;

Style: unusual whitespace (the tab after return).  I'd just do

if (!a-sha1 || ...)
return ...
return !hashcmp(a-sha1, b-sha1)  a-mode == b-mode;

since it is not too long.

[...]
 +static char *traverse_path(const struct traverse_info *info,
 + const struct name_entry *n)
 +{
 + char *path = xmalloc(traverse_path_len(info, n) + 1);
 + return make_traverse_path(path, info, n);
 +}

This function seems to be the same as one of the same name in
builtin/merge-tree.c.  Should it be put somewhere more public, for
example as part of the tree-walk API?  Who is responsible for freeing
path?  Would it make sense to use a strbuf here?

strbuf_setlen(buf, traverse_path_len(info, n));
make_traverse_path(buf.buf, info, n);

 +
 +static int add_touched_path(struct collect_paths_info *info, const char 
 *path)
 +{
 + if (info-searching) {
 + if (!string_list_has_string(info-paths, path))
 + return -1;
 + }
 + string_list_insert(info-paths, path);
 + return 0;
 +}

Same question about internal API: when I see

add_touched_path(info, path)

what should I expect it to do?

In the info-searching case, string_list_insert(info-paths, path)
will always be a no-op, right?  What does it mean that this is adding
a touched path in that case?

[...]
 +static int collect_touched_paths_cb(int n, unsigned long mask,
 + unsigned long dirmask, struct name_entry *entry,
 + struct traverse_info *info)
 +{

Same question about contracts.  Ideally a reader in a rush should be
able to read an opening comment about what the function does and
then return to the caller without delving into the details of how
it does its work.

 + struct collect_paths_info *collect_info = info-data;
 + if (n == 1) {
 + /* We're handling a root commit - add all the paths. */
[...]
 + if ((entry[0].sha1  S_ISDIR(entry[0].mode)) ||
 + (entry[1].sha1  S_ISDIR(entry[1].mode))) {

At this point I'm not sure what two trees are being traversed in
parallel, so it's not obvious to me what the code's about.  Are
they the before and after trees for commits being rebased?

[...]
 +static int collect_touched_paths(struct commit *commit, struct patch_ids 
 *ids,
 + int searching)
 +{
 + struct tree_desc trees[2];
 + struct collect_paths_info info = { ids-touched_paths, searching };
 + void *commitbuf;
 + int result;
 +
 + commitbuf = fill_tree_descriptor(trees + 1, commit-object.sha1);
 + if (commit-parents) {
 + void *parentbuf = fill_tree_descriptor(trees + 0,
 + commit-parents-item-object.sha1);

Looks like yes.

What should happen for commits with more than 1 parent?  If they're
supposed to not enter this codepath because of a previous check, a
die(BUG: ...) could be a good way to make reading easier.

[...]
 @@ -40,6 +172,7 @@ int init_patch_ids(struct patch_ids *ids)
   diff_setup(ids-diffopts);
   DIFF_OPT_SET(ids-diffopts, RECURSIVE);
   diff_setup_done(ids-diffopts);
 + ids-touched_paths.strdup_strings = 1;

Good.

[...]
 @@ -64,6 +199,13 @@ static struct patch_id *add_commit(struct commit *commit,
   unsigned char sha1[20];
   int pos;
  
 + if (no_add) {
 + if (collect_touched_paths(commit, ids, 1)  0)
 + return NULL;

Ah, so this is what the searching does.

The existing no_add argument is very confusing (what does it mean to
add a commit without adding?), but at least the confusion is
self-contained in a small, simple set of functions:

static struct patch_id *add_commit(struct commit 

Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff

2013-05-19 Thread Junio C Hamano
Jonathan Nieder jrnie...@gmail.com writes:

 @@ -64,6 +199,13 @@ static struct patch_id *add_commit(struct commit *commit,
  unsigned char sha1[20];
  int pos;
  
 +if (no_add) {
 +if (collect_touched_paths(commit, ids, 1)  0)
 +return NULL;

 Ah, so this is what the searching does.

 The existing no_add argument is very confusing (what does it mean to
 add a commit without adding?),

Such a mode of operation is usually called dry-run, no?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html