Re: [PATCH] revision: quit pruning diff more quickly when possible

2017-10-13 Thread Junio C Hamano
Jeff King  writes:

> Here it is cleaned up and with a commit message. There's another case
> that can be optimized, too: --remove-empty with an all-deletions commit.
> That's probably even more obscure and pathological, but it was easy to
> cover in the same breath.

This one looks good.  It appears that again you guys had all the fun
while I was offline ;-).  And I am happy to see that we didn't veer
in the direction to optimize for a wrong case by keeping track of
what trees we already saw and things like that, of course.

Thanks.

> Subject: revision: quit pruning diff more quickly when possible
>
> When the revision traversal machinery is given a pathspec,
> we must compute the parent-diff for each commit to determine
> which ones are TREESAME. We set the QUICK diff flag to avoid
> looking at more entries than we need; we really just care
> whether there are any changes at all.
>
> But there is one case where we want to know a bit more: if
> --remove-empty is set, we care about finding cases where the
> change consists only of added entries (in which case we may
> prune the parent in try_to_simplify_commit()). To cover that
> case, our file_add_remove() callback does not quit the diff
> upon seeing an added entry; it keeps looking for other types
> of entries.
>
> But this means when --remove-empty is not set (and it is not
> by default), we compute more of the diff than is necessary.
> You can see this in a pathological case where a commit adds
> a very large number of entries, and we limit based on a
> broad pathspec. E.g.:
>
>   perl -e '
> chomp(my $blob = `git hash-object -w --stdin  for my $a (1..1000) {
>   for my $b (1..1000) {
> print "100644 $blob\t$a/$b\n";
>   }
> }
>   ' | git update-index --index-info
>   git commit -qm add
>
>   git rev-list HEAD -- .
>
> This case takes about 100ms now, but after this patch only
> needs 6ms. That's not a huge improvement, but it's easy to
> get and it protects us against even more pathological cases
> (e.g., going from 1 million to 10 million files would take
> ten times as long with the current code, but not increase at
> all after this patch).
>
> This is reported to minorly speed-up pathspec limiting in
> real world repositories (like the 100-million-file Windows
> repository), but probably won't make a noticeable difference
> outside of pathological setups.
>
> This patch actually covers the case without --remove-empty,
> and the case where we see only deletions. See the in-code
> comment for details.
>
> Note that we have to add a new member to the diff_options
> struct so that our callback can see the value of
> revs->remove_empty_trees. This callback parameter could be
> passed to the "add_remove" and "change" callbacks, but
> there's not much point. They already receive the
> diff_options struct, and doing it this way avoids having to
> update the function signature of the other callbacks
> (arguably the format_callback and output_prefix functions
> could benefit from the same simplification).
>
> Signed-off-by: Jeff King 
> ---
>  diff.h |  1 +
>  revision.c | 16 +---
>  2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/diff.h b/diff.h
> index 7dcfcfbef7..4a34d256f1 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -180,6 +180,7 @@ struct diff_options {
>   pathchange_fn_t pathchange;
>   change_fn_t change;
>   add_remove_fn_t add_remove;
> + void *change_fn_data;
>   diff_format_fn_t format_callback;
>   void *format_callback_data;
>   diff_prefix_fn_t output_prefix;
> diff --git a/revision.c b/revision.c
> index 8fd222f3bf..a3f245e2cc 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct 
> rev_info *revs,
>   * if the whole diff is removal of old data, and otherwise
>   * REV_TREE_DIFFERENT (of course if the trees are the same we
>   * want REV_TREE_SAME).
> - * That means that once we get to REV_TREE_DIFFERENT, we do not
> - * have to look any further.
> + *
> + * The only time we care about the distinction is when
> + * remove_empty_trees is in effect, in which case we care only about
> + * whether the whole change is REV_TREE_NEW, or if there's another type
> + * of change. Which means we can stop the diff early in either of these
> + * cases:
> + *
> + *   1. We're not using remove_empty_trees at all.
> + *
> + *   2. We saw anything except REV_TREE_NEW.
>   */
>  static int tree_difference = REV_TREE_SAME;
>  
> @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,
>   const char *fullpath, unsigned dirty_submodule)
>  {
>   int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
> + struct rev_info *revs = options->change_fn_data;
>  
>   tree_difference |= diff;
> - if (tree_difference == REV_TREE_DIFFERENT)
> + if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
>   

Re: [PATCH] revision: quit pruning diff more quickly when possible

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 11:37:50AM -0400, Derrick Stolee wrote:

> Thanks, Peff. This patch looks good to me.
> 
> I tried a few other things like adding a flag DIFF_OPT_HAS_ANY_CHANGE next
> to DIFF_OPT_HAS_CHANGES that we could check in diff_can_quit_early() but it
> had side-effects that broke existing tests. From this exploration, it does
> seem necessary to be aware of 'remove_empty_trees'.

Keep in mind that the regular diff_change callbacks already handle this
case[1].

The file_change callbacks are specific to the revision machinery's
pruning diff, and intentionally hold back the HAS_CHANGES flag.

-Peff

[1] I tried "git diff-tree --root -r --quiet 45546f17e" on the bomb
repo, and it went quickly. Dropping --quiet makes it take a really
long time.


Re: [PATCH] revision: quit pruning diff more quickly when possible

2017-10-13 Thread Derrick Stolee

On 10/13/2017 11:27 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote:


On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
causing diff_can_quit_early() to return false. Due to the corner-case of the
bug it seems it will not be a huge performance improvement in most cases.
Still worth fixing and I'm looking at your suggestions to try and learn this
area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

Here it is cleaned up and with a commit message. There's another case
that can be optimized, too: --remove-empty with an all-deletions commit.
That's probably even more obscure and pathological, but it was easy to
cover in the same breath.

I didn't bother making a perf script, since this really isn't indicative
of real-world performance. If we wanted to do perf regression tests
here, I think the best path forward would be:

   1. Make sure there the perf tests cover pathspecs (maybe in p0001?).

   2. Make it easy to run the whole perf suite against a "bomb" repo.
  This surely isn't the only slow thing of interest.

-- >8 --
Subject: revision: quit pruning diff more quickly when possible

When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.

But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.

But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:

   perl -e '
 chomp(my $blob = `git hash-object -w --stdin remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).

Signed-off-by: Jeff King 
---
  diff.h |  1 +
  revision.c | 16 +---
  2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/diff.h b/diff.h
index 7dcfcfbef7..4a34d256f1 100644
--- a/diff.h
+++ b/diff.h
@@ -180,6 +180,7 @@ struct diff_options {
pathchange_fn_t pathchange;
change_fn_t change;
add_remove_fn_t add_remove;
+   void *change_fn_data;
diff_format_fn_t format_callback;
void *format_callback_data;
diff_prefix_fn_t output_prefix;
diff --git a/revision.c b/revision.c
index 8fd222f3bf..a3f245e2cc 100644
--- a/revision.c
+++ b/revision.c
@@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct 
rev_info *revs,
   * if the whole diff is removal of old data, and otherwise
   * REV_TREE_DIFFERENT (of course if the trees are the same we
   * want REV_TREE_SAME).
- * That means that once we get to REV_TREE_DIFFERENT, we do not
- * have to look any further.
+ *
+ * The only time we care about the distinction is when
+ * remove_empty_trees is in effect, in which case we care only about
+ * whether the whole change is REV_TREE_NEW, or if there's another type
+ * of change. Which means we can stop the diff early in either of these
+ * cases:
+ *
+ *   1. We're not using remove_empty_trees at all.
+ *
+ *   2. We saw anything except REV_TREE_NEW.
   */
  static int tree_difference = REV_TREE_SAME;
  
@@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,

const char *fullpath, unsigned dirty_submodule)
  {
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
+   struct rev_info *revs = options->change_fn_data;
  
  	tree_difference |= diff;

-   if (tree_difference == REV_TREE_DIFFERENT)
+   if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
DIFF_OPT_SET(options, HAS_CHANGES);
  }
  
@@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix)

DIFF_OPT_SET(>pruning, QUICK);
revs->pruning.add_remove = file_add_remove;
revs->pruning.change = file_change;
+   revs->pruning.change_fn_data = revs;
revs->sort_order =