Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-13 Thread John Keeping
On Sun, May 12, 2013 at 03:19:49PM -0700, Junio C Hamano wrote:
 John Keeping j...@keeping.me.uk writes:
 
  But it is not a big problem.  Either 3-way merge notices that there
  is nothing new, or you get a conflict and have chance to inspect
  what is going on.
 
  It's not a problem here, but false negatives would be annoying if you're
  looking at git log --cherry-mark.
 
 The primary thing to notice is that it is not a new problem with or
 without the caching layer.  As Linus mentioned how patch-ids are
 computed by ignoring offsets and whitespaces, the filtering is done
 as a crude approximation and false negatives are part of design, so
 making the cache more complex by recording hash of the binary and/or
 options used to compute misses the fundamental.

The caching layer could also introduce false positives though, which is
more serious.  If you cache patch IDs with a pathspec restriction and
then run a command that uses the cache with no such restriction you
could hit a patch that is the same for those paths but also touches
other paths that you don't want to ignore and mark it as patch identical
even though it is not.

Adding a hash of the diffopts fixes that.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-13 Thread Junio C Hamano
John Keeping j...@keeping.me.uk writes:

 The caching layer could also introduce false positives though, which is
 more serious.  If you cache patch IDs with a pathspec restriction ...

What?  What business does patch-id have with pathspec-limited diff
generation?  You do not rebase or cherry-pick with pathspec, so
unless you are populating the patch-id cache at a wrong point (like,
say whenevern git show $commit is run), I am not sure why pathspec
limit becomes even an issue.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-13 Thread John Keeping
On Mon, May 13, 2013 at 06:53:29AM -0700, Junio C Hamano wrote:
 John Keeping j...@keeping.me.uk writes:
 
  The caching layer could also introduce false positives though, which is
  more serious.  If you cache patch IDs with a pathspec restriction ...
 
 What?  What business does patch-id have with pathspec-limited diff
 generation?  You do not rebase or cherry-pick with pathspec, so
 unless you are populating the patch-id cache at a wrong point (like,
 say whenevern git show $commit is run), I am not sure why pathspec
 limit becomes even an issue.

revision.c::cherry_pick_list() sets the pathspec to what was specified
in the revision options.  It's done that since commit 36d56de (Fix
--cherry-pick with given paths, 2007-07-10) and t6007 tests that it
works.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-13 Thread Junio C Hamano
John Keeping j...@keeping.me.uk writes:

 On Mon, May 13, 2013 at 06:53:29AM -0700, Junio C Hamano wrote:
 John Keeping j...@keeping.me.uk writes:
 
  The caching layer could also introduce false positives though, which is
  more serious.  If you cache patch IDs with a pathspec restriction ...
 
 What?  What business does patch-id have with pathspec-limited diff
 generation?  You do not rebase or cherry-pick with pathspec, so
 unless you are populating the patch-id cache at a wrong point (like,
 say whenevern git show $commit is run), I am not sure why pathspec
 limit becomes even an issue.

 revision.c::cherry_pick_list() sets the pathspec to what was specified
 in the revision options.  It's done that since commit 36d56de (Fix
 --cherry-pick with given paths, 2007-07-10) and t6007 tests that it
 works.

Then the caching should be automatically turned off when pathspec is
given.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-13 Thread John Keeping
On Mon, May 13, 2013 at 07:46:09AM -0700, Junio C Hamano wrote:
 John Keeping j...@keeping.me.uk writes:
 
  On Mon, May 13, 2013 at 06:53:29AM -0700, Junio C Hamano wrote:
  John Keeping j...@keeping.me.uk writes:
  
   The caching layer could also introduce false positives though, which is
   more serious.  If you cache patch IDs with a pathspec restriction ...
  
  What?  What business does patch-id have with pathspec-limited diff
  generation?  You do not rebase or cherry-pick with pathspec, so
  unless you are populating the patch-id cache at a wrong point (like,
  say whenevern git show $commit is run), I am not sure why pathspec
  limit becomes even an issue.
 
  revision.c::cherry_pick_list() sets the pathspec to what was specified
  in the revision options.  It's done that since commit 36d56de (Fix
  --cherry-pick with given paths, 2007-07-10) and t6007 tests that it
  works.
 
 Then the caching should be automatically turned off when pathspec is
 given.

That was my first thought, but since we can be affected by other diff
options set in the user's config as well it ended up being simpler to
include it in the options hash and use that.

This has the advantage that you get the benefit of the cache if you run
git log --cherry-mark with the same paths more than once.  In my
testing the cache is beneficial as soon as you examine more than one
similar range (e.g. master...feature-A and then master...feature-B).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-13 Thread Junio C Hamano
John Keeping j...@keeping.me.uk writes:

 This has the advantage that you get the benefit of the cache if you run
 git log --cherry-mark with the same paths more than once.  In my
 testing the cache is beneficial as soon as you examine more than one
 similar range (e.g. master...feature-A and then master...feature-B).

OK, so perhaps the notes that are keyed with commit ID will record
multiple entries, one for each invocation pattern (i.e. all pathspec
given, possibly with nonstandard options)?

git diff -- t Documentation and git diff -- Docu\* t, even
though they use different pathspec, would produce the same diff;
instead of pathspec you may need to key with the actual list of
paths in the patch, though.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-13 Thread John Keeping
On Mon, May 13, 2013 at 08:45:21AM -0700, Junio C Hamano wrote:
 John Keeping j...@keeping.me.uk writes:
 
  This has the advantage that you get the benefit of the cache if you run
  git log --cherry-mark with the same paths more than once.  In my
  testing the cache is beneficial as soon as you examine more than one
  similar range (e.g. master...feature-A and then master...feature-B).
 
 OK, so perhaps the notes that are keyed with commit ID will record
 multiple entries, one for each invocation pattern (i.e. all pathspec
 given, possibly with nonstandard options)?

That would be possible, but I didn't do it in the current version of the
patch.

 git diff -- t Documentation and git diff -- Docu\* t, even
 though they use different pathspec, would produce the same diff;
 instead of pathspec you may need to key with the actual list of
 paths in the patch, though.

Maybe, but I think that would be overkill.

I'm interested to see how much of a benefit we could get by not
calculating the patch ID of any commits on the larger side of a
symmetric range that touch paths outside the set touched by the smaller
side.  (revision.c::cherry_pick_list() remembers patch IDs for the
smaller side of a symmetric range and then checks if anything on the
larger side matches so this fits in naturally.)

In my usage I generally compare a relatively small topic branch against
whatever has happened in some upstream branch so I think this could give
a big improvement but I haven't had time to try it yet.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-12 Thread John Keeping
On Sat, May 11, 2013 at 08:00:44PM -0700, Junio C Hamano wrote:
 Linus Torvalds torva...@linux-foundation.org writes:
 
  On Sat, May 11, 2013 at 2:49 PM, John Keeping j...@keeping.me.uk wrote:
 
  Hmm... I hadn't realised that.  Looking a bit closer, it looks like
  init_patch_ids sets up its own diffopts so its not affected by the
  command line (except for pathspecs which would be easy to check for).
  Of course that still means it can be affected by settings in the user's
  configuration.
 
  .. and in the actual diff algorithm.
 
 As to the objection side of the argument, I already said
 essentially the same thing several months ago:
 
   http://thread.gmane.org/gmane.comp.version-control.git/202654/focus=202898
 
 and do not have much to add [*1*].
 
 However.
 
 The use of patch-id in cherry and rebase is to facilitate avoiding
 to replay commits that are obviously identical to the ones you have
 in your history.  The cached patch id for an existing old commit may
 differ from a patch id you freshly compute for a new commit you are
 trying to see if it truly new, even though they may represent the
 same change.  So we may incorrectly think such a new commit is not
 yet in your history and attempt to replay it.
 
 But it is not a big problem.  Either 3-way merge notices that there
 is nothing new, or you get a conflict and have chance to inspect
 what is going on.

It's not a problem here, but false negatives would be annoying if you're
looking at git log --cherry-mark.

 A conceptually much larger and more problematic issue is that we may
 discard a truly new change that you still need as an old one you
 already have due to a hash collision and discard it.  Because the
 hash space of SHA-1 is so large, however, it is not a problem in
 practice, and more importantly, that hash space is just as large as
 the hash space used by Git to reduce a patch to a patch id, the
 filtering done with patch-id in cherry and rebase _already_ have
 that exact problem with or without this additional cache layer. A
 stale cache may make the possibility of lost change due to such a
 hash collision merely twice as likely.
 
  ... it's a the patch ID actually ignores a lot of data in order
  to give the same ID even if lins have been added above it, and the
  patch is at different line numbers etc.
 
 Yes.
 
  So maybe it doesn't matter. But at the same time, I really think
  caching patch ID's should be something people should be aware of is
  fundamentally wrong, even if it might work.
 
 I do not think it is caching patch ID that people should be aware
 of is fundamentally wrong.  What is fundamentally wrong, even if it
 might work, is using patch ID itself.
 
  And quite frankly, if you do rebases etc so much that you think patch
  ID's are so important that they need to be cached, you may be doing
  odd/wrong things.
 
 And that, too ;-)

I've never noticed a problem with rebases, it's when I use git log
--cherry master... to see if patches I've sent to a mailing list have
been picked up.

To take Git as an example (albeit a bad one because What's Cooking is
a more useful way to track patch state here), if I compare this patch to
pu I have:

$ git rev-list --left-right --count pu...
234 1

and caching patch IDs takes that from ~0.6s to ~0.1s.  When doing that
over several branches consecutively that makes a big difference to the
overall runtime, especially because most of the commits of interest will
be cached during the first one.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-12 Thread John Keeping
On Sat, May 11, 2013 at 06:57:02PM -0500, Johannes Schindelin wrote:
 On Sat, 11 May 2013, Linus Torvalds wrote:
 
  [...] I really think caching patch ID's should be something people
  should be aware of is fundamentally wrong, even if it might work.
 
 Given the incredible performance win, I would say it is still worth
 looking into.
 
 If you store also a hash of Git version and diff options (may even be the
 hash of the raw bytes of the diff options if you do not plan to share the
 ref between machines) with the patch ID, you can make it safe.
 
 That hash would be generated at patch_id init time and
 load_cached_patch_id() would check this hash in addition to the return
 value of get_sha1() (and ignore the note if the version/diff options
 differ).

I was thinking about this overnight, glad to see someone else had the
same idea :-)

It's slightly annoying because the diff options can be customized after
we return from init_patch_ids() so we either need a new
setup_patch_ids() function to be run after init once diff options have
been set or to set it lazily.  I'll try introducing a setup function.

 If you are following git.git slavishly, maybe hashing just the major/minor
 Git version would be in order to avoid frequent regeneration of identical
 patch IDs.

I think just storing the version is quite good here, and it avoids pain
when a topic that affects patch IDs is working its way through pu and
next.

  And quite frankly, if you do rebases etc so much that you think patch
  ID's are so important that they need to be cached, you may be doing
  odd/wrong things.
 
 AFAICT John actually gave a very valid scenario that validates his use
 case: git-gui patches are best tested in the git.git scenario but have to
 be contributed via git-gui.git. It's not John's fault that this typically
 requires a lot of rebasing between vastly divergent histories.

Actually, I don't think that use case is valid.  Because it's a subtree
merge I can be absolutely certain that nothing on the LHS of
master...git-gui/master is patch identical to anything on the RHS since
all the paths are different.  So doing git log --cherry-mark in that
case is completely useless.  I think my script should be able to learn
that, which gets rid of the really horrible case I was seeing, but it
would be nice to improve the fast enough cases as well if it can be
done without too much effort.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-12 Thread Junio C Hamano
John Keeping j...@keeping.me.uk writes:

 But it is not a big problem.  Either 3-way merge notices that there
 is nothing new, or you get a conflict and have chance to inspect
 what is going on.

 It's not a problem here, but false negatives would be annoying if you're
 looking at git log --cherry-mark.

The primary thing to notice is that it is not a new problem with or
without the caching layer.  As Linus mentioned how patch-ids are
computed by ignoring offsets and whitespaces, the filtering is done
as a crude approximation and false negatives are part of design, so
making the cache more complex by recording hash of the binary and/or
options used to compute misses the fundamental.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-11 Thread John Keeping
This adds a new configuration variable patchid.cacheRef which controls
whether (and where) patch IDs will be cached in a notes tree.

Caching patch IDs in this way results in a performance improvement in
every case I have tried, for example when comparing the git-gui tree to
git.git (where git-gui has been merged into git.git but most git.git
commits do not appear in git-gui):

Without patch ID caching:
$ time git log --cherry master...git-gui/master /dev/null
real0m32.981s
user0m32.364s
sys 0m0.621s

Prime the cache:
$ time git -c patchid.cacheref=patchids log --cherry \
master...git-gui/master /dev/null
real0m33.860s
user0m32.832s
sys 0m0.986s

With all patch IDs cached:
$ time git -c patchid.cacheref=patchids log --cherry \
master...git-gui/master /dev/null
real0m1.041s
user0m0.679s
sys 0m0.363s

Signed-off-by: John Keeping j...@keeping.me.uk
---
This is another approach to fixing the log --cherry takes a long time
issue I encountered comparing commits built on git-gui to those in
git.git [1][2].

I think this is a useful feature even outside that use case.  I measured
a small improvement (when the cache is primed) even comparing two
branches with 1 and 2 different commits respectively.

[1] http://article.gmane.org/gmane.comp.version-control.git/223959
[2] http://article.gmane.org/gmane.comp.version-control.git/223956

 Documentation/config.txt |  7 +++
 patch-ids.c  | 91 +++-
 patch-ids.h  |  1 +
 t/t6007-rev-list-cherry-pick-file.sh | 16 +++
 4 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 6e53fc5..e36585c 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1798,6 +1798,13 @@ pager.cmd::
precedence over this option.  To disable pagination for all
commands, set `core.pager` or `GIT_PAGER` to `cat`.
 
+patchid.cacheRef::
+   The name of a notes ref in which to store patch IDs for commits.
+   The ref is taken to be in `refs/notes/` if it is not qualified.
++
+Note that the patch ID notes are never pruned automatically, so you may
+wish to periodically run `git notes --ref ref prune` against this ref.
+
 pretty.name::
Alias for a --pretty= format string, as specified in
linkgit:git-log[1]. Any aliases defined here can be used just
diff --git a/patch-ids.c b/patch-ids.c
index bc8a28f..cb05eec 100644
--- a/patch-ids.c
+++ b/patch-ids.c
@@ -1,6 +1,9 @@
 #include cache.h
+#include blob.h
 #include diff.h
 #include commit.h
+#include notes.h
+#include refs.h
 #include sha1-lookup.h
 #include patch-ids.h
 
@@ -34,12 +37,38 @@ struct patch_id_bucket {
struct patch_id bucket[BUCKET_SIZE];
 };
 
+static int patch_id_config(const char *var, const char *value, void *cb)
+{
+   const char **cacheref = cb;
+
+   if (!strcmp(var, patchid.cacheref))
+   return git_config_string(cacheref, var, value);
+
+   return 0;
+}
+
 int init_patch_ids(struct patch_ids *ids)
 {
+   char *cacheref = NULL;
+
memset(ids, 0, sizeof(*ids));
diff_setup(ids-diffopts);
DIFF_OPT_SET(ids-diffopts, RECURSIVE);
diff_setup_done(ids-diffopts);
+
+   git_config(patch_id_config, cacheref);
+   if (cacheref) {
+   struct strbuf sb = STRBUF_INIT;
+   strbuf_addstr(sb, cacheref);
+   expand_notes_ref(sb);
+
+   ids-cache = xcalloc(1, sizeof(*ids-cache));
+   init_notes(ids-cache, sb.buf, combine_notes_overwrite, 0);
+
+   strbuf_release(sb);
+   free(cacheref);
+   }
+
return 0;
 }
 
@@ -52,9 +81,67 @@ int free_patch_ids(struct patch_ids *ids)
next = patches-next;
free(patches);
}
+
+   if (ids-cache) {
+   unsigned char notes_sha1[20];
+   if (write_notes_tree(ids-cache, notes_sha1) ||
+   update_ref(patch-id: update cache, ids-cache-ref,
+   notes_sha1, NULL, 0, QUIET_ON_ERR))
+   error(_(failed to write patch ID cache));
+
+   free_notes(ids-cache);
+   ids-cache = NULL;
+   }
+
return 0;
 }
 
+static int load_cached_patch_id(struct commit *commit,
+   struct notes_tree *cache, unsigned char *sha1)
+{
+   const unsigned char *note_sha1;
+   char *note;
+   enum object_type type;
+   unsigned long notelen;
+   int result = 1;
+
+   if (!cache)
+   return 1;
+
+   note_sha1 = get_note(cache, commit-object.sha1);
+   if (!note_sha1)
+   return 1;
+
+   if (!(note = read_sha1_file(note_sha1, type, notelen)) || !notelen ||
+   type != OBJ_BLOB)
+   goto out;
+
+   if (get_sha1_hex(note, sha1))
+   goto out;
+
+   

Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-11 Thread Linus Torvalds
On Sat, May 11, 2013 at 12:54 PM, John Keeping j...@keeping.me.uk wrote:
 This adds a new configuration variable patchid.cacheRef which controls
 whether (and where) patch IDs will be cached in a notes tree.

Patch ID's aren't stable wrt different diff options, so I think this
is potentially a very bad idea.

   Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-11 Thread John Keeping
On Sat, May 11, 2013 at 02:10:01PM -0700, Linus Torvalds wrote:
 On Sat, May 11, 2013 at 12:54 PM, John Keeping j...@keeping.me.uk wrote:
  This adds a new configuration variable patchid.cacheRef which controls
  whether (and where) patch IDs will be cached in a notes tree.
 
 Patch ID's aren't stable wrt different diff options, so I think this
 is potentially a very bad idea.

Hmm... I hadn't realised that.  Looking a bit closer, it looks like
init_patch_ids sets up its own diffopts so its not affected by the
command line (except for pathspecs which would be easy to check for).
Of course that still means it can be affected by settings in the user's
configuration.

It's a pity that this can't be done since it gives a significant
performance improvement for some tasks that I perform relatively
frequently.  Is there a reason patch IDs couldn't be generated using
fixed diff options?  Since there's no way to control it from the command
line it seems surprising that the results of log --cherry might be
different based on what's in your config.

That could go either way I suppose - is it useful to be able to change
patch IDs based on command line arguments or is it wrong that as we add
persistent diff settings to the configuration we've been silently
changing the behaviour of git patch-id and git cherry.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-11 Thread Linus Torvalds
On Sat, May 11, 2013 at 2:49 PM, John Keeping j...@keeping.me.uk wrote:

 Hmm... I hadn't realised that.  Looking a bit closer, it looks like
 init_patch_ids sets up its own diffopts so its not affected by the
 command line (except for pathspecs which would be easy to check for).
 Of course that still means it can be affected by settings in the user's
 configuration.

.. and in the actual diff algorithm.

The thing is - patches ARE NOT STABLE. There are many valid ways to
get a patch from one version to another, and even without command line
changes, we've had different versions of git generating different
patches. There are heuristics in xdiff to avoid some nasty use up
tons of CPU-time things that have been tweaked over time. And even
for simple cases there are ambiguous ways to describe the patch.

Now, maybe we don't care, because in practice, most of the time this
just doesn't much matter. And anybody who uses patch-ID's had better
not rely on them being guaranteed to be unique anyway, so it's more of
a if the patch ID is the same, it's almost guaranteed to be the same
patch, but that's a big almost. And no, it's not some theoretical
SHA1 collisions are very unlikely kind of thing, it's a the patch
ID actually ignores a lot of data in order to give the same ID even if
lins have been added above it, and the patch is at different line
numbers etc.

So maybe it doesn't matter. But at the same time, I really think
caching patch ID's should be something people should be aware of is
fundamentally wrong, even if it might work.

And quite frankly, if you do rebases etc so much that you think patch
ID's are so important that they need to be cached, you may be doing
odd/wrong things.

 Linus
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-11 Thread Johannes Schindelin
Hi John  Linus,

On Sat, 11 May 2013, Linus Torvalds wrote:

 [...] I really think caching patch ID's should be something people
 should be aware of is fundamentally wrong, even if it might work.

Given the incredible performance win, I would say it is still worth
looking into.

If you store also a hash of Git version and diff options (may even be the
hash of the raw bytes of the diff options if you do not plan to share the
ref between machines) with the patch ID, you can make it safe.

That hash would be generated at patch_id init time and
load_cached_patch_id() would check this hash in addition to the return
value of get_sha1() (and ignore the note if the version/diff options
differ).

If you are following git.git slavishly, maybe hashing just the major/minor
Git version would be in order to avoid frequent regeneration of identical
patch IDs.

 And quite frankly, if you do rebases etc so much that you think patch
 ID's are so important that they need to be cached, you may be doing
 odd/wrong things.

AFAICT John actually gave a very valid scenario that validates his use
case: git-gui patches are best tested in the git.git scenario but have to
be contributed via git-gui.git. It's not John's fault that this typically
requires a lot of rebasing between vastly divergent histories.

John, do you think you can incorporate that finger-print of the patch ID
generation and store it in the same note?

Ciao,
Johannes
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] patch-ids.c: cache patch IDs in a notes tree

2013-05-11 Thread Junio C Hamano
Linus Torvalds torva...@linux-foundation.org writes:

 On Sat, May 11, 2013 at 2:49 PM, John Keeping j...@keeping.me.uk wrote:

 Hmm... I hadn't realised that.  Looking a bit closer, it looks like
 init_patch_ids sets up its own diffopts so its not affected by the
 command line (except for pathspecs which would be easy to check for).
 Of course that still means it can be affected by settings in the user's
 configuration.

 .. and in the actual diff algorithm.

As to the objection side of the argument, I already said
essentially the same thing several months ago:

  http://thread.gmane.org/gmane.comp.version-control.git/202654/focus=202898

and do not have much to add [*1*].

However.

The use of patch-id in cherry and rebase is to facilitate avoiding
to replay commits that are obviously identical to the ones you have
in your history.  The cached patch id for an existing old commit may
differ from a patch id you freshly compute for a new commit you are
trying to see if it truly new, even though they may represent the
same change.  So we may incorrectly think such a new commit is not
yet in your history and attempt to replay it.

But it is not a big problem.  Either 3-way merge notices that there
is nothing new, or you get a conflict and have chance to inspect
what is going on.

A conceptually much larger and more problematic issue is that we may
discard a truly new change that you still need as an old one you
already have due to a hash collision and discard it.  Because the
hash space of SHA-1 is so large, however, it is not a problem in
practice, and more importantly, that hash space is just as large as
the hash space used by Git to reduce a patch to a patch id, the
filtering done with patch-id in cherry and rebase _already_ have
that exact problem with or without this additional cache layer. A
stale cache may make the possibility of lost change due to such a
hash collision merely twice as likely.

 ... it's a the patch ID actually ignores a lot of data in order
 to give the same ID even if lins have been added above it, and the
 patch is at different line numbers etc.

Yes.

 So maybe it doesn't matter. But at the same time, I really think
 caching patch ID's should be something people should be aware of is
 fundamentally wrong, even if it might work.

I do not think it is caching patch ID that people should be aware
of is fundamentally wrong.  What is fundamentally wrong, even if it
might work, is using patch ID itself.

 And quite frankly, if you do rebases etc so much that you think patch
 ID's are so important that they need to be cached, you may be doing
 odd/wrong things.

And that, too ;-)


[Footnote]

*1* For people listening from the sidelines, the fact that Git
algorithm can improve over time is a real issue, and and has caused
one issue that still hasn't been solved in the k.org upload process.
Somebody who has a repository there could *theoretically*:

 - push her v1.1 release via Git (git push origin v1.1);
 - create a tarball (git archive -o v1.1.tar v1.1) and diff since the
   last release (git diff v1.0 v1.1 v1.0-v1.1.diff) locally;
 - GPG sign them (gpg -b v1.1.tar, gpg -b v1.0-v1.1.diff); and
 - upload only the signature files

and have k.org create the tarballs and diff to save bandwidth of
uploading logically derivable stuff over and over again.  But that
can be done only when output from git archive and git diff are
stable, which is not the case.  We changed how extended header fields
are used in the tar archive for a long pathname recently, and also
we changed use of XDF_NEED_MINIMAL a couple of years ago in git diff;
both of these affect the output.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html