Re: [PATCH v3 1/3] patch-id: make it stable against hunk reordering

2014-03-31 Thread Junio C Hamano
"Michael S. Tsirkin"  writes:

> + memcpy(&ctx, &header_ctx, 
> sizeof ctx);
> + } else {
> + /* Save header ctx for next 
> hunk.  */
> + memcpy(&header_ctx, &ctx, 
> sizeof ctx);

I'll fix these to sizeof(ctx) when queuing.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] patch-id: make it stable against hunk reordering

2014-03-31 Thread Junio C Hamano
"Michael S. Tsirkin"  writes:

> though it does look like an unrelated enhancement to me.
> Agree?

Yes, that is exactly why I said "opens interesting opportunity" and
"making it possible" ;-) They are all very related, but they do not
have to graduate as parts of the same series.

The important thing is that the basic infrastructure of the new
codepath is designed in such a way that it later allows us to reuse
it in these changes.

I'm queuing these three patches almost as-is (I think I tweaked the
"sizeof" thing) on 'pu', without fixing the test styles and other
issues I may have pointed out in my reviews myself, expecting
further clean-ups.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] patch-id: make it stable against hunk reordering

2014-03-31 Thread Michael S. Tsirkin
On Mon, Mar 31, 2014 at 10:59:50AM -0700, Junio C Hamano wrote:
> "Michael S. Tsirkin"  writes:
> 
> > Patch id changes if users
> > 1. reorder file diffs that make up a patch
> > or
> > 2. split a patch up to multiple diffs that touch the same path
> > (keeping hunks within a single diff ordered to make patch valid).
> >
> > As the result is functionally equivalent, a different patch id is
> > surprising to many users.
> > In particular, reordering files using diff -O is helpful to make patches
> > more readable (e.g. API header diff before implementation diff).
> >
> > Change patch-id behaviour making it stable against these two kinds
> > of patch change:
> > 1. calculate SHA1 hash for each hunk separately and sum all hashes
> > (using a symmetrical sum) to get patch id
> > 2. hash the file-level headers together with each hunk (not just the
> > first hunk)
> >
> > We use a 20byte sum and not xor - since xor would give 0 output
> > for patches that have two identical diffs, which isn't all that
> > unlikely (e.g. append the same line in two places).
> >
> > Add a new flag --unstable to get the historical behaviour.
> >
> > Add --stable which is a nop, for symmetry.
> >
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >
> > changes from v2:
> > several bugfixes
> > changes from v1:
> > hanges from v1: documented motivation for supporting
> > diff splitting (and not just file reordering).
> > No code changes.
> >
> >  builtin/patch-id.c | 72 
> > ++
> >  1 file changed, 56 insertions(+), 16 deletions(-)
> 
> Does this have to interact or be consistent with patch-ids.c which
> is the real patch-id machinery used to filter like-changes out by
> "cherry-pick" and "log --cherry-pick"?

I don't know off-hand.

Specifically, this is diff_flush_patch_id and in diff.c, isn't it?

> This series opens a very interesting opportunity by making it
> possible to introduce the equivalence between two patches that touch
> the same file and a single patch that concatenates hunks from these
> two patches.
> 
> One example I am wondering about is perhaps this could be used to
> detect two branches, both sides with many patches cherry-picked from
> the same source, but some patches squashed together on one branch
> but not on the other.  It would be very nice if you can detect that
> two sets of patches are equivalent taken as a whole in such a
> situation while rebasing one on top of the other.
> 
> Another example is that another mode that gives a set of broken-up
> patch-ids for each hunk contained in the input.  Suppose there is a
> patch that is only meant to be used on the proprietary fork of an
> open source project, and the project releases the open source
> portion by cherry-picking topics from the development tree used for
> the proprietary "trunk".  The integration service of such a project
> used to prepare the open source branch may want to have a
> pre-receive hook that says "do not merge any commit to cause this
> this hunk appear in the result, no matter what other changes the
> patches in the commit may bring in", and broken-down patch-ids
> (e.g. "diff HEAD...$commit | patch-id --individual") may be an
> ingredient to implement such a hook.  There may be interesting
> applications other than such a "broken-down patch-ids" that can be
> based on the enhancement you are presenting here.

OK sure, I can tweak that to use the same algorithm if desired,
though it does look like an unrelated enhancement to me.
Agree?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] patch-id: make it stable against hunk reordering

2014-03-31 Thread Junio C Hamano
"Michael S. Tsirkin"  writes:

> Patch id changes if users
> 1. reorder file diffs that make up a patch
> or
> 2. split a patch up to multiple diffs that touch the same path
> (keeping hunks within a single diff ordered to make patch valid).
>
> As the result is functionally equivalent, a different patch id is
> surprising to many users.
> In particular, reordering files using diff -O is helpful to make patches
> more readable (e.g. API header diff before implementation diff).
>
> Change patch-id behaviour making it stable against these two kinds
> of patch change:
> 1. calculate SHA1 hash for each hunk separately and sum all hashes
> (using a symmetrical sum) to get patch id
> 2. hash the file-level headers together with each hunk (not just the
> first hunk)
>
> We use a 20byte sum and not xor - since xor would give 0 output
> for patches that have two identical diffs, which isn't all that
> unlikely (e.g. append the same line in two places).
>
> Add a new flag --unstable to get the historical behaviour.
>
> Add --stable which is a nop, for symmetry.
>
> Signed-off-by: Michael S. Tsirkin 
> ---
>
> changes from v2:
>   several bugfixes
> changes from v1:
>   hanges from v1: documented motivation for supporting
>   diff splitting (and not just file reordering).
>   No code changes.
>
>  builtin/patch-id.c | 72 
> ++
>  1 file changed, 56 insertions(+), 16 deletions(-)

Does this have to interact or be consistent with patch-ids.c which
is the real patch-id machinery used to filter like-changes out by
"cherry-pick" and "log --cherry-pick"?

This series opens a very interesting opportunity by making it
possible to introduce the equivalence between two patches that touch
the same file and a single patch that concatenates hunks from these
two patches.

One example I am wondering about is perhaps this could be used to
detect two branches, both sides with many patches cherry-picked from
the same source, but some patches squashed together on one branch
but not on the other.  It would be very nice if you can detect that
two sets of patches are equivalent taken as a whole in such a
situation while rebasing one on top of the other.

Another example is that another mode that gives a set of broken-up
patch-ids for each hunk contained in the input.  Suppose there is a
patch that is only meant to be used on the proprietary fork of an
open source project, and the project releases the open source
portion by cherry-picking topics from the development tree used for
the proprietary "trunk".  The integration service of such a project
used to prepare the open source branch may want to have a
pre-receive hook that says "do not merge any commit to cause this
this hunk appear in the result, no matter what other changes the
patches in the commit may bring in", and broken-down patch-ids
(e.g. "diff HEAD...$commit | patch-id --individual") may be an
ingredient to implement such a hook.  There may be interesting
applications other than such a "broken-down patch-ids" that can be
based on the enhancement you are presenting here.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/3] patch-id: make it stable against hunk reordering

2014-03-30 Thread Michael S. Tsirkin
Patch id changes if users
1. reorder file diffs that make up a patch
or
2. split a patch up to multiple diffs that touch the same path
(keeping hunks within a single diff ordered to make patch valid).

As the result is functionally equivalent, a different patch id is
surprising to many users.
In particular, reordering files using diff -O is helpful to make patches
more readable (e.g. API header diff before implementation diff).

Change patch-id behaviour making it stable against these two kinds
of patch change:
1. calculate SHA1 hash for each hunk separately and sum all hashes
(using a symmetrical sum) to get patch id
2. hash the file-level headers together with each hunk (not just the
first hunk)

We use a 20byte sum and not xor - since xor would give 0 output
for patches that have two identical diffs, which isn't all that
unlikely (e.g. append the same line in two places).

Add a new flag --unstable to get the historical behaviour.

Add --stable which is a nop, for symmetry.

Signed-off-by: Michael S. Tsirkin 
---

changes from v2:
several bugfixes
changes from v1:
hanges from v1: documented motivation for supporting
diff splitting (and not just file reordering).
No code changes.

 builtin/patch-id.c | 72 ++
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index 3cfe02d..7fd7007 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -1,17 +1,14 @@
 #include "builtin.h"
 
-static void flush_current_id(int patchlen, unsigned char *id, git_SHA_CTX *c)
+static void flush_current_id(int patchlen, unsigned char *id, unsigned char 
*result)
 {
-   unsigned char result[20];
char name[50];
 
if (!patchlen)
return;
 
-   git_SHA1_Final(result, c);
memcpy(name, sha1_to_hex(id), 41);
printf("%s %s\n", sha1_to_hex(result), name);
-   git_SHA1_Init(c);
 }
 
 static int remove_space(char *line)
@@ -56,10 +53,31 @@ static int scan_hunk_header(const char *p, int *p_before, 
int *p_after)
return 1;
 }
 
-static int get_one_patchid(unsigned char *next_sha1, git_SHA_CTX *ctx, struct 
strbuf *line_buf)
+static void flush_one_hunk(unsigned char *result, git_SHA_CTX *ctx)
 {
-   int patchlen = 0, found_next = 0;
+   unsigned char hash[20];
+   unsigned short carry = 0;
+   int i;
+
+   git_SHA1_Final(hash, ctx);
+   git_SHA1_Init(ctx);
+   /* 20-byte sum, with carry */
+   for (i = 0; i < 20; ++i) {
+   carry += result[i] + hash[i];
+   result[i] = carry;
+   carry >>= 8;
+   }
+}
+
+static int get_one_patchid(unsigned char *next_sha1, unsigned char *result,
+  struct strbuf *line_buf, int stable)
+{
+   int patchlen = 0, found_next = 0, hunks = 0;
int before = -1, after = -1;
+   git_SHA_CTX ctx, header_ctx;
+
+   git_SHA1_Init(&ctx);
+   hashclr(result);
 
while (strbuf_getwholeline(line_buf, stdin, '\n') != EOF) {
char *line = line_buf->buf;
@@ -98,7 +116,19 @@ static int get_one_patchid(unsigned char *next_sha1, 
git_SHA_CTX *ctx, struct st
if (before == 0 && after == 0) {
if (!memcmp(line, "@@ -", 4)) {
/* Parse next hunk, but ignore line numbers.  */
+   if (stable) {
+   /* Hash the file-level headers together 
with each hunk. */
+   if (hunks) {
+   flush_one_hunk(result, &ctx);
+   /* Prepend saved header ctx for 
next hunk.  */
+   memcpy(&ctx, &header_ctx, 
sizeof ctx);
+   } else {
+   /* Save header ctx for next 
hunk.  */
+   memcpy(&header_ctx, &ctx, 
sizeof ctx);
+   }
+   }
scan_hunk_header(line, &before, &after);
+   hunks++;
continue;
}
 
@@ -107,7 +137,10 @@ static int get_one_patchid(unsigned char *next_sha1, 
git_SHA_CTX *ctx, struct st
break;
 
/* Else we're parsing another header.  */
+   if (stable && hunks)
+   flush_one_hunk(result, &ctx);
before = after = -1;
+   hunks = 0;
}
 
/* If we get here, we're inside a hunk.  */
@@ -119,39 +152,46 @@ static int get_one_patchid(unsigned char *next_sha1, 
git_SHA_CTX *ctx, struct st
/* Compute the sha without whitespace