Re: [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
On Tue, Jan 28, 2014 at 01:55:09PM -0800, Junio C Hamano wrote: > Kirill Smelkov writes: > > > diff --git a/combine-diff.c b/combine-diff.c > > index 3b92c448..98c2562 100644 > > --- a/combine-diff.c > > +++ b/combine-diff.c > > @@ -15,8 +15,8 @@ > > ... > > + while (1) { > > ... > > + if (cmp < 0) { > > + if (pprev) > > + pprev->next = p->next; > > + ptmp = p; > > + p = p->next; > > + free(ptmp); > > + if (curr == ptmp) > > + curr = p; > > continue; > > ... > > + if (cmp > 0) { > > + i++; > > + continue; > > } > > ... > > + > > + pprev = p; > > + p = p->next; > > + i++; > > } > > return curr; > > } > > Thanks. I very much like the approach. > > I was staring at the above part of the code, but couldn't help > recalling this gem (look for "understand pointers" in the article): > > > http://meta.slashdot.org/story/12/10/11/0030249/linus-torvalds-answers-your-questions > > How about doing it this way (on top of your patch)? It reduces 7 > lines even though it adds two comment lines ;-) > > combine-diff.c | 37 +++-- > 1 file changed, 15 insertions(+), 22 deletions(-) Thanks, this is sound approach and adding guiding comments is good, and also now some of us with self-taught heritage understand (or at least they think so) pointers a bit better :) Now some nitpicks: > diff --git a/combine-diff.c b/combine-diff.c > index 2d79312..0809e79 100644 > --- a/combine-diff.c > +++ b/combine-diff.c > @@ -15,11 +15,10 @@ > static struct combine_diff_path *intersect_paths(struct combine_diff_path > *curr, int n, int num_parent) > { > struct diff_queue_struct *q = &diff_queued_diff; > - struct combine_diff_path *p, *pprev, *ptmp; > + struct combine_diff_path *p, **tail = &curr; > int i, cmp; > > if (!n) { > - struct combine_diff_path *list = NULL, **tail = &list; > for (i = 0; i < q->nr; i++) { > int len; > const char *path; > @@ -43,35 +42,30 @@ static struct combine_diff_path *intersect_paths(struct > combine_diff_path *curr, > *tail = p; > tail = &p->next; > } > - return list; > + return curr; > } > > /* > - * NOTE paths are coming sorted here (= in tree order) > + * paths in curr (linked list) and q->queue[] (array) are > + * both sorted in the tree order. >*/ > - > - pprev = NULL; > - p = curr; > i = 0; > + while ((p = *tail) != NULL) { > + cmp = ((i >= q->nr) > +? -1 : strcmp(p->path, q->queue[i]->two->path)); I liked cmp assignment being the original way - when "-1" is on one line and strcmp is on another - to me it reads better. I'm not insisting on it though. > - while (1) { > - if (!p) > - break; > - > - cmp = (i >= q->nr) ? -1 > -: strcmp(p->path, q->queue[i]->two->path); > if (cmp < 0) { > - if (pprev) > - pprev->next = p->next; > - ptmp = p; > - p = p->next; > - free(ptmp); > - if (curr == ptmp) > - curr = p; > + /* p->path not in q->queue[]; drop it */ > + struct combine_diff_path *next = p->next; > + > + if ((*tail = next) != NULL) > + tail = &next->next; > + free(p); > continue; > } A bug crept in here - if we are removing the first element, i.e. when p=curr, we have to advance curr as well - as we are returning curr back as new intersected paths set list start. That's why there was curr change. Now curr stays the same, and if we'll remove the first element, curr will be pointing to freed memory -> oops. A possible fixup could be: 8< diff --git a/combine-diff.c b/combine-diff.c index 0809e79..6a61a25 100644 --- a/combine-diff.c +++ b/combine-diff.c @@ -60,6 +60,8 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, if ((*tail = next) != NULL) tail = &next->next; + if (curr == p) + curr = next; free(p); continue; } 8< but this is blind code, as I had not tested it. > > if (cmp > 0) { > + /* q->queue[i] not in p->path; skip it */ > i++; >
Re: [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
Kirill Smelkov writes: > diff --git a/combine-diff.c b/combine-diff.c > index 3b92c448..98c2562 100644 > --- a/combine-diff.c > +++ b/combine-diff.c > @@ -15,8 +15,8 @@ > ... > + while (1) { > ... > + if (cmp < 0) { > + if (pprev) > + pprev->next = p->next; > + ptmp = p; > + p = p->next; > + free(ptmp); > + if (curr == ptmp) > + curr = p; > continue; > ... > + if (cmp > 0) { > + i++; > + continue; > } > ... > + > + pprev = p; > + p = p->next; > + i++; > } > return curr; > } Thanks. I very much like the approach. I was staring at the above part of the code, but couldn't help recalling this gem (look for "understand pointers" in the article): http://meta.slashdot.org/story/12/10/11/0030249/linus-torvalds-answers-your-questions How about doing it this way (on top of your patch)? It reduces 7 lines even though it adds two comment lines ;-) combine-diff.c | 37 +++-- 1 file changed, 15 insertions(+), 22 deletions(-) diff --git a/combine-diff.c b/combine-diff.c index 2d79312..0809e79 100644 --- a/combine-diff.c +++ b/combine-diff.c @@ -15,11 +15,10 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, int n, int num_parent) { struct diff_queue_struct *q = &diff_queued_diff; - struct combine_diff_path *p, *pprev, *ptmp; + struct combine_diff_path *p, **tail = &curr; int i, cmp; if (!n) { - struct combine_diff_path *list = NULL, **tail = &list; for (i = 0; i < q->nr; i++) { int len; const char *path; @@ -43,35 +42,30 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, *tail = p; tail = &p->next; } - return list; + return curr; } /* -* NOTE paths are coming sorted here (= in tree order) +* paths in curr (linked list) and q->queue[] (array) are +* both sorted in the tree order. */ - - pprev = NULL; - p = curr; i = 0; + while ((p = *tail) != NULL) { + cmp = ((i >= q->nr) + ? -1 : strcmp(p->path, q->queue[i]->two->path)); - while (1) { - if (!p) - break; - - cmp = (i >= q->nr) ? -1 - : strcmp(p->path, q->queue[i]->two->path); if (cmp < 0) { - if (pprev) - pprev->next = p->next; - ptmp = p; - p = p->next; - free(ptmp); - if (curr == ptmp) - curr = p; + /* p->path not in q->queue[]; drop it */ + struct combine_diff_path *next = p->next; + + if ((*tail = next) != NULL) + tail = &next->next; + free(p); continue; } if (cmp > 0) { + /* q->queue[i] not in p->path; skip it */ i++; continue; } @@ -80,8 +74,7 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, p->parent[n].mode = q->queue[i]->one->mode; p->parent[n].status = q->queue[i]->status; - pprev = p; - p = p->next; + tail = &p->next; i++; } return curr; -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
On Mon, Jan 20, 2014 at 08:20:40PM +0400, Kirill Smelkov wrote: [...] > @@ -1343,6 +1374,26 @@ void diff_tree_combined(const unsigned char *sha1, > if (p->len) > num_paths++; > } > + > + /* order paths according to diffcore_order */ > + if (opt->orderfile && num_paths) { > + struct obj_order *o; > + > + o = xmalloc(sizeof(*o) * num_paths); > + for (i = 0, p = paths; p; p = p->next, i++) > + o[i].obj = p; > + order_objects(opt->orderfile, path_path, o, num_paths); > + for (i = 0; i < num_paths - 1; i++) { > + p = o[i].obj; > + p->next = o[i+1].obj; > + } > + > + p = o[num_paths-1].obj; > + p->next = NULL; > + paths = o[0].obj; > + } I found I've introduced memory leak here (malloc without free). Please apply the fix. Thanks, Kirill. 8< From: Kirill Smelkov Date: Tue, 28 Jan 2014 19:39:16 +0400 Subject: [PATCH] fixup! combine-diff: Optimize combine_diff_path sets intersection Plug a memory leak. --- combine-diff.c | 1 + 1 file changed, 1 insertion(+) diff --git a/combine-diff.c b/combine-diff.c index 07faa96..2d79312 100644 --- a/combine-diff.c +++ b/combine-diff.c @@ -1383,6 +1383,7 @@ void diff_tree_combined(const unsigned char *sha1, p = o[num_paths-1].obj; p->next = NULL; paths = o[0].obj; + free(o); } -- 1.9.rc1.181.g641f458 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] combine-diff: Optimize combine_diff_path sets intersection
Currently, when generating combined diff, for a commit, we intersect diff paths from diff(parent_0,commit) to diff(parent_i,commit) comparing all paths pairs, i.e. doing it the quadratic way. That is correct, but could be optimized: Paths come from trees in sorted (= tree) order, and so does diff_tree() emits resulting paths in that order too. Now if we look at diffcore transformations, all of them, except diffcore_order, preserve resulting path ordering: - skip_stat_unmatch, grep, pickaxe, filter -- just skip elements -> order stays preserved - break -- just breaks diff for a path, adding path dup after the path -> order stays preserved - detect rename/copy-- resulting paths are emitted sorted (verified empirically) So only diffcore_order changes diff paths ordering. But diffcore_order meaning affects only presentation - i.e. only how to show the diff, so we could do all the internal computations without paths reordering, and order only resultant paths set. This is faster, since, if we know two paths sets are all ordered, their intersection could be done in linear time. This patch does just that. Timings for `git log --raw --no-abbrev --no-renames` without `-c` ("git log") and with `-c` ("git log -c") before and after the patch are as follows: linux.git v3.10..v3.11 log log -c before 1.9s20.4s after 1.9s16.6s navy.git(private repo) log log -c before 0.83s 15.6s after 0.83s2.1s P.S. I think linux.git case is sped up not so much as the second one, since in navy.git, there are more exotic (subtree, etc) merges. P.P.S. My tracing showed that the rest of the time (16.6s vs 1.9s) is usually spent in computing huge diffs from commit to second parent. Will try to deal with it, if I'll have time. P.P.P.S. For combine_diff_path, ->len is not needed anymore - will remove it in the next noisy cleanup path, to maintain good signal/noise ratio here. Signed-off-by: Kirill Smelkov --- combine-diff.c | 93 +- 1 file changed, 72 insertions(+), 21 deletions(-) diff --git a/combine-diff.c b/combine-diff.c index 3b92c448..98c2562 100644 --- a/combine-diff.c +++ b/combine-diff.c @@ -15,8 +15,8 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, int n, int num_parent) { struct diff_queue_struct *q = &diff_queued_diff; - struct combine_diff_path *p; - int i; + struct combine_diff_path *p, *pprev, *ptmp; + int i, cmp; if (!n) { struct combine_diff_path *list = NULL, **tail = &list; @@ -47,28 +47,43 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr, return list; } - for (p = curr; p; p = p->next) { - int found = 0; - if (!p->len) + /* +* NOTE paths are coming sorted here (= in tree order) +*/ + + pprev = NULL; + p = curr; + i = 0; + + while (1) { + if (!p) + break; + + cmp = (i >= q->nr) ? -1 + : strcmp(p->path, q->queue[i]->two->path); + if (cmp < 0) { + if (pprev) + pprev->next = p->next; + ptmp = p; + p = p->next; + free(ptmp); + if (curr == ptmp) + curr = p; continue; - for (i = 0; i < q->nr; i++) { - const char *path; - int len; + } - if (diff_unmodified_pair(q->queue[i])) - continue; - path = q->queue[i]->two->path; - len = strlen(path); - if (len == p->len && !memcmp(path, p->path, len)) { - found = 1; - hashcpy(p->parent[n].sha1, q->queue[i]->one->sha1); - p->parent[n].mode = q->queue[i]->one->mode; - p->parent[n].status = q->queue[i]->status; - break; - } + if (cmp > 0) { + i++; + continue; } - if (!found) - p->len = 0; + + hashcpy(p->parent[n].sha1, q->queue[i]->one->sha1); + p->parent[n].mode = q->queue[i]->one->mode; + p->parent[n].status = q->queue[i]->status; + + pprev = p; + p = p->next; + i++; } return curr; } @@ -1295,6 +1310,13 @@ sta