[PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
The commit-graph feature is not useful to end users until the commit-graph file is maintained automatically by Git during normal upkeep operations. One natural place to trigger this write is during 'git gc'. Before automatically generating a commit-graph, we need to be able to verify the contents of a commit-graph file. Integrate commit-graph checks into 'fsck' that check the commit-graph contents against commits in the object database. Thanks, Jakub, for the feedback on the RFC. I think there are still some things to decide at a high-level before we dig too far into the review. Specifically, the integration points in 'fsck', 'gc', and 'fetch' are worth considering our alternatives. For 'fsck', the current integration is minimal: if core.commitGraph is true, then 'git fsck' calls 'git commit-graph verify --object-dir=X' for the objects directory and every alternate. There are a few options to consider here: 1. Keep this behavior: we should always check the commit-graph if it exists. 2. Add a --[no-]commit-graph argument to 'fsck' that toggles the commit-graph verification. 3. Remove all direct integration between 'fsck' and 'commit-graph' and instead rely on users checking 'git commit-graph verify' manually when they suspect a problem with the commit-graph file. While this option is worth considering, it is my least favorite since it requires more from users. For 'gc' and 'fetch', these seemed like natural places to update the commit-graph file. Relative to the other maintenance that occurs during these commands, the 'git commit-graph write' command is fast, especially for incremental updates; only the "new" commits are walked when computing generation numbers and other metadata for the commit-graph file. The behavior in this patch series does the following: 1. Near the end of 'git gc', run 'git commit-graph write'. The location of this code assumes that a 'git gc --auto' has not terminated early due to not meeting the auto threshold. 2. At the end of 'git fetch', run 'git commit-graph write'. This means that every reachable commit will be in the commit-graph after a a successful fetch, which seems a reasonable frequency. Then, the only times we would be missing a reachable commit is after creating one locally. There is a problem with the current patch, though: every 'git fetch' call runs 'git commit-graph write', even if there were no ref updates or objects downloaded. Is there a simple way to detect if the fetch was non-trivial? One obvious problem with this approach: if we compute this during 'gc' AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers two commit-graph writes. If I were to abandon one of these patches, it would be the 'fetch' integration. A 'git gc' really wants to delete all references to unreachable commits, and without updating the commit-graph we may still have commit data in the commit-graph file that is not in the object database. In fact, deleting commits from the object database but not from the commit-graph will cause 'git commit-graph verify' to fail! I welcome discussion on these ideas, as we are venturing out of the "pure data structure" world and into the "user experience" world. I am less confident in my skills in this world, but the feature is worthless if it does not improve the user experience. Thanks, -Stolee Derrick Stolee (12): Commits 01-07 focus on the 'git commit-graph verify' subcommand. These are ready for full, rigorous review. commit-graph: add 'verify' subcommand commit-graph: verify file header information commit-graph: parse commit from chosen graph commit-graph: verify fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb Commit 08 integrates 'git commit-graph verify' into fsck. The work here is the minimum integration possible. (See above discussion of options.) fsck: verify commit-graph Commit 09 introduces a new '--reachable' option only to make the calls from 'gc' and 'fetch' simpler. Commits 10-11 integrate writing the commit-graph into 'gc' and 'fetch', respectively. (See above disucssion.) commit-graph: add '--reachable' option gc: automatically write commit-graph files fetch: compute commit-graph by default Commit 12 simply deletes sections from the "Future Work" section of the commit-graph design document. commit-graph: update design document Documentation/config.txt | 10 ++ Documentation/git-commit-graph.txt | 14 ++- Documentation/git-fsck.txt | 3 + Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 22 builtin/commit-graph.c | 79 +- builtin/fetch.c | 13 +++ builtin/fsck.c | 21 builtin/gc.c
Re: [PATCH 2/2] replace-object.c: remove the_repository from prepare_replace_object
On 5/10/2018 7:56 AM, Jeff King wrote: On Thu, May 10, 2018 at 07:23:13PM +0900, Junio C Hamano wrote: This one was doing ptr = xmalloc(sizeof(*another_ptr)) and it was OK because ptr and another_ptr happened to be of the same type. I wonder if we are making it safer, or making it more obscure to seasoned C programmers, if we introduced a pair of helper macros, perhaps like these: #define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr))) #define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr))) I've often wondered that, too. It's the natural endgame of the ALLOC_ARRAY() road we've been going down. I'll chime in that I like this idea. Because I'm trying to learn more about Coccinelle, I added the diff below and ran 'make coccicheck'. It resulted in a 1000-line patch on 'next'. I'll refrain from sending that patch series, but it's nice to know a "simple" transformation is possible. Using `git grep` I see 230 instances of 'xmalloc' and 261 instances of 'xcalloc'. After the Coccinelle transformation, these are down to 194 and 190, respectively, because the rest allocate in the same line as the definition. It's worth thinking about the macro pattern for those cases. diff --git a/contrib/coccinelle/alloc.cocci b/contrib/coccinelle/alloc.cocci new file mode 100644 index 00..95f426c4dc --- /dev/null +++ b/contrib/coccinelle/alloc.cocci @@ -0,0 +1,13 @@ +@@ +expression p; +@@ +- p = xmalloc(sizeof(*p)); ++ ALLOCATE(p); + +@@ +expression c; +expression p; +@@ +- p = xcalloc(c, sizeof(*p)); ++ CALLOCATE(p,c); + diff --git a/git-compat-util.h b/git-compat-util.h index f9e4c5f9bc..5398f217d7 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -843,6 +843,9 @@ extern FILE *fopen_or_warn(const char *path, const char *mode); */ #define FREE_AND_NULL(p) do { free(p); (p) = NULL; } while (0) +#define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr))) +#define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr))) + #define ALLOC_ARRAY(x, alloc) (x) = xmalloc(st_mult(sizeof(*(x)), (alloc))) #define REALLOC_ARRAY(x, alloc) (x) = xrealloc((x), st_mult(sizeof(*(x)), (alloc)))
Re: [PATCH 0/1] Fix UX issue with commit-graph.lock
On 5/10/2018 3:00 AM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: We use the lockfile API to avoid multiple Git processes from writing to the commit-graph file in the .git/objects/info directory. In some cases, this directory may not exist, so we check for its existence. The existing code does the following when acquiring the lock: 1. Try to acquire the lock. 2. If it fails, try to create the .git/object/info directory. 3. Try to acquire the lock, failing if necessary. The problem is that if the lockfile exists, then the mkdir fails, giving an error that doesn't help the user: "fatal: cannot mkdir .git/objects/info: File exists" Isn't a better immediate fix to make the second step pay attention to errno? If mkdir() failed due to EEXIST, then we know we tried to aquire the lock already, so we can die with an appropriate message. That way, we can keep the "optimize for the normal case" that the approach to assume object/info/ directory is already there, instead of always checking its existence which is almost always true beforehand. This "optimize for the normal case" is why the existing code is organized the way it is. Since this code is only for writing a commit-graph file, this "check the directory first" option is a very small portion of the full time to write the file, so the "optimization" has very little effect, relatively. My personal opinion is to make the code cleaner when the performance difference is negligible. I'm willing to concede this point and use the steps you suggest below, if we think this is the best way forward. Also, can't we tell why we failed to acquire the lock at step #1? Do we only get a NULL that says "I am not telling you why, but we failed to lock"? To tell why we failed to acquire the lock, we could inspect "errno". However, this requires whitebox knowledge of both the lockfile API and the tempfile API to know that the last system call to set errno was an open() or adjust_shared_perm(). To cleanly make decisions based on the reason the lock failed to acquire, I think we would need to modify the lockfile and tempfile APIs to return a failure reason. This could be done by passing an `int *reason`, but the extra noise in these APIs is likely not worth the change. What I am getting at is that the ideal sequence would be more like: 1. Try to acquire the lock. 2-a. if #1 succeeds, we are happy. ignore the rest and return the lock. 2-b. if #1 failed because object/info/ did not exist, mkdir() it, and die if we cannot, saying "cannot mkdir". if mkdir() succeeds, jump t 3. 2-c. if #1 failed but that is not due to missing object/info/, die saying "cannot lock". 3. Try to acquire the lock. 4-a. if #3 succeeds, we are happy.ignore the rest and return the lock. 4-b. die saying "cannot lock". Thanks, -Stolee
Re: [PATCH 1/1] commit-graph: fix UX issue when .lock file exists
On 5/9/2018 10:42 AM, Jeff King wrote: On Wed, May 09, 2018 at 02:15:38PM +, Derrick Stolee wrote: The commit-graph file lives in the .git/objects/info directory. Previously, a failure to acquire the commit-graph.lock file was assumed to be due to the lack of the info directory, so a mkdir() was called. This gave incorrect messaging if instead the lockfile was open by another process: "fatal: cannot mkdir .git/objects/info: File exists" Rearrange the expectations of this directory existing to avoid this error, and instead show the typical message when a lockfile already exists. Sounds like a reasonable bug fix. Your cover letter is way longer than this description. Should some of that background perhaps go in the commit message? I did want a place to include the full die() message in the new behavior, but that seemed like overkill for the commit message. (I would go so far as to say that sending a cover letter for a single patch is an anti-pattern, because the commit message should be able to stand on its own). @@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(); graph_name = get_commit_graph_filename(obj_dir); - fd = hold_lock_file_for_update(, graph_name, 0); - if (fd < 0) { - struct strbuf folder = STRBUF_INIT; - strbuf_addstr(, graph_name); - strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); - - if (mkdir(folder.buf, 0777) < 0) - die_errno(_("cannot mkdir %s"), folder.buf); - strbuf_release(); - - fd = hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); - - if (fd < 0) - die_errno("unable to create '%s'", graph_name); - } + strbuf_addstr(, graph_name); + strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); + if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0) + die_errno(_("cannot mkdir %s"), folder.buf); + strbuf_release(); The result is racy if somebody else is trying to create the directory at the same time. For that you'd want to notice EEXIST coming from mkdir and consider that a success. I think you probably ought to be calling adjust_shared_perm() on the result, too, in case core.sharedrepository is configured. If you use safe_create_leading_directories(), it should handle both. Something like: if (safe_create_leading_directories(graph_name)) die_errno(_("unable to create leading directories of %s"), graph_name)); I think I'm holding it right; that function is a little short on documentation, but it's the standard way to do this in Git's codebase, and you can find lots of example callers. Thanks for this method. I was unfamiliar with it. Saves the effort of creating the strbuf, too. Thanks, -Stolee
Re: Implementing reftable in Git
On 5/9/2018 10:33 AM, Christian Couder wrote: Hi, I might start working on implementing reftable in Git soon. During the last Git Merge conference last March Stefan talked about reftable. In Alex Vandiver's notes [1] it is asked that people announce it on the list when they start working on it, and it appears that there is a reference implementation in JGit. Thanks for starting on this! In addition to the performance gains, this will help a lot of users with case-insensitive file systems from getting case-errors on refnames. Looking it up, there is indeed some documentation [2], code [3], tests [4] and other related stuff [5] in the JGit repo. It looks like the JGit repo and the reftable code there are licensed under the Eclipse Distribution License - v 1.0 [7] which is very similar to the 3-Clause BSD License also called Modified BSD License which is GPL compatible according to gnu.org [9]. So from a quick look it appears that I should be able to port the JGit to Git if I just keep the copyright and license header comments in all the related files. So I think the most straightforward and compatible way to do it would be to port the JGit implementation. Thanks in advance for any suggestion or comment about this. Reftable was first described by Shawn and then discussed last July on the list [6]. The hope is that such a direct port should be possible, but someone else should comment on the porting process. This is also something that could be created independently based on the documentation you mention. I was planning to attempt that during a hackathon in July, but I'm happy you are able to start earlier (and that you are announcing your intentions). I would be happy to review your patch series, so please keep me posted. Thanks, -Stolee
[PATCH 1/1] commit-graph: fix UX issue when .lock file exists
The commit-graph file lives in the .git/objects/info directory. Previously, a failure to acquire the commit-graph.lock file was assumed to be due to the lack of the info directory, so a mkdir() was called. This gave incorrect messaging if instead the lockfile was open by another process: "fatal: cannot mkdir .git/objects/info: File exists" Rearrange the expectations of this directory existing to avoid this error, and instead show the typical message when a lockfile already exists. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 24 1 file changed, 8 insertions(+), 16 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index a8c337dd77..8399194da1 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1,5 +1,6 @@ #include "cache.h" #include "config.h" +#include "dir.h" #include "git-compat-util.h" #include "lockfile.h" #include "pack.h" @@ -640,13 +641,13 @@ void write_commit_graph(const char *obj_dir, struct hashfile *f; uint32_t i, count_distinct = 0; char *graph_name; - int fd; struct lock_file lk = LOCK_INIT; uint32_t chunk_ids[5]; uint64_t chunk_offsets[5]; int num_chunks; int num_extra_edges; struct commit_list *parent; + struct strbuf folder = STRBUF_INIT; oids.nr = 0; oids.alloc = approximate_object_count() / 4; @@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(); graph_name = get_commit_graph_filename(obj_dir); - fd = hold_lock_file_for_update(, graph_name, 0); - if (fd < 0) { - struct strbuf folder = STRBUF_INIT; - strbuf_addstr(, graph_name); - strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); - - if (mkdir(folder.buf, 0777) < 0) - die_errno(_("cannot mkdir %s"), folder.buf); - strbuf_release(); - - fd = hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); - - if (fd < 0) - die_errno("unable to create '%s'", graph_name); - } + strbuf_addstr(, graph_name); + strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); + if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0) + die_errno(_("cannot mkdir %s"), folder.buf); + strbuf_release(); + hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); hashwrite_be32(f, GRAPH_SIGNATURE); -- 2.17.0.39.g685157f7fb
[PATCH 0/1] Fix UX issue with commit-graph.lock
We use the lockfile API to avoid multiple Git processes from writing to the commit-graph file in the .git/objects/info directory. In some cases, this directory may not exist, so we check for its existence. The existing code does the following when acquiring the lock: 1. Try to acquire the lock. 2. If it fails, try to create the .git/object/info directory. 3. Try to acquire the lock, failing if necessary. The problem is that if the lockfile exists, then the mkdir fails, giving an error that doesn't help the user: "fatal: cannot mkdir .git/objects/info: File exists" While technically this honors the lockfile, it does not help the user. Instead, do the following: 1. Check for existence of .git/objects/info; create if necessary. 2. Try to acquire the lock, failing if necessary. The new output looks like: fatal: Unable to create '/home/stolee/git/.git/objects/info/commit-graph.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. This patch is based on ds/generation-numbers Derrick Stolee (1): commit-graph: fix UX issue when .lock file exists commit-graph.c | 24 1 file changed, 8 insertions(+), 16 deletions(-) base-commit: 7547b95b4fbb8591726b1d9381c176cc27fc6aea -- 2.17.0.39.g685157f7fb
Re: What's cooking in git.git (May 2018, #01; Mon, 7)
On 5/7/2018 10:58 AM, Junio C Hamano wrote: * ds/generation-numbers (2018-05-02) 11 commits - commit-graph.txt: update design document - merge: check config before loading commits - commit: use generation number in remove_redundant() - commit: add short-circuit to paint_down_to_common() - commit: use generation numbers for in_merge_bases() - ref-filter: use generation number for --contains - commit-graph: always load commit-graph information - commit: use generations in paint_down_to_common() - commit-graph: compute generation numbers - commit: add generation number to struct commmit - ref-filter: fix outdated comment on in_commit_list (this branch uses ds/commit-graph and ds/lazy-load-trees.) A recently added "commit-graph" datafile has learned to store pre-computed generation numbers to speed up the decisions to stop history traversal. Is this ready for 'next'? I see that you squashed the fix from [1], so I think this is ready to go. Thanks! [1] https://public-inbox.org/git/1cfe38f6-925b-d36b-53ae-6b586eed1...@gmail.com/ * ds/lazy-load-trees (2018-05-02) 6 commits (merged to 'next' on 2018-05-02 at d54016d9e3) + coccinelle: avoid wrong transformation suggestions from commit.cocci (merged to 'next' on 2018-04-25 at b90813f421) + commit-graph: lazy-load trees for commits + treewide: replace maybe_tree with accessor methods + commit: create get_commit_tree() method + treewide: rename tree to maybe_tree + Merge branch 'bw/c-plus-plus' into ds/lazy-load-trees (this branch is used by ds/generation-numbers; uses ds/commit-graph.) The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so. Will merge to 'master'. * ds/commit-graph (2018-04-11) 16 commits (merged to 'next' on 2018-04-25 at 18af3d28d9) + commit-graph: implement "--append" option + commit-graph: build graph from starting commits + commit-graph: read only from specific pack-indexes + commit: integrate commit graph with commit parsing + commit-graph: close under reachability + commit-graph: add core.commitGraph setting + commit-graph: implement git commit-graph read + commit-graph: implement git-commit-graph write + commit-graph: implement write_commit_graph() + commit-graph: create git-commit-graph builtin + graph: add commit graph design document + commit-graph: add format document + csum-file: refactor finalize_hashfile() method + csum-file: rename hashclose() to finalize_hashfile() + Merge branch 'jk/cached-commit-buffer' into HEAD + Merge branch 'jt/binsearch-with-fanout' into HEAD (this branch is used by ds/generation-numbers and ds/lazy-load-trees.) Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. Will merge to 'master'. These have been queued for master for a few weeks. Is anything delaying them? I'd love to see the community dogfood this feature by running the following on their local repos: git config core.commitGraph true git show-ref -s | git commit-graph write --stdin-commits Thanks, -Stolee
Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
On 5/4/2018 3:40 PM, Jakub Narebski wrote: Hello, With early parts of commit-graph feature (ds/commit-graph and ds/lazy-load-trees) close to being merged into "master", see https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/ I think it would be good idea to think what other data could be added there to make Git even faster. Before thinking about adding more data to the commit-graph, I think instead we need to finish taking advantage of the data that is already there. This means landing the generation number patch [1] (I think this is close, so I'll send a v6 this week if there is no new feedback.) and the auto-compute patch [2] (this could use more feedback, but I'll send a v1 based on the RFC feedback if no one chimes in). [1] https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/ [PATCH v5 00/11] Compute and consume generation numbers [2] https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/ [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' The big wins remaining from this data are `git tag --merged` and `git log --graph`. The `tag` scenario is probably easier: this can be done by replacing the revision-walk underlying the call to use paint_down_to_common() instead. Requires adding an external method to commit.c, but not too much code. The tougher challenge is `git log --graph`. The revision walk machinery currently uses two precompute phases before iterating results to the pager: limit_list() and sort_in_topological_order(); these correspond to two phases of Kahn's algorithm for topo-sort (compute in-degrees, then walk by peeling commits with in-degree zero). This requires O(N) time, where N is the number of reachable commits. Instead, we could make this be O(W) time to output one page of results, where W is (roughly) the number of reachable commits with generation number above the last reported result. In order to take advantage of this approach, the two phases of Kahn's algorithm need to be done in-line with reporting results to the pager. This means keeping two queues: one is a priority queue by generation number that computes in-degrees, the other is a priority queue (by commit-date or a visit-order value to do the --topo-order priority) that peels the in-degree-zero commits (and decrements the in-degree of their parents). I have not begun this refactoring effort because appears complicated to me, and it will be hard to tease out the logic without affecting other consumers of the revision-walk machinery. I would love it if someone picked up the `git log --graph` task, since it will be a few weeks before I have the time to focus on it. Without completing the benefits we get from generation numbers, these investigations into other reachability indexes will be incomplete as they are comparing benefits without all consumers taking advantage of a reachability index. [...] Bloom filter for changed paths -- The goal of this chunk is to speed up checking if the file or directory was changed in given commit, for queries such as "git log -- " or "git blame ". This is something that according to "Git Merge contributor summit notes" [2] is already present in VSTS (Visual Studio Team Services - the server counterpart of GVFS: Git Virtual File System) at Microsoft: AV> - VSTS adds bloom filters to know which paths have changed on the commit AV> - tree-same check in the bloom filter is fast; speeds up file history checks AV> - might be useful in the client as well, since limited-traversal is common AV> - if the file history is _very_ sparse, then bloom filter is useful AV> - but needs pre-compute, so useful to do once AV> - first make the client do it, then think about how to serve it centrally [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ I think it was what Derrick Stolee was talking about at the end of his part of "Making Git for Windows" presentation at Git Merge 2018: https://youtu.be/oOMzi983Qmw?t=1835 This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph", starting from [3] [3]: https://public-inbox.org/git/86y3hyeu6c@gmail.com/ Again, the benefits of Bloom filters should only be measured after already taking advantage of a reachability index during `git log`. However, you could get performance benefits from Bloom filters in a normal `git log` (no topo-order). The tricky part about this feature is that the decisions we made in our C# implementation for the VSTS Git server may be very different than the needs for the C implementation of the Git client. Questions like "how do we handle merge commits?" may have different answers, which can only be discovered by implementing the feature. (The answer for VSTS is that we only store Bloom filters containin
Re: [PATCH v3 00/12] get_short_oid UI improvements
On 5/2/2018 8:42 AM, Derrick Stolee wrote: On 5/1/2018 2:40 PM, Ævar Arnfjörð Bjarmason wrote: The biggest change in v3 is the no change at all to the code, but a lengthy explanation of why I didn't go for Derrick's simpler implementation. Maybe I'm wrong about that, but I felt uneasy offloading undocumented (or if I documented it, it would only be for this one edge-case) magic on the oid_array API. Instead I'm just making this patch a bit more complex. I think that's fair. Thanks for going along with me on the thought experiment. Also, v3 looks good to me. Reviewed-by: Derrick Stolee <dsto...@microsoft.com>
Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
On 5/2/2018 9:05 AM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: For a copy of the Linux repository, we measured the following performance improvements: git merge-base v3.3 v4.5 Before: 234 ms After: 208 ms Rel %: -11% git merge-base v4.3 v4.5 Before: 102 ms After: 83 ms Rel %: -19% The experiments above were chosen to demonstrate that we are improving the filtering of the merge-base set. In the first example, more time is spent walking the history to find the set of merge bases before the remove_redundant() call. The starting commits are closer together in the second example, therefore more time is spent in remove_redundant(). The relative change in performance differs as expected. Nice. I was not expecting as much performance improvements as we got for --contains tests because remove_redundant() is a final step in longer process, dominated by man calculations. Still, nothing to sneeze about. One reason these numbers are not too surprising is that remove_redundant() can demonstrate quadratic behavior. It is calculating pair-wise reachability by starting a walk at each of the candidates (in the worst case). In typical cases, the first walk marks many of the other candidates as redundant and we don't need to start walks from those commits. A possible optimization could be to sort the candidates by descending generation so we find the first walk is likely to mark the rest as redundant. But this may already be the case if the candidates are added to the list in order of "discovery" which is already simulating this behavior. Thanks, -Stolee
Re: [PATCH v3 00/12] get_short_oid UI improvements
On 5/1/2018 2:40 PM, Ævar Arnfjörð Bjarmason wrote: The biggest change in v3 is the no change at all to the code, but a lengthy explanation of why I didn't go for Derrick's simpler implementation. Maybe I'm wrong about that, but I felt uneasy offloading undocumented (or if I documented it, it would only be for this one edge-case) magic on the oid_array API. Instead I'm just making this patch a bit more complex. I think that's fair. Thanks for going along with me on the thought experiment. -Stolee
Re: [PATCH v5 09/11] commit: use generation number in remove_redundant()
On 5/1/2018 8:47 AM, Derrick Stolee wrote: The static remove_redundant() method is used to filter a list of commits by removing those that are reachable from another commit in the list. This is used to remove all possible merge- bases except a maximal, mutually independent set. To determine these commits are independent, we use a number of paint_down_to_common() walks and use the PARENT1, PARENT2 flags to determine reachability. Since we only care about reachability and not the full set of merge-bases between 'one' and 'twos', we can use the 'min_generation' parameter to short-circuit the walk. When no commit-graph exists, there is no change in behavior. For a copy of the Linux repository, we measured the following performance improvements: git merge-base v3.3 v4.5 Before: 234 ms After: 208 ms Rel %: -11% git merge-base v4.3 v4.5 Before: 102 ms After: 83 ms Rel %: -19% The experiments above were chosen to demonstrate that we are improving the filtering of the merge-base set. In the first example, more time is spent walking the history to find the set of merge bases before the remove_redundant() call. The starting commits are closer together in the second example, therefore more time is spent in remove_redundant(). The relative change in performance differs as expected. Reported-by: Jakub Narebski <jna...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 9875feec01..5064db4e61 100644 --- a/commit.c +++ b/commit.c @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) parse_commit(array[i]); for (i = 0; i < cnt; i++) { struct commit_list *common; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; This initialization should be uint32_t min_generation = array[i]->generation; since the assignment (using j) below skips the ith commit. if (redundant[i]) continue; @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) continue; filled_index[filled] = j; work[filled++] = array[j]; + + if (array[j]->generation < min_generation) + min_generation = array[j]->generation; } - common = paint_down_to_common(array[i], filled, work, 0); + common = paint_down_to_common(array[i], filled, work, + min_generation); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++)
Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 10:10 AM, Ævar Arnfjörð Bjarmason wrote: Actually I'm having second thoughts about that and thinking I might keep my original approach (with a better explanation). A few more lines of code seems worthwhile in order to not break the assumptions a documented API is making, no matter how briefly, so I set about documenting this case and supporting it, since e.g. oid_array_lookup() will completely fail with the hack of setting the .sorted member, and came up with this: diff --git a/Documentation/technical/api-oid-array.txt b/Documentation/technical/api-oid-array.txt index b0c11f868d..ff87260220 100644 --- a/Documentation/technical/api-oid-array.txt +++ b/Documentation/technical/api-oid-array.txt @@ -16,6 +16,20 @@ Data Structures the actual data. The `nr` member contains the number of items in the set. The `alloc` and `sorted` members are used internally, and should not be needed by API callers. ++ +Both the `oid_array_lookup` and `oid_array_for_each_unique` functions +rely on the array being sorted. For the former it's an absolute +requirenment that the internal `oid_array_sort` function has been +called on it, bu for the latter it's enough that the elements are +ordered in such a way as to guarantee that identical object IDs are +adjacent in the array. s/bu/but/ ++ +This is useful e.g. to print output where commits, tags etc. are +grouped together (barring a hash collision they won't have the same +object ID), in such cases the `custom_sorted` member can be set to `1` +before calling `oid_array_for_each_unique`, and it'll skip its own +sorting. Once it's been set calling e.g. `oid_array_lookup` without it +being cleared again will cause an internal panic, so use it carefully. Functions - diff --git a/sha1-array.c b/sha1-array.c index 466a926aa3..cbae07ff78 100644 --- a/sha1-array.c +++ b/sha1-array.c @@ -18,6 +18,7 @@ static void oid_array_sort(struct oid_array *array) { QSORT(array->oid, array->nr, void_hashcmp); array->sorted = 1; + array->custom_sorted = 0; } static const unsigned char *sha1_access(size_t index, void *table) @@ -28,6 +29,13 @@ static const unsigned char *sha1_access(size_t index, void *table) int oid_array_lookup(struct oid_array *array, const struct object_id *oid) { + if (array->custom_sorted) + /* +* We could also just clear custom_sorted here, but if +* the caller is custom sorting and then calling this +* that's likely something they'd like to know about. +*/ + BUG("PANIC: Cannot lookup OIDs in arrays with a custom sort!"); Probably don't need the "PANIC: " here. if (!array->sorted) oid_array_sort(array); return sha1_pos(oid->hash, array->oid, array->nr, sha1_access); @@ -39,6 +47,7 @@ void oid_array_clear(struct oid_array *array) array->nr = 0; array->alloc = 0; array->sorted = 0; + array->custom_sorted = 0; } int oid_array_for_each_unique(struct oid_array *array, @@ -47,7 +56,7 @@ int oid_array_for_each_unique(struct oid_array *array, { int i; - if (!array->sorted) + if (!array->sorted && !array->custom_sorted) oid_array_sort(array); for (i = 0; i < array->nr; i++) { diff --git a/sha1-array.h b/sha1-array.h index 1e1d24b009..bfa77ba1e4 100644 --- a/sha1-array.h +++ b/sha1-array.h @@ -6,6 +6,7 @@ struct oid_array { int nr; int alloc; int sorted; + int custom_sorted; }; #define OID_ARRAY_INIT { NULL, 0, 0, 0 } diff --git a/sha1-name.c b/sha1-name.c index b81e07adbb..d190800db0 100644 --- a/sha1-name.c +++ b/sha1-name.c @@ -490,9 +490,11 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data) find_short_packed_object(); QSORT(collect.oid, collect.nr, sort_ambiguous); - collect.sorted = 1; + collect.custom_sorted = 1; ret = oid_array_for_each_unique(, fn, cb_data); + collect.custom_sorted = 0; + oid_array_clear(); return ret; } So maybe I should just stop worrying and YOLO it, it just seems wrong to leave such a fragile setup in place where we set .sorted=1 and some future refactoring reasonably tries to call oid_array_lookup() on it and silently fails. What do you think? I think this extra custom_sort check is worth keeping the API stable to future changes. Thanks, -Stolee
Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 9:39 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, May 01 2018, Derrick Stolee wrote: From: Ævar Arnfjörð Bjarmason <ava...@gmail.com> Here is what I mean by sorting during for_each_abbrev(). This seems to work for me, so I don't know what the issue is with this one-pass approach. [...] +static int sort_ambiguous(const void *a, const void *b) +{ + int a_type = oid_object_info(a, NULL); + int b_type = oid_object_info(b, NULL); + int a_type_sort; + int b_type_sort; + + /* +* Sorts by hash within the same object type, just as +* oid_array_for_each_unique() would do. +*/ + if (a_type == b_type) + return oidcmp(a, b); + + /* +* Between object types show tags, then commits, and finally +* trees and blobs. +* +* The object_type enum is commit, tree, blob, tag, but we +* want tag, commit, tree blob. Cleverly (perhaps too +* cleverly) do that with modulus, since the enum assigns 1 to +* commit, so tag becomes 0. +*/ + a_type_sort = a_type % 4; + b_type_sort = b_type % 4; + return a_type_sort > b_type_sort ? 1 : -1; +} + static int get_short_oid(const char *name, int len, struct object_id *oid, unsigned flags) { @@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data) find_short_object_filename(); find_short_packed_object(); + QSORT(collect.oid, collect.nr, sort_ambiguous); + collect.sorted = 1; + Yes this works. You're right. I wasn't trying to intentionally omit stuff in my recent 878t93zh60@evledraar.gmail.com, I'd just written this code some days ago and forgotten why I did what I was doing (and this is hard to test for), but it's all coming back to me now. The actual requirement for oid_array_for_each_unique() working properly is that you've got to feed it in hash order, To work properly, duplicate entries must be consecutive. Since duplicate entries have the same type, our sort satisfies this condition. but my new sort_ambiguous() still does that (barring any SHA-1 collisions, at which point we have bigger problems), so two passes aren't needed. So yes, this apporoach works and is one-pass. But that's just an implementation detail of the current sort method, when I wrote this I was initially playing with other sort orders, e.g. sorting SHAs regardless of type by the mtime of the file I found them in. With this approach I'd start printing duplicates if I changed the internals of sort_ambiguous() like that. That makes sense. But I think it's extremely implausible that we'll start sorting things like that, so I'll just take this method of doing it and add some comment saying we must hashcmp() the entries in our own sort function for the de-duplication to work, I don't see us ever changing that. Sounds good. Thanks, -Stolee
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 8:36 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, May 01 2018, Derrick Stolee wrote: How would sorting in our custom order before de-duplicating fail the de-duplication? We will still pair identical OIDs as consecutive elements and oid_array_for_each_unique only cares about consecutive elements having distinct OIDs, not lex-ordered OIDs. Because there's no de-duplication without the array first being sorted in oidcmp() order, which oid_array_for_each_unique() checks for and re-sorts if !array->sorted. I.e. its de-duplication is just a state machine where it won't call the callback if the currently processed element has the same SHA1 as the last one. Perhaps the noise is because we rely on oid_array_sort() to mark the array as sorted inside oid_array_for_each_unique(), but that could be remedied by calling our QSORT() inside for_each_abbrev() and marking the array as sorted before calling oid_array_for_each_unique(). As noted above this won't work, because the function inherently relies on the array being sorted to be able to de-duplicate. Doing this will yield duplicate entries. I'm confused as to why my suggestion doesn't work, so I made it concrete. I sent an alternate commit 6/12 to your v2 series [1]. Thanks, -Stolee [1] https://public-inbox.org/git/20180501130318.58251-1-dsto...@microsoft.com/T/#u
[PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1
From: Ævar Arnfjörð Bjarmason <ava...@gmail.com> Here is what I mean by sorting during for_each_abbrev(). This seems to work for me, so I don't know what the issue is with this one-pass approach. Thanks, -Stolee -- >8 -- Change the output emitted when an ambiguous object is encountered so that we show tags first, then commits, followed by trees, and finally blobs. Within each type we show objects in hashcmp() order. Before this change the objects were only ordered by hashcmp(). The reason for doing this is that the output looks better as a result, e.g. the v2.17.0 tag before this change on "git show e8f2" would display: hint: The candidates are: hint: e8f2093055 tree hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f25a3a50 tree hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2650052 tag v2.17.0 hint: e8f2867228 blob hint: e8f28d537c tree hint: e8f2a35526 blob hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2cf6ec0 tree Now we'll instead show: hint: e8f2650052 tag v2.17.0 hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2093055 tree hint: e8f25a3a50 tree hint: e8f28d537c tree hint: e8f2cf6ec0 tree hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f2867228 blob hint: e8f2a35526 blob Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary. I don't want to order by object_type since there tags come last after blobs, which doesn't make sense if we want to show the most important things first. I could display them after commits, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it prominently at the top. Signed-off-by: Ævar Arnfjörð Bjarmason <ava...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1-name.c | 31 + t/t1512-rev-parse-disambiguation.sh | 21 +++ 2 files changed, 52 insertions(+) diff --git a/sha1-name.c b/sha1-name.c index 5deebab56d..0336630c64 100644 --- a/sha1-name.c +++ b/sha1-name.c @@ -385,6 +385,34 @@ static int collect_ambiguous(const struct object_id *oid, void *data) return 0; } +static int sort_ambiguous(const void *a, const void *b) +{ + int a_type = oid_object_info(a, NULL); + int b_type = oid_object_info(b, NULL); + int a_type_sort; + int b_type_sort; + + /* +* Sorts by hash within the same object type, just as +* oid_array_for_each_unique() would do. +*/ + if (a_type == b_type) + return oidcmp(a, b); + + /* +* Between object types show tags, then commits, and finally +* trees and blobs. +* +* The object_type enum is commit, tree, blob, tag, but we +* want tag, commit, tree blob. Cleverly (perhaps too +* cleverly) do that with modulus, since the enum assigns 1 to +* commit, so tag becomes 0. +*/ + a_type_sort = a_type % 4; + b_type_sort = b_type % 4; + return a_type_sort > b_type_sort ? 1 : -1; +} + static int get_short_oid(const char *name, int len, struct object_id *oid, unsigned flags) { @@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data) find_short_object_filename(); find_short_packed_object(); + QSORT(collect.oid, collect.nr, sort_ambiguous); + collect.sorted = 1; + ret = oid_array_for_each_unique(, fn, cb_data); oid_array_clear(); return ret; diff --git a/t/t1512-rev-parse-disambiguation.sh b/t/t1512-rev-parse-disambiguation.sh index c7ceda2f21..74e7d9c178 100755 --- a/t/t1512-rev-parse-disambiguation.sh +++ b/t/t1512-rev-parse-disambiguation.sh @@ -364,4 +364,25 @@ test_expect_success 'core.disambiguate does not override context' ' git -c core.disambiguate=committish rev-parse $sha1^{tree} ' +test_expect_success C_LOCALE_OUTPUT 'ambiguous commits are printed by type first, then hash order' ' + test_must_fail git rev-parse 2>stderr && + grep ^hint: stderr >hints && + grep hints >objects && + cat >expected <<-\EOF && + tag +
[PATCH v5 07/11] commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the initial commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 39a3749abd..3ecdc13356 100644 --- a/commit.c +++ b/commit.c @@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (reference[i]->generation < min_generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return ret; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb
[PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list
The in_commit_list() method does not check the parents of the candidate for containment in the list. Fix the comment that incorrectly states that it does. Reported-by: Jakub Narebski <jna...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..aff24d93be 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, -- 2.17.0.39.g685157f7fb
[PATCH v5 08/11] commit: add short-circuit to paint_down_to_common()
When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 3ecdc13356..9875feec01 100644 --- a/commit.c +++ b/commit.c @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -831,6 +834,15 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip %8x > %8x at %s", + commit->generation, last_gen, + oid_to_hex(>object.oid)); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -879,7 +891,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -946,7 +958,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1070,7 +1082,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return ret; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb
[PATCH v5 03/11] commit-graph: compute generation numbers
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 43 +++ 1 file changed, 43 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..36d765e10a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +573,45 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct packed_commit_list* commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < commits->nr; i++) { + if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && + commits->list[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits->list[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +735,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0); -- 2.17.0.39.g685157f7fb
[PATCH v5 05/11] commit-graph: always load commit-graph information
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 46 +++--- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 47 insertions(+), 20 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 36d765e10a..a8c337dd77 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,6 +245,13 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->graph_pos = pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; @@ -292,31 +299,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(g, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + if (!core_commit_graph) return 0; if (item->object.parsed) return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index 4d00b0a1d6..39a3749abd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) { const char *tail = buffer; const char *bufptr = buffer; @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s } item->date = parse_commit_date(bufptr, tail); + if (check_graph) + load_commit_graph_info(item); + return 0; } @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item,
[PATCH v5 11/11] commit-graph.txt: update design document
We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 29 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..e1a883eb46 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -- @@ -98,18 +121,14 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following operations are important candidates: -- paint_down_to_common() - 'log --topo-order' +- 'tag --merged' - Currently, parse_commit_gently() requires filling in the root tree object for a commit. This passes through lookup_tree() and consequently -- 2.17.0.39.g685157f7fb
[PATCH v5 00/11] Compute and consume generation numbers
t_list *common; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (redundant[i]) continue; @@ -955,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) continue; filled_index[filled] = j; work[filled++] = array[j]; + + if (array[j]->generation < min_generation) + min_generation = array[j]->generation; } - common = paint_down_to_common(array[i], filled, work, 0); + common = paint_down_to_common(array[i], filled, work, + min_generation); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1073,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; - if (min_generation > reference[i]->generation) + if (reference[i]->generation < min_generation) min_generation = reference[i]->generation; } -- >8 -- Derrick Stolee (11): ref-filter: fix outdated comment on in_commit_list commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph: always load commit-graph information ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() commit: use generation number in remove_redundant() merge: check config before loading commits commit-graph.txt: update design document Documentation/technical/commit-graph.txt | 30 ++-- alloc.c | 1 + builtin/merge.c | 7 +- commit-graph.c | 91 commit-graph.h | 8 +++ commit.c | 61 +--- commit.h | 7 +- object.c | 2 +- ref-filter.c | 26 +-- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 204 insertions(+), 40 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb
[PATCH v5 06/11] ref-filter: use generation number for --contains
A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 24 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index aff24d93be..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + load_commit_graph_info(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb
[PATCH v5 09/11] commit: use generation number in remove_redundant()
The static remove_redundant() method is used to filter a list of commits by removing those that are reachable from another commit in the list. This is used to remove all possible merge- bases except a maximal, mutually independent set. To determine these commits are independent, we use a number of paint_down_to_common() walks and use the PARENT1, PARENT2 flags to determine reachability. Since we only care about reachability and not the full set of merge-bases between 'one' and 'twos', we can use the 'min_generation' parameter to short-circuit the walk. When no commit-graph exists, there is no change in behavior. For a copy of the Linux repository, we measured the following performance improvements: git merge-base v3.3 v4.5 Before: 234 ms After: 208 ms Rel %: -11% git merge-base v4.3 v4.5 Before: 102 ms After: 83 ms Rel %: -19% The experiments above were chosen to demonstrate that we are improving the filtering of the merge-base set. In the first example, more time is spent walking the history to find the set of merge bases before the remove_redundant() call. The starting commits are closer together in the second example, therefore more time is spent in remove_redundant(). The relative change in performance differs as expected. Reported-by: Jakub Narebski <jna...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 9875feec01..5064db4e61 100644 --- a/commit.c +++ b/commit.c @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) parse_commit(array[i]); for (i = 0; i < cnt; i++) { struct commit_list *common; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (redundant[i]) continue; @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) continue; filled_index[filled] = j; work[filled++] = array[j]; + + if (array[j]->generation < min_generation) + min_generation = array[j]->generation; } - common = paint_down_to_common(array[i], filled, work, 0); + common = paint_down_to_common(array[i], filled, work, + min_generation); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) -- 2.17.0.39.g685157f7fb
[PATCH v5 10/11] merge: check config before loading commits
Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/merge.c | 7 --- t/t5318-commit-graph.sh | 9 + 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb
[PATCH v5 02/11] commit: add generation number to struct commmit
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0x GENERATION_NUMBER_MAX 0x3FFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = >parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb
[PATCH v5 04/11] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..4d00b0a1d6 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generations are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, May 01 2018, Derrick Stolee wrote: On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. Here's a non-arbitrary reason: the object types are ordered topologically (ignoring self-references): tag -> commit, tree, blob commit -> tree tree -> blob Thanks. I'll add a patch with that comment to v2. @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct object_id *oid, ds.fn = NULL; advise(_("The candidates are:")); - for_each_abbrev(ds.hex_pfx, show_ambiguous_object, ); + for_each_abbrev(ds.hex_pfx, collect_ambiguous, ); + QSORT(collect.oid, collect.nr, sort_ambiguous); I was wondering how the old code sorted by SHA even when the ambiguous objects were loaded from different sources (multiple pack-files, loose objects). Turns out that for_each_abbrev() does its own sort after collecting the SHAs and then calls the given function pointer only once per distinct object. This avoids multiple instances of the same object, which may appear multiple times across pack-files. I only ask because now we are doing two sorts. I wonder if it would be more elegant to provide your sorting algorithm to for_each_abbrev() and let it call show_ambiguous_object as before. Another question is if we should use this sort generally for all calls to for_each_abbrev(). The only other case I see is in builtin/revparse.c. When preparing v2 I realized how confusing this was, so I'd added this to the commit message of my WIP re-roll which should explain this: A note on the implementation: I started out with something much simpler which just replaced oid_array_sort() in sha1-array.c with a custom sort function before calling oid_array_for_each_unique(). But then dumbly noticed that it doesn't work because the output function was tangled up with the code added in fad6b9e590 ("for_each_abbrev: drop duplicate objects", 2016-09-26) to ensure we don't display duplicate objects. That's why we're doing two passes here, first we need to sort the list and de-duplicate the objects, then sort them in our custom order, and finally output them without re-sorting them. I suppose we could also make oid_array_for_each_unique() maintain a hashmap of emitted objects, but that would increase its memory profile and wouldn't be worth the complexity for this one-off use-case, oid_array_for_each_unique() is used in many other places. How would sorting in our custom order before de-duplicating fail the de-duplication? We will still pair identical OIDs as consecutive elements and oid_array_for_each_unique only cares about consecutive elements having distinct OIDs, not lex-ordered OIDs. Perhaps the noise is because we rely on oid_array_sort() to mark the array as sorted inside oid_array_for_each_unique(), but that could be remedied by calling our QSORT() inside for_each_abbrev() and marking the array as sorted before calling oid_array_for_each_unique(). (Again, my comments are not meant to block this series.) Thanks, -Stolee
Re: [PATCH v4 05/10] commit-graph: always load commit-graph information
On 4/29/2018 6:14 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: Most code paths load commits using lookup_commit() and then parse_commit(). And this automatically loads commit graph if needed, thanks to changes in parse_commit_gently(), which parse_commit() uses. In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). I guess the problem is that we cannot just add parse_commit_in_graph() like we did in parse_commit_gently(), for some reason? Like for example that parse_commit_gently() uses parse_commit_buffer() - which could have been handled by moving parse_commit_in_graph() down the call chain from parse_commit_gently() to parse_commit_buffer()... if not the fact that check_commit() also uses parse_commit_buffer(), but it does not want to load commit graph. Am I right? If a caller uses parse_commit_buffer() directly, then we will guarantee that all values in the struct commit that would be loaded from the buffer are loaded from the buffer. This means we do NOT load the root tree id or commit date from the commit-graph file. We do still need to load the data that is not available in the buffer, such as graph_pos and generation. With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Is it generation number, or generation number and position in commit graph? We don't need to ensure the graph_pos (the commit will never be re-parsed, so we will not try to find it in the commit-graph file again), but we DO need to ensure the generation (or our commit walks will be incorrect). We get the graph_pos as a side-effect. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. I think this commit would be easier to review if it was split into pure refactoring part (extracting fill_commit_graph_info() and find_commit_in_graph()). On the other hand the refactoring was needed to reduce code duplication betweem existing parse_commit_in_graph() and new load_commit_graph_info() functions. I guess that the difference between parse_commit_in_graph() and load_commit_graph_info() is that the former cares only about having just enough information that is needed for parse_commit_gently() - and does not load graph data if commit is parsed, while the latter is about loading commit-graph data like generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 ++--- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 46 insertions(+), 20 deletions(-) I wonder if it would be possible to add tests for this feature, for example that commit-graph is read when it should (including those branch lookups), and is not read when the feature should be disabled. But the only way to test it I can think of is a stupid one: create invalid commit graph, and check that git fails as expected (trying to read said malformed file), and does not fail if commit graph feature is disabled. Let me reorder files (BTW, is there a way for Git to put *.h files before *.c files in diff?) for easier review: diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit-graph.c b/commit-graph.c index 047fa9fca5..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} The comment in the header file commit-graph.h talks about filling graph_pos and generation members of the given commit, but I don't see filling graph_pos member here. We
Re: [PATCH v4 03/10] commit-graph: compute generation numbers
On 4/29/2018 5:08 AM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. Sidenote: for generation numbers it does not matter if we use depth-first or breadth-first strategy, but it is more natural to use depth-first search because generation numbers need post-order processing (parents before child). The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. A question: what happens if the existing commit graph is from older version of git and has _ZERO for generation numbers? Answer: I see that we treat both _INFINITY (not in commit-graph) and _ZERO (in commit graph but not computed) as not computed generation numbers. All right. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. All right, though I guess this would remain theoretical for a long while. We don't have any way of testing this, at least not without recompiling Git with lower value of GENERATION_NUMBER_MAX -- which means not automatically, isn't it? Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 + 1 file changed, 45 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..047fa9fca5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); + If we stumble upon commit marked as "not in commit-graph" while writing commit graph, it is a BUG(), isn't it? (Problem noticed by Junio.) Since we are computing the values for all commits in the list, this condition is not important and will be removed. It is a bit strange to me that the code uses get_be32 for reading, but htonl for writing. Is Git tested on non little-endian machines, like big-endian ppc64 or s390x, or on mixed-endian machines (or selectable-endian machines with data endianness set to non little-endian, like ia64)? If not, could we use for example openSUSE Build Service (https://build.opensuse.org/) for this? Since we are packing two values into 64 bits, I am using htonl() here to arrange the 30-bit generation number alongside the 34-bit commit date value, then writing with hashwrite(). The other 32-bit integers are written with hashwrite_be32() to avoid translating this data in-memory. packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; All right, commit_list will work as stack. + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; All right, we consider _INFINITY and _SERO as not computed. If generation number is computed (by 'recursion' or from commit graph), we (re)use it. This means that generation number calculation is incremental, as intended -- good. + + commit_list_insert(commits[i], ); Start depth-first walks from commits given. + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; Here all_parents_computed is a boolean flag. I see that it is easier to start with assumption that all parents will have computed generation numbers. + uint32_t max_generation = 0; The generation number value of 0 functions as sentinel; generation numbers start from 1. Not that it matters much, as lowest possible generation number is 1, and we could have started from that value. Except that for a commit with no parents, we want it to receive generation number max_generation + 1 = 1, so this value of 0 is important. + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed =
Re: [PATCH v4 10/10] commit-graph.txt: update design document
On 4/30/2018 7:32 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> Looks good. --- Documentation/technical/commit-graph.txt | 30 +++- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, The linux kernel commit graph has maximum of 513 commits sharing the same generation number, but is is 5.43 commits sharing the same generation number on average, with standard deviation 10.70; median is even lower: it is 2, with 5.35 median absolute deviation (MAD). So on average it would be a few extra commits. Right. but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. As I wrote before, handling those corner cases in more complicated, but not that complicated. We could simply use stronger condition if both generation numbers are ordinary generation numbers, and weaker condition when at least one generation number has one of those special values. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. Ordinary generation numbers, where stronger condition holds, are those between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX. + Design Details -- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - Good. - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' Another possible candidates: - remove_redundant() - see comment in previous patch - still_interesting() - where Git uses date slop to stop walking too far remove_redundant() will be included in v5, thanks. Instead of "still_interesting()" I'll add "git tag --merged" as the candidate to consider, as discussed in [1]. [1] https://public-inbox.org/git/87fu3g67ry@lant.ki.iif.hu/t/#u "branch --contains / tag --merged inconsistency" - Currently, parse_commit_gently() requires filling in the root tree One important issue left is handling features that change view of project history, and their interaction with commit-graph feature. What would happen, if we turn on commit-graph feature, generate commit graph file, and then: * use graft file or remove graft entries to cut history, or remove cut or join two [independent] histories. * use git-replace mechanims to do the same * in shallow clone, deepen or shorten the clone What would happen if without re-generating commit-graph file (assuming tha Git wouldn't do it f
Re: [PATCH v4 09/10] merge: check config before loading commits
On 4/30/2018 6:54 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Sidenote: I wonder why reading config was postponed to later in the command lifetime... I guess it was to avoid having to read config if HEAD was invalid. The 'branch' does need to be loaded before the call to git_config (as I found out after moving the config call too early), so I suppose it was natural to pair that with resolving head_commit. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. I guess this is because we avoid re-parsing objects at all costs; we want to avoid re-reading commit graph too. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. I would prefer if this commit was put earlier in the series, to avoid having broken Git (and thus a possibility of problems when bisecting) in between those two commits. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/merge.c | 7 --- t/t5318-commit-graph.sh | 9 + 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); - Good. if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' All right. (though I wonder if this tests catches all problems where BUG("bad generation skip") could have been encountered. We will never know until we have this series running in the wild (and even then, some features are very obscure) and enough people turn on the config setting. One goal of the "fsck and gc" series is to get this feature running during the rest of the test suite as much as possible, so we can get additional coverage. Also to get more experience from the community dogfooding the feature. + test_done Best,
Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
On 4/30/2018 6:19 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. This is thanks to the fact that generation numbers start at zero (as special case, though, with _ZERO), and we use strict inequality to avoid handling _ZERO etc. in a special way. As you wrote in response in previous version of this series, because paint_down_to_common() is file-local, there is no need to come up with symbolic name for GENERATION_NO_CUTOFF case. All right. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. All right, and when using paint_down_to_common() to actually find merge bases, and not only check for containment, we cannot use cutoff. Therefore at least one call site needs to run it without functional change... which we can do. Good. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Nice. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) Let me reorder chunks a bit to make it easier to review. diff --git a/commit.c b/commit.c index 7bb007f56a..e2e16ea1a7 100644 --- a/commit.c +++ b/commit.c @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return ret; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + last_gen = commit->generation; Shouldn't we provide more information about where the problem is to the user, to make it easier to debug the repository / commit-graph data? Good to have this sanity check here. This BUG() _should_ only be seen by developers who add callers which do not load commits from the commit-graph file. There is a chance that there are cases not covered by this patch and the added tests, though. Hopefully we catch them all by dogfooding the feature before turning it on by default. I can add the following to help debug these bad situations: + BUG("bad generation skip %d > %d at %s", + commit->generation, last_gen, + oid_to_hex(>object.oid)); + + if (commit->generation < min_generation) + break; So the reasoning for this, as far as I understand, is the following. Please correct me if I am wrong. The callsite with non-zero min_generation, in_merge_bases_many(), tries to find out if "commit" is an ancestor of one of the "references". At least one of "references" is above "commit", so in_merge_bases_many() uses paint_down_to_common() - but is interested only if "commit" was painted as reachable from one of "references". Thus we can interrupt the walk if we know that none of [considered] commits in the queue can reach "commit"/"one" - as if they were all STALE. The
Re: [PATCH 0/9] get_short_oid UI improvements
On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: I started out just wanting to do 04/09 so I'd get prettier output, but then noticed that ^{tag}, ^{commit}< ^{blob} and ^{tree} didn't behave as expected with the disambiguation output, and that core.disambiguate had never been documented. Ævar Arnfjörð Bjarmason (9): sha1-name.c: remove stray newline sha1-array.h: align function arguments sha1-name.c: move around the collect_ambiguous() function get_short_oid: sort ambiguous objects by type, then SHA-1 get_short_oid: learn to disambiguate by ^{tag} get_short_oid: learn to disambiguate by ^{blob} get_short_oid / peel_onion: ^{tree} should mean tree, not treeish get_short_oid / peel_onion: ^{tree} should mean commit, not commitish config doc: document core.disambiguate Documentation/config.txt| 14 ++ cache.h | 5 ++- sha1-array.c| 15 +++ sha1-array.h| 7 ++- sha1-name.c | 69 - t/t1512-rev-parse-disambiguation.sh | 32 ++--- 6 files changed, 120 insertions(+), 22 deletions(-) This is a good series. Please take a look at my suggestion in Patch 4/9, but feel free to keep this series as written. Reviewed-by: Derrick Stolee <dsto...@microsoft.com>
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: Change the output emitted when an ambiguous object is encountered so that we show tags first, then commits, followed by trees, and finally blobs. Within each type we show objects in hashcmp(). Before this change the objects were only ordered by hashcmp(). The reason for doing this is that the output looks better as a result, e.g. the v2.17.0 tag before this change on "git show e8f2" would display: hint: The candidates are: hint: e8f2093055 tree hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f25a3a50 tree hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2650052 tag v2.17.0 hint: e8f2867228 blob hint: e8f28d537c tree hint: e8f2a35526 blob hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2cf6ec0 tree Now we'll instead show: hint: e8f2650052 tag v2.17.0 hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2093055 tree hint: e8f25a3a50 tree hint: e8f28d537c tree hint: e8f2cf6ec0 tree hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f2867228 blob hint: e8f2a35526 blob Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. Here's a non-arbitrary reason: the object types are ordered topologically (ignoring self-references): tag -> commit, tree, blob commit -> tree tree -> blob Signed-off-by: Ævar Arnfjörð Bjarmason--- sha1-array.c | 15 +++ sha1-array.h | 3 +++ sha1-name.c | 37 - 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/sha1-array.c b/sha1-array.c index 838b3bf847..48bd9e9230 100644 --- a/sha1-array.c +++ b/sha1-array.c @@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array) array->sorted = 0; } + +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data) +{ + int i; + + for (i = 0; i < array->nr; i++) { + int ret = fn(array->oid + i, data); + if (ret) + return ret; + } + return 0; +} + int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data) diff --git a/sha1-array.h b/sha1-array.h index 1e1d24b009..232bf95017 100644 --- a/sha1-array.h +++ b/sha1-array.h @@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array); typedef int (*for_each_oid_fn)(const struct object_id *oid, void *data); +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data); int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data); diff --git a/sha1-name.c b/sha1-name.c index 9d7bbd3e96..46d8b1afa6 100644 --- a/sha1-name.c +++ b/sha1-name.c @@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, void *data) return 0; } +static int sort_ambiguous(const void *a, const void *b) +{ + int a_type = oid_object_info(a, NULL); + int b_type = oid_object_info(b, NULL); + int a_type_sort; + int b_type_sort; + + /* +* Sorts by hash within the same object type, just as +* oid_array_for_each_unique() would do. +*/ + if (a_type == b_type) + return oidcmp(a, b); + + /* +* Between object types show tags, then commits, and finally +* trees and blobs. +* +* The object_type enum is commit, tree, blob, tag, but we +* want tag, commit, tree blob. Cleverly (perhaps too +* cleverly) do that with modulus, since the enum assigns 1 to +* commit, so tag becomes 0. +*/ I appreciate this comment. Clever things should be marked as such. + a_type_sort = a_type % 4; + b_type_sort = b_type % 4; + return a_type_sort > b_type_sort ? 1 : -1; +} + static int get_short_oid(const char *name, int len, struct object_id *oid, unsigned flags) { @@ -409,6 +437,8 @@
Re: branch --contains / tag --merged inconsistency
On 4/27/2018 12:03 PM, SZEDER Gábor wrote: Szia Feri, I'm moving the IRC discussion here, because this might be a bug report in the end. So, kindly try these steps (103 MB free space required): $ git clone https://github.com/ClusterLabs/pacemaker.git && cd pacemaker [...] $ git branch --contains Pacemaker-0.6.1 * master $ git tag --merged master | fgrep Pacemaker-0.6 Pacemaker-0.6.0 Pacemaker-0.6.2 Pacemaker-0.6.3 Pacemaker-0.6.4 Pacemaker-0.6.5 Pacemaker-0.6.6 Notice that Pacemaker-0.6.1 is missing from the output. Kind people on IRC didn't find a quick explanation, and we all had to go eventually. Is this expected behavior? Reproduced with git 2.11.0 and 2.17.0. The commit pointed to by the tag Pacemaker-0.6.1 and its parent have a serious clock skew, i.e. they are a few months older than their parents: $ git log --format='%h %ad %cd%d%n%s' --date=short Pacemaker-0.6.1^..47a8ef4c 47a8ef4ce 2008-02-15 2008-02-15 Low: TE: Logging - display the op's magic field for unexpected and foreign events b9cfcd6b4 2007-12-10 2007-12-10 (tag: Pacemaker-0.6.2) haresources2cib.py: set default-action-timeout to the default (20s) 52e7793e0 2007-12-10 2007-12-10 haresources2cib.py: update ra parameters lists dea277271 2008-02-14 2008-02-14 Medium: Build: Turn on snmp support in rpm packages (patch from MATSUDA, Daiki) f418742fe 2008-02-14 2008-02-14 Low: Build: Update the .spec file with the one used by build service ccfa716a5 2008-02-14 2008-02-14 Medium: SNMP: Allow the snmp subagent to be built (patch from MATSUDA, Daiki) 50f0ade2d 2008-02-14 2008-02-14 Low: Build: Update last release number 90f11667f 2008-02-14 2008-02-14 Medium: Tools: Make sure the autoconf variables in haresources2cib are expanded 9d2383c46 2008-02-11 2008-02-11 (tag: Pacemaker-0.6.1) High: cib: Ensure the archived file hits the disk before returning (branch|tag|describe|...) (--contains|--merged) use the commit timestamp information as a heuristic to avoid traversing parts of the history, which makes these operations, especially on big histories, an order of magnitude or two faster. Yeah, commit timestamps can't always be trusted, but skewed commits are rare, and skewed commits with this much skew are even rarer. I'm not sure how (or if it's at all possible) to turn off this timestamp-based optimisation. This is actually a bit more complicated. The "--merged" check in 'git tag' uses a different mechanism to detect which tags are reachable. It uses a revision walk starting at the "merge commit" (master in your case) and all tags with the "limited" option (to ensure the walk happens during prepare_revision_walk()) but marks the merge commit as UNINTERESTING. The limit_list() method stops when all commits are marked UNINTERESTING - minus some "slop" related to the commits that start the walk. One important note: the set of tags is important here. If you add a new tag to the root commit (git tag MyTag a2d71961f) then the walk succeeds by ensuring it walks until MyTag. This gets around the clock skew issue. There may be other more-recent tags with a clock-skew issue, but since Pacemaker-0.6.0 is the oldest tag, that requires the walk to continue until at least that date. The commit-walk machinery in revision.c is rather complicated, and is used for a lot of different reasons, such as "git log" and this application in "git tag". It is on my list to refactor this code to use the commit-graph and generation numbers, but as we can see by this example, it is not easy to tease out what is happening in the code. In a world where generation numbers are expected to be available, we could rewrite do_merge_filter() in ref-filter.c to call into paint_down_to_common() in commit.c using the new "min_generation" marker. By assigning the tags to be in the "twos" list and the merge commit in the "one" commit, we can check if the tags have the PARENT1 flag after the walk in paint_down_to_common(). Due to the static nature of paint_down_to_common(), we will likely want to abstract this into an external method in commit.c, say can_reach_many(struct commit *from, struct commit_list *to). FWIW, much work is being done on a cached commit graph including commit generation numbers, which will solve this issue both correctly and more efficiently. Perhaps it will already be included in the next release. The work in ds/generation-numbers is focused on the "git tag --contains" method, which does return correctly here (it is the reverse of the --merged condition): Which tags can reach Pacemaker-0.6.1? $ git tag --contains Pacemaker-0.6.1 (returns a big list) This is the actual reverse lookup (which branches contain this tag?) $ git branch --contains Pacemaker-0.6.1 | grep master * master These commands work despite clock skew. The commit-graph feature makes them faster. Thanks, -Stolee
Re: [PATCH v4 02/10] commit: add generation number to struct commmit
On 4/28/2018 6:35 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Very minor nitpick: it would be more readable wrapped differently: * If a commit A has parents, then the generation number of A is one more than the maximum generation number among parents of A. Very minor nitpick: possibly "parents", not "the parents", but I am not native English speaker. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0x GENERATION_NUMBER_MAX 0x3FFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Good explanation; I wonder if we want to have it in some shortened form also in comments, and not only in the commit message. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) I have reordered patches to make it easier to review. diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 I wonder if it wouldn't be good to have some short in-line comments explaining those constants, or a block comment above them. struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; All right, simple addition of the new field. Nothing to go wrong here. Sidenote: With 0x7FFF being (if I am not wrong) maximum graph_pos and maximum number of nodes in commit graph, we won't hit 0x3FFF generation number limit for all except very, very linear histories. Both of these limits are far away from being realistic. But we could extend the maximum graph_pos independently from the maximum generation number now that we have the "capped" logic. diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } All right, start with initializing it with "not from commit-graph" value after allocation. diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + I guess we should not worry about these "magical constants" sprinkled here, like "+ 8" above. Let's examine how it goes, taking a look at commit-graph-format.txt in Documentation/technical/commit-graph-format.txt * The first H (g->hash_len) bytes are for the OID of the root tree. * The next 8 bytes are for the positions of the first two parents [...] So 'commit_data + g->hash_len + 8' is our offset from the start of commit data. All right. * The next 8 bytes store the generation number of the commit and the commit time in seconds since EPOCH. The generation number uses the higher 30 bits of the first 4 bytes. [...] The higher 30 bits of the 4 bytes, which is 32 bits, means that we need to shift 32-bit value 2 bits right, so that we get lower 30 bits of 32-bit value. All right. All 4-byte numbers are in network order. Shouldn't it be ntohl() to convert from network order to host order, and not get_be32()? I guess they are the same (network order is big-endian order), and get_be32() is what rest of git uses... ntohl() takes a 32-bit value, while get_be32() takes a pointer. This makes pulling network-bytes out of streams much cleaner with get_be32(), so I try to use that whenever possible.
Re: [PATCH] coccinelle: avoid wrong transformation suggestions from commit.cocci
On 4/30/2018 5:31 AM, SZEDER Gábor wrote: The semantic patch 'contrib/coccinelle/commit.cocci' added in 2e27bd7731 (treewide: replace maybe_tree with accessor methods, 2018-04-06) is supposed to "ensure that all references to the 'maybe_tree' member of struct commit are either mutations or accesses through get_commit_tree()". So get_commit_tree() clearly must be able to directly access the 'maybe_tree' member, and 'commit.cocci' has a bit of a roundabout workaround to ensure that get_commit_tree()'s direct access in its return statement is not transformed: after all references to 'maybe_tree' have been transformed to a call to get_commit_tree(), including the reference in get_commit_tree() itself, the last rule transforms back a 'return get_commit_tree()' statement, back then found only in get_commit_tree() itself, to a direct access. Unfortunately, already the very next commit shows that this workaround is insufficient: 7b8a21dba1 (commit-graph: lazy-load trees for commits, 2018-04-06) extends get_commit_tree() with a condition directly accessing the 'maybe_tree' member, and Coccinelle with 'commit.cocci' promptly detects it and suggests a transformation to avoid it. This transformation is clearly wrong, because calling get_commit_tree() to access 'maybe_tree' _in_ get_commit_tree() would obviously lead to recursion. Furthermore, the same commit added another, more specialized getter function get_commit_tree_in_graph(), whose legitimate direct access to 'maybe_tree' triggers a similar wrong transformation suggestion. Thanks for catching this, Szeder. Sorry for the noise. Exclude both of these getter functions from the general rule in 'commit.cocci' that matches their direct accesses to 'maybe_tree'. Also exclude load_tree_for_commit(), which, as static helper funcion of get_commit_tree_in_graph(), has legitimate direct access to 'maybe_tree' as well. This is an interesting feature of Coccinelle. Happy to learn it. The last rule transforming back 'return get_commit_tree()' statements to direct accesses thus became unnecessary, remove it. Signed-off-by: SZEDER Gábor <szeder@gmail.com> I applied this locally on 'next' and ran the check. I succeeded with no changes. Thanks! Reviewed-by: Derrick Stolee <dsto...@microsoft.com> --- contrib/coccinelle/commit.cocci | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/contrib/coccinelle/commit.cocci b/contrib/coccinelle/commit.cocci index ac38525941..a7e9215ffc 100644 --- a/contrib/coccinelle/commit.cocci +++ b/contrib/coccinelle/commit.cocci @@ -10,11 +10,15 @@ expression c; - c->maybe_tree->object.oid.hash + get_commit_tree_oid(c)->hash +// These excluded functions must access c->maybe_tree direcly. @@ +identifier f !~ "^(get_commit_tree|get_commit_tree_in_graph|load_tree_for_commit)$"; expression c; @@ + f(...) {... - c->maybe_tree + get_commit_tree(c) + ...} @@ expression c; @@ -22,9 +26,3 @@ expression s; @@ - get_commit_tree(c) = s + c->maybe_tree = s - -@@ -expression c; -@@ -- return get_commit_tree(c); -+ return c->maybe_tree;
Re: [PATCH v4 03/10] commit-graph: compute generation numbers
On 4/26/2018 8:58 AM, Derrick Stolee wrote: n 4/25/2018 10:35 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); The ones that have infinity are written as zero here. The code that reads the generation field off of a file in fill_commit_graph_info() and fill_commit_in_graph() both leave such a record in file as-is, so the reader of what we write out will think it is _ZERO, not _INF. Not that it matters, as it seems that most of the code being added by this series treat _ZERO and _INF more or less interchangeably. But it does raise another question, i.e. do we need both _ZERO and _INF, or is it sufficient to have just a single _UNKNOWN? This code is confusing. The 'if' condition is useless, since at this point every commit should be finite (since we computed generation numbers for everyone). We should just write the value always. For the sake of discussion, the value _INFINITY means not in the graph and _ZERO means in the graph without a computed generation number. It's a small distinction, but it gives a single boundary to use for reachability queries that test generation number. @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } If we haven't computed all parents' generations yet, current->generation is undefined (or at least "left as initialized"), so it does not make much sense to attempt to clip it at _MAX at this point. At leat not yet. IOW, shouldn't the following two lines be inside the "we now know genno of all parents, so we can compute genno for commit" block above? You're right! Good catch. This code sets every merge commit to _MAX. It should be in the block above. + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } This bothered me: why didn't I catch a bug here? I rebased my "fsck" RFC onto this branch and it succeeded. Then, I realized that this does not actually write incorrect values, since we re-visit this commit again after we pop the stack down to this commit. However, there is time in the middle where we have set the generation (in memory) incorrectly and that could easily turn into a real bug by a later change. I'll stick the _MAX check in the if above to prevent confusion. Thanks, -Stolee
Re: [PATCH v4 03/10] commit-graph: compute generation numbers
n 4/25/2018 10:35 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); The ones that have infinity are written as zero here. The code that reads the generation field off of a file in fill_commit_graph_info() and fill_commit_in_graph() both leave such a record in file as-is, so the reader of what we write out will think it is _ZERO, not _INF. Not that it matters, as it seems that most of the code being added by this series treat _ZERO and _INF more or less interchangeably. But it does raise another question, i.e. do we need both _ZERO and _INF, or is it sufficient to have just a single _UNKNOWN? This code is confusing. The 'if' condition is useless, since at this point every commit should be finite (since we computed generation numbers for everyone). We should just write the value always. For the sake of discussion, the value _INFINITY means not in the graph and _ZERO means in the graph without a computed generation number. It's a small distinction, but it gives a single boundary to use for reachability queries that test generation number. @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } If we haven't computed all parents' generations yet, current->generation is undefined (or at least "left as initialized"), so it does not make much sense to attempt to clip it at _MAX at this point. At leat not yet. IOW, shouldn't the following two lines be inside the "we now know genno of all parents, so we can compute genno for commit" block above? You're right! Good catch. This code sets every merge commit to _MAX. It should be in the block above. + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0);
Re: What's cooking in git.git (Apr 2018, #03; Wed, 25)
On 4/25/2018 1:43 PM, Brandon Williams wrote: On 04/25, Ævar Arnfjörð Bjarmason wrote: * bw/protocol-v2 (2018-03-15) 35 commits (merged to 'next' on 2018-04-11 at 23ee234a2c) + remote-curl: don't request v2 when pushing + remote-curl: implement stateless-connect command + http: eliminate "# service" line when using protocol v2 + http: don't always add Git-Protocol header + http: allow providing extra headers for http requests + remote-curl: store the protocol version the server responded with + remote-curl: create copy of the service name + pkt-line: add packet_buf_write_len function + transport-helper: introduce stateless-connect + transport-helper: refactor process_connect_service + transport-helper: remove name parameter + connect: don't request v2 when pushing + connect: refactor git_connect to only get the protocol version once + fetch-pack: support shallow requests + fetch-pack: perform a fetch using v2 + upload-pack: introduce fetch server command + push: pass ref prefixes when pushing + fetch: pass ref prefixes when fetching + ls-remote: pass ref prefixes when requesting a remote's refs + transport: convert transport_get_remote_refs to take a list of ref prefixes + transport: convert get_refs_list to take a list of ref prefixes + connect: request remote refs using v2 + ls-refs: introduce ls-refs server command + serve: introduce git-serve + test-pkt-line: introduce a packet-line test helper + protocol: introduce enum protocol_version value protocol_v2 + transport: store protocol version + connect: discover protocol version outside of get_remote_heads + connect: convert get_remote_heads to use struct packet_reader + transport: use get_refs_via_connect to get refs + upload-pack: factor out processing lines + upload-pack: convert to a builtin + pkt-line: add delim packet support + pkt-line: allow peeking a packet line without consuming it + pkt-line: introduce packet_read_with_status (this branch is used by bw/server-options.) The beginning of the next-gen transfer protocol. Will cook in 'next'. With a month & 10 days of no updates & this looking stable it would be great to have it in master sooner than later to build on top of it in the 2.18 window. I personally think that this series is ready to graduate to master but I mentioned to Junio off-list that it might be a good idea to wait until it has undergone a little more stress testing. We've been in the process of trying to get this rolled out to our internal server but due to a few roadblocks and people being out of the office its taken a bit longer than I would have liked to get it up and running. I'm hoping in another few days/a week I'll have some more data on live traffic. At that point I think I'll be more convinced that it will be safe to merge it. I may be overly cautions but I'm hoping that we can get this right without needing to do another protocol version bump for a very long time. Technically using v2 is hidden behind an "experimental" config flag so we should still be able to tweak it after the fact if we absolutely needed to, but I'd like to avoid that if necessary. There's no testing better than production. I think if we have an opportunity to test with realistic traffic, we should take advantage. But I also agree that this series looks stable. I realize it's hard to communicate both "this series is ready to merge" and "I appreciate your caution." Thanks, -Stolee
Re: [PATCH v4 00/10] Compute and consume generation numbers
As promised, here is the diff from v3. Thanks, -Stolee -- >8 -- diff --git a/builtin/merge.c b/builtin/merge.c index 7e1da6c6ea..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + init_diff_ui_defaults(); git_config(git_merge_config, NULL); @@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const char *prefix) else head_commit = lookup_commit_or_die(_oid, "HEAD"); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/commit-graph.c b/commit-graph.c index 21e853c21a..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin *pos = item->graph_pos; return 1; } else { - return bsearch_graph(commit_graph, &(item->object.oid), pos); + return bsearch_graph(g, &(item->object.oid), pos); } } @@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item) { uint32_t pos; - if (item->object.parsed) - return 0; if (!core_commit_graph) return 0; + if (item->object.parsed) + return 1; prepare_commit_graph(); if (commit_graph && find_commit_in_graph(item, commit_graph, )) return fill_commit_in_graph(item, commit_graph, pos); @@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; - if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + if ((*list)->generation != GENERATION_NUMBER_INFINITY) packedDate[0] |= htonl((*list)->generation << 2); - } packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); diff --git a/commit.c b/commit.c index 9ef6f699bd..e2e16ea1a7 100644 --- a/commit.c +++ b/commit.c @@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void else if (a->generation > b->generation) return -1; - /* use date as a heuristic when generataions are equal */ + /* use date as a heuristic when generations are equal */ if (a->date < b->date) return 1; else if (a->date > b->date) @@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * } if (commit->generation > min_generation) - return 0; + return ret; bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) diff --git a/ref-filter.c b/ref-filter.c index e2fea6d635..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1582,7 +1583,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, @@ -1629,7 +1630,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, for (p = want; p; p = p->next) { struct commit *c = p->item; - parse_commit_or_die(c); + load_commit_graph_info(c); if (c->generation < cutoff) cutoff = c->generation; }
[PATCH v4 09/10] merge: check config before loading commits
Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/merge.c | 7 --- t/t5318-commit-graph.sh | 9 + 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb
[PATCH v4 10/10] commit-graph.txt: update design document
We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 30 +++- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.39.g685157f7fb
[PATCH v4 04/10] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..4d00b0a1d6 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generations are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb
[PATCH v4 07/10] commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the initial commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 39a3749abd..7bb007f56a 100644 --- a/commit.c +++ b/commit.c @@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return ret; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb
[PATCH v4 05/10] commit-graph: always load commit-graph information
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 ++--- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 46 insertions(+), 20 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 047fa9fca5..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(g, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + if (!core_commit_graph) return 0; if (item->object.parsed) return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index 4d00b0a1d6..39a3749abd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) { const char *tail = buffer; const char *bufptr = buffer; @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s } item->date = parse_commit_date(bufptr, tail); + if (check_graph) + load_commit_graph_info(item); + return 0; } @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return error("
[PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 7bb007f56a..e2e16ea1a7 100644 --- a/commit.c +++ b/commit.c @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return ret; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb
[PATCH v4 06/10] ref-filter: use generation number for --contains
A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 24 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index aff24d93be..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + load_commit_graph_info(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb
[PATCH v4 02/10] commit: add generation number to struct commmit
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0x GENERATION_NUMBER_MAX 0x3FFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = >parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb
[PATCH v4 00/10] Compute and consume generation numbers
Thanks for the feedback on the previous version. I think this series is stabilizing nicely. I'll reply to this message with an inter-diff as it is not too large to share but would clutter this cover letter. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains), providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is built on ds/lazy-load-trees. Derrick Stolee (10): ref-filter: fix outdated comment on in_commit_list commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph: always load commit-graph information ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() merge: check config before loading commits commit-graph.txt: update design document Documentation/technical/commit-graph.txt | 30 ++-- alloc.c | 1 + builtin/merge.c | 7 +- commit-graph.c | 92 commit-graph.h | 8 +++ commit.c | 54 +++--- commit.h | 7 +- object.c | 2 +- ref-filter.c | 26 +-- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 198 insertions(+), 40 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb
[PATCH v4 03/10] commit-graph: compute generation numbers
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 + 1 file changed, 45 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..047fa9fca5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0); -- 2.17.0.39.g685157f7fb
[PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list
The in_commit_list() method does not check the parents of the candidate for containment in the list. Fix the comment that incorrectly states that it does. Reported-by: Jakub Narebski <jna...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..aff24d93be 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, -- 2.17.0.39.g685157f7fb
Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
On 4/24/2018 2:56 PM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: One way to fix this is to call 'prepare_commit_graph()' directly and then test that 'commit_graph' is non-null before performing any parses. I'm not thrilled with how that couples the commit-graph implementation to this feature, but that may be necessary to avoid regressions in the non-commit-graph case. Another possible solution (not sure if better or worse) would be to change the signature of contains_tag_algo() function to take parameter or flag that would decide whether to parse "wants". If I reorder commits so "commit-graph:always load commit-graph information" is before this one, then we can call load_commit_graph_info() which just fills the generation and graph_pos information. This will keep the coupling very light, instead of needing to call prepare_commit_graph() or checking the config setting. Thanks, -Stolee
Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
On 4/23/2018 5:38 PM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: On 4/18/2018 7:19 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: [...] [...], and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. If I understand the code properly, what happens is that we can now short-circuit if all commits that are left are lower than the target commit. This is because max-order priority queue is used: if the commit with maximum generation number is below generation number of target commit, then target commit is not reachable from any commit in the priority queue (all of which has generation number less or equal than the commit at head of queue, i.e. all are same level or deeper); compare what I have written in [1] [1]: https://public-inbox.org/git/866052dkju@gmail.com/ Do I have that right? If so, it looks all right to me. Yes, the priority queue needs to compare via generation number first or there will be errors. This is why we could not use commit time before. I was more concerned about getting right the order in the priority queue (does it return minimal or maximal generation number). I understand that the cutoff could not be used without generation numbers because of the possibility of clock skew - using cutoff on dates could lead to wrong results. Maximal generation number is important so we do not visit commits multiple times (say, once with PARENT1 set, and a second time when PARENT2 is set). A minimal generation number order would create a DFS order and walk until the cutoff every time. In cases without clock skew, maximal generation number order will walk the same set of commits as maximal commit time; the order may differ, but only between incomparable commits. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% [...] flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } -list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; -bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); Is it the only case where we would call paint_down_to_common() with non-zero last parameter? Would we always use commit->generation where commit is the first parameter of paint_down_to_common()? If both are true and will remain true, then in my humble opinion it is not necessary to change the signature of this function. We need to change the signature some way, but maybe the way I chose is not the best. No, after taking longer I think the new signature is a good choice. To elaborate: paint_down_to_common() is used for multiple purposes. The caller here that supplies 'commit->generation' is used only to compute reachability (by testing if the flag PARENT2 exists on the commit, then clears all flags). The other callers expect the full walk down to the common commits, and keeps those PARENT1, PARENT2, and STALE flags for future use (such as reporting merge bases). Usually the call to paint_down_to_common() is followed by a revision walk that only halts when reaching root commits or commits with both PARENT1 and PARENT2 flags on, so always short-circuiting on generations would break the functionality; this is confirmed by the t5318-commit-graph.sh. Right. I have realized that just after sending the email. I'm sorry about this. An alternative to the signature change is to add a boolean parameter "use_cutoff" or something, that specifies "don't walk beyond the commit". This may give a more of a clear description of what it will do with the generation value, but since we are
Re: [PATCH v3 0/9] Compute and consume generation numbers
On 4/18/2018 8:04 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains', providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is build on ds/lazy-load-trees. Derrick Stolee (9): commit: add generation number to struct commmit Nice and short patch. Looks good to me. commit-graph: compute generation numbers Another quite easy to understand patch. LGTM. commit: use generations in paint_down_to_common() Nice and short patch; minor typo in comment in code. Otherwise it looks good to me. commit-graph.txt: update design document I see that diagram got removed in this version; maybe it could be replaced with relationship table? Anyway, it looks good to me. The diagrams and tables seemed to cause more confusion than clarity. I think the reader should create their own mental model from the definitions and description and we should avoid trying to make a summary. ref-filter: use generation number for --contains A question: how performance looks like after the change if commit-graph is not available? The performance issue is minor, but will be fixed in v4. commit: use generation numbers for in_merge_bases() Possible typo in the commit message, and stylistic inconsistence in in_merge_bases() - though actually more clear than existing code. Short, simple, and gives good performance improvenebts. commit: add short-circuit to paint_down_to_common() Looks good to me; ignore [mostly] what I have written in response to the patch in question. commit-graph: always load commit-graph information Looks all right; question: parameter or one more global variable. I responded to say that the global variable approach is incorrect. Parameter is important to functionality and performance. merge: check config before loading commits This looks good to me. Documentation/technical/commit-graph.txt | 30 +-- alloc.c | 1 + builtin/merge.c | 5 +- commit-graph.c | 99 +++- commit-graph.h | 8 ++ commit.c | 54 +++-- commit.h | 7 +- object.c | 2 +- ref-filter.c | 23 +- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 199 insertions(+), 41 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
On 4/18/2018 8:02 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. All right, that is nice explanation of the why behind this change. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. This avoids duplicate work when we already checked the graph in parse_commit_gently() or when simply checking the buffer contents in check_commit(). Couldn't this 'check_graph' parameter be a global variable similar to the 'commit_graph' variable? Maybe I am not understanding it. See the two callers at the bottom of the patch. They have different purposes: one needs to fill in a valid commit struct, the other needs to check the commit buffer is valid (then throws away the struct). They have different values for 'check_graph'. Also, in parse_commit_gently() we check parse_commit_in_graph() before we call parse_commit_buffer, so we do not want to repeat work; in the case of a valid commit-graph file, but the commit is not in the commit-graph, we would repeat our binary search for the same commit. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 51 -- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 49 insertions(+), 23 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 688d5b1801..21e853c21a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; I'm probably wrong, but isn't it unrelated change? You're right. I saw this while I was in here, and there was a similar comment on this change in a different patch. Probably best to keep these cleanup things in a separate commit. item->object.parsed = 1; item->graph_pos = pos; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(commit_graph, &(item->object.oid), pos); + } +} All right (after the fix). + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + + if (item->object.parsed) + return 0; if (!core_commit_graph) return 0; - if (item->object.parsed) - return 1; Hmmm... previously the function returned 1 if item->object.parsed, now it returns 0 for this situation. I don't understand this change. The good news is that this change is unimportant (the only caller is parse_commit_gently() which checks item->object.parsed before calling parse_commit_in_graph()). I wonder why I reordered those things, anyway. I'll revert to simplify the patch. - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, po
Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
On 4/18/2018 7:19 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: [...] [...], and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. If I understand the code properly, what happens is that we can now short-circuit if all commits that are left are lower than the target commit. This is because max-order priority queue is used: if the commit with maximum generation number is below generation number of target commit, then target commit is not reachable from any commit in the priority queue (all of which has generation number less or equal than the commit at head of queue, i.e. all are same level or deeper); compare what I have written in [1] [1]: https://public-inbox.org/git/866052dkju@gmail.com/ Do I have that right? If so, it looks all right to me. Yes, the priority queue needs to compare via generation number first or there will be errors. This is why we could not use commit time before. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% [...] flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); Is it the only case where we would call paint_down_to_common() with non-zero last parameter? Would we always use commit->generation where commit is the first parameter of paint_down_to_common()? If both are true and will remain true, then in my humble opinion it is not necessary to change the signature of this function. We need to change the signature some way, but maybe the way I chose is not the best. To elaborate: paint_down_to_common() is used for multiple purposes. The caller here that supplies 'commit->generation' is used only to compute reachability (by testing if the flag PARENT2 exists on the commit, then clears all flags). The other callers expect the full walk down to the common commits, and keeps those PARENT1, PARENT2, and STALE flags for future use (such as reporting merge bases). Usually the call to paint_down_to_common() is followed by a revision walk that only halts when reaching root commits or commits with both PARENT1 and PARENT2 flags on, so always short-circuiting on generations would break the functionality; this is confirmed by the t5318-commit-graph.sh. An alternative to the signature change is to add a boolean parameter "use_cutoff" or something, that specifies "don't walk beyond the commit". This may give a more of a clear description of what it will do with the generation value, but since we are already performing generation comparisons before calling paint_down_to_common() I find this simple enough. Thanks, -Stolee
Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
On 4/18/2018 6:15 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. You have "target" twice in above paragraph; one of those should probably be something else. Thanks. Second "target" should be "initial". [...] + + if (commit->generation > min_generation) + return 0; Why not use "return ret;" instead of "return 0;", like the rest of the code [cryptically] does, that is: +if (commit->generation > min_generation) +return ret; Sure. Thanks, -Stolee
Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
On 4/18/2018 5:02 PM, Jakub Narebski wrote: Here I can offer only the cursory examination, as I don't know this area of code in question. Derrick Stolee <dsto...@microsoft.com> writes: A commit A can reach a commit B only if the generation number of A is larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for 'git tag --contains' queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% A question: what is the performance after if the "commit-graph" feature is disabled, or there is no commit-graph file? Is there performance regression in this case, or is the difference negligible? Negligible, since we are adding a small number of integer comparisons and the main cost is in commit parsing. More on commit parsing in response to your comments below. Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..e2fea6d635 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) /* * Test whether the candidate or one of its parents is contained in the list. ^ Sidenote: when examining the code after the change, I have noticed that the above part of commit header for the comtains_test() function is no longer entirely correct, as the function only checks the candidate commit, and in no place it access its parents. But that is not your problem. I'll add a commit in the next version that fixes this comment before I make any changes to the method. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + Looks good to me. The only [minor] question may be whether to define separate type for generation numbers, and whether to future proof the tests - though the latter would be almost certainly overengineering, and the former probablt too. If we have multiple notions of generation, then we can refactor all references to the "generation" member. return CONTAINS_UNKNOWN; } @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } Sholdn't the above be made conditional on the ability to get generation numbers from the commit-graph file (feature is turned on and file exists)? Otherwise here after the change contains_tag_algo() now parses each commit in 'want', which I think was not done previously. With commit-graph file parsing is [probably] cheap. Without it, not necessary. But I might be worrying about nothing. Not nothing. This parses the "wants" when we previously did not parse the wants. Further: this parsing happens before we do the simple check of comparing the OID of the candidate against the wants. The question is: are these parsed commits significant compared to the walk that will parse many more commits? It is certainly possible. One way to fix this is to call 'prepare_commit_graph()' directly and then test that 'commit_graph' is non-null before performing any parses. I'm not thrilled with how that couples the commit-graph implementation to this feature, but that may be necessary to avoid regressions in the non-commit-graph case. + result = contains_test(candidate, want, cache, cutoff); O
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/21/2018 4:44 PM, Jakub Narebski wrote: Jakub Narebski <jna...@gmail.com> writes: Derrick Stolee <sto...@gmail.com> writes: On 4/11/2018 3:32 PM, Jakub Narebski wrote: What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. The two main questions to consider are: 1. Can X reach Y? That is easy to do. The function generic_is_reachable() does that... though using direct translation of the pseudocode for "Algorithm 3: Reachable" from FELINE paper, which is recursive and doesn't check if vertex was already visited was not good idea for large graphs such as Linux kernel commit graph, oops. That is why generic_is_reachable_large() was created. [...] And the thing to measure is a commit count. If possible, it would be good to count commits walked (commits whose parent list is enumerated) and commits inspected (commits that were listed as a parent of some walked commit). Walked commits require a commit parse -- albeit from the commit-graph instead of the ODB now -- while inspected commits only check the in-memory cache. [...] For git.git and Linux, I like to use the release tags as tests. They provide a realistic view of the linear history, and maintenance releases have their own history from the major releases. Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the FELINE index does not bring any improvements over using just level (generation number) filter. But that may be caused by narrowing od commit DAG around releases. I try do do the same between commits in wide part, with many commits with the same level (same generation number) both for source and for target commit. Though this may be unfair to level filter, though... Note however that FELINE index is not unabiguous, like generation numbers are (modulo decision whether to start at 0 or at 1); it depends on the topological ordering chosen for the X elements. One can now test reachability on git.git repository; there is a form where one can plug source and destination revisions at https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1 I have tried the case that is quite unfair to the generation numbers filter, namely the check between one of recent tags, and one commit that shares generation number among largest number of other commits. Here level = generation number-1 (as it starts at 0 for root commit, not 1). The results are: * src = 468165c1d = v2.17.0 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec * 468165c1d has level 18418 which it shares with 6 commits * 66d2e04ec has level 14776 which it shares with 93 commits * gen(468165c1d) - gen(66d2e04ec) = 3642 algorithm | access | walk | maxdepth | visited | level-f | FELINE-f | ---+-++--+-+--+---+ naive | 48865 | 39599 | 244 | 9200| | | level | 3086 | 2492 | 113 | 528| 285 | | FELINE | 283 | 216 | 68 |0| | 25 | lev+FELINE | 282 | 215 | 68 |0| 5 | 24 | ---+-++--+-+--+---+ lev+FEL+mpi|79 |59 | 21 |0| 0 | 0 | Here we have: * 'naive' implementation means simple DFS walk, without any filters (cut-offs) * 'level' means using levels / generation numbers based negative-cut filter * 'FELINE' means using FELINE index based negative-cut filter * 'lev+FELINE' means combining generation numbers filter with FELINE filter * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter; note that the code does not walk the path after cut, but it is easy to do The stats have the following meaning: * 'access' means accessing the node * 'walk' is actual walking the node * 'maxdepth' is maximum depth of the stack used for DFS * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter were used for negative-cut; note that those are not disjoint; node can be rejected by both level filter and FELINE filter For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor: the results are the same, with 5 commits accessed and 6 walked compared to 61574 accessed in naive algorithm. The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads (different child-less commits) and 9 roots, and has average clustering coefficient 0.000409217. Thanks for these results. Now, write a patch. I'm sticking to generation numbers for my patch because of the simplified computation, but you can contribute a FELINE implementation. P.S. Would it be better to move the discussion about possible extensions to the commit-graph in the form of new chunks (topological order,
Re: [PATCH v10 00/36] Add directory rename detection to git
On 4/19/2018 2:41 PM, Stefan Beller wrote: On Thu, Apr 19, 2018 at 11:35 AM, Elijah Newrenwrote: On Thu, Apr 19, 2018 at 10:57 AM, Elijah Newren wrote: This series is a reboot of the directory rename detection series that was merged to master and then reverted due to the final patch having a buggy can-skip-update check, as noted at https://public-inbox.org/git/xmqqmuya43cs@gitster-ct.c.googlers.com/ This series based on top of master. ...and merges cleanly to next but apparently has some minor conflicts with both ds/lazy-load-trees and ps/test-chmtime-get from pu. What's the preferred way to resolve this? Rebase and resubmit my series on pu, or something else? If you were to base it off of pu, this series would depend on all other series that pu contains. This is bad for the progress of this series. (If it were to be merged to next, all other series would automatically merge to next as well) If the conflicts are minor, then Junio resolves them; if you want to be nice, pick your merge point as git checkout origin/master git merge ds/lazy-load-trees git merge ps/test-chmtime-get git tag my-anchor and put the series on top of that anchor. If you do this, you'd want to be reasonably sure that those two series are not in too much flux. I believe ds/lazy-load-trees is queued for 'next'. I'm not surprised that there are some conflicts here. Any reference to the 'tree' member of a commit should be replaced with 'get_commit_tree(c)', or 'get_commit_tree_oid(c)' if you only care about the tree's object id. I think Stefan's suggestion is the best approach to get the right conflicts out of the way. Thanks, -Stolee
Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()
On 4/18/2018 10:31 AM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Does it mean that it gives no measureable performance improvements for typical test cases? Not in this commit. When we add the `min_generation` parameter in a later commit, we do get a significant performance boost (when we can supply a non-zero value to `min_generation`). This step of using generation numbers for the priority is important for that commit, but on its own has limited value outside of the clock-skew case mentioned above. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..a44899c733 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generataions are equal */ Very minor typo in above comment: s/generataions/generations/ Good catch! + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
Re: What's cooking in git.git (Apr 2018, #02; Tue, 17)
On 4/17/2018 9:04 PM, Junio C Hamano wrote: Stefan Beller <sbel...@google.com> writes: What's the doneness of this thing? I didn't recall seeing any response, especially ones that demonstrated the reviewer carefully read and thought about the issues surrounding the code. Not that I spotted any problems in these patches myself, though. Stolee and Brandon provided a "quick LGTM" type of review https://public-inbox.org/git/20180409232536.gb102...@google.com/ https://public-inbox.org/git/9ddfee7e-025a-79c9-8d6b-700c65a14...@gmail.com/ Yup. Giving positive reviews is harder than giving constructive criticism. Much harder. As readers cannot tell from a "quick LGTM" between "I didn't read it but it did not smell foul" and "I read it thoroughly, understood how the solution works, it was presented well, and agree with the design and implementation---there is nothing to add", the reviewers need to come up with some way to express that it is the latter case rather than the former. I would not claim that I've perfected my technique to do so, but when responding to such a "good" series, I rephrase the main idea in the series in my own words to show that I as a reviewer read the series well enough to be able to do so, perhaps with comparison with possible alternatives I could think of and dicussion to argue that the solution presented in the series is better, in an attempt to demonstrate that I am qualified to say "this one is good" with good enough understanding of both the issue the series addresses and the solution in the series. I'm sorry that my second message was terse. My response to v1 [1] was > I looked through these patches and only found one set of whitespace > errors. Compiles and tests fine on my machine. > > Reviewed-by: Derrick Stolee <dsto...@microsoft.com> So, I pulled the code, went through it patch-by-patch, and saw that the transformations were made using the established pattern. The second review was to chime in that my v1 comments had been addressed. Thanks, -Stolee [1] https://public-inbox.org/git/6c319100-df47-3b8d-8661-24e4643ad...@gmail.com/
Re: [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
On 4/17/2018 2:10 PM, Derrick Stolee wrote: The commit-graph feature is not useful to end users until the commit-graph file is maintained automatically by Git during normal upkeep operations. One natural place to trigger this write is during 'git gc'. Before automatically generating a commit-graph, we need to be able to verify the contents of a commit-graph file. Integrate commit-graph checks into 'fsck' that check the commit-graph contents against commits in the object database. Things to think about: * Are these the right integration points? * gc.commitGraph defaults to true right now for the purpose of testing, but may not be required to start. The goal is to have this default to true eventually, but we may want to delay that until the feature is stable. * I implement a "--reachable" option to 'git commit-graph write' that iterates over all refs. This does the same as git show-ref -s | git commit-graph write --stdin-commits but I don't know how to pipe two child processes together inside of Git. Perhaps this is a better solution, anyway. What other things should I be considering in this case? I'm unfamiliar with the inner-workings of 'fsck' and 'gc', so this is a new space for me. This RFC is based on v3 of ds/generation-numbers, and the first commit is a fixup! based on a bug in that version that I caught while prepping this series. Thanks, -Stolee Derrick Stolee (12): fixup! commit-graph: always load commit-graph information commit-graph: add 'check' subcommand commit-graph: check file header information commit-graph: parse commit from chosen graph commit-graph: check fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb fsck: check commit-graph commit-graph: add '--reachable' option gc: automatically write commit-graph files commit-graph: update design document Documentation/git-commit-graph.txt | 15 +- Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 9 -- builtin/commit-graph.c | 79 +- builtin/fsck.c | 13 ++ builtin/gc.c | 8 + commit-graph.c | 178 ++- commit-graph.h | 2 + commit.c | 14 +- commit.h | 1 + t/t5318-commit-graph.sh | 15 ++ 11 files changed, 311 insertions(+), 27 deletions(-) This RFC is also available as a GitHub pull request [1] [1] https://github.com/derrickstolee/git/pull/6
[RFC PATCH 05/12] commit-graph: check fanout and lookup table
While running 'git commit-graph check', verify that the object IDs are listed in lexicographic order and that the fanout table correctly navigates into that list of object IDs. In anticipation of checking the commits in the commit-graph file against the object database, parse the commits from that file in advance. We perform this parse now to ensure the object cache contains only commits from this commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 34 ++ 1 file changed, 34 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 6d0d303a7a..6e3c08cd5c 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -835,6 +835,9 @@ static int check_commit_graph_error; int check_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; + if (!g) { graph_report(_("no commit-graph file loaded")); return 1; @@ -859,5 +862,36 @@ int check_commit_graph(struct commit_graph *g) if (g->hash_len != GRAPH_OID_LEN) graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len); + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + if (i > 0 && oidcmp(_oid, _oid) >= 0) + graph_report(_("commit-graph has incorrect oid order: %s then %s"), + + oid_to_hex(_oid), + oid_to_hex(_oid)); + oidcpy(_oid, _oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"), +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + + graph_commit = lookup_commit(_oid); + + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report(_("failed to parse %s from commit-graph"), oid_to_hex(_oid)); + + if (graph_commit->graph_pos != i) + graph_report(_("graph_pos for commit %s is %u != %u"), oid_to_hex(_oid), +graph_commit->graph_pos, i); + } + return check_commit_graph_error; } -- 2.17.0.39.g685157f7fb
[RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
The commit-graph feature is not useful to end users until the commit-graph file is maintained automatically by Git during normal upkeep operations. One natural place to trigger this write is during 'git gc'. Before automatically generating a commit-graph, we need to be able to verify the contents of a commit-graph file. Integrate commit-graph checks into 'fsck' that check the commit-graph contents against commits in the object database. Things to think about: * Are these the right integration points? * gc.commitGraph defaults to true right now for the purpose of testing, but may not be required to start. The goal is to have this default to true eventually, but we may want to delay that until the feature is stable. * I implement a "--reachable" option to 'git commit-graph write' that iterates over all refs. This does the same as git show-ref -s | git commit-graph write --stdin-commits but I don't know how to pipe two child processes together inside of Git. Perhaps this is a better solution, anyway. What other things should I be considering in this case? I'm unfamiliar with the inner-workings of 'fsck' and 'gc', so this is a new space for me. This RFC is based on v3 of ds/generation-numbers, and the first commit is a fixup! based on a bug in that version that I caught while prepping this series. Thanks, -Stolee Derrick Stolee (12): fixup! commit-graph: always load commit-graph information commit-graph: add 'check' subcommand commit-graph: check file header information commit-graph: parse commit from chosen graph commit-graph: check fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb fsck: check commit-graph commit-graph: add '--reachable' option gc: automatically write commit-graph files commit-graph: update design document Documentation/git-commit-graph.txt | 15 +- Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 9 -- builtin/commit-graph.c | 79 +- builtin/fsck.c | 13 ++ builtin/gc.c | 8 + commit-graph.c | 178 ++- commit-graph.h | 2 + commit.c | 14 +- commit.h | 1 + t/t5318-commit-graph.sh | 15 ++ 11 files changed, 311 insertions(+), 27 deletions(-) -- 2.17.0.39.g685157f7fb
[RFC PATCH 03/12] commit-graph: check file header information
During a run of 'git commit-graph check', list the issues with the header information in the commit-graph file. Some of this information is inferred from the loaded 'struct commit_graph'. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 29 - 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index cd0634bba0..c5e5a0f860 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -820,7 +820,34 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; } +static int check_commit_graph_error; +#define graph_report(...) { check_commit_graph_error = 1; printf(__VA_ARGS__); } + int check_commit_graph(struct commit_graph *g) { - return !g; + if (!g) { + graph_report(_("no commit-graph file loaded")); + return 1; + } + + check_commit_graph_error = 0; + + if (get_be32(g->data) != GRAPH_SIGNATURE) + graph_report(_("commit-graph file has incorrect header")); + + if (*(g->data + 4) != 1) + graph_report(_("commit-graph file version is not 1")); + if (*(g->data + 5) != GRAPH_OID_VERSION) + graph_report(_("commit-graph OID version is not 1 (SHA1)")); + + if (!g->chunk_oid_fanout) + graph_report(_("commit-graph is missing the OID Fanout chunk")); + if (!g->chunk_oid_lookup) + graph_report(_("commit-graph is missing the OID Lookup chunk")); + if (!g->chunk_commit_data) + graph_report(_("commit-graph is missing the Commit Data chunk")); + if (g->hash_len != GRAPH_OID_LEN) + graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len); + + return check_commit_graph_error; } -- 2.17.0.39.g685157f7fb
[RFC PATCH 11/12] gc: automatically write commit-graph files
The commit-graph file is a very helpful feature for speeding up git operations. In order to make it more useful, write the commit-graph file by default during standard garbage collection operations. Add a 'gc.commitGraph' config setting that triggers writing a commit-graph file after any non-trivial 'git gc' command. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-gc.txt | 4 builtin/gc.c | 8 2 files changed, 12 insertions(+) diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt index 571b5a7e3c..17dd654a59 100644 --- a/Documentation/git-gc.txt +++ b/Documentation/git-gc.txt @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if it within all non-bare repos or it can be set to a boolean value. This defaults to true. +The optional configuration variable 'gc.commitGraph' determines if +'git gc' runs 'git commit-graph write'. This can be set to a boolean +value. This defaults to false. + The optional configuration variable `gc.aggressiveWindow` controls how much time is spent optimizing the delta compression of the objects in the repository when the --aggressive option is specified. The larger diff --git a/builtin/gc.c b/builtin/gc.c index 77fa720bd0..070f2a7a3d 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -34,6 +34,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; +static int gc_commit_graph = 1; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; @@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT; static struct argv_array prune = ARGV_ARRAY_INIT; static struct argv_array prune_worktrees = ARGV_ARRAY_INIT; static struct argv_array rerere = ARGV_ARRAY_INIT; +static struct argv_array commit_graph = ARGV_ARRAY_INIT; static struct tempfile *pidfile; static struct lock_file log_lock; @@ -121,6 +123,7 @@ static void gc_config(void) git_config_get_int("gc.aggressivedepth", _depth); git_config_get_int("gc.auto", _auto_threshold); git_config_get_int("gc.autopacklimit", _auto_pack_limit); + git_config_get_bool("gc.commitgraph", _commit_graph); git_config_get_bool("gc.autodetach", _auto); git_config_get_expiry("gc.pruneexpire", _expire); git_config_get_expiry("gc.worktreepruneexpire", _worktrees_expire); @@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) argv_array_pushl(, "prune", "--expire", NULL); argv_array_pushl(_worktrees, "worktree", "prune", "--expire", NULL); argv_array_pushl(, "rerere", "gc", NULL); + argv_array_pushl(_graph, "commit-graph", "write", "--reachable", NULL); /* default expiry time, overwritten in gc_config */ gc_config(); @@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (pack_garbage.nr > 0) clean_pack_garbage(); + if (gc_commit_graph) + if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, commit_graph.argv[0]); + if (auto_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " "run 'git prune' to remove them.")); -- 2.17.0.39.g685157f7fb
[RFC PATCH 02/12] commit-graph: add 'check' subcommand
If the commit-graph file becomes corrupt, we need a way to verify its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph check' subcommand to report all issues with the file. Add the 'check' subcommand to the 'commit-graph' builtin and its documentation. Add a simple test that ensures the command returns a zero error code. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 7 +- builtin/commit-graph.c | 38 ++ commit-graph.c | 5 commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 5 5 files changed, 56 insertions(+), 1 deletion(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 4c97b555cc..93c7841ba2 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -9,10 +9,10 @@ git-commit-graph - Write and verify Git commit graph files SYNOPSIS [verse] +'git commit-graph check' [--object-dir ] 'git commit-graph read' [--object-dir ] 'git commit-graph write' [--object-dir ] - DESCRIPTION --- @@ -52,6 +52,11 @@ existing commit-graph file. Read a graph file given by the commit-graph file and output basic details about the graph file. Used for debugging purposes. +'check':: + +Read the commit-graph file and verify its contents against the object +database. Used to check for corrupted data. + EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 37420ae0fd..77c1a04932 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -7,11 +7,17 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph check [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; +static const char * const builtin_commit_graph_check_usage[] = { + N_("git commit-graph check [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_read_usage[] = { N_("git commit-graph read [--object-dir ]"), NULL @@ -29,6 +35,36 @@ static struct opts_commit_graph { int append; } opts; + +static int graph_check(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_check_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_check_options, +builtin_commit_graph_check_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + + if (!graph) + die("graph file %s does not exist", graph_name); + FREE_AND_NULL(graph_name); + + return check_commit_graph(graph); +} + static int graph_read(int argc, const char **argv) { struct commit_graph *graph = NULL; @@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "check")) + return graph_check(argc, argv); if (!strcmp(argv[0], "read")) return graph_read(argc, argv); if (!strcmp(argv[0], "write")) diff --git a/commit-graph.c b/commit-graph.c index 3f0c142603..cd0634bba0 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -819,3 +819,8 @@ void write_commit_graph(const char *obj_dir, oids.alloc = 0; oids.nr = 0; } + +int check_commit_graph(struct commit_graph *g) +{ + return !g; +} diff --git a/commit-graph.h b/commit-graph.h index 96cccb10f3..e8c8d99dff 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir, int nr_commits, int append); +int check_commit_graph(struct commit_graph *g); + #endif diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 77d85aefe7..e91053271a 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -230,4 +230,9 @@ test_expect_success 'perform fast-forward merge in full repo' ' test_cmp expect output ' +test_expect_success 'git commit-graph check' ' + cd "$TRASH_DIRECTORY/full" && + git commit-graph check >output +' + test_done -- 2.17.0.39.g685157f7fb
[RFC PATCH 08/12] commit-graph: verify commit contents against odb
When running 'git commit-graph check', compare the contents of the commits that are loaded from the commit-graph file with commits that are loaded directly from the object database. This includes checking the root tree object ID, commit date, and parents. In addition, verify the generation number calculation is correct for all commits in the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 82 ++ 1 file changed, 82 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 80a2ac2a6a..b5bce2bac4 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -899,5 +899,87 @@ int check_commit_graph(struct commit_graph *g) graph_commit->graph_pos, i); } + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit, *odb_commit; + struct commit_list *graph_parents, *odb_parents; + int num_parents = 0; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + graph_commit = lookup_commit(_oid); + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); + if (parse_commit_internal(odb_commit, 0, 0)) + graph_report(_("failed to parse %s from object database"), oid_to_hex(_oid)); + + if (oidcmp(_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report(_("root tree object ID for commit %s in commit-graph is %s != %s"), +oid_to_hex(_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); + + if (graph_commit->date != odb_commit->date) + graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime""), +oid_to_hex(_oid), +graph_commit->date, +odb_commit->date); + + + graph_parents = graph_commit->parents; + odb_parents = odb_commit->parents; + + while (graph_parents) { + num_parents++; + + if (odb_parents == NULL) + graph_report(_("commit-graph parent list for commit %s is too long (%d)"), +oid_to_hex(_oid), +num_parents); + + if (oidcmp(_parents->item->object.oid, _parents->item->object.oid)) + graph_report(_("commit-graph parent for %s is %s != %s"), +oid_to_hex(_oid), + oid_to_hex(_parents->item->object.oid), + oid_to_hex(_parents->item->object.oid)); + + graph_parents = graph_parents->next; + odb_parents = odb_parents->next; + } + + if (odb_parents != NULL) + graph_report(_("commit-graph parent list for commit %s terminates early"), +oid_to_hex(_oid)); + + if (graph_commit->generation) { + uint32_t max_generation = 0; + graph_parents = graph_commit->parents; + + while (graph_parents) { + if (graph_parents->item->generation == GENERATION_NUMBER_ZERO || + graph_parents->item->generation == GENERATION_NUMBER_INFINITY) + graph_report(_("commit-graph has valid generation for %s but not its parent, %s"), +oid_to_hex(_oid), + oid_to_hex(_parents->item->object.oid)); + if (graph_parents->item->generation > max_generation) + max_generation = graph_parents->item->generation; + graph_parents = graph_parents->next; + } + + if (graph_commit->generation != max_generation + 1) + graph_report(_("commit-graph has incorrect generation for %s"), +oid_to_hex(_oid)); + } else { + graph_parents = graph_commit->parents; + +
[RFC PATCH 09/12] fsck: check commit-graph
If a commit-graph file exists, check its contents during 'git fsck'. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/fsck.c | 13 + 1 file changed, 13 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index ef78c6c00c..9712f230ba 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -16,6 +16,7 @@ #include "streaming.h" #include "decorate.h" #include "packfile.h" +#include "run-command.h" #define REACHABLE 0x0001 #define SEEN 0x0002 @@ -45,6 +46,7 @@ static int name_objects; #define ERROR_REACHABLE 02 #define ERROR_PACK 04 #define ERROR_REFS 010 +#define ERROR_COMMIT_GRAPH 020 static const char *describe_object(struct object *obj) { @@ -815,5 +817,16 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) } check_connectivity(); + + if (core_commit_graph) { + struct child_process commit_graph_check = CHILD_PROCESS_INIT; + const char *check_argv[] = { "commit-graph", "check", NULL, NULL }; + commit_graph_check.argv = check_argv; + commit_graph_check.git_cmd = 1; + + if (run_command(_graph_check)) + errors_found |= ERROR_COMMIT_GRAPH; + } + return errors_found; } -- 2.17.0.39.g685157f7fb
[RFC PATCH 06/12] commit: force commit to parse from object database
In anticipation of checking commit-graph file contents against the object database, create parse_commit_internal() to allow side-stepping the commit-graph file and parse directly from the object database. Due to the use of generation numbers, this method should not be called unless the intention is explicit in avoiding commits from the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 14 ++ commit.h | 1 + 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 9ef6f699bd..07752d8503 100644 --- a/commit.c +++ b/commit.c @@ -392,7 +392,8 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) + +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph) { enum object_type type; void *buffer; @@ -403,17 +404,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return -1; if (item->object.parsed) return 0; - if (parse_commit_in_graph(item)) + if (use_commit_graph && parse_commit_in_graph(item)) return 0; buffer = read_sha1_file(item->object.oid.hash, , ); if (!buffer) return quiet_on_missing ? -1 : error("Could not read %s", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); if (type != OBJ_COMMIT) { free(buffer); return error("Object %s not a commit", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); } ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { @@ -424,6 +425,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index b5afde1ae9..5fde74fcd7 100644 --- a/commit.h +++ b/commit.h @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { -- 2.17.0.39.g685157f7fb
[RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information
--- commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 21e853c21a..3f0c142603 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin *pos = item->graph_pos; return 1; } else { - return bsearch_graph(commit_graph, &(item->object.oid), pos); + return bsearch_graph(g, &(item->object.oid), pos); } } -- 2.17.0.39.g685157f7fb
[RFC PATCH 10/12] commit-graph: add '--reachable' option
When writing commit-graph files, it can be convenient to ask for all reachable commits (starting at the ref set) in the resulting file. This is particularly helpful when writing to stdin is complicated, such as a future integration with 'git gc' which will call 'git commit-graph write --reachable' after performing cleanup of the object database. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 8 -- builtin/commit-graph.c | 41 +++--- t/t5318-commit-graph.sh| 10 3 files changed, 53 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 93c7841ba2..1b14d40590 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -37,12 +37,16 @@ Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by walking objects only in the specified pack-indexes. (Cannot be combined -with --stdin-commits.) +with --stdin-commits or --reachable.) + With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with ---stdin-packs.) +--stdin-packs or --reachable.) ++ +With the `--reachable` option, generate the new commit graph by walking +commits starting at all refs. (Cannot be combined with --stdin-commits +or --stind-packs.) + With the `--append` option, include all commits that are present in the existing commit-graph file. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 77c1a04932..a89285ada8 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -3,13 +3,14 @@ #include "dir.h" #include "lockfile.h" #include "parse-options.h" +#include "refs.h" #include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph check [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; @@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int reachable; int stdin_packs; int stdin_commits; int append; @@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv) return 0; } +struct hex_list { + char **hex_strs; + int hex_nr; + int hex_alloc; +}; + +static int add_ref_to_list(const char *refname, + const struct object_id *oid, + int flags, void *cb_data) +{ + struct hex_list *list = (struct hex_list*)cb_data; + + ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc); + list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1); + strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid)); + list->hex_nr++; + return 0; +} + static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; @@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv) OPT_STRING(0, "object-dir", _dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "reachable", , + N_("start walk at all refs")), OPT_BOOL(0, "stdin-packs", _packs, N_("scan pack-indexes listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", _commits, @@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); - if (opts.stdin_packs && opts.stdin_commits) - die(_("cannot use both --stdin-commits and --stdin-packs")); + if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1) + die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs")); if (!opts.obj_dir)
[RFC PATCH 07/12] commit-graph: load a root tree from specific graph
When lazy-loading a tree for a commit, it will be important to select the tree from a specific struct commit_graph. Create a new method that specifies the commit-graph file and use that in get_commit_tree_in_graph(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 6e3c08cd5c..80a2ac2a6a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -354,14 +354,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit * return c->maybe_tree; } -struct tree *get_commit_tree_in_graph(const struct commit *c) +static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g, +const struct commit *c) { if (c->maybe_tree) return c->maybe_tree; if (c->graph_pos == COMMIT_NOT_FROM_GRAPH) BUG("get_commit_tree_in_graph called from non-commit-graph commit"); - return load_tree_for_commit(commit_graph, (struct commit *)c); + return load_tree_for_commit(g, (struct commit *)c); +} + +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + return get_commit_tree_in_graph_one(commit_graph, c); } static void write_graph_chunk_fanout(struct hashfile *f, -- 2.17.0.39.g685157f7fb
[RFC PATCH 12/12] commit-graph: update design document
The commit-graph feature is now integrated with 'fsck' and 'gc', so remove those items from the "Future Work" section of the commit-graph design document. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 9 - 1 file changed, 9 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index d9f2713efa..d04657b781 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -118,9 +118,6 @@ Future Work - The commit graph feature currently does not honor commit grafts. This can be remedied by duplicating or refactoring the current graft logic. -- The 'commit-graph' subcommand does not have a "verify" mode that is - necessary for integration with fsck. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered @@ -142,12 +139,6 @@ Future Work such as "ensure_tree_loaded(commit)" that fully loads a tree before using commit->tree. -- The current design uses the 'commit-graph' subcommand to generate the graph. - When this feature stabilizes enough to recommend to most users, we should - add automatic graph writes to common operations that create many commits. - For example, one could compute a graph on 'clone', 'fetch', or 'repack' - commands. - - A server could provide a commit graph file as part of the network protocol to avoid extra calculations by clients. This feature is only of benefit if the user is willing to trust the file, because verifying the file is correct -- 2.17.0.39.g685157f7fb
[RFC PATCH 04/12] commit-graph: parse commit from chosen graph
Before checking a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index c5e5a0f860..6d0d303a7a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -308,17 +308,27 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin } } -int parse_commit_in_graph(struct commit *item) +int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) { uint32_t pos; if (item->object.parsed) - return 0; + return 1; + + if (find_commit_in_graph(item, g, )) + return fill_commit_in_graph(item, g, pos); + + return 0; +} + +int parse_commit_in_graph(struct commit *item) +{ if (!core_commit_graph) return 0; + prepare_commit_graph(); - if (commit_graph && find_commit_in_graph(item, commit_graph, )) - return fill_commit_in_graph(item, commit_graph, pos); + if (commit_graph) + return parse_commit_in_graph_one(commit_graph, item); return 0; } -- 2.17.0.39.g685157f7fb
Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
On 4/17/2018 1:00 PM, Derrick Stolee wrote: Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. This avoids duplicate work when we already checked the graph in parse_commit_gently() or when simply checking the buffer contents in check_commit(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 51 -- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 49 insertions(+), 23 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 688d5b1801..21e853c21a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(commit_graph, &(item->object.oid), pos); The reference to 'commit_graph' in the above line should be 'g'. Sorry! + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + + if (item->object.parsed) + return 0; if (!core_commit_graph) return 0; - if (item->object.parsed) - return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index a70f120878..9ef6f699bd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_comm
[PATCH v3 9/9] merge: check config before loading commits
Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/merge.c | 5 +++-- t/t5318-commit-graph.sh | 9 + 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..7e1da6c6ea 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,13 +1148,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb
[PATCH v3 5/9] ref-filter: use generation number for --contains
A commit A can reach a commit B only if the generation number of A is larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for 'git tag --contains' queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..e2fea6d635 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb
[PATCH v3 4/9] commit-graph.txt: update design document
We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 30 +++- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.39.g685157f7fb
[PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index bceb79c419..a70f120878 100644 --- a/commit.c +++ b/commit.c @@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb
[PATCH v3 2/9] commit-graph: compute generation numbers
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 46 ++ 1 file changed, 46 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..688d5b1801 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,10 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + packedDate[0] |= htonl((*list)->generation << 2); + } + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +575,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +738,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0); -- 2.17.0.39.g685157f7fb
[PATCH v3 8/9] commit-graph: always load commit-graph information
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. This avoids duplicate work when we already checked the graph in parse_commit_gently() or when simply checking the buffer contents in check_commit(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 51 -- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 49 insertions(+), 23 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 688d5b1801..21e853c21a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(commit_graph, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + + if (item->object.parsed) + return 0; if (!core_commit_graph) return 0; - if (item->object.parsed) - return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index a70f120878..9ef6f699bd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer
[PATCH v3 1/9] commit: add generation number to struct commmit
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0x GENERATION_NUMBER_MAX 0x3FFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = >parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb
[PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index a44899c733..bceb79c419 100644 --- a/commit.c +++ b/commit.c @@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb
[PATCH v3 0/9] Compute and consume generation numbers
Thanks for all the help on v2. Here are a few changes between versions: * Removed the constant-time check in queue_has_nonstale() due to the possibility of a performance hit and no evidence of a performance benefit in typical cases. * Reordered the commits about loading commits from the commit-graph. This way it is easier to demonstrate the incorrect checks. On my machine, every commit compiles and the test suite passes, but patches 6-8 have the bug that is fixed in patch 9 "merge: check config before loading commits". * The interaction with parse_commit_in_graph() from parse_object() is replaced with a new 'check_graph' parameter in parse_commit_buffer(). This allows us to fill in the graph_pos and generation values for commits that are parsed directly from a buffer. This keeps the existing behavior that a commit parsed this way should match its buffer. * There was discussion about making GENERATION_NUMBER_MAX assignable by an environment variable so we could add tests that exercise the behavior of capping a generation at that value. Perhaps the code around this is simple enough that we do not need to add that complexity. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains), providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is build on ds/lazy-load-trees. Derrick Stolee (9): commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph.txt: update design document ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() commit-graph: always load commit-graph information merge: check config before loading commits Documentation/technical/commit-graph.txt | 30 +-- alloc.c | 1 + builtin/merge.c | 5 +- commit-graph.c | 99 +++- commit-graph.h | 8 ++ commit.c | 54 +++-- commit.h | 7 +- object.c | 2 +- ref-filter.c | 23 +- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 199 insertions(+), 41 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb
[PATCH v3 3/9] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..a44899c733 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generataions are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb
Re: [PATCHv3 00/15] replace_object.c: rename to use dash in file name
On 4/11/2018 8:21 PM, Stefan Beller wrote: v3: * interdiff below, the only changes are renaming the variable - struct ref_store *main_ref_store; + struct ref_store *refs; in struct repository. as well as dropping the file rename patch. * improved commit messages from discussion on the single patches. Looks good to me. Thanks! -Stolee
Re: [PATCH v2 07/10] commit-graph.txt: update future work
On 4/12/2018 5:12 AM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: +Here is a diagram to visualize the shape of the full commit graph, and +how different generation numbers relate: + ++-+ +| GENERATION_NUMBER_INFINITY = 0x | ++-+ + || ^ + || | + |+--+ + | [gen(A) = gen(B)] + V ++-+ +| 0 < commit->generation < 0x4000 | ++-+ + || ^ + || | + |+--+ + |[gen(A) > gen(B)] + V ++-+ +| GENERATION_NUMBER_ZERO = 0 | ++-+ +| ^ +| | ++--+ +[gen(A) = gen(B)] It may be just me but all I can read out of the above is that commit->generation may store 0x, a value between 0 and 0x4000, or 0. I cannot quite tell what the notation [gen(A) gen(B)] is trying to say. I am guessing "Two generation numbers within the 'valid' range can be compared" is what the second one is trying to say, but it is much less interesting to know that two infinities compare equal than how generation numbers from different classes compare, which cannot be depicted in the above notation, I am afraid. For example, don't we want to say that a commit with INF can never be reached by a commit with a valid generation number, or something like that? My intention with the arrows was to demonstrate where parent relationships can go, and the generation-number relation between a commit A with parent B. Clearly, this diagram is less than helpful. Design Details -- @@ -98,17 +141,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: Good.
Re: [PATCH v8 03/14] commit-graph: add format document
On 4/11/2018 4:58 PM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) I think it is a typo, and it should be CDAT, not CGET (CDAT seem to me to stand for Commit DATa): + Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes) This is what you use in actual implementation, in PATCH v8 06/14 DS> +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ DS> +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ DS> +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ DS> +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ DS> +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ Documentation bugs are hard to diagnose. Thanks for finding this. I double checked that the hex int "0x43444154" matches "CDAT". Here is a diff to make it match. diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index ad6af8105c..af03501834 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -70,7 +70,7 @@ CHUNK DATA: OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) The OIDs for all commits in the graph, sorted in ascending order. - Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes) * The first H bytes are for the OID of the root tree. * The next 8 bytes are for the positions of the first two parents of the ith commit. Stores value 0x if no parent in that
Re: [PATCH resend] SubmittingPatches: mention the git contacts command
On 4/11/2018 4:20 PM, Thomas Gummerer wrote: Instead of just mentioning 'git blame' and 'git shortlog', which make it quite hard for new contributors to pick out the appropriate list of people to cc on their patch series, mention the 'git contacts' utility, which makes it much easier to get a reasonable list of contacts for a change. This should help new contributors pick out a reasonable cc list by simply using a single command. Signed-off-by: Thomas Gummerer--- I've originally sent this at <20180316213323.GC2224@hank>, during an the rc period. Eric had some comments, which I interpreted as being okay with the change (hope I'm not mistaken there :)). As I still think this would be an improvement for new contributors, I'm resending it here. I didn't know about this tool, and it seems helpful. I plan to use it now. Thanks! Documentation/SubmittingPatches | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index a1d0feca36..945f8edb46 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -260,8 +260,8 @@ that starts with `-BEGIN PGP SIGNED MESSAGE-`. That is not a text/plain, it's something else. Send your patch with "To:" set to the mailing list, with "cc:" listing -people who are involved in the area you are touching (the output from -`git blame $path` and `git shortlog --no-merges $path` would help to +people who are involved in the area you are touching (the `git +contacts` command in `contrib/contacts/` can help to identify them), to solicit comments and reviews. :1: footnote:[The current maintainer: gits...@pobox.com]
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/11/2018 3:32 PM, Jakub Narebski wrote: What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. The two main questions to consider are: 1. Can X reach Y? 2. What is the set of merge-bases between X and Y? And the thing to measure is a commit count. If possible, it would be good to count commits walked (commits whose parent list is enumerated) and commits inspected (commits that were listed as a parent of some walked commit). Walked commits require a commit parse -- albeit from the commit-graph instead of the ODB now -- while inspected commits only check the in-memory cache. For git.git and Linux, I like to use the release tags as tests. They provide a realistic view of the linear history, and maintenance releases have their own history from the major releases. I have tried finding number of false positives for level (generation number) filter and for FELINE index, and number of false negatives for min-post intervals in the spanning tree (for DFS tree) for 1 randomly selected pairs of commits... but I don't think this is a good benchmark. What is a false-positive? A case where gen(X) < gen(Y) but Y cannot reach X? I do not think that is a great benchmark, but I guess it is something to measure. I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) that has 750832 nodes and 811733 edges, and 563747941392 possible directed pairs, we have for 1 randomly selected pairs of commits: level-filter has91 = 0.91% [all] false positives FELINE index has78 = 0.78% [all] false positives FELINE index has 1.16667 less false positives than level filter min-post spanning-tree intervals has 3641 = 36.41% [all] false negatives Perhaps something you can do instead of sampling from N^2 commits in total is to select a pair of generations (say, G = 2, G' = 20100) or regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150) and see how many false positives you see by testing all pairs (one from each level). The delta between the generations may need to be smaller to actually have a large proportion of unreachable pairs. Try different levels, since major version releases tend to "pinch" the commit graph to a common history. For git.git repository (https://github.com/git/git.git) that has 52950 nodes and 65887 edges the numbers are slighly more in FELINE index favor (also out of 1 random pairs): level-filter has 504 = 9.11% false positives FELINE index has 125 = 2.26% false positives FELINE index has 4.032 less false positives than level filter This is for FELINE which does not use level / generatio-numbers filter. Thanks, -Stolee
Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk
On 4/10/2018 11:02 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return result; } prio_queue_put(, one); + if (one->generation < min_nonstale_gen) + min_nonstale_gen = one->generation; for (i = 0; i < n; i++) { twos[i]->object.flags |= PARENT2; prio_queue_put(, twos[i]); + if (twos[i]->generation < min_nonstale_gen) + min_nonstale_gen = twos[i]->generation; } - while (queue_has_nonstale()) { + while (queue_has_nonstale(, min_nonstale_gen)) { struct commit *commit = prio_queue_get(); struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + + last_gen = commit->generation; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return NULL; p->object.flags |= flags; Hmph. Can a commit that used to be not stale (and contributed to the current value of min_nonstale_gen) become stale here by getting visited twice, invalidating the value in min_nonstale_gen? min_nonstale_gen can be "wrong" in the way you say, but fits the definition from the commit message: "To properly take advantage of this condition, track the minimum generation number of a commit that **enters the queue** with nonstale status." (Emphasis added) You make an excellent point about how this can be problematic. I was confused by the lack of clear performance benefits here, but I think that whatever benefits making queue_has_nonstale() be O(1) were removed by walking more commits than necessary. Consider the following commit graph, where M is a parent of both A and B, S is a parent of M and B, and there is a large set of commits reachable from M with generation number larger than gen(S). A B | __/| |/ | M | |\ | . | | . | | . |_/ |/ S Between A and B, the true merge base is M. Anything reachable from M is marked as stale. When S is added to the queue, it is only reachable from B, so it is non-stale. However, it is marked stale after M is walked. The old code would detect this as a termination condition, but the new code would not. I think this data shape is actually common (not exactly, as it may be that some ancestor of M provides a second path to S) especially in the world of pull requests and users merging master into their topic branches. I'll remove this commit in the next version, but use the new prototype for queue_has_nonstale() in "commit: add short-circuit to paint_down_to_common()" using the given 'min_generation' instead of 'min_nonstale_gen'. Thanks, -Stolee
Re: [PATCH v2 04/10] commit-graph: compute generation numbers
On 4/10/2018 10:51 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + if ((*list)->generation > GENERATION_NUMBER_MAX) + die("generation number %u is too large to store in commit-graph", + (*list)->generation); + packedDate[0] |= htonl((*list)->generation << 2); + } How serious do we want this feature to be? On one extreme, we could be irresponsible and say it will be a problem for our descendants in the future if their repositories have more than billion pearls on a single strand, and the above certainly is a reasonable way to punt. Those who actually encounter the problem will notice by Git dying somewhere rather deep in the callchain. Or we could say Git actually does support a history that is arbitrarily long, even though such a deep portion of history will not benefit from having generation numbers in commit-graph. I've been assuming that our stance is the latter and that is why I made noises about overflowing 30-bit generation field in my review of the previous step. In case we want to do the "we know this is very large, but we do not know the exact value", we may actually want a mode where we can pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and make sure that the code to handle overflow behaves sensibly. I agree. I wonder how we can effectively expose this value into a test. It's probably not sufficient to manually test using compiler flags ("-D GENERATION_NUMBER_MAX=8"). + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { +... + } + } So we go over the list of commits just _once_ and make sure each of them gets the generation assigned correctly by (conceptually recursively but iteratively in implementation by using a commit list) making sure that all its parents have generation assigned and compute the generation for the commit, before moving to the next one. Which sounds correct. Yes, we compute the generation number of a commit exactly once. We use the list as a stack so we do not have recursion limits during our depth-first search (DFS). We rely on the object cache to ensure we store the computed generation numbers, and computed generation numbers provide termination conditions to the DFS. Thanks, -Stolee
Re: [PATCH v2 03/10] commit: add generation number to struct commmit
On 4/10/2018 10:31 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use two special values to signal the generation number is invalid: GENERATION_NUMBER_ININITY 0x GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_ZERO) means the generation number was loaded from a commit graph file that was stored before generation numbers were computed. Should it also be possible for a caller to tell if a given commit has too deep a history, i.e. we do not know its generation number exactly, but we know it is larger than 1<<30? It seems that we only have a 30-bit field in the file, so wouldn't we need a special value defined in (e.g. "0") so that we can tell that the commit has such a large generation number? E.g. + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; if (!item->generation) item->generation = GENERATION_NUMBER_OVERFLOW; when we read it from the file? We obviously need to do something similar when assigning a generation number to a child commit, perhaps like #define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1) commit->generation = 1; /* assume no parent */ for (p = commit->parents; p; p++) { uint32_t gen = p->item->generation + 1; if (gen >= GENERATION_NUMBER_OVERFLOW) { commit->generation = GENERATION_NUMBER_OVERFLOW; break; } else if (commit->generation < gen) commit->generation = gen; } or something? And then on the writing side you'd encode too large a generation as '0'. You raise a very good point. How about we do a slightly different arrangement for these overflow commits? Instead of storing the commits in the commit-graph file as "0" (which currently means "written by a version of git that did not compute generation numbers") we could let GENERATION_NUMBER_MAX be the maximum generation of a commit in the commit-graph, and if a commit would have larger generation, we collapse it down to that value. It slightly complicates the diagram I made in Documentation/technical/commit-graph.txt, but it was already a bit of a simplification. Here is an updated diagram, but likely we will want to limit discussion of the special-case GENERATION_NUMBER_MAX to the prose, since it is not a practical situation at the moment. +-+ | GENERATION_NUMBER_INFINITY = 0x | +-+ | | | ^ | | | | | | +--+ | | [gen(A) = gen(B)] | V | ++ | | GENERATION_NUMBER_MAX = 0x3FFF | | ++ | | | ^ | | | | | | +--+ | | [gen(A) = gen(B)] V V +-+ | 0 < commit->generation < 0x3FFF | +-+ | | ^ | | | | +--+ | [gen(A) > gen(B)] V +-+ | GENERATION_NUMBER_ZERO = 0 | +-+ | ^ | | +--+ [gen(A) = gen(B)] Thanks, -Stolee
Re: [PATCH v2 02/10] merge: check config before loading commits
On 4/10/2018 10:12 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: diff --git a/builtin/merge.c b/builtin/merge.c index ee050a47f3..20897f8223 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); Wow, that's tricky. git_merge_config() wants to know which "branch" we are on, and this place is as early as we can move the call to without breaking things. Is this to allow parse_object() called in lookup_commit_reference_gently() to know if we can rely on the data cached in the commit-graph data? When I saw the bug on my machine, I tracked the issue down to a call to parse_commit_in_graph() that skipped the graph check since core_commit_graph was not set. The call stack from this call is as follows: * lookup_commit_or_die() * lookup_commit_reference() * lookup_commit_reference_gently() * parse_object() * parse_object_buffer() * parse_commit_in_graph() [as introduced in PATCH 01/10] Move the config load to be between the initialization of 'branch' and the commit lookup. Also add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. It is not clear to me how a successful merge of commits/8 demonstrates that reading the config earlier than before is regression free. I didn't want to introduce commits in an order that led to a commit failing tests, but if you drop the change to builtin/merge.c from this series, the tip commit will fail this test with "BUG: bad generation skip". The reason for this failure is that commits/5 is loaded from HEAD from the object database, so its generation is marked as GENERATION_NUMBER_INFINITY, and the commit is marked as parsed. Later, the commit at merges/3 is loaded from the graph with generation 4. This triggers the BUG statement in paint_down_to_common(). That is why it is important to check a fast-forward merge. In the 'graph_git_behavior' steps of t5318-commit-graph.sh, we were already testing 'git merge-base' to check the commit walk logic. Thanks, -Stolee