Re: [PATCH v4 05/10] commit-graph: always load commit-graph information
On 4/29/2018 6:14 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: Most code paths load commits using lookup_commit() and then parse_commit(). And this automatically loads commit graph if needed, thanks to changes in parse_commit_gently(), which parse_commit() uses. In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). I guess the problem is that we cannot just add parse_commit_in_graph() like we did in parse_commit_gently(), for some reason? Like for example that parse_commit_gently() uses parse_commit_buffer() - which could have been handled by moving parse_commit_in_graph() down the call chain from parse_commit_gently() to parse_commit_buffer()... if not the fact that check_commit() also uses parse_commit_buffer(), but it does not want to load commit graph. Am I right? If a caller uses parse_commit_buffer() directly, then we will guarantee that all values in the struct commit that would be loaded from the buffer are loaded from the buffer. This means we do NOT load the root tree id or commit date from the commit-graph file. We do still need to load the data that is not available in the buffer, such as graph_pos and generation. With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Is it generation number, or generation number and position in commit graph? We don't need to ensure the graph_pos (the commit will never be re-parsed, so we will not try to find it in the commit-graph file again), but we DO need to ensure the generation (or our commit walks will be incorrect). We get the graph_pos as a side-effect. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. I think this commit would be easier to review if it was split into pure refactoring part (extracting fill_commit_graph_info() and find_commit_in_graph()). On the other hand the refactoring was needed to reduce code duplication betweem existing parse_commit_in_graph() and new load_commit_graph_info() functions. I guess that the difference between parse_commit_in_graph() and load_commit_graph_info() is that the former cares only about having just enough information that is needed for parse_commit_gently() - and does not load graph data if commit is parsed, while the latter is about loading commit-graph data like generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 ++--- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 46 insertions(+), 20 deletions(-) I wonder if it would be possible to add tests for this feature, for example that commit-graph is read when it should (including those branch lookups), and is not read when the feature should be disabled. But the only way to test it I can think of is a stupid one: create invalid commit graph, and check that git fails as expected (trying to read said malformed file), and does not fail if commit graph feature is disabled. Let me reorder files (BTW, is there a way for Git to put *.h files before *.c files in diff?) for easier review: diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit-graph.c b/commit-graph.c index 047fa9fca5..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} The comment in the header file commit-graph.h talks about filling graph_pos and generation members of the given commit, but I don't see filling graph_pos member here. We
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, May 01 2018, Derrick Stolee wrote: On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. Here's a non-arbitrary reason: the object types are ordered topologically (ignoring self-references): tag -> commit, tree, blob commit -> tree tree -> blob Thanks. I'll add a patch with that comment to v2. @@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct object_id *oid, ds.fn = NULL; advise(_("The candidates are:")); - for_each_abbrev(ds.hex_pfx, show_ambiguous_object, ); + for_each_abbrev(ds.hex_pfx, collect_ambiguous, ); + QSORT(collect.oid, collect.nr, sort_ambiguous); I was wondering how the old code sorted by SHA even when the ambiguous objects were loaded from different sources (multiple pack-files, loose objects). Turns out that for_each_abbrev() does its own sort after collecting the SHAs and then calls the given function pointer only once per distinct object. This avoids multiple instances of the same object, which may appear multiple times across pack-files. I only ask because now we are doing two sorts. I wonder if it would be more elegant to provide your sorting algorithm to for_each_abbrev() and let it call show_ambiguous_object as before. Another question is if we should use this sort generally for all calls to for_each_abbrev(). The only other case I see is in builtin/revparse.c. When preparing v2 I realized how confusing this was, so I'd added this to the commit message of my WIP re-roll which should explain this: A note on the implementation: I started out with something much simpler which just replaced oid_array_sort() in sha1-array.c with a custom sort function before calling oid_array_for_each_unique(). But then dumbly noticed that it doesn't work because the output function was tangled up with the code added in fad6b9e590 ("for_each_abbrev: drop duplicate objects", 2016-09-26) to ensure we don't display duplicate objects. That's why we're doing two passes here, first we need to sort the list and de-duplicate the objects, then sort them in our custom order, and finally output them without re-sorting them. I suppose we could also make oid_array_for_each_unique() maintain a hashmap of emitted objects, but that would increase its memory profile and wouldn't be worth the complexity for this one-off use-case, oid_array_for_each_unique() is used in many other places. How would sorting in our custom order before de-duplicating fail the de-duplication? We will still pair identical OIDs as consecutive elements and oid_array_for_each_unique only cares about consecutive elements having distinct OIDs, not lex-ordered OIDs. Perhaps the noise is because we rely on oid_array_sort() to mark the array as sorted inside oid_array_for_each_unique(), but that could be remedied by calling our QSORT() inside for_each_abbrev() and marking the array as sorted before calling oid_array_for_each_unique(). (Again, my comments are not meant to block this series.) Thanks, -Stolee
Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1
On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: Change the output emitted when an ambiguous object is encountered so that we show tags first, then commits, followed by trees, and finally blobs. Within each type we show objects in hashcmp(). Before this change the objects were only ordered by hashcmp(). The reason for doing this is that the output looks better as a result, e.g. the v2.17.0 tag before this change on "git show e8f2" would display: hint: The candidates are: hint: e8f2093055 tree hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f25a3a50 tree hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2650052 tag v2.17.0 hint: e8f2867228 blob hint: e8f28d537c tree hint: e8f2a35526 blob hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2cf6ec0 tree Now we'll instead show: hint: e8f2650052 tag v2.17.0 hint: e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached HEAD abbreviated object name hint: e8f26250fa commit 2017-02-03 - Merge pull request #996 from jeffhostetler/jeffhostetler/register_rename_src hint: e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for multiple remote.url entries hint: e8f2093055 tree hint: e8f25a3a50 tree hint: e8f28d537c tree hint: e8f2cf6ec0 tree hint: e8f21d02f7 blob hint: e8f21d577c blob hint: e8f2867228 blob hint: e8f2a35526 blob Since we show the commit data in the output that's nicely aligned once we sort by object type. The decision to show tags before commits is pretty arbitrary, but it's much less likely that we'll display a tag, so if there is one it makes sense to show it first. Here's a non-arbitrary reason: the object types are ordered topologically (ignoring self-references): tag -> commit, tree, blob commit -> tree tree -> blob Signed-off-by: Ævar Arnfjörð Bjarmason--- sha1-array.c | 15 +++ sha1-array.h | 3 +++ sha1-name.c | 37 - 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/sha1-array.c b/sha1-array.c index 838b3bf847..48bd9e9230 100644 --- a/sha1-array.c +++ b/sha1-array.c @@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array) array->sorted = 0; } + +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data) +{ + int i; + + for (i = 0; i < array->nr; i++) { + int ret = fn(array->oid + i, data); + if (ret) + return ret; + } + return 0; +} + int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data) diff --git a/sha1-array.h b/sha1-array.h index 1e1d24b009..232bf95017 100644 --- a/sha1-array.h +++ b/sha1-array.h @@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array); typedef int (*for_each_oid_fn)(const struct object_id *oid, void *data); +int oid_array_for_each(struct oid_array *array, + for_each_oid_fn fn, + void *data); int oid_array_for_each_unique(struct oid_array *array, for_each_oid_fn fn, void *data); diff --git a/sha1-name.c b/sha1-name.c index 9d7bbd3e96..46d8b1afa6 100644 --- a/sha1-name.c +++ b/sha1-name.c @@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, void *data) return 0; } +static int sort_ambiguous(const void *a, const void *b) +{ + int a_type = oid_object_info(a, NULL); + int b_type = oid_object_info(b, NULL); + int a_type_sort; + int b_type_sort; + + /* +* Sorts by hash within the same object type, just as +* oid_array_for_each_unique() would do. +*/ + if (a_type == b_type) + return oidcmp(a, b); + + /* +* Between object types show tags, then commits, and finally +* trees and blobs. +* +* The object_type enum is commit, tree, blob, tag, but we +* want tag, commit, tree blob. Cleverly (perhaps too +* cleverly) do that with modulus, since the enum assigns 1 to +* commit, so tag becomes 0. +*/ I appreciate this comment. Clever things should be marked as such. + a_type_sort = a_type % 4; + b_type_sort = b_type % 4; + return a_type_sort > b_type_sort ? 1 : -1; +} + static int get_short_oid(const char *name, int len, struct object_id *oid, unsigned flags) { @@ -409,6 +437,8 @@
Re: [PATCH 0/9] get_short_oid UI improvements
On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote: I started out just wanting to do 04/09 so I'd get prettier output, but then noticed that ^{tag}, ^{commit}< ^{blob} and ^{tree} didn't behave as expected with the disambiguation output, and that core.disambiguate had never been documented. Ævar Arnfjörð Bjarmason (9): sha1-name.c: remove stray newline sha1-array.h: align function arguments sha1-name.c: move around the collect_ambiguous() function get_short_oid: sort ambiguous objects by type, then SHA-1 get_short_oid: learn to disambiguate by ^{tag} get_short_oid: learn to disambiguate by ^{blob} get_short_oid / peel_onion: ^{tree} should mean tree, not treeish get_short_oid / peel_onion: ^{tree} should mean commit, not commitish config doc: document core.disambiguate Documentation/config.txt| 14 ++ cache.h | 5 ++- sha1-array.c| 15 +++ sha1-array.h| 7 ++- sha1-name.c | 69 - t/t1512-rev-parse-disambiguation.sh | 32 ++--- 6 files changed, 120 insertions(+), 22 deletions(-) This is a good series. Please take a look at my suggestion in Patch 4/9, but feel free to keep this series as written. Reviewed-by: Derrick Stolee <dsto...@microsoft.com>
Re: [PATCH v4 10/10] commit-graph.txt: update design document
On 4/30/2018 7:32 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> Looks good. --- Documentation/technical/commit-graph.txt | 30 +++- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, The linux kernel commit graph has maximum of 513 commits sharing the same generation number, but is is 5.43 commits sharing the same generation number on average, with standard deviation 10.70; median is even lower: it is 2, with 5.35 median absolute deviation (MAD). So on average it would be a few extra commits. Right. but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. As I wrote before, handling those corner cases in more complicated, but not that complicated. We could simply use stronger condition if both generation numbers are ordinary generation numbers, and weaker condition when at least one generation number has one of those special values. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. Ordinary generation numbers, where stronger condition holds, are those between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX. + Design Details -- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - Good. - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' Another possible candidates: - remove_redundant() - see comment in previous patch - still_interesting() - where Git uses date slop to stop walking too far remove_redundant() will be included in v5, thanks. Instead of "still_interesting()" I'll add "git tag --merged" as the candidate to consider, as discussed in [1]. [1] https://public-inbox.org/git/87fu3g67ry@lant.ki.iif.hu/t/#u "branch --contains / tag --merged inconsistency" - Currently, parse_commit_gently() requires filling in the root tree One important issue left is handling features that change view of project history, and their interaction with commit-graph feature. What would happen, if we turn on commit-graph feature, generate commit graph file, and then: * use graft file or remove graft entries to cut history, or remove cut or join two [independent] histories. * use git-replace mechanims to do the same * in shallow clone, deepen or shorten the clone What would happen if without re-generating commit-graph file (assuming tha Git wouldn't do it f
Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1
On 5/1/2018 9:39 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, May 01 2018, Derrick Stolee wrote: From: Ævar Arnfjörð Bjarmason <ava...@gmail.com> Here is what I mean by sorting during for_each_abbrev(). This seems to work for me, so I don't know what the issue is with this one-pass approach. [...] +static int sort_ambiguous(const void *a, const void *b) +{ + int a_type = oid_object_info(a, NULL); + int b_type = oid_object_info(b, NULL); + int a_type_sort; + int b_type_sort; + + /* +* Sorts by hash within the same object type, just as +* oid_array_for_each_unique() would do. +*/ + if (a_type == b_type) + return oidcmp(a, b); + + /* +* Between object types show tags, then commits, and finally +* trees and blobs. +* +* The object_type enum is commit, tree, blob, tag, but we +* want tag, commit, tree blob. Cleverly (perhaps too +* cleverly) do that with modulus, since the enum assigns 1 to +* commit, so tag becomes 0. +*/ + a_type_sort = a_type % 4; + b_type_sort = b_type % 4; + return a_type_sort > b_type_sort ? 1 : -1; +} + static int get_short_oid(const char *name, int len, struct object_id *oid, unsigned flags) { @@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data) find_short_object_filename(); find_short_packed_object(); + QSORT(collect.oid, collect.nr, sort_ambiguous); + collect.sorted = 1; + Yes this works. You're right. I wasn't trying to intentionally omit stuff in my recent 878t93zh60@evledraar.gmail.com, I'd just written this code some days ago and forgotten why I did what I was doing (and this is hard to test for), but it's all coming back to me now. The actual requirement for oid_array_for_each_unique() working properly is that you've got to feed it in hash order, To work properly, duplicate entries must be consecutive. Since duplicate entries have the same type, our sort satisfies this condition. but my new sort_ambiguous() still does that (barring any SHA-1 collisions, at which point we have bigger problems), so two passes aren't needed. So yes, this apporoach works and is one-pass. But that's just an implementation detail of the current sort method, when I wrote this I was initially playing with other sort orders, e.g. sorting SHAs regardless of type by the mtime of the file I found them in. With this approach I'd start printing duplicates if I changed the internals of sort_ambiguous() like that. That makes sense. But I think it's extremely implausible that we'll start sorting things like that, so I'll just take this method of doing it and add some comment saying we must hashcmp() the entries in our own sort function for the de-duplication to work, I don't see us ever changing that. Sounds good. Thanks, -Stolee
Re: [PATCH v5 09/11] commit: use generation number in remove_redundant()
On 5/1/2018 8:47 AM, Derrick Stolee wrote: The static remove_redundant() method is used to filter a list of commits by removing those that are reachable from another commit in the list. This is used to remove all possible merge- bases except a maximal, mutually independent set. To determine these commits are independent, we use a number of paint_down_to_common() walks and use the PARENT1, PARENT2 flags to determine reachability. Since we only care about reachability and not the full set of merge-bases between 'one' and 'twos', we can use the 'min_generation' parameter to short-circuit the walk. When no commit-graph exists, there is no change in behavior. For a copy of the Linux repository, we measured the following performance improvements: git merge-base v3.3 v4.5 Before: 234 ms After: 208 ms Rel %: -11% git merge-base v4.3 v4.5 Before: 102 ms After: 83 ms Rel %: -19% The experiments above were chosen to demonstrate that we are improving the filtering of the merge-base set. In the first example, more time is spent walking the history to find the set of merge bases before the remove_redundant() call. The starting commits are closer together in the second example, therefore more time is spent in remove_redundant(). The relative change in performance differs as expected. Reported-by: Jakub Narebski <jna...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 9875feec01..5064db4e61 100644 --- a/commit.c +++ b/commit.c @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) parse_commit(array[i]); for (i = 0; i < cnt; i++) { struct commit_list *common; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; This initialization should be uint32_t min_generation = array[i]->generation; since the assignment (using j) below skips the ith commit. if (redundant[i]) continue; @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) continue; filled_index[filled] = j; work[filled++] = array[j]; + + if (array[j]->generation < min_generation) + min_generation = array[j]->generation; } - common = paint_down_to_common(array[i], filled, work, 0); + common = paint_down_to_common(array[i], filled, work, + min_generation); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++)
Re: branch --contains / tag --merged inconsistency
On 4/27/2018 12:03 PM, SZEDER Gábor wrote: Szia Feri, I'm moving the IRC discussion here, because this might be a bug report in the end. So, kindly try these steps (103 MB free space required): $ git clone https://github.com/ClusterLabs/pacemaker.git && cd pacemaker [...] $ git branch --contains Pacemaker-0.6.1 * master $ git tag --merged master | fgrep Pacemaker-0.6 Pacemaker-0.6.0 Pacemaker-0.6.2 Pacemaker-0.6.3 Pacemaker-0.6.4 Pacemaker-0.6.5 Pacemaker-0.6.6 Notice that Pacemaker-0.6.1 is missing from the output. Kind people on IRC didn't find a quick explanation, and we all had to go eventually. Is this expected behavior? Reproduced with git 2.11.0 and 2.17.0. The commit pointed to by the tag Pacemaker-0.6.1 and its parent have a serious clock skew, i.e. they are a few months older than their parents: $ git log --format='%h %ad %cd%d%n%s' --date=short Pacemaker-0.6.1^..47a8ef4c 47a8ef4ce 2008-02-15 2008-02-15 Low: TE: Logging - display the op's magic field for unexpected and foreign events b9cfcd6b4 2007-12-10 2007-12-10 (tag: Pacemaker-0.6.2) haresources2cib.py: set default-action-timeout to the default (20s) 52e7793e0 2007-12-10 2007-12-10 haresources2cib.py: update ra parameters lists dea277271 2008-02-14 2008-02-14 Medium: Build: Turn on snmp support in rpm packages (patch from MATSUDA, Daiki) f418742fe 2008-02-14 2008-02-14 Low: Build: Update the .spec file with the one used by build service ccfa716a5 2008-02-14 2008-02-14 Medium: SNMP: Allow the snmp subagent to be built (patch from MATSUDA, Daiki) 50f0ade2d 2008-02-14 2008-02-14 Low: Build: Update last release number 90f11667f 2008-02-14 2008-02-14 Medium: Tools: Make sure the autoconf variables in haresources2cib are expanded 9d2383c46 2008-02-11 2008-02-11 (tag: Pacemaker-0.6.1) High: cib: Ensure the archived file hits the disk before returning (branch|tag|describe|...) (--contains|--merged) use the commit timestamp information as a heuristic to avoid traversing parts of the history, which makes these operations, especially on big histories, an order of magnitude or two faster. Yeah, commit timestamps can't always be trusted, but skewed commits are rare, and skewed commits with this much skew are even rarer. I'm not sure how (or if it's at all possible) to turn off this timestamp-based optimisation. This is actually a bit more complicated. The "--merged" check in 'git tag' uses a different mechanism to detect which tags are reachable. It uses a revision walk starting at the "merge commit" (master in your case) and all tags with the "limited" option (to ensure the walk happens during prepare_revision_walk()) but marks the merge commit as UNINTERESTING. The limit_list() method stops when all commits are marked UNINTERESTING - minus some "slop" related to the commits that start the walk. One important note: the set of tags is important here. If you add a new tag to the root commit (git tag MyTag a2d71961f) then the walk succeeds by ensuring it walks until MyTag. This gets around the clock skew issue. There may be other more-recent tags with a clock-skew issue, but since Pacemaker-0.6.0 is the oldest tag, that requires the walk to continue until at least that date. The commit-walk machinery in revision.c is rather complicated, and is used for a lot of different reasons, such as "git log" and this application in "git tag". It is on my list to refactor this code to use the commit-graph and generation numbers, but as we can see by this example, it is not easy to tease out what is happening in the code. In a world where generation numbers are expected to be available, we could rewrite do_merge_filter() in ref-filter.c to call into paint_down_to_common() in commit.c using the new "min_generation" marker. By assigning the tags to be in the "twos" list and the merge commit in the "one" commit, we can check if the tags have the PARENT1 flag after the walk in paint_down_to_common(). Due to the static nature of paint_down_to_common(), we will likely want to abstract this into an external method in commit.c, say can_reach_many(struct commit *from, struct commit_list *to). FWIW, much work is being done on a cached commit graph including commit generation numbers, which will solve this issue both correctly and more efficiently. Perhaps it will already be included in the next release. The work in ds/generation-numbers is focused on the "git tag --contains" method, which does return correctly here (it is the reverse of the --merged condition): Which tags can reach Pacemaker-0.6.1? $ git tag --contains Pacemaker-0.6.1 (returns a big list) This is the actual reverse lookup (which branches contain this tag?) $ git branch --contains Pacemaker-0.6.1 | grep master * master These commands work despite clock skew. The commit-graph feature makes them faster. Thanks, -Stolee
Re: [PATCH v4 02/10] commit: add generation number to struct commmit
On 4/28/2018 6:35 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Very minor nitpick: it would be more readable wrapped differently: * If a commit A has parents, then the generation number of A is one more than the maximum generation number among parents of A. Very minor nitpick: possibly "parents", not "the parents", but I am not native English speaker. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0x GENERATION_NUMBER_MAX 0x3FFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Good explanation; I wonder if we want to have it in some shortened form also in comments, and not only in the commit message. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) I have reordered patches to make it easier to review. diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 I wonder if it wouldn't be good to have some short in-line comments explaining those constants, or a block comment above them. struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; All right, simple addition of the new field. Nothing to go wrong here. Sidenote: With 0x7FFF being (if I am not wrong) maximum graph_pos and maximum number of nodes in commit graph, we won't hit 0x3FFF generation number limit for all except very, very linear histories. Both of these limits are far away from being realistic. But we could extend the maximum graph_pos independently from the maximum generation number now that we have the "capped" logic. diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } All right, start with initializing it with "not from commit-graph" value after allocation. diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + I guess we should not worry about these "magical constants" sprinkled here, like "+ 8" above. Let's examine how it goes, taking a look at commit-graph-format.txt in Documentation/technical/commit-graph-format.txt * The first H (g->hash_len) bytes are for the OID of the root tree. * The next 8 bytes are for the positions of the first two parents [...] So 'commit_data + g->hash_len + 8' is our offset from the start of commit data. All right. * The next 8 bytes store the generation number of the commit and the commit time in seconds since EPOCH. The generation number uses the higher 30 bits of the first 4 bytes. [...] The higher 30 bits of the 4 bytes, which is 32 bits, means that we need to shift 32-bit value 2 bits right, so that we get lower 30 bits of 32-bit value. All right. All 4-byte numbers are in network order. Shouldn't it be ntohl() to convert from network order to host order, and not get_be32()? I guess they are the same (network order is big-endian order), and get_be32() is what rest of git uses... ntohl() takes a 32-bit value, while get_be32() takes a pointer. This makes pulling network-bytes out of streams much cleaner with get_be32(), so I try to use that whenever possible.
Re: [PATCH] coccinelle: avoid wrong transformation suggestions from commit.cocci
On 4/30/2018 5:31 AM, SZEDER Gábor wrote: The semantic patch 'contrib/coccinelle/commit.cocci' added in 2e27bd7731 (treewide: replace maybe_tree with accessor methods, 2018-04-06) is supposed to "ensure that all references to the 'maybe_tree' member of struct commit are either mutations or accesses through get_commit_tree()". So get_commit_tree() clearly must be able to directly access the 'maybe_tree' member, and 'commit.cocci' has a bit of a roundabout workaround to ensure that get_commit_tree()'s direct access in its return statement is not transformed: after all references to 'maybe_tree' have been transformed to a call to get_commit_tree(), including the reference in get_commit_tree() itself, the last rule transforms back a 'return get_commit_tree()' statement, back then found only in get_commit_tree() itself, to a direct access. Unfortunately, already the very next commit shows that this workaround is insufficient: 7b8a21dba1 (commit-graph: lazy-load trees for commits, 2018-04-06) extends get_commit_tree() with a condition directly accessing the 'maybe_tree' member, and Coccinelle with 'commit.cocci' promptly detects it and suggests a transformation to avoid it. This transformation is clearly wrong, because calling get_commit_tree() to access 'maybe_tree' _in_ get_commit_tree() would obviously lead to recursion. Furthermore, the same commit added another, more specialized getter function get_commit_tree_in_graph(), whose legitimate direct access to 'maybe_tree' triggers a similar wrong transformation suggestion. Thanks for catching this, Szeder. Sorry for the noise. Exclude both of these getter functions from the general rule in 'commit.cocci' that matches their direct accesses to 'maybe_tree'. Also exclude load_tree_for_commit(), which, as static helper funcion of get_commit_tree_in_graph(), has legitimate direct access to 'maybe_tree' as well. This is an interesting feature of Coccinelle. Happy to learn it. The last rule transforming back 'return get_commit_tree()' statements to direct accesses thus became unnecessary, remove it. Signed-off-by: SZEDER Gábor <szeder@gmail.com> I applied this locally on 'next' and ran the check. I succeeded with no changes. Thanks! Reviewed-by: Derrick Stolee <dsto...@microsoft.com> --- contrib/coccinelle/commit.cocci | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/contrib/coccinelle/commit.cocci b/contrib/coccinelle/commit.cocci index ac38525941..a7e9215ffc 100644 --- a/contrib/coccinelle/commit.cocci +++ b/contrib/coccinelle/commit.cocci @@ -10,11 +10,15 @@ expression c; - c->maybe_tree->object.oid.hash + get_commit_tree_oid(c)->hash +// These excluded functions must access c->maybe_tree direcly. @@ +identifier f !~ "^(get_commit_tree|get_commit_tree_in_graph|load_tree_for_commit)$"; expression c; @@ + f(...) {... - c->maybe_tree + get_commit_tree(c) + ...} @@ expression c; @@ -22,9 +26,3 @@ expression s; @@ - get_commit_tree(c) = s + c->maybe_tree = s - -@@ -expression c; -@@ -- return get_commit_tree(c); -+ return c->maybe_tree;
[RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
Commentary: This file format uses the large offsets from the pack-index version 2 format, but drops the CRC32 hashes from that format. Also: I included the HASH footer at the end only because it is already in the pack and pack-index formats, but not because it is particularly useful here. If possible, I'd like to remove it and speed up MIDX writes. -- >8 -- Add a technical documentation file describing the design for the multi-pack index (MIDX). Includes current limitations and future work. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/multi-pack-index.txt | 149 +++ 1 file changed, 149 insertions(+) create mode 100644 Documentation/technical/multi-pack-index.txt diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt new file mode 100644 index 00..d31b03dec5 --- /dev/null +++ b/Documentation/technical/multi-pack-index.txt @@ -0,0 +1,149 @@ +Multi-Pack-Index (MIDX) Design Notes + + +The Git object directory contains a 'pack' directory containing +packfiles (with suffix ".pack") and pack-indexes (with suffix +".idx"). The pack-indexes provide a way to lookup objects and +navigate to their offset within the pack, but these must come +in pairs with the packfiles. This pairing depends on the file +names, as the pack-index differs only in suffix with its pack- +file. While the pack-indexes provide fast lookup per packfile, +this performance degrades as the number of packfiles increases, +because abbreviations need to inspect every packfile and we are +more likely to have a miss on our most-recently-used packfile. +For some large repositories, repacking into a single packfile +is not feasible due to storage space or excessive repack times. + +The multi-pack-index (MIDX for short, with suffix ".midx") +stores a list of objects and their offsets into multiple pack- +files. It contains: + +- A list of packfile names. +- A sorted list of object IDs. +- A list of metadata for the ith object ID including: + - A value j referring to the jth packfile. + - An offset within the jth packfile for the object. +- If large offsets are required, we use another list of large + offsets similar to version 2 pack-indexes. + +Thus, we can provide O(log N) lookup time for any number +of packfiles. + +A new config setting 'core.midx' must be enabled before writing +or reading MIDX files. + +The MIDX files are updated by the 'midx' builtin with the +following common parameter combinations: + +- 'git midx' gives the hash of the current MIDX head. +- 'git midx --write --update-head --delete-expired' writes a new + MIDX file, points the MIDX head to that file, and deletes the + existing MIDX file if out-of-date. +- 'git midx --read' lists some basic information about the current + MIDX head. Used for basic tests. +- 'git midx --clear' deletes the current MIDX head. + +Design Details +-- + +- The MIDX file refers only to packfiles in the same directory + as the MIDX file. + +- A special file, 'midx-head', stores the hash of the latest + MIDX file so we can load the file without performing a dirstat. + This file is especially important with incremental MIDX files, + pointing to the newest file. + +- If a packfile exists in the pack directory but is not referenced + by the MIDX file, then the packfile is loaded into the packed_git + list and Git can access the objects as usual. This behavior is + necessary since other tools could add packfiles to the pack + directory without notifying Git. + +- The MIDX file should be only a supplemental structure. If a + user downgrades or disables the `core.midx` config setting, + then the existing .idx and .pack files should be sufficient + to operate correctly. + +- The file format includes parameters for the object id length + and hash algorithm, so a future change of hash algorithm does + not require a change in format. + +- If an object appears in multiple packfiles, then only one copy + is stored in the MIDX. This has a possible performance issue: + If an object appears as the delta-base of multiple objects from + multiple packs, then cross-pack delta calculations may slow down. + This is currently only theoretical and has not been demonstrated + to be a measurable issue. + +Current Limitations +--- + +- MIDX files are managed only by the midx builtin and is not + automatically updated on clone or fetch. + +- There is no '--verify' option for the midx builtin to verify + the contents of the MIDX file against the pack contents. + +- Constructing a MIDX file currently requires the single-pack + index for every pack being added to the MIDX. + +- The fsck builtin does not check MIDX files, but should. + +- The repack builtin is not aware of the MIDX files, and may + invalidate the MIDX files by deleting existing packfiles. The + MIDX may also be e
[RFC PATCH 00/18] Multi-pack index (MIDX)
This RFC includes a new way to index the objects in multiple packs using one file, called the multi-pack index (MIDX). The commits are split into parts as follows: [01] - A full design document. [02] - The full file format for MIDX files. [03] - Creation of core.midx config setting. [04-12] - Creation of "midx" builtin that writes, reads, and deletes MIDX files. [13-18] - Consume MIDX files for abbreviations and object loads. The main goals of this RFC are: * Determine interest in this feature. * Find other use cases for the MIDX feature. * Design a proper command-line interface for constructing and checking MIDX files. The current "midx" builtin is probably inadequate. * Determine what additional changes are needed before the feature can be merged. Specifically, I'm interested in the interactions with repack and fsck. The current patch also does not update the MIDX on a fetch (which adds a packfile) but would be valuable. Whenever possible, I tried to leave out features that could be added in a later patch. * Consider splitting this patch into multiple patches, such as: i. The MIDX design document. ii. The command-line interface for building and reading MIDX files. iii. Integrations with abbreviations and object lookups. Please do not send any style nits to this patch, as I expect the code to change dramatically before we consider merging. I created three copies of the Linux repo with 1, 24, and 127 packfiles each using 'git repack -adfF --max-pack-size=[64m|16m]'. These copies gave significant performance improvements on the following comand: git log --oneline --raw --parents Num Packs | Before MIDX | After MIDX | Rel % | 1 pack % --+-+++-- 1 | 35.64 s |35.28 s | -1.0% | -1.0% 24 | 90.81 s |40.06 s | -55.9% | +12.4% 127 |257.97 s |42.25 s | -83.6% | +18.6% The last column is the relative difference between the MIDX-enabled repo and the single-pack repo. The goal of the MIDX feature is to present the ODB as if it was fully repacked, so there is still room for improvement. Changing the command to git log --oneline --raw --parents --abbrev=40 has no observable difference (sub 1% change in all cases). This is likely due to the repack I used putting commits and trees in a small number of packfiles so the MRU cache workes very well. On more naturally-created lists of packfiles, there can be up to 20% improvement on this command. We are using a version of this patch with an upcoming release of GVFS. This feature is particularly important in that space since GVFS performs a "prefetch" step that downloads a pack of commits and trees on a daily basis. These packfiles are placed in an alternate that is shared by all enlistments. Some users have 150+ packfiles and the MRU misses and abbreviation computations are significant. Now, GVFS manages the MIDX file after adding new prefetch packfiles using the following command: git midx --write --update-head --delete-expired --pack-dir= As that release deploys we will gather more specific numbers on the performance improvements and report them in this thread. Derrick Stolee (18): docs: Multi-Pack Index (MIDX) Design Notes midx: specify midx file format midx: create core.midx config setting midx: write multi-pack indexes for an object list midx: create midx builtin with --write mode midx: add t5318-midx.sh test script midx: teach midx --write to update midx-head midx: teach git-midx to read midx file details midx: find details of nth object in midx midx: use existing midx when writing midx: teach git-midx to clear midx files midx: teach git-midx to delete expired files t5318-midx.h: confirm git actions are stable midx: load midx files when loading packs midx: use midx for approximate object count midx: nth_midxed_object_oid() and bsearch_midx() sha1_name: use midx for abbreviations packfile: use midx for object loads .gitignore | 1 + Documentation/config.txt | 3 + Documentation/git-midx.txt | 106 Documentation/technical/multi-pack-index.txt | 149 + Documentation/technical/pack-format.txt | 85 +++ Makefile | 2 + builtin.h| 1 + builtin/midx.c | 352 +++ cache.h | 1 + command-list.txt | 1 + config.c | 5 + environment.c| 2 + git.c| 1 + midx.c | 850 +++ midx.h | 136 + packfile.c | 79 ++- packfile.h
[RFC PATCH 02/18] midx: specify midx file format
A multi-pack-index (MIDX) file indexes the objects in multiple packfiles in a single pack directory. After a simple fixed-size header, the version 1 file format uses chunks to specify different regions of the data that correspond to different types of data, including: - List of OIDs in lex-order - A fanout table into the OID list - List of packfile names (relative to pack directory) - List of object metadata - Large offsets (if needed) By adding extra optional chunks, we can easily extend this format without invalidating written v1 files. One value in the header corresponds to a number of "base MIDX files" and will always be zero until the value is used in a later patch. We considered using a hashtable format instead of an ordered list of objects to reduce the O(log N) lookups to O(1) lookups, but two main issues arose that lead us to abandon the idea: - Extra space required to ensure collision counts were low. - We need to identify the two lexicographically closest OIDs for fast abbreviations. Binary search allows this. The current solution presents multiple packfiles as if they were packed into a single packfile with one pack-index. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/pack-format.txt | 85 + 1 file changed, 85 insertions(+) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 8e5bf60be3..ab459ef142 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -160,3 +160,88 @@ Pack file entry: <+ corresponding packfile. 20-byte SHA-1-checksum of all of the above. + +== midx-*.midx files have the following format: + +The multi-pack-index (MIDX) files refer to multiple pack-files. + +In order to allow extensions that add extra data to the MIDX format, we +organize the body into "chunks" and provide a lookup table at the beginning +of the body. The header includes certain length values, such as the number +of packs, the number of base MIDX files, hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'M', 'I', 'D', 'X'} + + 4-byte version number: + Git currently only supports version 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) + + 1-byte number (I) of base multi-pack-index files: + This value is currently always zero. + + 1-byte number (C) of "chunks" + + 4-byte number (P) of pack files + +CHUNK LOOKUP: + + (C + 1) * 12 bytes providing the chunk offsets: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are provided in file-order, so you can infer the length + using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of objects (N). The number of objects with first byte + value i is (F[i] - F[i-1]) for i > 0. + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all objects in the MIDX are stored in lexicographic + order in this chunk. + + Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes) + Stores two 4-byte values for every object. + 1: The pack-int-id for the pack storing this object. + 2: The offset within the pack. + If all offsets are less than 2^31, then the large offset chunk + will not exist and offsets are stored as in IDX v1. + If there is at least one offset value larger than 2^32-1, then + the large offset chunk must exist. If the large offset chunk + exists and the 31st bit is on, then removing that bit reveals + the row in the large offsets containing the 8-byte offset of + this object. + + [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'}) + 8-byte offsets into large packfiles. + + Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes) + P * 4 bytes storing the offset in the packfile name chunk for + the null-terminated string containing the filename for the + ith packfile. The filename is relative to the MIDX file's parent + directory. + + Packfile Names (ID: {'P', 'N', 'A', 'M'}) + Stores the packfile names as concatenated, null-terminated strings. + Packfiles must be list
[RFC PATCH 11/18] midx: teach git-midx to clear midx files
As a way to troubleshoot unforeseen problems with MIDX files, provide a way to delete "midx-head" and the MIDX it references. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-midx.txt | 12 +++- builtin/midx.c | 31 ++- t/t5318-midx.sh| 9 + 3 files changed, 50 insertions(+), 2 deletions(-) diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt index 3eeed1d969..c184d3a593 100644 --- a/Documentation/git-midx.txt +++ b/Documentation/git-midx.txt @@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files). SYNOPSIS [verse] -'git midx' [--write|--read] [--pack-dir ] +'git midx' [--write|--read|--clear] [--pack-dir ] DESCRIPTION --- @@ -22,6 +22,10 @@ OPTIONS Use given directory for the location of packfiles, pack-indexes, and MIDX files. +--clear:: + If specified, delete the midx file specified by midx-head, and + midx-head. (Cannot be combined with --write or --read.) + --read:: If specified, read a midx file specified by the midx-head file and output basic details about the midx file. (Cannot be combined @@ -79,6 +83,12 @@ $ git midx --read $ git midx --read --midx-id 3e50d982a2257168c7fd0ff12ffe5cf6af38c74e +* Delete the current midx-head and the file it references. ++ +--- +$ git midx --clear +--- + CONFIGURATION - diff --git a/builtin/midx.c b/builtin/midx.c index aff2085771..b30ef36ff8 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -11,11 +11,13 @@ static char const * const builtin_midx_usage[] = { N_("git midx [--pack-dir ]"), N_("git midx --write [--update-head] [--pack-dir ]"), + N_("git midx --clear [--pack-dir ]"), NULL }; static struct opts_midx { const char *pack_dir; + int clear; int read; const char *midx_id; int write; @@ -24,6 +26,29 @@ static struct opts_midx { struct object_id old_midx_oid; } opts; +static int midx_clear(void) +{ + struct strbuf head_path = STRBUF_INIT; + char *old_path; + + if (!opts.has_existing) + return 0; + + strbuf_addstr(_path, opts.pack_dir); + strbuf_addstr(_path, "/"); + strbuf_addstr(_path, "midx-head"); + if (remove_path(head_path.buf)) + die("failed to remove path %s", head_path.buf); + strbuf_release(_path); + + old_path = get_midx_filename_oid(opts.pack_dir, _midx_oid); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + free(old_path); + + return 0; +} + static int midx_read(void) { struct object_id midx_oid; @@ -263,6 +288,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix) { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory containing set of packfile and pack-index pairs.") }, + OPT_BOOL('c', "clear", , + N_("clear midx file and midx-head")), OPT_BOOL('r', "read", , N_("read midx file")), { OPTION_STRING, 'M', "midx-id", _id, @@ -287,7 +314,7 @@ int cmd_midx(int argc, const char **argv, const char *prefix) builtin_midx_options, builtin_midx_usage, 0); - if (opts.write + opts.read > 1) + if (opts.write + opts.read + opts.clear > 1) usage_with_options(builtin_midx_usage, builtin_midx_options); if (!opts.pack_dir) { @@ -299,6 +326,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix) opts.has_existing = !!get_midx_head_oid(opts.pack_dir, _midx_oid); + if (opts.clear) + return midx_clear(); if (opts.read) return midx_read(); if (opts.write) diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh index 2e52389442..9337355ab3 100755 --- a/t/t5318-midx.sh +++ b/t/t5318-midx.sh @@ -143,4 +143,13 @@ test_expect_success 'write-midx in bare repo' \ git midx --read >output && cmp output expect' +test_expect_success 'midx --clear' \ +'git midx --clear && + test_path_is_missing "${baredir}/midx-${midx4}.midx" && + test_path_is_missing "${baredir}/midx-head" && + cd ../full && + git midx --clear && + test_path_is_missing "${packdir}/midx-${midx4}.midx" && + test_path_is_missing "${packdir}/midx-head"' + test_done -- 2.15.0
[RFC PATCH 15/18] midx: use midx for approximate object count
The MIDX files contain a complete object count, so we can report the number of objects in the MIDX. The count remains approximate as there may be overlap between the packfiles not referenced by the MIDX. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- packfile.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/packfile.c b/packfile.c index 1c0822878b..866a1f30dd 100644 --- a/packfile.c +++ b/packfile.c @@ -803,7 +803,8 @@ static void prepare_packed_git_one(char *objdir, int local) if (ends_with(de->d_name, ".idx") || ends_with(de->d_name, ".pack") || ends_with(de->d_name, ".bitmap") || - ends_with(de->d_name, ".keep")) + ends_with(de->d_name, ".keep") || + ends_with(de->d_name, ".midx")) string_list_append(, path.buf); else report_garbage(PACKDIR_FILE_GARBAGE, path.buf); @@ -828,9 +829,12 @@ unsigned long approximate_object_count(void) static unsigned long count; if (!approximate_object_count_valid) { struct packed_git *p; + struct midxed_git *m; - prepare_packed_git(); + prepare_packed_git_internal(1); count = 0; + for (m = midxed_git; m; m = m->next) + count += m->num_objects; for (p = packed_git; p; p = p->next) { if (open_pack_index(p)) continue; -- 2.15.0
[RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable
Perform some basic read-only operations that load objects and find abbreviations. As this functionality begins to reference MIDX files, ensure the output matches when using MIDX files and when not using them. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- t/t5318-midx.sh | 31 +++ 1 file changed, 31 insertions(+) diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh index 42d103c879..00be852ed3 100755 --- a/t/t5318-midx.sh +++ b/t/t5318-midx.sh @@ -37,6 +37,19 @@ _midx_read_expect() { EOF } +_midx_git_two_modes() { + git -c core.midx=true $1 >output + git -c core.midx=false $1 >expect +} + +_midx_git_behavior() { + test_expect_success 'check normal git operations' \ + '_midx_git_two_modes "log --patch master" && +cmp output expect && +_midx_git_two_modes "rev-list --all --objects" && +cmp output expect' +} + test_expect_success 'write-midx from index version 1' \ 'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 ${packdir}/test-1) && midx1=$(git midx --write) && @@ -48,6 +61,8 @@ test_expect_success 'write-midx from index version 1' \ git midx --read --midx-id=${midx1} >output && cmp output expect' +_midx_git_behavior + test_expect_success 'write-midx from index version 2' \ 'rm "${packdir}/test-1-${pack1}.pack" && pack2=$(git rev-list --all --objects | git pack-objects --index-version=2 ${packdir}/test-2) && @@ -61,6 +76,8 @@ test_expect_success 'write-midx from index version 2' \ git midx --read> output && cmp output expect' +_midx_git_behavior + test_expect_success 'Create more objects' \ 'for i in $(test_seq 100) do @@ -71,6 +88,8 @@ test_expect_success 'Create more objects' \ git commit -m "test data 2" && git branch commit2 HEAD' +_midx_git_behavior + test_expect_success 'write-midx with two packs' \ 'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects --index-version=2 ${packdir}/test-3) && midx3=$(git midx --write --update-head) && @@ -84,6 +103,8 @@ test_expect_success 'write-midx with two packs' \ git midx --read >output && cmp output expect' +_midx_git_behavior + test_expect_success 'Add more packs' \ 'for i in $(test_seq 10) do @@ -107,6 +128,8 @@ test_expect_success 'Add more packs' \ git pack-objects --index-version=2 ${packdir}/test-pack output && cmp output expect' +_midx_git_behavior + test_expect_success 'write-midx with no new packs' \ 'midx5=$(git midx --write --update-head --delete-expired) && test_path_is_file ${packdir}/midx-${midx5}.midx && @@ -127,6 +152,8 @@ test_expect_success 'write-midx with no new packs' \ test_path_is_file ${packdir}/midx-head && test $(cat ${packdir}/midx-head) = "$midx4"' +_midx_git_behavior + test_expect_success 'create bare repo' \ 'cd .. && git clone --bare full bare && @@ -146,6 +173,8 @@ test_expect_success 'write-midx in bare repo' \ git midx --read >output && cmp output expect' +_midx_git_behavior + test_expect_success 'midx --clear' \ 'git midx --clear && test_path_is_missing "${baredir}/midx-${midx4}.midx" && @@ -155,4 +184,6 @@ test_expect_success 'midx --clear' \ test_path_is_missing "${packdir}/midx-${midx4}.midx" && test_path_is_missing "${packdir}/midx-head"' +_midx_git_behavior + test_done -- 2.15.0
[RFC PATCH 03/18] midx: create core.midx config setting
As the multi-pack-index feature is being developed, we use a config setting 'core.midx' to enable all use of MIDX files. Since MIDX files are designed as a way to augment the existing data stores in Git, turning this setting off will revert to previous behavior without needing to downgrade. This can also be a repo- specific setting if the MIDX is misbehaving in only one repo. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 2 ++ 4 files changed, 11 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 64c1dbba94..dc7cb4b900 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -896,6 +896,9 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.midx:: + Enable "multi-pack-index" feature. Set to true to read and write MIDX files. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index a2ec8c0b55..f4943d3136 100644 --- a/cache.h +++ b/cache.h @@ -820,6 +820,7 @@ extern int precomposed_unicode; extern int protect_hfs; extern int protect_ntfs; extern const char *core_fsmonitor; +extern int core_midx; /* * Include broken refs in all ref iterations, which will diff --git a/config.c b/config.c index e617c2018d..17f560ddc4 100644 --- a/config.c +++ b/config.c @@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.midx")) { + core_midx = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index 63ac38a46f..57a3943849 100644 --- a/environment.c +++ b/environment.c @@ -78,6 +78,8 @@ int protect_hfs = PROTECT_HFS_DEFAULT; int protect_ntfs = PROTECT_NTFS_DEFAULT; const char *core_fsmonitor; +int core_midx; + /* * The character that begins a commented line in user-editable file * that is subject to stripspace. -- 2.15.0
[RFC PATCH 04/18] midx: write multi-pack indexes for an object list
The write_midx_file() method takes a list of packfiles and indexed objects with offset information and writes according to the format in Documentation/technical/pack-format.txt. The chunks are separated into methods. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + midx.c | 412 +++ midx.h | 42 +++ 3 files changed, 455 insertions(+) create mode 100644 midx.c create mode 100644 midx.h diff --git a/Makefile b/Makefile index 2a81ae22e9..d0d810951f 100644 --- a/Makefile +++ b/Makefile @@ -827,6 +827,7 @@ LIB_OBJS += merge.o LIB_OBJS += merge-blobs.o LIB_OBJS += merge-recursive.o LIB_OBJS += mergesort.o +LIB_OBJS += midx.o LIB_OBJS += mru.o LIB_OBJS += name-hash.o LIB_OBJS += notes.o diff --git a/midx.c b/midx.c new file mode 100644 index 00..5c320726ed --- /dev/null +++ b/midx.c @@ -0,0 +1,412 @@ +#include "cache.h" +#include "git-compat-util.h" +#include "pack.h" +#include "packfile.h" +#include "midx.h" + +#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ +#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */ +#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */ +#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */ +#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */ + +#define MIDX_VERSION_1 1 +#define MIDX_VERSION MIDX_VERSION_1 + +#define MIDX_OID_VERSION_SHA1 1 +#define MIDX_OID_LEN_SHA1 20 +#define MIDX_OID_VERSION MIDX_OID_VERSION_SHA1 +#define MIDX_OID_LEN MIDX_OID_LEN_SHA1 + +#define MIDX_LARGE_OFFSET_NEEDED 0x8000 + +char* get_midx_filename_oid(const char *pack_dir, + struct object_id *oid) +{ + struct strbuf head_path = STRBUF_INIT; + strbuf_addstr(_path, pack_dir); + strbuf_addstr(_path, "/midx-"); + strbuf_addstr(_path, oid_to_hex(oid)); + strbuf_addstr(_path, ".midx"); + + return strbuf_detach(_path, NULL); +} + +struct pack_midx_details_internal { + uint32_t pack_int_id; + uint32_t internal_offset; +}; + +static int midx_sha1_compare(const void *_a, const void *_b) +{ + struct pack_midx_entry *a = *(struct pack_midx_entry **)_a; + struct pack_midx_entry *b = *(struct pack_midx_entry **)_b; + return oidcmp(>oid, >oid); +} + +static void write_midx_chunk_packlookup( + struct sha1file *f, + const char **pack_names, uint32_t nr_packs) +{ + uint32_t i, cur_len = 0; + + for (i = 0; i < nr_packs; i++) { + uint32_t swap_len = htonl(cur_len); + sha1write(f, _len, 4); + cur_len += strlen(pack_names[i]) + 1; + } +} + +static void write_midx_chunk_packnames( + struct sha1file *f, + const char **pack_names, uint32_t nr_packs) +{ + uint32_t i; + for (i = 0; i < nr_packs; i++) + sha1write(f, pack_names[i], strlen(pack_names[i]) + 1); +} + +static void write_midx_chunk_oidfanout( + struct sha1file *f, + struct pack_midx_entry **objects, uint32_t nr_objects) +{ + struct pack_midx_entry **list = objects; + struct pack_midx_entry **last = objects + nr_objects; + uint32_t count_distinct = 0; + uint32_t i; + + /* + * Write the first-level table (the list is sorted, + * but we use a 256-entry lookup to be able to avoid + * having to do eight extra binary search iterations). + */ + for (i = 0; i < 256; i++) { + struct pack_midx_entry **next = list; + struct pack_midx_entry *prev = 0; + uint32_t swap_distinct; + + while (next < last) { + struct pack_midx_entry *obj = *next; + if (obj->oid.hash[0] != i) + break; + + if (!prev || oidcmp(&(prev->oid), &(obj->oid))) + count_distinct++; + + prev = obj; + next++; + } + + swap_distinct = htonl(count_distinct); + sha1write(f, _distinct, 4); + list = next; + } +} + +static void write_midx_chunk_oidlookup( + struct sha1file *f, unsigned char hash_len, + struct pack_midx_entry **objects, uint32_t nr_objects) +{ + struct pack_midx_entry **list = objects; + struct object_id *last_oid = 0; + uint32_t i; + + for (i = 0; i < nr_objects; i++) { + struct pack_midx_entry *obj = *list++; + + if (last_oid && !oidcmp(last_oid, >oid)) + continue; + + last_oid = >oid; + sha1write(f, obj->
[RFC PATCH 05/18] midx: create midx builtin with --write mode
Commentary: As we extend the function of the midx builtin, I expand the SYNOPSIS row of "git-midx.txt" but do not create multiple rows. If this builtin doesn't change too much, I will rewrite the SYNOPSIS to be multi- lined, such as in "git-branch.txt". -- >8 -- Create, document, and implement the first ability of the midx builtin. The --write subcommand creates a multi-pack-index for all indexed packfiles within a given pack directory. If none is provided, the objects/pack directory is implied. The arguments allow specifying the pack directory so we can add MIDX files to alternates. The packfiles are expected to be paired with pack-indexes and are otherwise ignored. This simplifies the implementation and also keeps compatibility with older versions of Git (or changing core.midx to false). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .gitignore | 1 + Documentation/git-midx.txt | 54 + Makefile | 1 + builtin.h | 1 + builtin/midx.c | 195 + command-list.txt | 1 + git.c | 1 + 7 files changed, 254 insertions(+) create mode 100644 Documentation/git-midx.txt create mode 100644 builtin/midx.c diff --git a/.gitignore b/.gitignore index 833ef3b0b7..545e195f2a 100644 --- a/.gitignore +++ b/.gitignore @@ -95,6 +95,7 @@ /git-merge-subtree /git-mergetool /git-mergetool--lib +/git-midx /git-mktag /git-mktree /git-name-rev diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt new file mode 100644 index 00..17464222c1 --- /dev/null +++ b/Documentation/git-midx.txt @@ -0,0 +1,54 @@ +git-midx(1) + + +NAME + +git-midx - Write and verify multi-pack-indexes (MIDX files). + + +SYNOPSIS + +[verse] +'git midx' --write [--pack-dir ] + +DESCRIPTION +--- +Write a MIDX file. + +OPTIONS +--- + +--pack-dir :: + Use given directory for the location of packfiles, pack-indexes, + and MIDX files. + +--write:: + If specified, write a new midx file to the pack directory using + the packfiles present. Outputs the hash of the result midx file. + +EXAMPLES + + +* Write a MIDX file for the packfiles in your local .git folder. ++ + +$ git midx --write + + +* Write a MIDX file for the packfiles in a different folder ++ +- +$ git midx --write --pack-dir ../../alt/pack/ +- + +CONFIGURATION +- + +core.midx:: + The midx command will fail if core.midx is false. + Also, the written MIDX files will be ignored by other commands + unless core.midx is true. + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Makefile b/Makefile index d0d810951f..5c458705c1 100644 --- a/Makefile +++ b/Makefile @@ -980,6 +980,7 @@ BUILTIN_OBJS += builtin/merge-index.o BUILTIN_OBJS += builtin/merge-ours.o BUILTIN_OBJS += builtin/merge-recursive.o BUILTIN_OBJS += builtin/merge-tree.o +BUILTIN_OBJS += builtin/midx.o BUILTIN_OBJS += builtin/mktag.o BUILTIN_OBJS += builtin/mktree.o BUILTIN_OBJS += builtin/mv.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..880383e341 100644 --- a/builtin.h +++ b/builtin.h @@ -188,6 +188,7 @@ extern int cmd_merge_ours(int argc, const char **argv, const char *prefix); extern int cmd_merge_file(int argc, const char **argv, const char *prefix); extern int cmd_merge_recursive(int argc, const char **argv, const char *prefix); extern int cmd_merge_tree(int argc, const char **argv, const char *prefix); +extern int cmd_midx(int argc, const char **argv, const char *prefix); extern int cmd_mktag(int argc, const char **argv, const char *prefix); extern int cmd_mktree(int argc, const char **argv, const char *prefix); extern int cmd_mv(int argc, const char **argv, const char *prefix); diff --git a/builtin/midx.c b/builtin/midx.c new file mode 100644 index 00..4aae14cf8e --- /dev/null +++ b/builtin/midx.c @@ -0,0 +1,195 @@ +#include "builtin.h" +#include "cache.h" +#include "config.h" +#include "dir.h" +#include "git-compat-util.h" +#include "lockfile.h" +#include "packfile.h" +#include "parse-options.h" +#include "midx.h" + +static char const * const builtin_midx_usage[] = { + N_("git midx --write [--pack-dir ]"), + NULL +}; + +static struct opts_midx { + const char *pack_dir; + int write; +} opts; + +static int build_midx_from_packs( + const char *pack_dir, + const char **pack_names, uint32_t nr_packs, + const char **midx_id) +{ + struct packed_git **packs; + const char **installed_pack_names; + uint32_t i, j, nr_install
[RFC PATCH 07/18] midx: teach midx --write to update midx-head
There may be multiple MIDX files in a single pack directory. The primary file is pointed to by a pointer file "midx-head" that contains an OID. The MIDX file to load is then given by "midx-.midx". This head file will be especially important when the MIDX files are extended to be incremental and we expect multiple MIDX files at any point. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-midx.txt | 19 ++- builtin/midx.c | 32 ++-- midx.c | 31 +++ midx.h | 3 +++ t/t5318-midx.sh| 33 ++--- 5 files changed, 104 insertions(+), 14 deletions(-) diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt index 17464222c1..01f79cbba5 100644 --- a/Documentation/git-midx.txt +++ b/Documentation/git-midx.txt @@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files). SYNOPSIS [verse] -'git midx' --write [--pack-dir ] +'git midx' --write [--pack-dir ] DESCRIPTION --- @@ -26,15 +26,32 @@ OPTIONS If specified, write a new midx file to the pack directory using the packfiles present. Outputs the hash of the result midx file. +--update-head:: + If specified with --write, update the midx-head file to point to + the written midx file. + EXAMPLES +* Read the midx-head file and output the OID of the head MIDX file. ++ + +$ git midx + + * Write a MIDX file for the packfiles in your local .git folder. + $ git midx --write +* Write a MIDX file for the packfiles in your local .git folder and +* update the midx-head file. ++ + +$ git midx --write --update-head + + * Write a MIDX file for the packfiles in a different folder + - diff --git a/builtin/midx.c b/builtin/midx.c index 4aae14cf8e..84ce6588a2 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -9,13 +9,17 @@ #include "midx.h" static char const * const builtin_midx_usage[] = { - N_("git midx --write [--pack-dir ]"), + N_("git midx [--pack-dir ]"), + N_("git midx --write [--update-head] [--pack-dir ]"), NULL }; static struct opts_midx { const char *pack_dir; int write; + int update_head; + int has_existing; + struct object_id old_midx_oid; } opts; static int build_midx_from_packs( @@ -109,6 +113,22 @@ static int build_midx_from_packs( return 0; } +static void update_head_file(const char *pack_dir, const char *midx_id) +{ + int fd; + struct lock_file lk = LOCK_INIT; + char *head_path = get_midx_head_filename(pack_dir); + + fd = hold_lock_file_for_update(, head_path, LOCK_DIE_ON_ERROR); + FREE_AND_NULL(head_path); + + if (fd < 0) + die_errno("unable to open midx-head"); + + write_in_full(fd, midx_id, GIT_MAX_HEXSZ); + commit_lock_file(); +} + static int midx_write(void) { const char **pack_names = NULL; @@ -152,6 +172,9 @@ static int midx_write(void) printf("%s\n", midx_id); + if (opts.update_head) + update_head_file(opts.pack_dir, midx_id); + cleanup: if (pack_names) FREE_AND_NULL(pack_names); @@ -166,6 +189,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix) N_("The pack directory containing set of packfile and pack-index pairs.") }, OPT_BOOL('w', "write", , N_("write midx file")), + OPT_BOOL('u', "update-head", _head, + N_("update midx-head to written midx file")), OPT_END(), }; @@ -187,9 +212,12 @@ int cmd_midx(int argc, const char **argv, const char *prefix) opts.pack_dir = strbuf_detach(, NULL); } + opts.has_existing = !!get_midx_head_oid(opts.pack_dir, _midx_oid); + if (opts.write) return midx_write(); - usage_with_options(builtin_midx_usage, builtin_midx_options); + if (opts.has_existing) + printf("%s\n", oid_to_hex(_midx_oid)); return 0; } diff --git a/midx.c b/midx.c index 5c320726ed..f4178c1b81 100644 --- a/midx.c +++ b/midx.c @@ -34,6 +34,37 @@ char* get_midx_filename_oid(const char *pack_dir, return strbuf_detach(_path, NULL); } +char *get_midx_head_filename(const char *pack_dir) +{ + struct strbuf head_filename =
[RFC PATCH 17/18] sha1_name: use midx for abbreviations
Create unique_in_midx() to mimic behavior of unique_in_pack(). Create find_abbrev_len_for_midx() to mimic behavior of find_abbrev_len_for_pack(). Consume these methods when interacting with abbreviations. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 70 +++-- 1 file changed, 68 insertions(+), 2 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..2f426e136e 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -10,6 +10,7 @@ #include "dir.h" #include "sha1-array.h" #include "packfile.h" +#include "midx.h" static int get_oid_oneline(const char *, struct object_id *, struct commit_list *); @@ -190,11 +191,40 @@ static void unique_in_pack(struct packed_git *p, } } +static void unique_in_midx(struct midxed_git *m, + struct disambiguate_state *ds) +{ + uint32_t num, i, first = 0; + const struct object_id *current = NULL; + + if (!m->num_objects) + return; + + num = m->num_objects; + bsearch_midx(m, ds->bin_pfx.hash, ); + + /* +* At this point, "first" is the location of the lowest object +* with an object name that could match "bin_pfx". See if we have +* 0, 1 or more objects that actually match(es). +*/ + for (i = first; i < num && !ds->ambiguous; i++) { + struct object_id oid; + current = nth_midxed_object_oid(, m, i); + if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash)) + break; + update_candidates(ds, current); + } +} + static void find_short_packed_object(struct disambiguate_state *ds) { struct packed_git *p; + struct midxed_git *m; - prepare_packed_git(); + prepare_packed_git_internal(1); + for (m = midxed_git; m && !ds->ambiguous; m = m->next) + unique_in_midx(m, ds); for (p = packed_git; p && !ds->ambiguous; p = p->next) unique_in_pack(p, ds); } @@ -508,6 +538,39 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_midx(struct midxed_git *m, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t first = 0; + struct object_id oid; + + if (!m->num_objects) + return; + + match = bsearch_midx(m, mad->hash, ); + + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_midxed_object_oid(, m, first); + extend_abbrev_len(, mad); + } else if (first < m->num_objects - 1) { + nth_midxed_object_oid(, m, first + 1); + extend_abbrev_len(, mad); + } + if (first > 0) { + nth_midxed_object_oid(, m, first - 1); + extend_abbrev_len(, mad); + } + mad->init_len = mad->cur_len; +} + static void find_abbrev_len_for_pack(struct packed_git *p, struct min_abbrev_data *mad) { @@ -563,8 +626,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p, static void find_abbrev_len_packed(struct min_abbrev_data *mad) { struct packed_git *p; + struct midxed_git *m; - prepare_packed_git(); + prepare_packed_git_internal(1); + for (m = midxed_git; m; m = m->next) + find_abbrev_len_for_midx(m, mad); for (p = packed_git; p; p = p->next) find_abbrev_len_for_pack(p, mad); } -- 2.15.0
[RFC PATCH 14/18] midx: load midx files when loading packs
Replace prepare_packed_git() with prepare_packed_git_internal(use_midx) to allow some consumers of prepare_packed_git() with a way to load MIDX files. Consumers should only use the new method if they are prepared to use the midxed_git struct alongside the packed_git struct. If a packfile is found that is not referenced by the current MIDX, then add it to the packed_git struct. This is important to keep the MIDX useful after adding packs due to "fetch" commands and when third-party tools (such as libgit2) add packs directly to the repo. If prepare_packed_git_internal is called with use_midx = 0, then unload the MIDX file and reload the packfiles in to the packed_git struct. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- midx.c | 57 +++ midx.h | 6 -- packfile.c | 64 +- packfile.h | 1 + 4 files changed, 117 insertions(+), 11 deletions(-) diff --git a/midx.c b/midx.c index 3ce2b736ea..a66763b9e3 100644 --- a/midx.c +++ b/midx.c @@ -22,6 +22,9 @@ #define MIDX_LARGE_OFFSET_NEEDED 0x8000 +/* MIDX-git global storage */ +struct midxed_git *midxed_git = 0; + char* get_midx_filename_oid(const char *pack_dir, struct object_id *oid) { @@ -197,6 +200,45 @@ struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid) return m; } +static char* get_midx_filename_dir(const char *pack_dir) +{ + struct object_id oid; + if (!get_midx_head_oid(pack_dir, )) + return 0; + + return get_midx_filename_oid(pack_dir, ); +} + +static int prepare_midxed_git_head(char *pack_dir, int local) +{ + struct midxed_git *m = midxed_git; + char *midx_head_path = get_midx_filename_dir(pack_dir); + + if (!core_midx) + return 1; + + if (midx_head_path) { + midxed_git = load_midxed_git_one(midx_head_path, pack_dir); + midxed_git->next = m; + FREE_AND_NULL(midx_head_path); + return 1; + } + + return 0; +} + +int prepare_midxed_git_objdir(char *obj_dir, int local) +{ + int ret; + struct strbuf pack_dir = STRBUF_INIT; + strbuf_addstr(_dir, obj_dir); + strbuf_add(_dir, "/pack", 5); + + ret = prepare_midxed_git_head(pack_dir.buf, local); + strbuf_release(_dir); + return ret; +} + struct pack_midx_details_internal { uint32_t pack_int_id; uint32_t internal_offset; @@ -677,3 +719,18 @@ int close_midx(struct midxed_git *m) return 1; } + +void close_all_midx(void) +{ + struct midxed_git *m = midxed_git; + struct midxed_git *next; + + while (m) { + next = m->next; + close_midx(m); + free(m); + m = next; + } + + midxed_git = 0; +} diff --git a/midx.h b/midx.h index 27d48163e9..d8ede8121c 100644 --- a/midx.h +++ b/midx.h @@ -27,7 +27,7 @@ struct pack_midx_header { uint32_t num_packs; }; -struct midxed_git { +extern struct midxed_git { struct midxed_git *next; int midx_fd; @@ -81,9 +81,10 @@ struct midxed_git { /* something like ".git/objects/pack" */ char pack_dir[FLEX_ARRAY]; /* more */ -}; +} *midxed_git; extern struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid); +extern int prepare_midxed_git_objdir(char *obj_dir, int local); struct pack_midx_details { uint32_t pack_int_id; @@ -118,5 +119,6 @@ extern const char *write_midx_file(const char *pack_dir, uint32_t nr_objects); extern int close_midx(struct midxed_git *m); +extern void close_all_midx(void); #endif diff --git a/packfile.c b/packfile.c index c36420b33f..1c0822878b 100644 --- a/packfile.c +++ b/packfile.c @@ -8,6 +8,7 @@ #include "list.h" #include "streaming.h" #include "sha1-lookup.h" +#include "midx.h" char *odb_pack_name(struct strbuf *buf, const unsigned char *sha1, @@ -309,10 +310,22 @@ void close_pack(struct packed_git *p) void close_all_packs(void) { struct packed_git *p; + struct midxed_git *m; + + for (m = midxed_git; m; m = m->next) { + int i; + for (i = 0; i < m->num_packs; i++) { + p = m->packs[i]; + if (p && p->do_not_close) + BUG("want to close pack marked 'do-not-close'"); + else if (p) + close_pack(p); + } + } for (p = packed_git; p; p = p->next) if (p->do_not_close) - die("BUG: want to close pack marked 'do-not-close'"); + BUG("wa
[RFC PATCH 12/18] midx: teach git-midx to delete expired files
As we write new MIDX files, the existing files are probably not needed. Supply the "--delete-expired" flag to remove these files during the "--write" sub- command. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-midx.txt | 4 builtin/midx.c | 15 ++- midx.c | 26 ++ midx.h | 2 ++ packfile.c | 2 +- packfile.h | 1 + t/t5318-midx.sh| 9 ++--- 7 files changed, 54 insertions(+), 5 deletions(-) diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt index c184d3a593..4635247d0d 100644 --- a/Documentation/git-midx.txt +++ b/Documentation/git-midx.txt @@ -43,6 +43,10 @@ OPTIONS If specified with --write, update the midx-head file to point to the written midx file. +--delete-expired:: + If specified with --write and --update-head, delete the midx file + previously pointed to by midx-head (if changed). + EXAMPLES diff --git a/builtin/midx.c b/builtin/midx.c index b30ef36ff8..6f56f39390 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -10,7 +10,7 @@ static char const * const builtin_midx_usage[] = { N_("git midx [--pack-dir ]"), - N_("git midx --write [--update-head] [--pack-dir ]"), + N_("git midx --write [--update-head [--delete-expired]] [--pack-dir ]"), N_("git midx --clear [--pack-dir ]"), NULL }; @@ -22,6 +22,7 @@ static struct opts_midx { const char *midx_id; int write; int update_head; + int delete_expired; int has_existing; struct object_id old_midx_oid; } opts; @@ -276,6 +277,16 @@ static int midx_write(void) if (opts.update_head) update_head_file(opts.pack_dir, midx_id); + if (opts.delete_expired && opts.update_head && opts.has_existing && + strcmp(midx_id, oid_to_hex(_midx_oid))) { + char *old_path = get_midx_filename_oid(opts.pack_dir, _midx_oid); + close_midx(midx); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + + free(old_path); + } + cleanup: if (pack_names) FREE_AND_NULL(pack_names); @@ -300,6 +311,8 @@ int cmd_midx(int argc, const char **argv, const char *prefix) N_("write midx file")), OPT_BOOL('u', "update-head", _head, N_("update midx-head to written midx file")), + OPT_BOOL('d', "delete-expired", _expired, + N_("delete expired head midx file")), OPT_END(), }; diff --git a/midx.c b/midx.c index 53eb29dac3..3ce2b736ea 100644 --- a/midx.c +++ b/midx.c @@ -651,3 +651,29 @@ const char *write_midx_file(const char *pack_dir, return final_hex; } + +int close_midx(struct midxed_git *m) +{ + int i; + if (m->midx_fd < 0) + return 0; + + for (i = 0; i < m->num_packs; i++) { + if (m->packs[i]) { + close_pack(m->packs[i]); + free(m->packs[i]); + m->packs[i] = NULL; + } + } + + munmap((void *)m->data, m->data_len); + m->data = 0; + + close(m->midx_fd); + m->midx_fd = -1; + + free(m->packs); + free(m->pack_names); + + return 1; +} diff --git a/midx.h b/midx.h index 1e7a94651c..27d48163e9 100644 --- a/midx.h +++ b/midx.h @@ -117,4 +117,6 @@ extern const char *write_midx_file(const char *pack_dir, struct pack_midx_entry **objects, uint32_t nr_objects); +extern int close_midx(struct midxed_git *m); + #endif diff --git a/packfile.c b/packfile.c index 4a5fe7ab18..c36420b33f 100644 --- a/packfile.c +++ b/packfile.c @@ -299,7 +299,7 @@ void close_pack_index(struct packed_git *p) } } -static void close_pack(struct packed_git *p) +void close_pack(struct packed_git *p) { close_pack_windows(p); close_pack_fd(p); diff --git a/packfile.h b/packfile.h index 0cdeb54dcd..7cf4771029 100644 --- a/packfile.h +++ b/packfile.h @@ -61,6 +61,7 @@ extern void close_pack_index(struct packed_git *); extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *); extern void close_pack_windows(struct packed_git *); +extern void close_pack(struct packed_git *p); extern void close_all_packs(void); extern void unuse_pack(struct pack_window **); extern void clear_delta_base_cache(void); diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh index 9337355ab3..42d103c879 100755 --- a/t/t5318-midx.sh +++
[RFC PATCH 10/18] midx: use existing midx when writing
When writing a new MIDX file, it is faster to use an existing MIDX file to load the object list and pack offsets and to only inspect pack-indexes for packs not already covered by the MIDX file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/midx.c | 34 +++--- midx.c | 23 +++ midx.h | 2 ++ 3 files changed, 56 insertions(+), 3 deletions(-) diff --git a/builtin/midx.c b/builtin/midx.c index ee9234583d..aff2085771 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -73,7 +73,7 @@ static int midx_read(void) static int build_midx_from_packs( const char *pack_dir, const char **pack_names, uint32_t nr_packs, - const char **midx_id) + const char **midx_id, struct midxed_git *midx) { struct packed_git **packs; const char **installed_pack_names; @@ -86,6 +86,9 @@ static int build_midx_from_packs( struct strbuf pack_path = STRBUF_INIT; int baselen; + if (midx) + nr_total_packs += midx->num_packs; + if (!nr_total_packs) { *midx_id = NULL; return 0; @@ -94,6 +97,12 @@ static int build_midx_from_packs( ALLOC_ARRAY(packs, nr_total_packs); ALLOC_ARRAY(installed_pack_names, nr_total_packs); + if (midx) { + for (i = 0; i < midx->num_packs; i++) + installed_pack_names[nr_installed_packs++] = midx->pack_names[i]; + pack_offset = midx->num_packs; + } + strbuf_addstr(_path, pack_dir); strbuf_addch(_path, '/'); baselen = pack_path.len; @@ -101,6 +110,9 @@ static int build_midx_from_packs( strbuf_setlen(_path, baselen); strbuf_addstr(_path, pack_names[i]); + if (midx && contains_pack(midx, pack_names[i])) + continue; + strbuf_strip_suffix(_path, ".pack"); strbuf_addstr(_path, ".idx"); @@ -120,13 +132,24 @@ static int build_midx_from_packs( if (!nr_objects || !nr_installed_packs) { FREE_AND_NULL(packs); FREE_AND_NULL(installed_pack_names); - *midx_id = NULL; + + if (opts.has_existing) + *midx_id = oid_to_hex(_midx_oid); + else + *midx_id = NULL; + return 0; } + if (midx) + nr_objects += midx->num_objects; + ALLOC_ARRAY(objects, nr_objects); nr_objects = 0; + for (i = 0; midx && i < midx->num_objects; i++) + nth_midxed_object_entry(midx, i, [nr_objects++]); + for (i = pack_offset; i < nr_installed_packs; i++) { struct packed_git *p = packs[i]; @@ -184,6 +207,10 @@ static int midx_write(void) const char *midx_id = 0; DIR *dir; struct dirent *de; + struct midxed_git *midx = NULL; + + if (opts.has_existing) + midx = get_midxed_git(opts.pack_dir, _midx_oid); dir = opendir(opts.pack_dir); if (!dir) { @@ -212,7 +239,8 @@ static int midx_write(void) if (!nr_packs) goto cleanup; - if (build_midx_from_packs(opts.pack_dir, pack_names, nr_packs, _id)) + if (build_midx_from_packs(opts.pack_dir, pack_names, + nr_packs, _id, midx)) die("failed to build MIDX"); if (midx_id == NULL) diff --git a/midx.c b/midx.c index 4e0df0285a..53eb29dac3 100644 --- a/midx.c +++ b/midx.c @@ -257,6 +257,29 @@ const struct object_id *nth_midxed_object_oid(struct object_id *oid, return oid; } +int contains_pack(struct midxed_git *m, const char *pack_name) +{ + uint32_t first = 0, last = m->num_packs; + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + const char *current; + int cmp; + + current = m->pack_names[mid]; + cmp = strcmp(pack_name, current); + if (!cmp) + return 1; + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + return 0; +} + static int midx_sha1_compare(const void *_a, const void *_b) { struct pack_midx_entry *a = *(struct pack_midx_entry **)_a; diff --git a/midx.h b/midx.h index 9255909ae8..1e7a94651c 100644 --- a/midx.h +++ b/midx.h @@ -100,6 +100,8 @@ extern const struct object_id *nth_midxed_object_oid(struct object_id *oid, struct midxed_git *m, uint32_t n); +extern int contains_pack(struct midxed_git *m, const char *pack_name); + /* * Write a single MI
[RFC PATCH 09/18] midx: find details of nth object in midx
The MIDX file stores pack offset information for a list of objects. The nth_midxed_object_* methods provide ways to extract this information. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- midx.c | 55 +++ midx.h | 15 +++ 2 files changed, 70 insertions(+) diff --git a/midx.c b/midx.c index c631be451f..4e0df0285a 100644 --- a/midx.c +++ b/midx.c @@ -202,6 +202,61 @@ struct pack_midx_details_internal { uint32_t internal_offset; }; +struct pack_midx_details *nth_midxed_object_details(struct midxed_git *m, + uint32_t n, + struct pack_midx_details *d) +{ + struct pack_midx_details_internal *d_internal; + const unsigned char *details = m->chunk_object_offsets; + + if (n >= m->num_objects) + return NULL; + + d_internal = (struct pack_midx_details_internal*)(details + 8 * n); + d->pack_int_id = ntohl(d_internal->pack_int_id); + d->offset = ntohl(d_internal->internal_offset); + + if (m->chunk_large_offsets && d->offset & MIDX_LARGE_OFFSET_NEEDED) { + uint32_t large_offset = d->offset ^ MIDX_LARGE_OFFSET_NEEDED; + const unsigned char *large_offsets = m->chunk_large_offsets + 8 * large_offset; + + d->offset = (((uint64_t)ntohl(*((uint32_t *)(large_offsets + 0 << 32) | +ntohl(*((uint32_t *)(large_offsets + 4))); + } + + return d; +} + +struct pack_midx_entry *nth_midxed_object_entry(struct midxed_git *m, + uint32_t n, + struct pack_midx_entry *e) +{ + struct pack_midx_details details; + const unsigned char *index = m->chunk_oid_lookup; + + if (!nth_midxed_object_details(m, n, )) + return NULL; + + memcpy(e->oid.hash, index + m->hdr->hash_len * n, m->hdr->hash_len); + e->pack_int_id = details.pack_int_id; + e->offset = details.offset; + + return e; +} + +const struct object_id *nth_midxed_object_oid(struct object_id *oid, + struct midxed_git *m, + uint32_t n) +{ + struct pack_midx_entry e; + + if (!nth_midxed_object_entry(m, n, )) + return 0; + + hashcpy(oid->hash, e.oid.hash); + return oid; +} + static int midx_sha1_compare(const void *_a, const void *_b) { struct pack_midx_entry *a = *(struct pack_midx_entry **)_a; diff --git a/midx.h b/midx.h index 92b74e49db..9255909ae8 100644 --- a/midx.h +++ b/midx.h @@ -85,6 +85,21 @@ struct midxed_git { extern struct midxed_git *get_midxed_git(const char *pack_dir, struct object_id *oid); +struct pack_midx_details { + uint32_t pack_int_id; + off_t offset; +}; + +extern struct pack_midx_details *nth_midxed_object_details(struct midxed_git *m, + uint32_t n, + struct pack_midx_details *d); +extern struct pack_midx_entry *nth_midxed_object_entry(struct midxed_git *m, + uint32_t n, + struct pack_midx_entry *e); +extern const struct object_id *nth_midxed_object_oid(struct object_id *oid, +struct midxed_git *m, +uint32_t n); + /* * Write a single MIDX file storing the given entries for the * given list of packfiles. If midx_name is null, then a temp -- 2.15.0
[RFC PATCH 08/18] midx: teach git-midx to read midx file details
Commentary: I included the pack directory of the MIDX file as a FLEX_ARRAY at the end of the midxed_git struct, similar to how the pack name appears at the end of the packed_git struct. A colleague mentioned this pattern is confusing and possibly dangerous so I should consider changing it. If there is no strong reason for this, then I will modify the struct before the v1 patch to use a char*. -- >8 -- Add a "--read" subcommand to the midx builtin to report summary information on the head MIDX file or a MIDX file specified by the supplied "--midx-id" parameter. This subcommand is used by t5318-midx.sh to verify the indexed objects are as expected. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-midx.txt | 23 +++- builtin/midx.c | 59 midx.c | 132 + midx.h | 58 t/t5318-midx.sh| 79 +++ 5 files changed, 328 insertions(+), 23 deletions(-) diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt index 01f79cbba5..3eeed1d969 100644 --- a/Documentation/git-midx.txt +++ b/Documentation/git-midx.txt @@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files). SYNOPSIS [verse] -'git midx' --write [--pack-dir ] +'git midx' [--write|--read] [--pack-dir ] DESCRIPTION --- @@ -22,9 +22,18 @@ OPTIONS Use given directory for the location of packfiles, pack-indexes, and MIDX files. +--read:: + If specified, read a midx file specified by the midx-head file + and output basic details about the midx file. (Cannot be combined + with --write.) + +--midx-id :: + If specified with --read, use the given oid to read midx-[oid].midx + instead of using midx-head. --write:: If specified, write a new midx file to the pack directory using the packfiles present. Outputs the hash of the result midx file. + (Cannot be combined with --read.) --update-head:: If specified with --write, update the midx-head file to point to @@ -58,6 +67,18 @@ $ git midx --write --update-head $ git midx --write --pack-dir ../../alt/pack/ - +* Read the current midx-head. ++ +--- +$ git midx --read +--- + +* Read a specific MIDX file in the local .git folder. ++ + +$ git midx --read --midx-id 3e50d982a2257168c7fd0ff12ffe5cf6af38c74e + + CONFIGURATION - diff --git a/builtin/midx.c b/builtin/midx.c index 84ce6588a2..ee9234583d 100644 --- a/builtin/midx.c +++ b/builtin/midx.c @@ -16,12 +16,60 @@ static char const * const builtin_midx_usage[] = { static struct opts_midx { const char *pack_dir; + int read; + const char *midx_id; int write; int update_head; int has_existing; struct object_id old_midx_oid; } opts; +static int midx_read(void) +{ + struct object_id midx_oid; + struct midxed_git *midx; + uint32_t i; + + if (opts.midx_id && strlen(opts.midx_id) == GIT_MAX_HEXSZ) + get_oid_hex(opts.midx_id, _oid); + else if (!get_midx_head_oid(opts.pack_dir, _oid)) + die("No midx-head exists."); + + midx = get_midxed_git(opts.pack_dir, _oid); + + printf("header: %08x %x %d %d %d %d %d\n", + ntohl(midx->hdr->midx_signature), + ntohl(midx->hdr->midx_version), + midx->hdr->hash_version, + midx->hdr->hash_len, + midx->hdr->num_base_midx, + midx->hdr->num_chunks, + ntohl(midx->hdr->num_packs)); + printf("num_objects: %d\n", midx->num_objects); + printf("chunks:"); + + if (midx->chunk_pack_lookup) + printf(" pack_lookup"); + if (midx->chunk_pack_names) + printf(" pack_names"); + if (midx->chunk_oid_fanout) + printf(" oid_fanout"); + if (midx->chunk_oid_lookup) + printf(" oid_lookup"); + if (midx->chunk_object_offsets) + printf(" object_offsets"); + if (midx->chunk_large_offsets) + printf(" large_offsets"); + printf("\n"); + + printf("pack_names:\n"); + for (i = 0; i < midx->num_packs; i++) + printf("%s\n", midx->pack_names[i]); + + printf("pack_dir: %s\n", midx->pack_dir); + return 0; +} + st
[RFC PATCH 18/18] packfile: use midx for object loads
When looking for a packed object, first check the MIDX for that object. This reduces thrashing in the MRU list of packfiles. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- midx.c | 84 ++ midx.h | 3 +++ packfile.c | 5 +++- 3 files changed, 91 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 8c643caa92..4b2398b3ee 100644 --- a/midx.c +++ b/midx.c @@ -329,6 +329,90 @@ int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, uint32_t *pos) return 0; } +static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id) +{ + struct strbuf pack_name = STRBUF_INIT; + + if (pack_int_id >= m->hdr->num_packs) + return 1; + + if (m->packs[pack_int_id]) + return 0; + + strbuf_addstr(_name, m->pack_dir); + strbuf_addstr(_name, "/"); + strbuf_addstr(_name, m->pack_names[pack_int_id]); + strbuf_strip_suffix(_name, ".pack"); + strbuf_addstr(_name, ".idx"); + + m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1); + strbuf_release(_name); + return !m->packs[pack_int_id]; +} + +static int find_pack_entry_midx(const unsigned char *sha1, + struct midxed_git *m, + struct packed_git **p, + off_t *offset) +{ + uint32_t pos; + struct pack_midx_details d; + + if (!bsearch_midx(m, sha1, ) || + !nth_midxed_object_details(m, pos, )) + return 0; + + if (d.pack_int_id >= m->num_packs) + die(_("bad pack-int-id %d"), d.pack_int_id); + + /* load packfile, if necessary */ + if (prepare_midx_pack(m, d.pack_int_id)) + return 0; + + *p = m->packs[d.pack_int_id]; + *offset = d.offset; + + return 1; +} + +int fill_pack_entry_midx(const unsigned char *sha1, +struct pack_entry *e) +{ + struct packed_git *p; + struct midxed_git *m; + + if (!core_midx) + return 0; + + m = midxed_git; + while (m) + { + off_t offset; + if (!find_pack_entry_midx(sha1, m, , )) { + m = m->next; + continue; + } + + /* + * We are about to tell the caller where they can locate the + * requested object. We better make sure the packfile is + * still here and can be accessed before supplying that + * answer, as it may have been deleted since the MIDX was + * loaded! + */ + if (!is_pack_valid(p)) + return 0; + + e->offset = offset; + e->p = p; + hashcpy(e->sha1, sha1); + + return 1; + } + + return 0; +} + int contains_pack(struct midxed_git *m, const char *pack_name) { uint32_t first = 0, last = m->num_packs; diff --git a/midx.h b/midx.h index 5598799189..b7e8b15fe4 100644 --- a/midx.h +++ b/midx.h @@ -11,6 +11,9 @@ extern char *get_midx_head_filename(const char *pack_dir); extern struct object_id *get_midx_head_oid(const char *pack_dir, struct object_id *oid); +extern int fill_pack_entry_midx(const unsigned char *sha1, + struct pack_entry *e); + struct pack_midx_entry { struct object_id oid; uint32_t pack_int_id; diff --git a/packfile.c b/packfile.c index 866a1f30dd..9ec39a83e9 100644 --- a/packfile.c +++ b/packfile.c @@ -1883,7 +1883,10 @@ int find_pack_entry(const unsigned char *sha1, struct pack_entry *e) { struct mru_entry *p; - prepare_packed_git(); + prepare_packed_git_internal(1); + if (fill_pack_entry_midx(sha1, e)) + return 1; + if (!packed_git) return 0; -- 2.15.0
[RFC PATCH 06/18] midx: add t5318-midx.sh test script
Test interactions between the midx builtin and other Git operations. Use both a full repo and a bare repo to ensure the pack directory redirection works correctly. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- t/t5318-midx.sh | 100 1 file changed, 100 insertions(+) create mode 100755 t/t5318-midx.sh diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh new file mode 100755 index 00..869bbea29c --- /dev/null +++ b/t/t5318-midx.sh @@ -0,0 +1,100 @@ +#!/bin/sh + +test_description='multi-pack indexes' +. ./test-lib.sh + +test_expect_success 'config' \ +'rm -rf .git && + mkdir full && + cd full && + git init && + git config core.midx true && + git config pack.threads 1 && + packdir=.git/objects/pack' + +test_expect_success 'write-midx with no packs' \ +'midx0=$(git midx --write) && + test "a$midx0" = "a"' + +test_expect_success 'create objects' \ +'for i in $(test_seq 100) + do + echo $i >file-1-$i + done && + git add file-* && + test_tick && + git commit -m "test data 1" && + git branch commit1 HEAD' + +test_expect_success 'write-midx from index version 1' \ +'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 ${packdir}/test-1) && + midx1=$(git midx --write) && + test_path_is_file ${packdir}/midx-${midx1}.midx' + +test_expect_success 'write-midx from index version 2' \ +'rm "${packdir}/test-1-${pack1}.pack" && + pack2=$(git rev-list --all --objects | git pack-objects --index-version=2 ${packdir}/test-2) && + midx2=$(git midx --write) && + test_path_is_file ${packdir}/midx-${midx2}.midx' + +test_expect_success 'Create more objects' \ +'for i in $(test_seq 100) + do + echo $i >file-2-$i + done && + git add file-* && + test_tick && + git commit -m "test data 2" && + git branch commit2 HEAD' + +test_expect_success 'write-midx with two packs' \ +'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects --index-version=2 ${packdir}/test-3) && + midx3=$(git midx --write) && + test_path_is_file ${packdir}/midx-${midx3}.midx' + +test_expect_success 'Add more packs' \ +'for j in $(test_seq 10) + do + jjj=$(printf '%03i' $j) + test-genrandom "bar" 200 > wide_delta_$jjj && + test-genrandom "baz $jjj" 50 >> wide_delta_$jjj && + test-genrandom "foo"$j 100 > deep_delta_$jjj && + test-genrandom "foo"$(expr $j + 1) 100 >> deep_delta_$jjj && + test-genrandom "foo"$(expr $j + 2) 100 >> deep_delta_$jjj && + echo $jjj >file_$jjj && + test-genrandom "$jjj" 8192 >>file_$jjj && + git update-index --add file_$jjj deep_delta_$jjj wide_delta_$jjj && + { echo 101 && test-genrandom 100 8192; } >file_101 && + git update-index --add file_101 && + commit=$(git commit-tree $EMPTY_TREE -p HEADobj-list && + echo commit_packs_$j = $commit && +git branch commit_packs_$j $commit && + git update-ref HEAD $commit && + git pack-objects --index-version=2 ${packdir}/test-pack
[RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx()
Using a binary search, we can navigate to the position n within a MIDX file where an object appears in the ordered list of objects. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- midx.c | 30 ++ midx.h | 9 + 2 files changed, 39 insertions(+) diff --git a/midx.c b/midx.c index a66763b9e3..8c643caa92 100644 --- a/midx.c +++ b/midx.c @@ -299,6 +299,36 @@ const struct object_id *nth_midxed_object_oid(struct object_id *oid, return oid; } +int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, uint32_t *pos) +{ + uint32_t last, first = 0; + + if (sha1[0]) + first = ntohl(*(uint32_t*)(m->chunk_oid_fanout + 4 * (sha1[0] - 1))); + last = ntohl(*(uint32_t*)(m->chunk_oid_fanout + 4 * sha1[0])); + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + const unsigned char *current; + int cmp; + + current = m->chunk_oid_lookup + m->hdr->hash_len * mid; + cmp = hashcmp(sha1, current); + if (!cmp) { + *pos = mid; + return 1; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + *pos = first; + return 0; +} + int contains_pack(struct midxed_git *m, const char *pack_name) { uint32_t first = 0, last = m->num_packs; diff --git a/midx.h b/midx.h index d8ede8121c..5598799189 100644 --- a/midx.h +++ b/midx.h @@ -101,6 +101,15 @@ extern const struct object_id *nth_midxed_object_oid(struct object_id *oid, struct midxed_git *m, uint32_t n); +/* + * Perform a binary search on the object list in a MIDX file for the given sha1. + * + * If the object exists, then return 1 and set *pos to the position of the sha1. + * Otherwise, return 0 and set *pos to the position of the lex-first object greater + * than the given sha1. + */ +extern int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, uint32_t *pos); + extern int contains_pack(struct midxed_git *m, const char *pack_name); /* -- 2.15.0
Re: merge-base --is-ancestor A B is unreasonably slow with unrelated history B
On 1/9/2018 10:17 AM, Ævar Arnfjörð Bjarmason wrote: This is a pathological case I don't have time to dig into right now: git branch -D orphan; git checkout --orphan orphan && git reset --hard && touch foo && git add foo && git commit -m"foo" && time git merge-base --is-ancestor master orphan This takes around 5 seconds on linux.git to return 1. Which is around the same time it takes to run current master against the first commit in linux.git: git merge-base --is-ancestor 1da177e4c3f4 master This is obviously a pathological case, but maybe we should work slightly harder on the RHS of and discover that it itself is an orphan commit. I ran into this while writing a hook where we'd like to do: git diff $master...topic Or not, depending on if the topic is an orphan or just something recently branched off, figured I could use --is-ancestor as on optimization, and then discovered it's not much of an optimization. Ævar, This is the same performance problem that we are trying to work around with Jeff's "Add --no-ahead-behind to status" patch [1]. For commits that are far apart, many commits need to be parsed. I think the right solution is to create a serialized commit graph that stores the adjacency information of the commits and can create commit structs quickly. This requires storing the commit id, commit date, parents, and root tree id to satisfy the needs of parse_commit_gently(). Once the framework for this data is constructed, it is simple to add generation numbers to that data and start consuming them in other algorithms (by adding the field to 'struct commit'). I'm working on such a patch right now, but it will be a few weeks before I'm ready. Thanks, -Stolee [1] v5 of --no-ahead-behind https://public-inbox.org/git/20180109185018.69164-1-...@jeffhostetler.com/T/#t [2] v4 of --no-ahead-behind https://public-inbox.org/git/nycvar.qro.7.76.6.1801091744540...@zvavag-6oxh6da.rhebcr.pbec.zvpebfbsg.pbz/T/#t
Re: [PATCH v4 0/4] Add --no-ahead-behind to status
On 1/9/2018 8:15 AM, Johannes Schindelin wrote: Hi Peff, On Tue, 9 Jan 2018, Jeff King wrote: On Mon, Jan 08, 2018 at 03:04:20PM -0500, Jeff Hostetler wrote: I was thinking about something similar to the logic we use today about whether to start reporting progress on other long commands. That would mean you could still get the ahead/behind values if you aren't that far behind but would only get "different" if that calculation gets too expensive (which implies the actual value isn't going to be all that helpful anyway). After a off-line conversation with the others I'm going to look into a version that is limited to n commits rather than be completely on or off. I think if you are for example less than 100 a/b then those numbers have value; if you are past n, then they have much less value. I'd rather do it by a fixed limit than by time to ensure that output is predictable on graph shape and not on system load. I like this direction a lot. I had hoped we could say "100+ commits ahead", How about "100+ commits apart" instead? Unfortunately, we can _never_ guarantee more than 1 commit ahead/behind unless we walk to the merge base (or have generation numbers). For example, suppose the 101st commit in each history has a parent that in the recent history of the other commit. (There must be merge commits to make this work without creating cycles, but the ahead/behind counts could be much lower than the number of walked commits.) but I don't think we can do so accurately without generation numbers. Even with generation numbers, it is not possible to say whether two given commits reflect diverging branches or have an ancestor-descendant relationship (or in graph speak: whether they are comparable). If you walk commits using a priority queue where the priority is the generation number, then you can know that you have walked all reachable commits with generation greater than X, so you know among those commits which are comparable or not. For this to work accurately, you must walk from both tips to a generation lower than each. It does not help the case where one branch is 100,000+ commits ahead, where most of those commits have higher generation number than the behind commit. It could potentially make it possible to cut off the commit traversal, but I do not even see how that would be possible. The only thing you could say for sure is that two different commits with the same generation number are for sure "uncomparable", i.e. reflect diverging branches. E.g., the case I mentioned at the bottom of this mail: https://public-inbox.org/git/20171224143318.gc23...@sigill.intra.peff.net/ I haven't thought too hard on it, but I suspect with generation numbers you could bound it and at least say "100+ ahead" or "100+ behind". If you have walked 100 commits and still have not found a merge base, there is no telling whether one start point is the ancestor of the other. All you can say is that there are more than 100 commits in the "difference". You would not even be able to say that the *shortest* path between those two start points is longer than 100 commits, you can construct pathological DAGs pretty easily. Even if you had generation numbers, and one commit's generation number was, say, 17, and the other one's was 17,171, you could not necessarily assume that the 17 one is the ancestor of the 17,171 one, all you can say that it is not possible the other way round. This is why we cannot _always_ use generation numbers, but they do help in some cases. But I don't think you can approximate both ahead and behind together without finding the actual merge base. But even still, finding small answers quickly and accurately and punting to "really far, I didn't bother to compute it" on the big ones would be an improvement over always punting. Indeed. The longer I think about it, the more I like the "100+ commits apart" idea. Again, I strongly suggest we drop this approach because it will be more pain than it is worth. Thanks, -Stolee
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: On Sun, Jan 07 2018, Derrick Stolee jotted: git log --oneline --raw --parents Num Packs | Before MIDX | After MIDX | Rel % | 1 pack % --+-+++-- 1 | 35.64 s |35.28 s | -1.0% | -1.0% 24 | 90.81 s |40.06 s | -55.9% | +12.4% 127 |257.97 s |42.25 s | -83.6% | +18.6% The last column is the relative difference between the MIDX-enabled repo and the single-pack repo. The goal of the MIDX feature is to present the ODB as if it was fully repacked, so there is still room for improvement. Changing the command to git log --oneline --raw --parents --abbrev=40 has no observable difference (sub 1% change in all cases). This is likely due to the repack I used putting commits and trees in a small number of packfiles so the MRU cache workes very well. On more naturally-created lists of packfiles, there can be up to 20% improvement on this command. We are using a version of this patch with an upcoming release of GVFS. This feature is particularly important in that space since GVFS performs a "prefetch" step that downloads a pack of commits and trees on a daily basis. These packfiles are placed in an alternate that is shared by all enlistments. Some users have 150+ packfiles and the MRU misses and abbreviation computations are significant. Now, GVFS manages the MIDX file after adding new prefetch packfiles using the following command: git midx --write --update-head --delete-expired --pack-dir= (Not a critique of this, just a (stupid) question) What's the practical use-case for this feature? Since it doesn't help with --abbrev=40 the speedup is all in the part that ensures we don't show an ambiguous SHA-1. The point of including the --abbrev=40 is to point out that object lookups do not get slower with the MIDX feature. Using these "git log" options is a good way to balance object lookups and abbreviations with object parsing and diff machinery. And while the public data shape I shared did not show a difference, our private testing of the Windows repository did show a valuable improvement when isolating to object lookups and ignoring abbreviation calculations. The reason we do that at all is because it makes for a prettier UI. We tried setting core.abbrev=40 on GVFS enlistments to speed up performance and the users rebelled against the hideous output. They would rather have slower speeds than long hashes. Are there things that both want the pretty SHA-1 and also care about the throughput? I'd have expected machine parsing to just use --no-abbrev-commit. The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes. If something cares about both throughput and e.g. is saving the abbreviated SHA-1s isn't it better off picking some arbitrary size (e.g. --abbrev=20), after all the default abbreviation is going to show something as small as possible, which may soon become ambigous after the next commit. Unfortunately, with the way the abbreviation algorithms work, using --abbrev=20 will have similar performance problems because you still need to inspect all packfiles to ensure there isn't a collision in the first 20 hex characters. Thanks, -Stolee
Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
On 1/8/2018 2:32 PM, Jonathan Tan wrote: On Sun, 7 Jan 2018 13:14:42 -0500 Derrick Stolee <sto...@gmail.com> wrote: +Design Details +-- + +- The MIDX file refers only to packfiles in the same directory + as the MIDX file. + +- A special file, 'midx-head', stores the hash of the latest + MIDX file so we can load the file without performing a dirstat. + This file is especially important with incremental MIDX files, + pointing to the newest file. I presume that the actual MIDX files are named by hash? (You might have written this somewhere that I somehow missed.) Also, I notice that in the "Future Work" section, the possibility of multiple MIDX files is raised. Could this 'midx-head' file be allowed to store multiple such files? That way, we avoid a bit of file format churn (in that we won't need to define a new "chunk" in the future). I hadn't considered this idea, and I like it. I'm not sure this is a robust solution, since isolated MIDX files don't contain information that they could use other MIDX files, or what order they should be in. I think the "order" of incremental MIDX files is important in a few ways (such as the "stable object order" idea). I will revisit this idea when I come back with the incremental MIDX feature. For now, the only reference to "number of base MIDX files" is in one byte of the MIDX header. We should consider changing that byte for this patch. +- If a packfile exists in the pack directory but is not referenced + by the MIDX file, then the packfile is loaded into the packed_git + list and Git can access the objects as usual. This behavior is + necessary since other tools could add packfiles to the pack + directory without notifying Git. + +- The MIDX file should be only a supplemental structure. If a + user downgrades or disables the `core.midx` config setting, + then the existing .idx and .pack files should be sufficient + to operate correctly. Let me try to summarize: so, at this point, there are no backwards-incompatible changes to the repo disk format. Unupdated code paths (and old versions of Git) can just read the .idx and .pack files, as always. Updated code paths will look at the .midx and .idx files, and will sort them as follows: - .midx files go into a data structure - .idx files not referenced by any .midx files go into the existing packed_git data structure A writer can either merely write a new packfile (like old versions of Git) or write a packfile and update the .midx file, and everything above will still work. In the event that a writer deletes an existing packfile referenced by a .midx (for example, old versions of Git during a repack), we will lose the advantages of the .midx file - we will detect that the .midx no longer works when attempting to read an object given its information, but in this case, we can recover by dropping the .midx file and loading all the .idx files it references that still exist. As a reviewer, I think this is a very good approach, and this does make things easier to review (as opposed to, say, an approach where a lot of the code must be aware of .midx files). Thanks! That is certainly the idea. If you know about MIDX, then you can benefit from it. If you do not, then you have all the same data available to you do to your work. Having a MIDX file will not break other tools (libgit2, JGit, etc.). One thing I'd like to determine before this patch goes to v1 is how much we should make the other packfile-aware commands also midx-aware. My gut reaction right now is to have git-repack call 'git midx --clear' if core.midx=true and a packfile was deleted. However, this could easily be changed with 'git midx --clear' followed by 'git midx --write --update-head' if midx-head exists. Thanks, -Stolee
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On 1/10/2018 1:25 PM, Martin Fick wrote: On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee wrote: This RFC includes a new way to index the objects in multiple packs using one file, called the multi-pack index (MIDX). ... The main goals of this RFC are: * Determine interest in this feature. * Find other use cases for the MIDX feature. My interest in this feature would be to speed up fetches when there is more than one large pack-file with many of the same objects that are in other pack-files. What does your MIDX design do when it encounters multiple copies of the same object in different pack files? Does it index them all, or does it keep a single copy? The MIDX currently keeps only one reference to each object. Duplicates are dropped during writing. (See the care taken in commit 04/18 to avoid duplicates.) Since midx_sha1_compare() does not use anything other than the OID to order the objects, there is no decision being made about which pack is "better". The MIDX writes the first copy it finds and discards the others. It would not be difficult to include a check in midx_sha1_compare() to favor one packfile over another based on some measurement (size? mtime?). Since this would be a heuristic at best, I left it out of the current patch. In our Gerrit instance (Gerrit uses jgit), we have multiple copies of the linux kernel repos linked together via the alternatives file mechanism. GVFS also uses alternates for sharing packfiles across multiple copies of the repo. The MIDX is designed to cover all packfiles in the same directory, but is not designed to cover packfiles in multiple alternates; currently, each alternate would need its own MIDX file. Does that cause issues with your setup? These repos have many different references (mostly Gerrit change references), but they share most of the common objects from the mainline. I have found that during a large fetch such as a clone, jgit spends a significant amount of extra time by having the extra large pack-files from the other repos visible to it, usually around an extra minute per instance of these (without them, the clone takes around 7mins). This adds up easily with a few repos extra repos, it can almost double the time. My investigations have shown that this is due to jgit searching each of these pack files to decide which version of each object to send. I don't fully understand its selection criteria, however if I shortcut it to just pick the first copy of an object that it finds, I regain my lost time. I don't know if git suffers from a similar problem? If git doesn't suffer from this then it likely just uses the first copy of an object it finds (which may not be the best object to send?) It would be nice if this use case could be improved with MIDX. To do so, it seems that it would either require that MIDX either only put "the best" version of an object (i.e. pre-select which one to use), or include the extra information to help make the selection process of which copy to use (perhaps based on the operation being performed) fast. I'm not sure if there is sufficient value in storing multiple references to the same object stored in multiple packfiles. There could be value in carefully deciding which copy is "best" during the MIDX write, but during read is not a good time to make such a decision. It also increases the size of the file to store multiple copies. This also leads me to ask, what other additional information (bitmaps?) for other operations, besides object location, might suddenly be valuable in an index that potentially points to multiple copies of objects? Would such information be appropriate in MIDX, or would it be better in another index? For applications to bitmaps, it is probably best that we only include one copy of each object. Otherwise, we need to include extra bits in the bitmaps for those copies (when asking "is this object in the bitmap?"). Thanks for the context with Gerrit's duplicate object problem. I'll try to incorporate it in to the design document (commit 01/18) for the v1 patch. Thanks, -Stolee
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On 1/8/2018 5:20 AM, Jeff King wrote: On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote: (Not a critique of this, just a (stupid) question) What's the practical use-case for this feature? Since it doesn't help with --abbrev=40 the speedup is all in the part that ensures we don't show an ambiguous SHA-1. The point of including the --abbrev=40 is to point out that object lookups do not get slower with the MIDX feature. Using these "git log" options is a good way to balance object lookups and abbreviations with object parsing and diff machinery. And while the public data shape I shared did not show a difference, our private testing of the Windows repository did show a valuable improvement when isolating to object lookups and ignoring abbreviation calculations. Just to make sure I'm parsing this correctly: normal lookups do get faster when you have a single index, given the right setup? I'm curious what that setup looked like. Is it just tons and tons of packs? Is it ones where the packs do not follow the mru patterns very well? The way I repacked the Linux repo creates an artificially good set of packs for the MRU cache. When the packfiles are partitioned instead by the time the objects were pushed to a remote, the MRU cache performs poorly. Improving these object lookups are a primary reason for the MIDX feature, and almost all commands improve because of it. 'git log' is just the simplest to use for demonstration. I think it's worth thinking a bit about, because... If something cares about both throughput and e.g. is saving the abbreviated SHA-1s isn't it better off picking some arbitrary size (e.g. --abbrev=20), after all the default abbreviation is going to show something as small as possible, which may soon become ambigous after the next commit. Unfortunately, with the way the abbreviation algorithms work, using --abbrev=20 will have similar performance problems because you still need to inspect all packfiles to ensure there isn't a collision in the first 20 hex characters. ...if what we primarily care about speeding up is abbreviations, is it crazy to consider disabling the disambiguation step entirely? The results of find_unique_abbrev are already a bit of a probability game. They're guaranteed at the moment of generation, but as more objects are added, ambiguities may be introduced. Likewise, what's unambiguous for you may not be for somebody else you're communicating with, if they have their own clone. Since we now scale the default abbreviation with the size of the repo, that gives us a bounded and pretty reasonable probability that we won't hit a collision at all[1]. I.e., what if we did something like this: diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..04c661ba85 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; + /* +* A default length of 10 implies a repository big enough that it's +* getting expensive to double check the ambiguity of each object, +* and the chance that any particular object of interest has a +* collision is low. +*/ + if (len >= 10) + return len; + mad.init_len = len; mad.cur_len = len; mad.hex = hex; If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The patch above drops "git log --oneline --raw" on the resulting repo from ~150s to ~30s. With a single pack, it goes from ~33s ~29s. Less impressive, but there's still some benefit. There may be other reasons to want MIDX or something like it, but I just wonder if we can do this much simpler thing to cover the abbreviation case. I guess the question is whether somebody is going to be annoyed in the off chance that they hit a collision. No only are users going to be annoyed when they hit collisions after copy-pasting an abbreviated hash, there are also a large number of tools that people build that use abbreviated hashes (either for presenting to users or because they didn't turn off abbreviations). Abbreviations cause performance issues in other commands, too (like 'fetch'!), so whatever short-circuit you put in, it would need to be global. A flag on one builtin would not suffice. -Peff [1] I'd have to check the numbers, but IIRC we've set the scaling so that the chance of having a _single_ collision in the repository is less than 50%, and rounding to the conservative side (since each hex char gives us 4 bits). And indeed, "git log --oneline --raw" on linux.git does not seem to have any collisions at its default of 12 characters, at least in my clone. We could also consider switching core.disambiguate to "commit", which makes even a collision less likely to annoy the user.
Re: [PATCH 03/14] packed-graph: create git-graph builtin
On 1/25/2018 6:01 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: Teach Git the 'graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a check that the core.graph setting is enabled and a '--pack-dir' option. Just to set my expectation straight. Is it fair to say that in the ideal endgame state, this will be like "git pack-objects" in that end users won't have to know about it, but would serve as a crucial building block that is invoked by other front-end commands that are more familiar to end users (just like pack-objects are used behind the scenes by repack, push, etc.)? That is my hope. Leaving that integration for later, after this feature has proven itself.
Re: [PATCH 04/14] packed-graph: add format document
On 1/25/2018 5:07 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote: Add document specifying the binary format for packed graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/graph-format.txt | 88 So this is different from Documentation/technical/packed-graph.txt, which gives high level design and this gives the details on how to set bits. 1 file changed, 88 insertions(+) create mode 100644 Documentation/technical/graph-format.txt diff --git a/Documentation/technical/graph-format.txt b/Documentation/technical/graph-format.txt new file mode 100644 index 00..a15e1036d7 --- /dev/null +++ b/Documentation/technical/graph-format.txt @@ -0,0 +1,88 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +== graph-*.graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks, +hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) This is 20 or 40 for sha1 ? (binary or text representation?) 20 for binary. + 1-byte number (C) of "chunks" + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. ... offset [in bytes/words/4k blocks?] in ... bytes. + (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). So F[0] > 0 for git.git for example. Or another way: To lookup a 01xxx, I need to look at entry(F[00] + 1 )...entry(F[01]). Makes sense. + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph. ... sorted ascending. + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the int-ids of the first two parents of + the ith commit. Stores value 0x if no parent in that position. + If there are more than two parents, the second value has its most- + significant bit on and the other bits store an offset into the Large + Edge List chunk. s/an offset into/position in/ ? (otherwise offset in bytes?) + * The next 8 bytes store the generation number of the commit and the + commit time in seconds since EPOCH. The generation number uses the + higher 30 bits of the first 4 bytes, while the commit time uses the + 32 bits of the second 4 bytes, along with the lowest 2 bits of the + lowest byte, storing the 33rd and 34th bit of the commit time. This allows for a maximum generation number of 1.073.741.823 (2^30 -1) = 1 billion, and a max time stamp of later than 2100. Do you allow negative time stamps? + + [Optional] Large Edge List (ID: {'E', 'D', 'G', 'E'}) + This list of 4-byte values store the second through nth parents for +
Re: [PATCH 01/14] graph: add packed graph design document
On 1/25/2018 4:14 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: Add Documentation/technical/packed-graph.txt with details of the planned packed graph feature, including future plans. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/packed-graph.txt | 185 +++ 1 file changed, 185 insertions(+) create mode 100644 Documentation/technical/packed-graph.txt I really wanted to like having this patch at the beginning, but unfortunatelly I didn't see the actual file format description, which was a bit disappointing. An example of the things that I was curious about was how the "integer ID" is used to access into the file. If we could somehow use "integer ID" as an index into an array of fixed size elements, it would be ideal to gain "fast lookups", but because of the "list of parents" thing, it needs some trickery to do so, and that was among the things that I wanted to see how much thought went into the design, for example. There is definitely a chicken-or-the-egg situation here. I'm happy to start with the format before the design document. I can try to expand this "integer ID" concept, but you can see how I use it in the following method from patch 11/14: +int parse_packed_commit(struct commit *item) +{ + if (!core_graph) + return 0; + if (item->object.parsed) + return 1; + + prepare_packed_graph(); + if (packed_graph) { + uint32_t pos; + int found; + if (item->graphId != 0x) { + pos = item->graphId; + found = 1; + } else { + found = bsearch_graph(packed_graph, &(item->object.oid), ); + } + + if (found) + return fill_packed_commit(item, packed_graph, pos); + } + + return 0; +} Note that if item->graphId has a "real" value (not 0x which in hindsight should be a macro) then we navigate directly to that position in the graph. Otherwise, we use binary search to query the graph's commit list to find the position (if the commit is packed). diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt new file mode 100644 index 00..fcc0c83874 --- /dev/null +++ b/Documentation/technical/packed-graph.txt @@ -0,0 +1,185 @@ +Git Packed Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows above 100K. +The merge base calculation shows up in many user-facing commands, such +as 'status' and 'fetch' and can take minutes to compute depending on +data shape. There are two main costs here: s/data shape/history shape/ may make it even clearer. +1. The commit OID. +2. The list of parents. +3. The commit date. +4. The root tree OID. +5. An integer ID for fast lookups in the graph. +6. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +By providing an integer ID we can avoid lookups in the graph as we walk +commits. Specifically, we need to provide the integer ID of the parent +commits so we navigate directly to their information on request. Commits created after a packed graph file is built may of course not appear in a packed graph file, but that is OK because they never need to be listed as parents of commits in the file. So "list of parents" can always refer to the parents using the "integer ID for fast lookup". One thing I need to test locally is what happens with boundary commits of a shallow clone. The commit's parents are not in the repo, so they will not be in the graph. I think that parse_commit_buffer() drops the parents, so the graph will treat them as root commits. Makes sense. Item 2. might want to say "The list of parents, using the fast lookup integer ID (see 5.) as reference instead of OID", though. That will be more specific, thanks. +Define the "generation number" of a commit recursively as follows: + * A commit with no parents (a root commit) has generation number 1. + * A commit with at least one parent has generation number 1 more than + the largest generation number among its parents. +Equivalently, the generation number is one more than the length of a +longest path from the commit to a root commit. When a commit A can reach roots X and Y, and Y is further than X, the distance between Y and A becomes A's generation number. "One more than the length of the path from the commit to the furthest root commit it can reach", in other words. My "Equivalently,..." sentence
Re: [PATCH 03/14] packed-graph: create git-graph builtin
On 1/25/2018 4:45 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote: Teach Git the 'graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a check that the core.graph setting is enabled and a '--pack-dir' option. I wonder if this builtin should not respect the boolean core graph, as this new builtin commands' whole existence is to deal with these new files? As you assume this builtin as a plumbing command, I would expect it to pay less attention to config rather than more. My thought was to alert the caller "This graph isn't going to be good for anything!" and fail quickly before doing work. You do have a good point, and I think we can remove that condition here. When we integrate with other commands ('repack', 'fetch', 'clone') we will want a different setting that signals automatically writing the graph and we don't want those to fail because they are not aware of a second config setting. @@ -408,6 +408,7 @@ static struct cmd_struct commands[] = { { "fsck-objects", cmd_fsck, RUN_SETUP }, { "gc", cmd_gc, RUN_SETUP }, { "get-tar-commit-id", cmd_get_tar_commit_id }, + { "graph", cmd_graph, RUN_SETUP_GENTLY }, Why gently, though? From reading the docs (and assumptions on further patches) we'd want to abort if there is no .git dir to be found? Or is a future patch having manual logic? (e.g. if pack-dir is given, the command may be invoked from outside a git dir) You are right. I inherited this from my MIDX patch which can operate on a list of IDX files without a .git folder. The commit graph operations need an ODB. Thanks, -Stolee
Re: [PATCH 06/14] packed-graph: implement git-graph --write
On 1/25/2018 6:28 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote: + +$ git midx --write midx? Looks like I missed some replacements as I was building this. Now you see how I hope the feedback from this patch will inform the MIDX patch. ;) +test_done The tests basically tests that there is no segfault? Makes sense. Also checks that files are written based on the output hash. The next commits gives inspection ability. Thanks, -Stolee
Re: [PATCH 02/14] packed-graph: add core.graph setting
On 1/25/2018 4:43 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: The packed graph feature is controlled by the new core.graph config setting. This defaults to 0, so the feature is opt-in. The intention of core.graph is that a user can always stop checking for or parsing packed graph files if core.graph=0. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 10 insertions(+) Before you get too married to the name "graph", is it reasonable to assume that the commit ancestry graph is the primary "graph" that should come to users' minds when a simple word "graph" is used in the context of discussing Git? I suspect not. Let's not just call this "core.graph" and "packed-graph", and in addition give some adjective before "graph". I was too focused that I wanted the word "graph" but "graph.c" already existed in source root that I came up with "packed-graph.c" just to have a separate filename. Clearly, "commit-graph" should be used instead. In v2, I'll use "/commit-graph.c" and "/builtin/commit-graph.c". Thanks, -Stolee
Re: [PATCH 04/14] packed-graph: add format document
On 1/25/2018 5:06 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: Add document specifying the binary format for packed graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/graph-format.txt | 88 1 file changed, 88 insertions(+) create mode 100644 Documentation/technical/graph-format.txt diff --git a/Documentation/technical/graph-format.txt b/Documentation/technical/graph-format.txt new file mode 100644 index 00..a15e1036d7 --- /dev/null +++ b/Documentation/technical/graph-format.txt @@ -0,0 +1,88 @@ +Git commit graph format +=== Good that this is not saying "graph format" but is explicit that it is about "commit". Do the same for the previous steps. Especially, builtin/graph.c that does not have much to do with graph.c is not a good way forward ;-) :+1: I do like the fact that later parents of octopus merges are moved out of way to make the majority of records fixed length, but I am not sure if the "up to two parents are recorded in line" is truly the best arrangement. Aren't majority of commits single-parent, thereby wasting 4 bytes almost always? Will 32-bit stay to be enough for everybody? Wouldn't it make sense to at least define them to be indices into arrays (i.e. scaled to element size), not "offsets", to recover a few lost bits? I incorrectly used the word "offset" when I mean "array position" for the edge values. What's the point of storing object id length? If you do not understand the object ID scheme, knowing only the length would not do you much good anyway, no? And if you know the hashing scheme specified by Object ID version, you already know the length, no? I'll go read the OID transition document to learn more, but I didn't know if there were plans for things like "Use SHA3 but with different hash lengths depending on user requirements". One side benefit is that we immediately know the width of our commit and tree references within the commit graph file without needing to consult a table of hash definitions. On 1/25/2018 5:18 PM, Stefan Beller wrote: git.git has ~37k non-merge commits and ~12k merge commits, (35 of them have 3 or more parents). So 75% would waste the 4 bytes of the second parent. However the first parent is still there, so any operation that only needs the first parent (git bisect --first-parent?) would still be fast. Not sure if that is common. The current API boundary does not support this, as parse_commit_gently() is not aware of the --first-parent option. The benefits of injecting that information are probably not worth the complication. On 1/25/2018 5:29 PM, Junio C Hamano wrote: Stefan Beller <sbel...@google.com> writes: The downside of just having one parent or pointer into the edge list would be to penalize 25% of the commit lookups with an indirection compared to ~0% (the 35 octopus'). I'd rather want to optimize for speed than disk size? (4 bytes for 37k is 145kB for git.git, which I find is not a lot). My comment is not about disk size but is about the size of working set (or "size of array element"). I do want to optimize for speed over space, at least for two-parent commits. Hopefully my clarification about offset/array position clarifies Junio's concerns here. Thanks, -Stolee
Re: [PATCH 00/14] Serialized Commit Graph
On 1/25/2018 6:06 PM, Ævar Arnfjörð Bjarmason wrote: On Thu, Jan 25 2018, Derrick Stolee jotted: Oops! This is my mistake. The correct command should be: git show-ref -s | git graph --write --update-head --stdin-commits Without "--stdin-commits" the command will walk all packed objects to look for commits and then build the graph. That's why it's taking so long. That method takes several minutes on the Linux repo, but with --stdin-commits it should take as long as "git log >/dev/null". Thanks, it took around 15m to finish with the command I initially ran on my test repo. Then the `merge-base --is-ancestor` performance problem I was complaining about in https://public-inbox.org/git/87608bawoa@evledraar.gmail.com/ takes around 1s with your series, 5s without it. Nice. Thanks for testing this! May I ask how many commits are in your repo? One way to find out is to run 'git graph --read' and it will tell you how many commits are in the serialized graph. Thanks, -Stolee
Re: [PATCH 01/14] graph: add packed graph design document
On 1/25/2018 3:04 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote: Add Documentation/technical/packed-graph.txt with details of the planned packed graph feature, including future plans. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/packed-graph.txt | 185 +++ 1 file changed, 185 insertions(+) create mode 100644 Documentation/technical/packed-graph.txt diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt new file mode 100644 index 00..fcc0c83874 --- /dev/null +++ b/Documentation/technical/packed-graph.txt @@ -0,0 +1,185 @@ +Git Packed Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows above 100K. How did you come up with that specific number? (Is it platform dependent?) I'd avoid a specific number to not derail the reader here into wondering how this got measured. Using a specific number was a mistake. Git can walk ~100K commits per second by parsing commits, in my tests on my machine. I'll instead say "commit count grows." +The merge base calculation shows up in many user-facing commands, such +as 'status' and 'fetch' and can take minutes to compute depending on +data shape. There are two main costs here: status needs the walk for the ahead/behind computation which is (1)? (I forget how status would need to compute a merge-base) 'status' computes the ahead/behind counts using paint_down_to_common(). This is a more robust method than simply computing merge bases, but the possible merge bases are found as a result. fetch is a networked command, which traditionally in Git is understood as "can be slow" because you might be in Australia, or the connection is slow otherwise. So giving this as an example it is not obvious that the DAG walking is the bottleneck. Maybe git-merge or "git show --remerge-diff" [1] are better examples for walk-intensive commands? [1] https://public-inbox.org/git/cover.1409860234.git...@thomasrast.ch/ never landed, so maybe that is a bad example. But for me that command is more obviously dependent on cheap walking the DAG compared to fetch. So, take my comments with a grain of salt! Actually, a 'fetch' command does the same ahead/behind calculation as 'status', and in GVFS repos we have seen that walk take 30s per branch when comparing local and remote copies a fast-moving branch. Yes, there are other (usually) more expensive things in 'fetch' so I'll drop that reference.. +1. Decompressing and parsing commits. +2. Walking the entire graph to avoid topological order mistakes. + +The packed graph is a file that stores the commit graph structure along +with some extra metadata to speed up graph walks. This format allows a +consumer to load the following info for a commit: + +1. The commit OID. +2. The list of parents. +3. The commit date. +4. The root tree OID. +5. An integer ID for fast lookups in the graph. +6. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). This new format is specifically removing the cost of decompression and parsing (1) completely, whereas (2) we still have to walk the entire graph for now as the generation numbers are not fully used as of yet, but provided. A major goal of this work is to provide a place to store computed generation numbers so we can not walk the entire graph. I mention this here because 'git log -' is O(n) (due to commit-date heuristics that prevent walking the entire graph) while 'git log --topo-order -' is O(T) where T is the total number of reachable commits. +By providing an integer ID we can avoid lookups in the graph as we walk +commits. Specifically, we need to provide the integer ID of the parent +commits so we navigate directly to their information on request. Does this mean we decrease the pressure on fast lookups in packfiles/loose objects? Yes, we do. In fact, when profiling 'git log --topo-order -1000', I noticed that 30-50% of the time (after this patch) is spent in lookup_tree(). If we can prevent checking the ODB for the existence of these trees until they are needed, we can get additional speedups. It is a bit wasteful that we are loading these trees even when we will never use them (such as computing merge bases). +Define the "generation number" of a commit recursively as follows: + * A commit with no parents (a root commit) has generation number 1. + * A commit with at least one parent has generation number 1 more than + the largest generation number among its parents. +Equivalently, the generation number is one more than the length of a +longest path from the commit to a roo
[PATCH] packfile: use get_be64() for large offsets
The pack-index version 2 format uses two 4-byte integers in network-byte order to represent one 8-byte value. The current implementation has several code clones for stitching these integers together. Use get_be64() to create an 8-byte integer from two 4-byte integers represented this way. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- pack-revindex.c | 6 ++ packfile.c | 3 +-- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/pack-revindex.c b/pack-revindex.c index 1b7ebd8d7e..ff5f62c033 100644 --- a/pack-revindex.c +++ b/pack-revindex.c @@ -134,10 +134,8 @@ static void create_pack_revindex(struct packed_git *p) if (!(off & 0x8000)) { p->revindex[i].offset = off; } else { - p->revindex[i].offset = - ((uint64_t)ntohl(*off_64++)) << 32; - p->revindex[i].offset |= - ntohl(*off_64++); + p->revindex[i].offset = get_be64(off_64); + off_64 += 2; } p->revindex[i].nr = i; } diff --git a/packfile.c b/packfile.c index 4a5fe7ab18..228ed0d59a 100644 --- a/packfile.c +++ b/packfile.c @@ -1702,8 +1702,7 @@ off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n) return off; index += p->num_objects * 4 + (off & 0x7fff) * 8; check_pack_index_ptr(p, index); - return (((uint64_t)ntohl(*((uint32_t *)(index + 0 << 32) | - ntohl(*((uint32_t *)(index + 4))); + return get_be64(index); } } -- 2.15.0
Re: [PATCH] sha1_file: remove static strbuf from sha1_file_name()
On 1/16/2018 2:18 AM, Christian Couder wrote: Using a static buffer in sha1_file_name() is error prone and the performance improvements it gives are not needed in most of the callers. So let's get rid of this static buffer and, if necessary or helpful, let's use one in the caller. First: this is a good change for preventing bugs in the future. Do not let my next thought deter you from making this change. Second: I wonder if there is any perf hit now that we are allocating buffers much more often. Also, how often does get_object_directory() change, so in some cases we could cache the buffer and only append the parts for the loose object (and not reallocate because the filenames will have equal length). I'm concerned about the perf implications when inspecting many loose objects (100k+) but these code paths seem to be involved with more substantial work, such as opening and parsing the objects, so keeping a buffer in-memory is probably unnecessary. --- cache.h | 8 +++- http-walker.c | 6 -- http.c| 16 ++-- sha1_file.c | 38 +- 4 files changed, 42 insertions(+), 26 deletions(-) diff --git a/cache.h b/cache.h index d8b975a571..6db565408e 100644 --- a/cache.h +++ b/cache.h @@ -957,12 +957,10 @@ extern void check_repository_format(void); #define TYPE_CHANGED0x0040 /* - * Return the name of the file in the local object database that would - * be used to store a loose object with the specified sha1. The - * return value is a pointer to a statically allocated buffer that is - * overwritten each time the function is called. + * Put in `buf` the name of the file in the local object database that + * would be used to store a loose object with the specified sha1. */ -extern const char *sha1_file_name(const unsigned char *sha1); +extern void sha1_file_name(struct strbuf *buf, const unsigned char *sha1); /* * Return an abbreviated sha1 unique within this repository's object database. diff --git a/http-walker.c b/http-walker.c index 1ae8363de2..07c2b1af82 100644 --- a/http-walker.c +++ b/http-walker.c @@ -544,8 +544,10 @@ static int fetch_object(struct walker *walker, unsigned char *sha1) } else if (hashcmp(obj_req->sha1, req->real_sha1)) { ret = error("File %s has bad hash", hex); } else if (req->rename < 0) { - ret = error("unable to write sha1 filename %s", - sha1_file_name(req->sha1)); + struct strbuf buf = STRBUF_INIT; + sha1_file_name(, req->sha1); + ret = error("unable to write sha1 filename %s", buf.buf); + strbuf_release(); } release_http_object_request(req); diff --git a/http.c b/http.c index 5977712712..5979305bc9 100644 --- a/http.c +++ b/http.c @@ -2168,7 +2168,7 @@ struct http_object_request *new_http_object_request(const char *base_url, unsigned char *sha1) { char *hex = sha1_to_hex(sha1); - const char *filename; + struct strbuf filename = STRBUF_INIT; char prevfile[PATH_MAX]; int prevlocal; char prev_buf[PREV_BUF_SIZE]; @@ -2180,14 +2180,15 @@ struct http_object_request *new_http_object_request(const char *base_url, hashcpy(freq->sha1, sha1); freq->localfile = -1; - filename = sha1_file_name(sha1); + sha1_file_name(, sha1); snprintf(freq->tmpfile, sizeof(freq->tmpfile), -"%s.temp", filename); +"%s.temp", filename.buf); - snprintf(prevfile, sizeof(prevfile), "%s.prev", filename); + snprintf(prevfile, sizeof(prevfile), "%s.prev", filename.buf); unlink_or_warn(prevfile); rename(freq->tmpfile, prevfile); unlink_or_warn(freq->tmpfile); + strbuf_release(); if (freq->localfile != -1) error("fd leakage in start: %d", freq->localfile); @@ -2302,6 +2303,7 @@ void process_http_object_request(struct http_object_request *freq) int finish_http_object_request(struct http_object_request *freq) { struct stat st; + struct strbuf filename = STRBUF_INIT; close(freq->localfile); freq->localfile = -1; @@ -2327,8 +2329,10 @@ int finish_http_object_request(struct http_object_request *freq) unlink_or_warn(freq->tmpfile); return -1; } - freq->rename = - finalize_object_file(freq->tmpfile, sha1_file_name(freq->sha1)); + + sha1_file_name(, freq->sha1); + freq->rename = finalize_object_file(freq->tmpfile, filename.buf); + strbuf_release(); return freq->rename; } diff --git a/sha1_file.c b/sha1_file.c index 3da70ac650..f66c21b2da 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -321,15 +321,11 @@ static void fill_sha1_path(struct strbuf *buf, const unsigned char *sha1) } } -const char *sha1_file_name(const unsigned char *sha1) +void sha1_file_name(struct
Re: [PATCH] describe: use strbuf_add_unique_abbrev() for adding short hashes
On 1/15/2018 12:10 PM, René Scharfe wrote: Call strbuf_add_unique_abbrev() to add an abbreviated hash to a strbuf instead of taking a detour through find_unique_abbrev() and its static buffer. This is shorter and a bit more efficient. Patch generated by Coccinelle (and contrib/coccinelle/strbuf.cocci). Signed-off-by: Rene Scharfe--- The changed line was added by 4dbc59a4cc (builtin/describe.c: factor out describe_commit). "make coccicheck" doesn't propose any other changes for current master. builtin/describe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/describe.c b/builtin/describe.c index 3b0b204b1e..21e37f5dae 100644 --- a/builtin/describe.c +++ b/builtin/describe.c @@ -380,7 +380,7 @@ static void describe_commit(struct object_id *oid, struct strbuf *dst) if (!match_cnt) { struct object_id *cmit_oid = >object.oid; if (always) { - strbuf_addstr(dst, find_unique_abbrev(cmit_oid->hash, abbrev)); + strbuf_add_unique_abbrev(dst, cmit_oid->hash, abbrev); if (suffix) strbuf_addstr(dst, suffix); return; René, Thanks for this cleanup! I just learned about strbuf_add_unique_abbrev() and like that it uses the reentrant find_unique_abbrev_r() instead. Looks good to me. -Stolee
Re: [PATCH 02/14] packed-graph: add core.graph setting
On 1/25/2018 3:17 PM, Stefan Beller wrote: On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote: The packed graph feature is controlled by the new core.graph config setting. This defaults to 0, so the feature is opt-in. The intention of core.graph is that a user can always stop checking for or parsing packed graph files if core.graph=0. @@ -825,6 +825,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; extern int core_apply_sparse_checkout; +extern int core_graph; Putting it here instead of say the_repository makes sense as you'd want to use this feature globally. However you can still have the config different per repository (e.g. version number of the graph setting, as one might be optimized for speed and the other for file size of the .graph file or such). So not sure if we'd rather want to put this into the repository struct. But then again the other core settings aren't there either and this feature sounds like it is repository specific only in the experimental phase; later it is expected to be on everywhere? I do think that more things should go in the repository struct. Unfortunately, that is not the world we live in. However, to make things clearer I'm following the pattern currently in master. You'll see the same with the global 'packed_graph' pointer, similar to 'packed_git'. I think these should be paired together until the repository absorbs them. If other 'core_*' variables move to the repository, I'm happy to move core_graph. If 'packed_git' moves to the repository, I'm happy to move 'packed_git'. However, if there is significant interest in moving all new state to the repository, then I'll move these values there. Let's have that discussion here instead of spread around the rest of the patch. Thanks, -Stolee
Re: [PATCH 00/14] Serialized Commit Graph
On 1/25/2018 10:46 AM, Ævar Arnfjörð Bjarmason wrote: On Thu, Jan 25 2018, Derrick Stolee jotted: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. This is great, spotted / questions so far: * git graph --blah says you need to enable the config, should say "unknown option --blah ". I.e. overzelous config guard. This is a good point. * On a big repo (git show-ref -s | ~/g/git/git-graph --write --update-head) is as of writing this still hanging for me, but strace shows it's brk()-ing. Presumably just still busy, a progress bar would be very nice. Oops! This is my mistake. The correct command should be: git show-ref -s | git graph --write --update-head --stdin-commits Without "--stdin-commits" the command will walk all packed objects to look for commits and then build the graph. That's why it's taking so long. That method takes several minutes on the Linux repo, but with --stdin-commits it should take as long as "git log >/dev/null". * Shouldn't there be a pack.useGraph option so this gets auto-updated on repack? I understand this series is a WIP, so that's more a "is that the UI" than "it needs now". This will definitely be part of a follow-up patch. Thanks, -Stolee
[PATCH 00/14] Serialized Commit Graph
As promised [1], this patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv | 0.42s | 0.27s | -35% | | rev-list --all | 6.4s | 1.0s | -84% | | rev-list --all --objects | 32.6s | 27.6s | -15% | To test this yourself, run the following on your repo: git config core.graph true git show-ref -s | git graph --write --update-head The second command writes a commit graph file containing every commit reachable from your refs. Now, all git commands that walk commits will check your graph first before consulting the ODB. You can run your own performance comparisions by toggling the 'core.graph' setting. [1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/ Re: What's cooking in git.git (Jan 2018, #03; Tue, 23) [2] https://github.com/derrickstolee/git/pull/2 A GitHub pull request containing the latest version of this patch. P.S. I'm sending this patch from my gmail address to avoid Outlook munging the URLs included in the design document. Derrick Stolee (14): graph: add packed graph design document packed-graph: add core.graph setting packed-graph: create git-graph builtin packed-graph: add format document packed-graph: implement construct_graph() packed-graph: implement git-graph --write packed-graph: implement git-graph --read graph: implement git-graph --update-head packed-graph: implement git-graph --clear packed-graph: teach git-graph --delete-expired commit: integrate packed graph with commit parsing packed-graph: read only from specific pack-indexes packed-graph: close under reachability packed-graph: teach git-graph to read commits Documentation/config.txt | 3 + Documentation/git-graph.txt | 102 Documentation/technical/graph-format.txt | 88 Documentation/technical/packed-graph.txt | 185 +++ Makefile | 2 + alloc.c | 1 + builtin.h| 1 + builtin/graph.c | 231 + cache.h | 1 + command-list.txt | 1 + commit.c | 20 +- commit.h | 2 + config.c | 5 + environment.c| 1 + git.c| 1 + log-tree.c | 3 +- packed-graph.c | 840 +++ packed-graph.h | 65 +++ packfile.c | 4 +- packfile.h | 2 + t/t5319-graph.sh | 271 ++ 21 files changed, 1822 insertions(+), 7 deletions(-) create mode 100644 Documentation/git-graph.txt create mode 100644 Documentation/technical/graph-format.txt create mode 100644 Documentation/technical/packed-graph.txt create mode 100644 builtin/graph.c create mode 100644 packed-graph.c create mode 100644 packed-graph.h create mode 100755 t/t5319-graph.sh
[PATCH 04/14] packed-graph: add format document
Add document specifying the binary format for packed graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/graph-format.txt | 88 1 file changed, 88 insertions(+) create mode 100644 Documentation/technical/graph-format.txt diff --git a/Documentation/technical/graph-format.txt b/Documentation/technical/graph-format.txt new file mode 100644 index 00..a15e1036d7 --- /dev/null +++ b/Documentation/technical/graph-format.txt @@ -0,0 +1,88 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +== graph-*.graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks, +hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) + + 1-byte number (C) of "chunks" + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the int-ids of the first two parents of + the ith commit. Stores value 0x if no parent in that position. + If there are more than two parents, the second value has its most- + significant bit on and the other bits store an offset into the Large + Edge List chunk. + * The next 8 bytes store the generation number of the commit and the + commit time in seconds since EPOCH. The generation number uses the + higher 30 bits of the first 4 bytes, while the commit time uses the + 32 bits of the second 4 bytes, along with the lowest 2 bits of the + lowest byte, storing the 33rd and 34th bit of the commit time. + + [Optional] Large Edge List (ID: {'E', 'D', 'G', 'E'}) + This list of 4-byte values store the second through nth parents for + all octoput merges. The second parent value in the commit data is a + negative number pointing into this list. Then iterate through this + list starting at that position until reaching a value with the most- + significant bit on. The other bits correspond to the int-id of the + last parent. + +TRAILER: + + H-byte HASH-checksum of all of the above. -- 2.16.0
[PATCH 02/14] packed-graph: add core.graph setting
The packed graph feature is controlled by the new core.graph config setting. This defaults to 0, so the feature is opt-in. The intention of core.graph is that a user can always stop checking for or parsing packed graph files if core.graph=0. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 10 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 0e25b2c92b..e7b98fa14f 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,9 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.graph:: + Enable git commit graph feature. Allows writing and reading from .graph files. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index d8b975a571..655a81ac90 100644 --- a/cache.h +++ b/cache.h @@ -825,6 +825,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; extern int core_apply_sparse_checkout; +extern int core_graph; extern int precomposed_unicode; extern int protect_hfs; extern int protect_ntfs; diff --git a/config.c b/config.c index e617c2018d..fee90912d8 100644 --- a/config.c +++ b/config.c @@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.graph")) { + core_graph = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index 63ac38a46f..0c56a3d869 100644 --- a/environment.c +++ b/environment.c @@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; int core_apply_sparse_checkout; +int core_graph; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ unsigned long pack_size_limit_cfg; -- 2.16.0
[PATCH 01/14] graph: add packed graph design document
Add Documentation/technical/packed-graph.txt with details of the planned packed graph feature, including future plans. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/packed-graph.txt | 185 +++ 1 file changed, 185 insertions(+) create mode 100644 Documentation/technical/packed-graph.txt diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt new file mode 100644 index 00..fcc0c83874 --- /dev/null +++ b/Documentation/technical/packed-graph.txt @@ -0,0 +1,185 @@ +Git Packed Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows above 100K. +The merge base calculation shows up in many user-facing commands, such +as 'status' and 'fetch' and can take minutes to compute depending on +data shape. There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to avoid topological order mistakes. + +The packed graph is a file that stores the commit graph structure along +with some extra metadata to speed up graph walks. This format allows a +consumer to load the following info for a commit: + +1. The commit OID. +2. The list of parents. +3. The commit date. +4. The root tree OID. +5. An integer ID for fast lookups in the graph. +6. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +By providing an integer ID we can avoid lookups in the graph as we walk +commits. Specifically, we need to provide the integer ID of the parent +commits so we navigate directly to their information on request. + +Define the "generation number" of a commit recursively as follows: + * A commit with no parents (a root commit) has generation number 1. + * A commit with at least one parent has generation number 1 more than + the largest generation number among its parents. +Equivalently, the generation number is one more than the length of a +longest path from the commit to a root commit. The recursive definition +is easier to use for computation and the following property: + +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. + +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation can make +mistakes with topological orders (such as "git log" with default order), +but is not used when the topological order is required (such as merge +base calculations, "git log --graph"). + +Design Details +-- + +- A graph file is stored in a file named 'graph-.graph' in the pack + directory. This could be stored in an alternate. + +- The most-recent graph file OID is stored in a 'graph-head' file for + immediate access and storing backup graphs. This could be stored in an + alternate, and refers to a 'graph-.graph' file in the same pack + directory. + +- The core.graph config setting must be on to create or consume graph files. + +- The graph file is only a supplemental structure. If a user downgrades + or disables the 'core.graph' config setting, then the existing ODB is + sufficient. + +- The file format includes parameters for the object id length + and hash algorithm, so a future change of hash algorithm does + not require a change in format. + +Current Limitations +--- + +- Only one graph file is used at one time. This allows the integer ID to + seek into the single graph file. It is possible to extend the model + for multiple graph files, but that is currently not part of the design. + +- .graph files are managed only by the 'graph' builtin. These are not + updated automatically during clone or fetch. + +- There is no '--verify' option for the 'graph' builtin to verify the + contents of the graph file. + +- The graph only considers commits existing in packfiles and does not + walk to fill in reachable commits. [Small] + +- When rewriting the graph, we do not check for a commit still existing + in
[PATCH 03/14] packed-graph: create git-graph builtin
Teach Git the 'graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a check that the core.graph setting is enabled and a '--pack-dir' option. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-graph.txt | 7 +++ Makefile| 1 + builtin.h | 1 + builtin/graph.c | 36 command-list.txt| 1 + git.c | 1 + 6 files changed, 47 insertions(+) create mode 100644 Documentation/git-graph.txt create mode 100644 builtin/graph.c diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt new file mode 100644 index 00..de5a3c07e6 --- /dev/null +++ b/Documentation/git-graph.txt @@ -0,0 +1,7 @@ +git-graph(1) + + +NAME + +git-graph - Write and verify Git commit graphs (.graph files) + diff --git a/Makefile b/Makefile index 1a9b23b679..d8b0d0457a 100644 --- a/Makefile +++ b/Makefile @@ -965,6 +965,7 @@ BUILTIN_OBJS += builtin/for-each-ref.o BUILTIN_OBJS += builtin/fsck.o BUILTIN_OBJS += builtin/gc.o BUILTIN_OBJS += builtin/get-tar-commit-id.o +BUILTIN_OBJS += builtin/graph.o BUILTIN_OBJS += builtin/grep.o BUILTIN_OBJS += builtin/hash-object.o BUILTIN_OBJS += builtin/help.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..ae7e816908 100644 --- a/builtin.h +++ b/builtin.h @@ -168,6 +168,7 @@ extern int cmd_format_patch(int argc, const char **argv, const char *prefix); extern int cmd_fsck(int argc, const char **argv, const char *prefix); extern int cmd_gc(int argc, const char **argv, const char *prefix); extern int cmd_get_tar_commit_id(int argc, const char **argv, const char *prefix); +extern int cmd_graph(int argc, const char **argv, const char *prefix); extern int cmd_grep(int argc, const char **argv, const char *prefix); extern int cmd_hash_object(int argc, const char **argv, const char *prefix); extern int cmd_help(int argc, const char **argv, const char *prefix); diff --git a/builtin/graph.c b/builtin/graph.c new file mode 100644 index 00..a902dc8646 --- /dev/null +++ b/builtin/graph.c @@ -0,0 +1,36 @@ +#include "builtin.h" +#include "cache.h" +#include "config.h" +#include "dir.h" +#include "git-compat-util.h" +#include "lockfile.h" +#include "packfile.h" +#include "parse-options.h" + +static char const * const builtin_graph_usage[] ={ + N_("git graph [--pack-dir ]"), + NULL +}; + +static struct opts_graph { + const char *pack_dir; +} opts; + +int cmd_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_graph_options[] = { + { OPTION_STRING, 'p', "pack-dir", _dir, + N_("dir"), + N_("The pack directory to store the graph") }, + OPT_END(), + }; + + if (!core_graph) + die("core.graph is false"); + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_graph_usage, builtin_graph_options); + + return 0; +} + diff --git a/command-list.txt b/command-list.txt index a1fad28fd8..d9c17cb9f8 100644 --- a/command-list.txt +++ b/command-list.txt @@ -61,6 +61,7 @@ git-format-patchmainporcelain git-fsckancillaryinterrogators git-gc mainporcelain git-get-tar-commit-id ancillaryinterrogators +git-graph plumbingmanipulators git-grepmainporcelain info git-gui mainporcelain git-hash-object plumbingmanipulators diff --git a/git.c b/git.c index c870b9719c..29f8b6e7dd 100644 --- a/git.c +++ b/git.c @@ -408,6 +408,7 @@ static struct cmd_struct commands[] = { { "fsck-objects", cmd_fsck, RUN_SETUP }, { "gc", cmd_gc, RUN_SETUP }, { "get-tar-commit-id", cmd_get_tar_commit_id }, + { "graph", cmd_graph, RUN_SETUP_GENTLY }, { "grep", cmd_grep, RUN_SETUP_GENTLY }, { "hash-object", cmd_hash_object }, { "help", cmd_help }, -- 2.16.0
[PATCH 09/14] packed-graph: implement git-graph --clear
Teach Git to delete the current 'graph_head' file and the packed graph it references. This is a good safety valve if somehow the file is corrupted and needs to be recalculated. Since the packed graph is a summary of contents already in the ODB, it can be regenerated. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-graph.txt | 16 ++-- builtin/graph.c | 31 ++- t/t5319-graph.sh| 7 ++- 3 files changed, 50 insertions(+), 4 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index ac20aa67a9..f690699570 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -11,6 +11,7 @@ SYNOPSIS [verse] 'git graph' --write [--pack-dir ] 'git graph' --read [--pack-dir ] +'git graph' --clear [--pack-dir ] OPTIONS --- @@ -18,16 +19,21 @@ OPTIONS Use given directory for the location of packfiles, graph-head, and graph files. +--clear:: + Delete the graph-head file and the graph file it references. + (Cannot be combined with --read or --write.) + --read:: Read a graph file given by the graph-head file and output basic - details about the graph file. (Cannot be combined with --write.) + details about the graph file. (Cannot be combined with --clear + or --write.) --graph-id:: When used with --read, consider the graph file graph-.graph. --write:: Write a new graph file to the pack directory. (Cannot be combined - with --read.) + with --clear or --read.) --update-head:: When used with --write, update the graph-head file to point to @@ -61,6 +67,12 @@ $ git graph --write --update-head $ git graph --read --graph-id= +* Delete /graph-head and the file it references. ++ + +$ git graph --clear --pack-dir= + + CONFIGURATION - diff --git a/builtin/graph.c b/builtin/graph.c index 0760d99f43..ac15febc46 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -10,6 +10,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), + N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), N_("git graph --write [--pack-dir ] [--update-head]"), NULL @@ -17,6 +18,7 @@ static char const * const builtin_graph_usage[] ={ static struct opts_graph { const char *pack_dir; + int clear; int read; const char *graph_id; int write; @@ -25,6 +27,29 @@ static struct opts_graph { struct object_id old_graph_oid; } opts; +static int graph_clear(void) +{ + struct strbuf head_path = STRBUF_INIT; + char *old_path; + + if (!opts.has_existing) + return 0; + + strbuf_addstr(_path, opts.pack_dir); + strbuf_addstr(_path, "/"); + strbuf_addstr(_path, "graph-head"); + if (remove_path(head_path.buf)) + die("failed to remove path %s", head_path.buf); + strbuf_release(_path); + + old_path = get_graph_filename_oid(opts.pack_dir, _graph_oid); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + free(old_path); + + return 0; +} + static int graph_read(void) { struct object_id graph_oid; @@ -105,6 +130,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('c', "clear", , + N_("clear graph file and graph-head")), OPT_BOOL('r', "read", , N_("read graph file")), OPT_BOOL('w', "write", , @@ -129,7 +156,7 @@ int cmd_graph(int argc, const char **argv, const char *prefix) builtin_graph_options, builtin_graph_usage, 0); - if (opts.write + opts.read > 1) + if (opts.write + opts.read + opts.clear > 1) usage_with_options(builtin_graph_usage, builtin_graph_options); if (!opts.pack_dir) { @@ -141,6 +168,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) opts.has_existing = !!get_graph_head_oid(opts.pack_dir, _graph_oid); + if (opts.clear) + return graph_clear(); if (opts.read) return graph_read(); if (opts.write) diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh index 3919a3ad73..311fb9dd67 100755 --- a/t/t5319-graph.sh +++ b/t/t5319-graph.sh @@ -80,6 +80,11 @@ t
[PATCH 08/14] graph: implement git-graph --update-head
It is possible to have multiple packed graph files in a pack directory, but only one is important at a time. Use a 'graph_head' file to point to the important file. Teach git-graph to write 'graph_head' upon writing a new packed graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-graph.txt | 38 -- builtin/graph.c | 38 +++--- packed-graph.c | 25 + packed-graph.h | 1 + t/t5319-graph.sh| 12 ++-- 5 files changed, 107 insertions(+), 7 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index 0939c3f1be..ac20aa67a9 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -12,19 +12,53 @@ SYNOPSIS 'git graph' --write [--pack-dir ] 'git graph' --read [--pack-dir ] +OPTIONS +--- +--pack-dir:: + Use given directory for the location of packfiles, graph-head, + and graph files. + +--read:: + Read a graph file given by the graph-head file and output basic + details about the graph file. (Cannot be combined with --write.) + +--graph-id:: + When used with --read, consider the graph file graph-.graph. + +--write:: + Write a new graph file to the pack directory. (Cannot be combined + with --read.) + +--update-head:: + When used with --write, update the graph-head file to point to + the written graph file. + EXAMPLES +* Output the OID of the graph file pointed to by /graph-head. ++ + +$ git graph --pack-dir= + + * Write a graph file for the packed commits in your local .git folder. + -$ git midx --write +$ git graph --write + + +* Write a graph file for the packed commits in your local .git folder, +* and update graph-head. ++ + +$ git graph --write --update-head * Read basic information from a graph file. + -$ git midx --read --graph-id= +$ git graph --read --graph-id= CONFIGURATION diff --git a/builtin/graph.c b/builtin/graph.c index bc66722924..0760d99f43 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -11,7 +11,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ]"), + N_("git graph --write [--pack-dir ] [--update-head]"), NULL }; @@ -20,6 +20,9 @@ static struct opts_graph { int read; const char *graph_id; int write; + int update_head; + int has_existing; + struct object_id old_graph_oid; } opts; static int graph_read(void) @@ -30,8 +33,8 @@ static int graph_read(void) if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ) get_oid_hex(opts.graph_id, _oid); - else - die("no graph id specified"); + else if (!get_graph_head_oid(opts.pack_dir, _oid)) + die("no graph-head exists."); graph_file = get_graph_filename_oid(opts.pack_dir, _oid); graph = load_packed_graph_one(graph_file, opts.pack_dir); @@ -62,10 +65,33 @@ static int graph_read(void) return 0; } +static void update_head_file(const char *pack_dir, const struct object_id *graph_id) +{ + struct strbuf head_path = STRBUF_INIT; + int fd; + struct lock_file lk = LOCK_INIT; + + strbuf_addstr(_path, pack_dir); + strbuf_addstr(_path, "/"); + strbuf_addstr(_path, "graph-head"); + + fd = hold_lock_file_for_update(, head_path.buf, LOCK_DIE_ON_ERROR); + strbuf_release(_path); + + if (fd < 0) + die_errno("unable to open graph-head"); + + write_in_full(fd, oid_to_hex(graph_id), GIT_MAX_HEXSZ); + commit_lock_file(); +} + static int graph_write(void) { struct object_id *graph_id = construct_graph(opts.pack_dir); + if (opts.update_head) + update_head_file(opts.pack_dir, graph_id); + if (graph_id) printf("%s\n", oid_to_hex(graph_id)); @@ -83,6 +109,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("read graph file")), OPT_BOOL('w', "write", , N_("write graph file")), + OPT_BOOL('u', "update-head", _head, + N_("update graph-head to written graph file")),
[PATCH 06/14] packed-graph: implement git-graph --write
Teach git-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-graph.txt | 26 ++ builtin/graph.c | 37 ++-- t/t5319-graph.sh| 83 + 3 files changed, 143 insertions(+), 3 deletions(-) create mode 100755 t/t5319-graph.sh diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index de5a3c07e6..be6bc38814 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -5,3 +5,29 @@ NAME git-graph - Write and verify Git commit graphs (.graph files) + +SYNOPSIS + +[verse] +'git graph' --write [--pack-dir ] + +EXAMPLES + + +* Write a graph file for the packed commits in your local .git folder. ++ + +$ git midx --write + + +CONFIGURATION +- + +core.graph:: + The graph command will fail if core.graph is false. + Also, the written graph files will be ignored by other commands + unless core.graph is true. + +GIT +--- +Part of the linkgit:git[1] suite \ No newline at end of file diff --git a/builtin/graph.c b/builtin/graph.c index a902dc8646..09f5552338 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -6,31 +6,62 @@ #include "lockfile.h" #include "packfile.h" #include "parse-options.h" +#include "packed-graph.h" static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), + N_("git graph --write [--pack-dir ]"), NULL }; static struct opts_graph { const char *pack_dir; + int write; } opts; +static int graph_write(void) +{ + struct object_id *graph_id = construct_graph(opts.pack_dir); + + if (graph_id) + printf("%s\n", oid_to_hex(graph_id)); + + free(graph_id); + return 0; +} + int cmd_graph(int argc, const char **argv, const char *prefix) { static struct option builtin_graph_options[] = { { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('w', "write", , + N_("write graph file")), OPT_END(), }; - if (!core_graph) - die("core.graph is false"); - if (argc == 2 && !strcmp(argv[1], "-h")) usage_with_options(builtin_graph_usage, builtin_graph_options); + git_config(git_default_config, NULL); + if (!core_graph) + die("git-graph requires core.graph=true."); + + argc = parse_options(argc, argv, prefix, +builtin_graph_options, +builtin_graph_usage, 0); + + if (!opts.pack_dir) { + struct strbuf path = STRBUF_INIT; + strbuf_addstr(, get_object_directory()); + strbuf_addstr(, "/pack"); + opts.pack_dir = strbuf_detach(, NULL); + } + + if (opts.write) + return graph_write(); + return 0; } diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh new file mode 100755 index 00..52e979dfd3 --- /dev/null +++ b/t/t5319-graph.sh @@ -0,0 +1,83 @@ +#!/bin/sh + +test_description='packed graph' +. ./test-lib.sh + +test_expect_success \ +'setup full repo' \ +'rm -rf .git && + mkdir full && + cd full && + git init && + git config core.graph true && + git config pack.threads 1 && + packdir=".git/objects/pack"' + +test_expect_success \ +'write graph with no packs' \ +'git graph --write --pack-dir .' + +test_expect_success \ +'create commits and repack' \ +'for i in $(test_seq 5) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git repack' + +test_expect_success \ +'write graph' \ +'graph1=$(git graph --write) && + test_path_is_file ${packdir}/graph-${graph1}.graph' + +test_expect_success \ +'Add more commits' \ +'git reset --hard commits/3 && + for i in $(test_seq 6 10) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git reset --hard commits/7 && + for i in $(test_seq 11 15) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i&qu
[PATCH 14/14] packed-graph: teach git-graph to read commits
Teach git-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/graph.c | 33 + packed-graph.c | 18 +++--- packed-graph.h | 3 ++- t/t5319-graph.sh | 18 ++ 4 files changed, 60 insertions(+), 12 deletions(-) diff --git a/builtin/graph.c b/builtin/graph.c index 3cace3a18c..708889677b 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs]"), + N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs|--stdin-commits]"), NULL }; @@ -25,6 +25,7 @@ static struct opts_graph { int update_head; int delete_expired; int stdin_packs; + int stdin_commits; int has_existing; struct object_id old_graph_oid; } opts; @@ -116,22 +117,36 @@ static int graph_write(void) { struct object_id *graph_id; char **pack_indexes = NULL; + char **commits = NULL; int num_packs = 0; - int size_packs = 0; + int num_commits = 0; + char **lines = NULL; + int num_lines = 0; + int size_lines = 0; - if (opts.stdin_packs) { + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; - size_packs = 128; - ALLOC_ARRAY(pack_indexes, size_packs); + size_lines = 128; + ALLOC_ARRAY(lines, size_lines); while (strbuf_getline(, stdin) != EOF) { - ALLOC_GROW(pack_indexes, num_packs + 1, size_packs); - pack_indexes[num_packs++] = buf.buf; + ALLOC_GROW(lines, num_lines + 1, size_lines); + lines[num_lines++] = buf.buf; strbuf_detach(, NULL); } + + if (opts.stdin_packs) { + pack_indexes = lines; + num_packs = num_lines; + } + if (opts.stdin_commits) { + commits = lines; + num_commits = num_lines; + } } - graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs); + graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs, + commits, num_commits); if (opts.update_head) update_head_file(opts.pack_dir, graph_id); @@ -170,6 +185,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("delete expired head graph file")), OPT_BOOL('s', "stdin-packs", _packs, N_("only scan packfiles listed by stdin")), + OPT_BOOL('C', "stdin-commits", _commits, + N_("start walk at commits listed by stdin")), { OPTION_STRING, 'G', "graph-id", _id, N_("oid"), N_("An OID for a specific graph file in the pack-dir."), diff --git a/packed-graph.c b/packed-graph.c index c93515f18e..94e1a97000 100644 --- a/packed-graph.c +++ b/packed-graph.c @@ -662,7 +662,8 @@ static void close_reachable(struct packed_oid_list *oids) } } -struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int nr_packs) +struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int nr_packs, + char **commit_hex, int nr_commits) { // Find a list of oids, adding the pointer to a list. struct packed_oid_list oids; @@ -719,10 +720,21 @@ struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int for_each_object_in_pack(p, if_packed_commit_add_to_list, ); close_pack(p); } - } else { - for_each_packed_object(if_packed_commit_add_to_list, , 0); } + if (commit_hex) { + for (i = 0; i < nr_commits; i++) { + const char *end; + ALLOC_GROW(oids.list, oids.num + 1, oids.size); +
[PATCH 05/14] packed-graph: implement construct_graph()
Teach Git to write a packed graph file by checking all packed objects to see if they are commits, then store the file in the given pack directory. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + packed-graph.c | 375 + packed-graph.h | 20 +++ 3 files changed, 396 insertions(+) create mode 100644 packed-graph.c create mode 100644 packed-graph.h diff --git a/Makefile b/Makefile index d8b0d0457a..59439e13a1 100644 --- a/Makefile +++ b/Makefile @@ -841,6 +841,7 @@ LIB_OBJS += notes-utils.o LIB_OBJS += object.o LIB_OBJS += oidmap.o LIB_OBJS += oidset.o +LIB_OBJS += packed-graph.o LIB_OBJS += packfile.o LIB_OBJS += pack-bitmap.o LIB_OBJS += pack-bitmap-write.o diff --git a/packed-graph.c b/packed-graph.c new file mode 100644 index 00..9be9811667 --- /dev/null +++ b/packed-graph.c @@ -0,0 +1,375 @@ +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "pack.h" +#include "packfile.h" +#include "commit.h" +#include "object.h" +#include "packed-graph.h" + +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ + +#define GRAPH_DATA_WIDTH 36 + +#define GRAPH_VERSION_1 0x1 +#define GRAPH_VERSION GRAPH_VERSION_1 + +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 20 +#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1 +#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 + +#define GRAPH_LARGE_EDGES_NEEDED 0x8000 +#define GRAPH_PARENT_MISSING 0x7fff +#define GRAPH_EDGE_LAST_MASK 0x7fff +#define GRAPH_PARENT_NONE 0x7000 + +#define GRAPH_LAST_EDGE 0x8000 + +char* get_graph_filename_oid(const char *pack_dir, + struct object_id *oid) +{ + size_t len; + struct strbuf head_path = STRBUF_INIT; + strbuf_addstr(_path, pack_dir); + strbuf_addstr(_path, "/graph-"); + strbuf_addstr(_path, oid_to_hex(oid)); + strbuf_addstr(_path, ".graph"); + + return strbuf_detach(_path, ); +} + +static void write_graph_chunk_fanout( + struct sha1file *f, + struct commit **commits, int nr_commits) +{ + uint32_t i, count = 0; + struct commit **list = commits; + struct commit **last = commits + nr_commits; + + /* + * Write the first-level table (the list is sorted, + * but we use a 256-entry lookup to be able to avoid + * having to do eight extra binary search iterations). + */ + for (i = 0; i < 256; i++) { + uint32_t swap_count; + + while (list < last) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } + + swap_count = htonl(count); + sha1write(f, _count, 4); + } +} + +static void write_graph_chunk_oids( + struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + uint32_t i; + for (i = 0; i < nr_commits; i++) { + sha1write(f, (*list)->object.oid.hash, (int)hash_len); + list++; + } +} + +static int commit_pos(struct commit **commits, int nr_commits, const struct object_id *oid, uint32_t *pos) +{ + uint32_t first = 0, last = nr_commits; + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + struct object_id *current; + int cmp; + + current = &(commits[mid]->object.oid); + cmp = oidcmp(oid, current); + if (!cmp) { + *pos = mid; + return 1; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + *pos = first; + return 0; +} + +static void write_graph_chunk_data( + struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_large_edges = 0; + + while (list < last) { + struct commit_list *parent; + uint32_t intId, swapIntId; + uint32_t packedDate[2]; + + parse_commit(*list); + sha1write(f, (*list)->tree->object.oid.hash, hash_len); + + parent = (*list)->parents; + + if (!parent) + sw
[PATCH 07/14] packed-graph: implement git-graph --read
Teach git-graph to read packed graph files and summarize their contents. Use the --read option to verify the contents of a graph file in the graph tests. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-graph.txt | 7 +++ builtin/graph.c | 54 packed-graph.c | 147 +++- packed-graph.h | 25 t/t5319-graph.sh| 50 +-- 5 files changed, 260 insertions(+), 23 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index be6bc38814..0939c3f1be 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git graph' --write [--pack-dir ] +'git graph' --read [--pack-dir ] EXAMPLES @@ -20,6 +21,12 @@ EXAMPLES $ git midx --write +* Read basic information from a graph file. ++ + +$ git midx --read --graph-id= + + CONFIGURATION - diff --git a/builtin/graph.c b/builtin/graph.c index 09f5552338..bc66722924 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -10,15 +10,58 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), + N_("git graph --read [--graph-id=]"), N_("git graph --write [--pack-dir ]"), NULL }; static struct opts_graph { const char *pack_dir; + int read; + const char *graph_id; int write; } opts; +static int graph_read(void) +{ + struct object_id graph_oid; + struct packed_graph *graph = 0; + const char *graph_file; + + if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ) + get_oid_hex(opts.graph_id, _oid); + else + die("no graph id specified"); + + graph_file = get_graph_filename_oid(opts.pack_dir, _oid); + graph = load_packed_graph_one(graph_file, opts.pack_dir); + + if (!graph) + die("graph file %s does not exist.\n", graph_file); + + printf("header: %08x %02x %02x %02x %02x\n", + ntohl(graph->hdr->graph_signature), + graph->hdr->graph_version, + graph->hdr->hash_version, + graph->hdr->hash_len, + graph->hdr->num_chunks); + printf("num_commits: %u\n", graph->num_commits); + printf("chunks:"); + + if (graph->chunk_oid_fanout) + printf(" oid_fanout"); + if (graph->chunk_oid_lookup) + printf(" oid_lookup"); + if (graph->chunk_commit_data) + printf(" commit_metadata"); + if (graph->chunk_large_edges) + printf(" large_edges"); + printf("\n"); + + printf("pack_dir: %s\n", graph->pack_dir); + return 0; +} + static int graph_write(void) { struct object_id *graph_id = construct_graph(opts.pack_dir); @@ -36,8 +79,14 @@ int cmd_graph(int argc, const char **argv, const char *prefix) { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('r', "read", , + N_("read graph file")), OPT_BOOL('w', "write", , N_("write graph file")), + { OPTION_STRING, 'M', "graph-id", _id, + N_("oid"), + N_("An OID for a specific graph file in the pack-dir."), + PARSE_OPT_OPTARG, NULL, (intptr_t) "" }, OPT_END(), }; @@ -52,6 +101,9 @@ int cmd_graph(int argc, const char **argv, const char *prefix) builtin_graph_options, builtin_graph_usage, 0); + if (opts.write + opts.read > 1) + usage_with_options(builtin_graph_usage, builtin_graph_options); + if (!opts.pack_dir) { struct strbuf path = STRBUF_INIT; strbuf_addstr(, get_object_directory()); @@ -59,6 +111,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) opts.pack_dir = strbuf_detach(, NULL); } + if (opts.read) + return graph_read(); if (opts.write) return graph_write(); diff --git a/packed-graph.c b/packed-graph.c index 9be9811667..eaa656becb 100644 --- a/packed-graph.c +++ b/packed-graph.c @@ -30,6 +30,11 @@ #define GRAPH_LAST_EDGE 0x8000 +#define GRAP
[PATCH 12/14] packed-graph: read only from specific pack-indexes
Teach git-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the graph iteratively, since we add all commits stored in a previous packed graph. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-graph.txt | 12 builtin/graph.c | 26 +++--- packed-graph.c | 27 +++ packed-graph.h | 2 +- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5319-graph.sh| 10 ++ 7 files changed, 69 insertions(+), 14 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index f4f1048d28..b68a61ddea 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -43,6 +43,11 @@ OPTIONS When used with --write and --update-head, delete the graph file previously referenced by graph-head. +--stdin-packs:: + When used with --write, generate the new graph by walking objects + only in the specified packfiles and any commits in the + existing graph-head. + EXAMPLES @@ -65,6 +70,13 @@ $ git graph --write $ git graph --write --update-head --delete-expired +* Write a graph file, extending the current graph file using commits +* in , update graph-head, and delete the old graph-.graph file. ++ + +$ echo | git graph --write --update-head --delete-expired --stdin-packs + + * Read basic information from a graph file. + diff --git a/builtin/graph.c b/builtin/graph.c index adf779b601..3cace3a18c 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired]"), + N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs]"), NULL }; @@ -24,6 +24,7 @@ static struct opts_graph { int write; int update_head; int delete_expired; + int stdin_packs; int has_existing; struct object_id old_graph_oid; } opts; @@ -113,7 +114,24 @@ static void update_head_file(const char *pack_dir, const struct object_id *graph static int graph_write(void) { - struct object_id *graph_id = construct_graph(opts.pack_dir); + struct object_id *graph_id; + char **pack_indexes = NULL; + int num_packs = 0; + int size_packs = 0; + + if (opts.stdin_packs) { + struct strbuf buf = STRBUF_INIT; + size_packs = 128; + ALLOC_ARRAY(pack_indexes, size_packs); + + while (strbuf_getline(, stdin) != EOF) { + ALLOC_GROW(pack_indexes, num_packs + 1, size_packs); + pack_indexes[num_packs++] = buf.buf; + strbuf_detach(, NULL); + } + } + + graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs); if (opts.update_head) update_head_file(opts.pack_dir, graph_id); @@ -150,7 +168,9 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("update graph-head to written graph file")), OPT_BOOL('d', "delete-expired", _expired, N_("delete expired head graph file")), - { OPTION_STRING, 'M', "graph-id", _id, + OPT_BOOL('s', "stdin-packs", _packs, + N_("only scan packfiles listed by stdin")), + { OPTION_STRING, 'G', "graph-id", _id, N_("oid"), N_("An OID for a specific graph file in the pack-dir."), PARSE_OPT_OPTARG, NULL, (intptr_t) "" }, diff --git a/packed-graph.c b/packed-graph.c index 343b231973..0dc68a077e 100644 --- a/packed-graph.c +++ b/packed-graph.c @@ -401,7 +401,7 @@ static int fill_packed_commit(struct commit *item, struct packed_graph *g, uint3 * 2. date * 3. parents. * - * Returns 1 iff the commit was found in the packed graph. + * Returns 1 if and only if the commit was found in the packed graph. * * See parse_commit_buffer() for the fallback after this call. */ @@ -427,7 +427,7 @@ int parse_packed_commit(struct commit *item) return fill_packed_commit(item, packed_graph, pos); } - return parse_commit_internal(item, 0, 0); + return 0; } static void write_graph_chunk
[PATCH 13/14] packed-graph: close under reachability
Teach construct_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- packed-graph.c | 26 ++ t/t5319-graph.sh | 14 ++ 2 files changed, 40 insertions(+) diff --git a/packed-graph.c b/packed-graph.c index 0dc68a077e..c93515f18e 100644 --- a/packed-graph.c +++ b/packed-graph.c @@ -5,6 +5,7 @@ #include "packfile.h" #include "commit.h" #include "object.h" +#include "revision.h" #include "packed-graph.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ @@ -638,6 +639,29 @@ static int if_packed_commit_add_to_list(const struct object_id *oid, return 0; } +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct rev_info revs; + struct commit *commit; + init_revisions(, NULL); + + for (i = 0; i < oids->num; i++) { + commit = lookup_commit(oids->list[i]); + if (commit && !parse_commit(commit)) + revs.commits = commit_list_insert(commit, ); + } + + if (prepare_revision_walk()) + die(_("revision walk setup failed")); + + while ((commit = get_revision()) != NULL) { + ALLOC_GROW(oids->list, oids->num + 1, oids->size); + oids->list[oids->num] = &(commit->object.oid); + (oids->num)++; + } +} + struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int nr_packs) { // Find a list of oids, adding the pointer to a list. @@ -698,6 +722,8 @@ struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, int } else { for_each_packed_object(if_packed_commit_add_to_list, , 0); } + + close_reachable(); QSORT(oids.list, oids.num, commit_compare); count_distinct = 1; diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh index 969150cd21..8bf5a0c993 100755 --- a/t/t5319-graph.sh +++ b/t/t5319-graph.sh @@ -212,6 +212,20 @@ test_expect_success 'clear graph' \ _graph_git_behavior commits/20 merge/1 _graph_git_behavior commits/20 merge/2 +test_expect_success 'build graph from latest pack with closure' \ +'graph5=$(cat new-idx | git graph --write --update-head --stdin-packs) && + test_path_is_file ${packdir}/graph-${graph5}.graph && + test_path_is_file ${packdir}/graph-${graph1}.graph && + test_path_is_file ${packdir}/graph-head && + echo ${graph5} >expect && + cmp -n 40 expect ${packdir}/graph-head && + git graph --read --graph-id=${graph5} >output && + _graph_read_expect "21" "${packdir}" && + cmp expect output' + +_graph_git_behavior commits/20 merge/1 +_graph_git_behavior commits/20 merge/2 + test_expect_success 'setup bare repo' \ 'cd .. && git clone --bare full bare && -- 2.16.0
[PATCH 11/14] commit: integrate packed graph with commit parsing
Teach Git to inspect a packed graph to supply the contents of a struct commit when calling parse_commit_gently(). This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date. The only loosely-expected condition is that the commit buffer is loaded into the cache. This was checked in log-tree.c:show_log(), but the "return;" on failure produced unexpected results (i.e. the message line was never terminated). The new behavior of loading the buffer when needed prevents the unexpected behavior. If core.graph is false, then do not load the graph and behave as usual. In test script t5319-graph.sh, add output-matching conditions on read- only graph operations. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commits walks. Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv | 0.42s | 0.27s | -35% | | rev-list --all | 6.4s | 1.0s | -84% | | rev-list --all --objects | 32.6s | 27.6s | -15% | Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c | 1 + commit.c | 20 - commit.h | 2 + log-tree.c | 3 +- packed-graph.c | 242 +++ packed-graph.h | 18 + t/t5319-graph.sh | 114 -- 7 files changed, 387 insertions(+), 13 deletions(-) diff --git a/alloc.c b/alloc.c index 12afadfacd..4a4dcfa2b7 100644 --- a/alloc.c +++ b/alloc.c @@ -93,6 +93,7 @@ void *alloc_commit_node(void) struct commit *c = alloc_node(_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); + c->graphId = 0x; return c; } diff --git a/commit.c b/commit.c index cab8d4455b..253c102808 100644 --- a/commit.c +++ b/commit.c @@ -12,6 +12,7 @@ #include "prio-queue.h" #include "sha1-lookup.h" #include "wt-status.h" +#include "packed-graph.h" static struct commit_extra_header *read_commit_extra_header_lines(const char *buf, size_t len, const char **); @@ -374,7 +375,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) +int parse_commit_internal(struct commit *item, int quiet_on_missing, int check_packed) { enum object_type type; void *buffer; @@ -383,19 +384,27 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) if (!item) return -1; + + // If we already parsed, but got it from the graph, then keep going! if (item->object.parsed) return 0; + + if (check_packed && parse_packed_commit(item)) + return 0; + buffer = read_sha1_file(item->object.oid.hash, , ); if (!buffer) return quiet_on_missing ? -1 : error("Could not read %s", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); if (type != OBJ_COMMIT) { free(buffer); return error("Object %s not a commit", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); } + ret = parse_commit_buffer(item, buffer, size); + if (save_commit_buffer && !ret) { set_commit_buffer(item, buffer, size); return 0; @@ -404,6 +413,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index 8c68ca1a5a..02f5f2a182 100644 --- a/commit.h +++ b/commit.h @@ -21,6 +21,7 @@ struct commit { timestamp_t date; struct commit_list *parents; struct tree *tree; + uint32_t graphId; }; extern int save_commit_buffer; @@ -60,6 +61,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); +extern int parse_commit_internal(struct commit *item, int quiet_on_missing, int check_packed); int parse_commit_gently(struct com
[PATCH 10/14] packed-graph: teach git-graph --delete-expired
Teach git-graph to delete the graph previously referenced by 'graph_head' when writing a new graph file and updating 'graph_head'. This prevents data creep by storing a list of useless graphs. Be careful to not delete the graph if the file did not change. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-graph.txt | 8 ++-- builtin/graph.c | 14 +- t/t5319-graph.sh| 37 +++-- 3 files changed, 54 insertions(+), 5 deletions(-) diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt index f690699570..f4f1048d28 100644 --- a/Documentation/git-graph.txt +++ b/Documentation/git-graph.txt @@ -39,6 +39,10 @@ OPTIONS When used with --write, update the graph-head file to point to the written graph file. +--delete-expired:: + When used with --write and --update-head, delete the graph file + previously referenced by graph-head. + EXAMPLES @@ -55,10 +59,10 @@ $ git graph --write * Write a graph file for the packed commits in your local .git folder, -* and update graph-head. +* update graph-head, and delete the old graph-.graph file. + -$ git graph --write --update-head +$ git graph --write --update-head --delete-expired * Read basic information from a graph file. diff --git a/builtin/graph.c b/builtin/graph.c index ac15febc46..adf779b601 100644 --- a/builtin/graph.c +++ b/builtin/graph.c @@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={ N_("git graph [--pack-dir ]"), N_("git graph --clear [--pack-dir ]"), N_("git graph --read [--graph-id=]"), - N_("git graph --write [--pack-dir ] [--update-head]"), + N_("git graph --write [--pack-dir ] [--update-head] [--delete-expired]"), NULL }; @@ -23,6 +23,7 @@ static struct opts_graph { const char *graph_id; int write; int update_head; + int delete_expired; int has_existing; struct object_id old_graph_oid; } opts; @@ -120,6 +121,15 @@ static int graph_write(void) if (graph_id) printf("%s\n", oid_to_hex(graph_id)); + if (opts.delete_expired && opts.update_head && opts.has_existing && + oidcmp(graph_id, _graph_oid)) { + char *old_path = get_graph_filename_oid(opts.pack_dir, _graph_oid); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + + free(old_path); + } + free(graph_id); return 0; } @@ -138,6 +148,8 @@ int cmd_graph(int argc, const char **argv, const char *prefix) N_("write graph file")), OPT_BOOL('u', "update-head", _head, N_("update graph-head to written graph file")), + OPT_BOOL('d', "delete-expired", _expired, + N_("delete expired head graph file")), { OPTION_STRING, 'M', "graph-id", _id, N_("oid"), N_("An OID for a specific graph file in the pack-dir."), diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh index 311fb9dd67..a70c7bbb02 100755 --- a/t/t5319-graph.sh +++ b/t/t5319-graph.sh @@ -80,9 +80,42 @@ test_expect_success 'write graph with merges' \ _graph_read_expect "18" "${packdir}" && cmp expect output' +test_expect_success 'Add more commits' \ +'git reset --hard commits/3 && + for i in $(test_seq 16 20) + do +git commit --allow-empty -m "commit $i" && +git branch commits/$i + done && + git repack' + +test_expect_success 'write graph with merges' \ +'graph3=$(git graph --write --update-head --delete-expired) && + test_path_is_file ${packdir}/graph-${graph3}.graph && + test_path_is_missing ${packdir}/graph-${graph2}.graph && + test_path_is_file ${packdir}/graph-${graph1}.graph && + test_path_is_file ${packdir}/graph-head && + echo ${graph3} >expect && + cmp -n 40 expect ${packdir}/graph-head && + git graph --read --graph-id=${graph3} >output && + _graph_read_expect "23" "${packdir}" && + cmp expect output' + +test_expect_success 'write graph with nothing new' \ +'graph4=$(git graph --write --update-head --delete-expired) && + test_path_is_file ${packdir}/graph-${graph4}.graph && + test_path_is_file ${packdir}/graph-${graph1}.graph && + test_path_
[PATCH v2 01/14] commit-graph: add format document
Add document specifying the binary format for commit graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. The format automatically includes two parent positions for every commit. This favors speed over space, since using only one position per commit would cause an extra level of indirection for every merge commit. (Octopus merges suffer from this indirection, but they are very rare.) Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph-format.txt | 89 + 1 file changed, 89 insertions(+) create mode 100644 Documentation/technical/commit-graph-format.txt diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 00..8a987c7aa9 --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,89 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +== graph-*.graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks, +hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) + + 1-byte number (C) of "chunks" + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) +* The first H bytes are for the OID of the root tree. +* The next 8 bytes are for the int-ids of the first two parents + of the ith commit. Stores value 0x if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. +* The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data is a + negative number pointing into this list. Then iterate through this + list starting at that position until reaching a value with the most- + significant bit on. The other bits correspond to the int-id of the + last parent. This chunk should always be present, but may be empty. + +TRAILER: + + H-byte HASH-checksum of all of the above. -- 2.16.0.15.g9c3cf44.dirty
[PATCH v2 11/14] commit: integrate commit graph with commit parsing
Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently(). This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date. The only loosely-expected condition is that the commit buffer is loaded into the cache. This was checked in log-tree.c:show_log(), but the "return;" on failure produced unexpected results (i.e. the message line was never terminated). The new behavior of loading the buffer when needed prevents the unexpected behavior. If core.commitgraph is false, then do not check graph files. In test script t5319-commit-graph.sh, add output-matching conditions on read-only graph operations. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks. Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv | 0.42s | 0.27s | -35% | | rev-list --all | 6.4s | 1.0s | -84% | | rev-list --all --objects | 32.6s | 27.6s | -15% | Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c | 1 + commit-graph.c | 237 commit-graph.h | 20 +++- commit.c| 10 +- commit.h| 4 + log-tree.c | 3 +- t/t5318-commit-graph.sh | 47 ++ 7 files changed, 318 insertions(+), 4 deletions(-) diff --git a/alloc.c b/alloc.c index 12afadfacd..cf4f8b61e1 100644 --- a/alloc.c +++ b/alloc.c @@ -93,6 +93,7 @@ void *alloc_commit_node(void) struct commit *c = alloc_node(_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); + c->graph_pos = COMMIT_NOT_FROM_GRAPH; return c; } diff --git a/commit-graph.c b/commit-graph.c index 764e016ddb..fc816533c6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -35,6 +35,9 @@ #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \ GRAPH_OID_LEN + sizeof(struct commit_graph_header)) +/* global storage */ +struct commit_graph *commit_graph = 0; + struct object_id *get_graph_head_hash(const char *pack_dir, struct object_id *hash) { struct strbuf head_filename = STRBUF_INIT; @@ -209,6 +212,220 @@ struct commit_graph *load_commit_graph_one(const char *graph_file, const char *p return graph; } +static void prepare_commit_graph_one(const char *obj_dir) +{ + char *graph_file; + struct object_id oid; + struct strbuf pack_dir = STRBUF_INIT; + strbuf_addstr(_dir, obj_dir); + strbuf_add(_dir, "/pack", 5); + + if (!get_graph_head_hash(pack_dir.buf, )) + return; + + graph_file = get_commit_graph_filename_hash(pack_dir.buf, ); + + commit_graph = load_commit_graph_one(graph_file, pack_dir.buf); + strbuf_release(_dir); +} + +static int prepare_commit_graph_run_once = 0; +void prepare_commit_graph(void) +{ + struct alternate_object_database *alt; + char *obj_dir; + + if (prepare_commit_graph_run_once) + return; + prepare_commit_graph_run_once = 1; + + obj_dir = get_object_directory(); + prepare_commit_graph_one(obj_dir); + prepare_alt_odb(); + for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next) + prepare_commit_graph_one(alt->path); +} + +static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos) +{ + uint32_t last, first = 0; + + if (oid->hash[0]) + first = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * (oid->hash[0] - 1))); + last = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * oid->hash[0])); + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + const unsigned char *current; + int cmp; + + current = g->chunk_oid_lookup + g->hdr->hash_len * mid; + cmp = hashcmp(oid->hash, current); + if (!cmp) { + *pos = mid; + return 1; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + *pos = first; + return 0; +} + +struct object_id *get_nth_commit_oid(struct commit_graph *g, +uint32_t n, +struct object_id *oid) +{ +
[PATCH v2 06/14] commit-graph: implement git-commit-graph --read
Teach git-commit-graph to read commit graph files and summarize their contents. Use the --read option to verify the contents of a commit graph file in the tests. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 7 ++ builtin/commit-graph.c | 55 +++ commit-graph.c | 138 - commit-graph.h | 25 +++ t/t5318-commit-graph.sh| 28 ++-- 5 files changed, 247 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 3f3790d9a8..09aeaf6c82 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git commit-graph' --write [--pack-dir ] +'git commit-graph' --read [--pack-dir ] EXAMPLES @@ -20,6 +21,12 @@ EXAMPLES $ git commit-graph --write +* Read basic information from a graph file. ++ + +$ git commit-graph --read --graph-hash= + + GIT --- Part of the linkgit:git[1] suite diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 7affd512f1..218740b1f8 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -10,15 +10,58 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), + N_("git commit-graph --read [--graph-hash=]"), N_("git commit-graph --write [--pack-dir ]"), NULL }; static struct opts_commit_graph { const char *pack_dir; + int read; + const char *graph_hash; int write; } opts; +static int graph_read(void) +{ + struct object_id graph_hash; + struct commit_graph *graph = 0; + const char *graph_file; + + if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ) + get_oid_hex(opts.graph_hash, _hash); + else + die("no graph hash specified"); + + graph_file = get_commit_graph_filename_hash(opts.pack_dir, _hash); + graph = load_commit_graph_one(graph_file, opts.pack_dir); + + if (!graph) + die("graph file %s does not exist", graph_file); + + printf("header: %08x %02x %02x %02x %02x\n", + ntohl(graph->hdr->graph_signature), + graph->hdr->graph_version, + graph->hdr->hash_version, + graph->hdr->hash_len, + graph->hdr->num_chunks); + printf("num_commits: %u\n", graph->num_commits); + printf("chunks:"); + + if (graph->chunk_oid_fanout) + printf(" oid_fanout"); + if (graph->chunk_oid_lookup) + printf(" oid_lookup"); + if (graph->chunk_commit_data) + printf(" commit_metadata"); + if (graph->chunk_large_edges) + printf(" large_edges"); + printf("\n"); + + printf("pack_dir: %s\n", graph->pack_dir); + return 0; +} + static int graph_write(void) { struct object_id *graph_hash = construct_commit_graph(opts.pack_dir); @@ -36,8 +79,14 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('r', "read", , + N_("read graph file")), OPT_BOOL('w', "write", , N_("write commit graph file")), + { OPTION_STRING, 'H', "graph-hash", _hash, + N_("hash"), + N_("A hash for a specific graph file in the pack-dir."), + PARSE_OPT_OPTARG, NULL, (intptr_t) "" }, OPT_END(), }; @@ -49,6 +98,10 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) builtin_commit_graph_options, builtin_commit_graph_usage, 0); + if (opts.write + opts.read > 1) + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); + if (!opts.pack_dir) { struct strbuf path = STRBUF_INIT; strbuf_addstr(, get_object_directory()); @@ -56,6 +109,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) opts.pack_dir = strbuf_detach(, NULL); } + if (opts.read) +
[PATCH v2 08/14] commit-graph: implement git-commit-graph --clear
Teach Git to delete the current 'graph_head' file and the commit graph it references. This is a good safety valve if somehow the file is corrupted and needs to be recalculated. Since the commit graph is a summary of contents already in the ODB, it can be regenerated. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 16 ++-- builtin/commit-graph.c | 32 +++- t/t5318-commit-graph.sh| 7 ++- 3 files changed, 51 insertions(+), 4 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 99ced16ddc..33d6567f11 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -11,6 +11,7 @@ SYNOPSIS [verse] 'git commit-graph' --write [--pack-dir ] 'git commit-graph' --read [--pack-dir ] +'git commit-graph' --clear [--pack-dir ] OPTIONS --- @@ -18,16 +19,21 @@ OPTIONS Use given directory for the location of packfiles, graph-head, and graph files. +--clear:: + Delete the graph-head file and the graph file it references. + (Cannot be combined with --read or --write.) + --read:: Read a graph file given by the graph-head file and output basic - details about the graph file. (Cannot be combined with --write.) + details about the graph file. (Cannot be combined with --clear + or --write.) --graph-id:: When used with --read, consider the graph file graph-.graph. --write:: Write a new graph file to the pack directory. (Cannot be combined - with --read.) + with --clear or --read.) --update-head:: When used with --write, update the graph-head file to point to @@ -61,6 +67,12 @@ $ git commit-graph --write --update-head $ git commit-graph --read --graph-hash= +* Delete /graph-head and the file it references. ++ + +$ git commit-graph --clear --pack-dir= + + GIT --- Part of the linkgit:git[1] suite diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index d73cbc907d..4970dec133 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -10,6 +10,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), + N_("git commit-graph --clear [--pack-dir ]"), N_("git commit-graph --read [--graph-hash=]"), N_("git commit-graph --write [--pack-dir ] [--update-head]"), NULL @@ -17,6 +18,7 @@ static char const * const builtin_commit_graph_usage[] = { static struct opts_commit_graph { const char *pack_dir; + int clear; int read; const char *graph_hash; int write; @@ -25,6 +27,30 @@ static struct opts_commit_graph { struct object_id old_graph_hash; } opts; +static int graph_clear(void) +{ + struct strbuf head_path = STRBUF_INIT; + char *old_path; + + if (!opts.has_existing) + return 0; + + strbuf_addstr(_path, opts.pack_dir); + strbuf_addstr(_path, "/"); + strbuf_addstr(_path, "graph-head"); + if (remove_path(head_path.buf)) + die("failed to remove path %s", head_path.buf); + strbuf_release(_path); + + old_path = get_commit_graph_filename_hash(opts.pack_dir, + _graph_hash); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + free(old_path); + + return 0; +} + static int graph_read(void) { struct object_id graph_hash; @@ -105,6 +131,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('c', "clear", , + N_("clear graph file and graph-head")), OPT_BOOL('r', "read", , N_("read graph file")), OPT_BOOL('w', "write", , @@ -126,7 +154,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) builtin_commit_graph_options, builtin_commit_graph_usage, 0); - if (opts.write + opts.read > 1) + if (opts.write + opts.read + opts.clear > 1) usage_with_options(builtin_commit_graph_usage, builtin_commit_graph_options); @@ -139,6 +167,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) opts.has_existing = !!get_graph_head_hash(opts.pack_dir,
[PATCH v2 03/14] commit-graph: create git-commit-graph builtin
Teach git the 'commit-graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a '--pack-dir' option. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .gitignore | 1 + Documentation/git-commit-graph.txt | 7 +++ Makefile | 1 + builtin.h | 1 + builtin/commit-graph.c | 33 + command-list.txt | 1 + git.c | 1 + 7 files changed, 45 insertions(+) create mode 100644 Documentation/git-commit-graph.txt create mode 100644 builtin/commit-graph.c diff --git a/.gitignore b/.gitignore index 833ef3b0b7..e82f90184d 100644 --- a/.gitignore +++ b/.gitignore @@ -34,6 +34,7 @@ /git-clone /git-column /git-commit +/git-commit-graph /git-commit-tree /git-config /git-count-objects diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt new file mode 100644 index 00..c8ea548dfb --- /dev/null +++ b/Documentation/git-commit-graph.txt @@ -0,0 +1,7 @@ +git-commit-graph(1) + + +NAME + +git-commit-graph - Write and verify Git commit graphs (.graph files) + diff --git a/Makefile b/Makefile index 1a9b23b679..aee5d3f7b9 100644 --- a/Makefile +++ b/Makefile @@ -965,6 +965,7 @@ BUILTIN_OBJS += builtin/for-each-ref.o BUILTIN_OBJS += builtin/fsck.o BUILTIN_OBJS += builtin/gc.o BUILTIN_OBJS += builtin/get-tar-commit-id.o +BUILTIN_OBJS += builtin/commit-graph.o BUILTIN_OBJS += builtin/grep.o BUILTIN_OBJS += builtin/hash-object.o BUILTIN_OBJS += builtin/help.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..079855b6d4 100644 --- a/builtin.h +++ b/builtin.h @@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix); extern int cmd_clean(int argc, const char **argv, const char *prefix); extern int cmd_column(int argc, const char **argv, const char *prefix); extern int cmd_commit(int argc, const char **argv, const char *prefix); +extern int cmd_commit_graph(int argc, const char **argv, const char *prefix); extern int cmd_commit_tree(int argc, const char **argv, const char *prefix); extern int cmd_config(int argc, const char **argv, const char *prefix); extern int cmd_count_objects(int argc, const char **argv, const char *prefix); diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c new file mode 100644 index 00..2104550d25 --- /dev/null +++ b/builtin/commit-graph.c @@ -0,0 +1,33 @@ +#include "builtin.h" +#include "cache.h" +#include "config.h" +#include "dir.h" +#include "git-compat-util.h" +#include "lockfile.h" +#include "packfile.h" +#include "parse-options.h" + +static char const * const builtin_commit_graph_usage[] = { + N_("git commit-graph [--pack-dir ]"), + NULL +}; + +static struct opts_commit_graph { + const char *pack_dir; +} opts; + +int cmd_commit_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_commit_graph_options[] = { + { OPTION_STRING, 'p', "pack-dir", _dir, + N_("dir"), + N_("The pack directory to store the graph") }, + OPT_END(), + }; + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); + + return 0; +} diff --git a/command-list.txt b/command-list.txt index a1fad28fd8..835c5890be 100644 --- a/command-list.txt +++ b/command-list.txt @@ -34,6 +34,7 @@ git-clean mainporcelain git-clone mainporcelain init git-column purehelpers git-commit mainporcelain history +git-commit-graphplumbingmanipulators git-commit-tree plumbingmanipulators git-config ancillarymanipulators git-count-objects ancillaryinterrogators diff --git a/git.c b/git.c index c870b9719c..c7b5adae7b 100644 --- a/git.c +++ b/git.c @@ -388,6 +388,7 @@ static struct cmd_struct commands[] = { { "clone", cmd_clone }, { "column", cmd_column, RUN_SETUP_GENTLY }, { "commit", cmd_commit, RUN_SETUP | NEED_WORK_TREE }, + { "commit-graph", cmd_commit_graph, RUN_SETUP }, { "commit-tree", cmd_commit_tree, RUN_SETUP }, { "config", cmd_config, RUN_SETUP_GENTLY }, { "count-objects", cmd_count_objects, RUN_SETUP }, -- 2.16.0.15.g9c3cf44.dirty
[PATCH v2 00/14] Serialized Git Commit Graph
Thanks to everyone who gave comments on v1. I tried my best to respond to all of the feedback, but may have missed some while I was doing several renames, including: * builtin/graph.c -> builtin/commit-graph.c * packed-graph.[c|h] -> commit-graph.[c|h] * t/t5319-graph.sh -> t/t5318-commit-graph.sh Because of these renames (and several type/function renames) the diff is too large to conveniently share here. Some issues that came up and are addressed: * Use instead of when referring to the graph-.graph filenames and the contents of graph-head. * 32-bit timestamps will not cause undefined behavior. * timestamp_t is unsigned, so they are never negative. * The config setting "core.commitgraph" now only controls consuming the graph during normal operations and will not block the commit-graph plumbing command. * The --stdin-commits is better about sanitizing the input for strings that do not parse to OIDs or are OIDs for non-commit objects. One unresolved comment that I would like consensus on is the use of globals to store the config setting and the graph state. I'm currently using the pattern from packed_git instead of putting these values in the_repository. However, we want to eventually remove globals like packed_git. Should I deviate from the pattern _now_ in order to keep the problem from growing, or should I keep to the known pattern? Finally, I tried to clean up my incorrect style as I was recreating these commits. Feel free to be merciless in style feedback now that the architecture is more stable. Thanks, -Stolee -- >8 -- As promised [1], this patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv | 0.42s | 0.27s | -35% | | rev-list --all | 6.4s | 1.0s | -84% | | rev-list --all --objects | 32.6s | 27.6s | -15% | To test this yourself, run the following on your repo: git config core.commitgraph true git show-ref -s | git graph --write --update-head --stdin-commits The second command writes a commit graph file containing every commit reachable from your refs. Now, all git commands that walk commits will check your graph first before consulting the ODB. You can run your own performance comparisions by toggling the 'core.commitgraph' setting. [1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/ Re: What's cooking in git.git (Jan 2018, #03; Tue, 23) [2] https://github.com/derrickstolee/git/pull/2 A GitHub pull request containing the latest version of this patch. Derrick Stolee (14): commit-graph: add format document graph: add commit graph design document commit-graph: create git-commit-graph builtin commit-graph: implement construct_commit_graph() commit-graph: implement git-commit-graph --write commit-graph: implement git-commit-graph --read commit-graph: implement git-commit-graph --update-head commit-graph: implement git-commit-graph --clear commit-graph: teach git-commit-graph --delete-expired commit-graph: add core.commitgraph setting commit: integrate commit graph with commit parsing commit-graph: read only from specific pack-indexes commit-graph: close under
[PATCH v2 10/14] commit-graph: add core.commitgraph setting
The commit graph feature is controlled by the new core.commitgraph config setting. This defaults to 0, so the feature is opt-in. The intention of core.commitgraph is that a user can always stop checking for or parsing commit graph files if core.commitgraph=0. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 10 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 0e25b2c92b..5b63559a2b 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,9 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.commitgraph:: + Enable git commit graph feature. Allows reading from .graph files. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index d8b975a571..e50e447a4f 100644 --- a/cache.h +++ b/cache.h @@ -825,6 +825,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; extern int core_apply_sparse_checkout; +extern int core_commitgraph; extern int precomposed_unicode; extern int protect_hfs; extern int protect_ntfs; diff --git a/config.c b/config.c index e617c2018d..99153fcfdb 100644 --- a/config.c +++ b/config.c @@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.commitgraph")) { + core_commitgraph = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index 63ac38a46f..faa4323cc5 100644 --- a/environment.c +++ b/environment.c @@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; int core_apply_sparse_checkout; +int core_commitgraph; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ unsigned long pack_size_limit_cfg; -- 2.16.0.15.g9c3cf44.dirty
[PATCH v2 02/14] graph: add commit graph design document
Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 189 +++ 1 file changed, 189 insertions(+) create mode 100644 Documentation/technical/commit-graph.txt diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 00..cbf88f7264 --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,189 @@ +Git Commit Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'git show --remerge-diff' and can take minutes to compute depending on +history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to avoid topological order mistakes. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitgraph' +config setting, then the existing ODB is sufficient. The file is stored +next to packfiles either in the .git/objects/pack directory or in the pack +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. + +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation can make +mistakes with topological orders (such as "git log" with default order), +but is not used when the topological order is required (such as merge +base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +Design Details +-- + +- A graph file is stored in a file named 'graph-.graph' in the pack + directory. This could be stored in an alternate. + +- The most-recent graph file hash is stored in a 'graph-head' file for + immediate access and storing backup graphs. This could be stored in an + alternate, and refers to a 'graph-.graph' file in the same pack + directory. + +- The core.commitgraph config setting must be on to consume graph files. + +- The file format includes parameters for the object id length and hash + algorithm, so a future change of hash algorithm does not require a change + in format. + +Current Limitations +--- + +- Only one graph file is used at one time. This allows the integer position + to seek into the single graph file. It is possible to extend the mode
[PATCH v2 12/14] commit-graph: read only from specific pack-indexes
Teach git-commit-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the commit graph iteratively, since we add all commits stored in a previous commit graph. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 13 + builtin/commit-graph.c | 25 ++--- commit-graph.c | 25 +++-- commit-graph.h | 4 +++- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5318-commit-graph.sh| 6 -- 7 files changed, 69 insertions(+), 10 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 7b376e9212..d0571cd896 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -43,6 +43,11 @@ OPTIONS When used with --write and --update-head, delete the graph file previously referenced by graph-head. +--stdin-packs:: + When used with --write, generate the new graph by walking objects + only in the specified packfiles and any commits in the + existing graph-head. + EXAMPLES @@ -65,6 +70,14 @@ $ git commit-graph --write $ git commit-graph --write --update-head --delete-expired +* Write a graph file, extending the current graph file using commits +* in , update graph-head, and delete the old graph-.graph +* file. ++ + +$ echo | git commit-graph --write --update-head --delete-expired --stdin-packs + + * Read basic information from a graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 766f09e6fc..80a409e784 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), N_("git commit-graph --clear [--pack-dir ]"), N_("git commit-graph --read [--graph-hash=]"), - N_("git commit-graph --write [--pack-dir ] [--update-head] [--delete-expired]"), + N_("git commit-graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs]"), NULL }; @@ -24,6 +24,7 @@ static struct opts_commit_graph { int write; int update_head; int delete_expired; + int stdin_packs; int has_existing; struct object_id old_graph_hash; } opts; @@ -114,7 +115,24 @@ static void update_head_file(const char *pack_dir, const struct object_id *graph static int graph_write(void) { - struct object_id *graph_hash = construct_commit_graph(opts.pack_dir); + struct object_id *graph_hash; + char **pack_indexes = NULL; + int num_packs = 0; + int size_packs = 0; + + if (opts.stdin_packs) { + struct strbuf buf = STRBUF_INIT; + size_packs = 128; + ALLOC_ARRAY(pack_indexes, size_packs); + + while (strbuf_getline(, stdin) != EOF) { + ALLOC_GROW(pack_indexes, num_packs + 1, size_packs); + pack_indexes[num_packs++] = buf.buf; + strbuf_detach(, NULL); + } + } + + graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, num_packs); if (opts.update_head) update_head_file(opts.pack_dir, graph_hash); @@ -122,7 +140,6 @@ static int graph_write(void) if (graph_hash) printf("%s\n", oid_to_hex(graph_hash)); - if (opts.delete_expired && opts.update_head && opts.has_existing && oidcmp(graph_hash, _graph_hash)) { char *old_path = get_commit_graph_filename_hash(opts.pack_dir, @@ -153,6 +170,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) N_("update graph-head to written graph file")), OPT_BOOL('d', "delete-expired", _expired, N_("delete expired head graph file")), + OPT_BOOL('s', "stdin-packs", _packs, + N_("only scan packfiles listed by stdin")), { OPTION_STRING, 'H', "graph-hash", _hash, N_("hash"), N_("A hash for a specific graph file in the pack-dir."), diff --git a/commit-graph.c b/commit-graph.c index fc816533c6..e5a1d9ee8b 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -638,7 +638,9 @@ static int if_packed_commit_add_to_list(const struct object_id *oid, return 0; } -struct object_id
[PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired
Teach git-commit-graph to delete the graph previously referenced by 'graph_head' when writing a new graph file and updating 'graph_head'. This prevents data creep by storing a list of useless graphs. Be careful to not delete the graph if the file did not change. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 8 +++-- builtin/commit-graph.c | 16 - t/t5318-commit-graph.sh| 66 +- 3 files changed, 86 insertions(+), 4 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 33d6567f11..7b376e9212 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -39,6 +39,10 @@ OPTIONS When used with --write, update the graph-head file to point to the written graph file. +--delete-expired:: + When used with --write and --update-head, delete the graph file + previously referenced by graph-head. + EXAMPLES @@ -55,10 +59,10 @@ $ git commit-graph --write * Write a graph file for the packed commits in your local .git folder, -* and update graph-head. +* update graph-head, and delete the old graph-.graph file. + -$ git commit-graph --write --update-head +$ git commit-graph --write --update-head --delete-expired * Read basic information from a graph file. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 4970dec133..766f09e6fc 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), N_("git commit-graph --clear [--pack-dir ]"), N_("git commit-graph --read [--graph-hash=]"), - N_("git commit-graph --write [--pack-dir ] [--update-head]"), + N_("git commit-graph --write [--pack-dir ] [--update-head] [--delete-expired]"), NULL }; @@ -23,6 +23,7 @@ static struct opts_commit_graph { const char *graph_hash; int write; int update_head; + int delete_expired; int has_existing; struct object_id old_graph_hash; } opts; @@ -121,6 +122,17 @@ static int graph_write(void) if (graph_hash) printf("%s\n", oid_to_hex(graph_hash)); + + if (opts.delete_expired && opts.update_head && opts.has_existing && + oidcmp(graph_hash, _graph_hash)) { + char *old_path = get_commit_graph_filename_hash(opts.pack_dir, + _graph_hash); + if (remove_path(old_path)) + die("failed to remove path %s", old_path); + + free(old_path); + } + free(graph_hash); return 0; } @@ -139,6 +151,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) N_("write commit graph file")), OPT_BOOL('u', "update-head", _head, N_("update graph-head to written graph file")), + OPT_BOOL('d', "delete-expired", _expired, + N_("delete expired head graph file")), { OPTION_STRING, 'H', "graph-hash", _hash, N_("hash"), N_("A hash for a specific graph file in the pack-dir."), diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 6e3b62b754..b56a6d4217 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -101,9 +101,73 @@ test_expect_success 'write graph with merges' \ _graph_read_expect "18" "${packdir}" && cmp expect output' +test_expect_success 'Add more commits' \ +'for i in $(test_seq 16 20) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git repack' + +# Current graph structure: +# +# 20 +# | +# 19 +# | +# 18 +# | +# 17 +# | +# 16 +# | +# M3 +# / |\_ +#/ 10 15 +# / | | +# /9 M2 14 +# | |/ \ | +# | 8 M1 | 13 +# | |/ | \_| +# 5 7 | 12 +# | | \__| +# 4 6 11 +# |/__/ +# 3 +# | +# 2 +# | +# 1 + +test_expect_success 'write graph with merges' \ +'graph3=$(git commit-graph --write --update-head --delete-expired) && + test_path_is_file ${packdir}/graph-${graph3}.graph && + test_path_is_missing ${packdir}/graph-${graph2}.graph && +
[PATCH v2 05/14] commit-graph: implement git-commit-graph --write
Teach git-commit-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 18 +++ builtin/commit-graph.c | 30 t/t5318-commit-graph.sh| 96 ++ 3 files changed, 144 insertions(+) create mode 100755 t/t5318-commit-graph.sh diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index c8ea548dfb..3f3790d9a8 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -5,3 +5,21 @@ NAME git-commit-graph - Write and verify Git commit graphs (.graph files) + +SYNOPSIS + +[verse] +'git commit-graph' --write [--pack-dir ] + +EXAMPLES + + +* Write a commit graph file for the packed commits in your local .git folder. ++ + +$ git commit-graph --write + + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 2104550d25..7affd512f1 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -6,22 +6,38 @@ #include "lockfile.h" #include "packfile.h" #include "parse-options.h" +#include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), + N_("git commit-graph --write [--pack-dir ]"), NULL }; static struct opts_commit_graph { const char *pack_dir; + int write; } opts; +static int graph_write(void) +{ + struct object_id *graph_hash = construct_commit_graph(opts.pack_dir); + + if (graph_hash) + printf("%s\n", oid_to_hex(graph_hash)); + + free(graph_hash); + return 0; +} + int cmd_commit_graph(int argc, const char **argv, const char *prefix) { static struct option builtin_commit_graph_options[] = { { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('w', "write", , + N_("write commit graph file")), OPT_END(), }; @@ -29,5 +45,19 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) usage_with_options(builtin_commit_graph_usage, builtin_commit_graph_options); + argc = parse_options(argc, argv, prefix, +builtin_commit_graph_options, +builtin_commit_graph_usage, 0); + + if (!opts.pack_dir) { + struct strbuf path = STRBUF_INIT; + strbuf_addstr(, get_object_directory()); + strbuf_addstr(, "/pack"); + opts.pack_dir = strbuf_detach(, NULL); + } + + if (opts.write) + return graph_write(); + return 0; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh new file mode 100755 index 00..6bcd1cc264 --- /dev/null +++ b/t/t5318-commit-graph.sh @@ -0,0 +1,96 @@ +#!/bin/sh + +test_description='commit graph' +. ./test-lib.sh + +test_expect_success 'setup full repo' \ +'rm -rf .git && + mkdir full && + cd full && + git init && + git config core.commitgraph true && + git config pack.threads 1 && + packdir=".git/objects/pack"' + +test_expect_success 'write graph with no packs' \ +'git commit-graph --write --pack-dir .' + +test_expect_success 'create commits and repack' \ +'for i in $(test_seq 5) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git repack' + +test_expect_success 'write graph' \ +'graph1=$(git commit-graph --write) && + test_path_is_file ${packdir}/graph-${graph1}.graph' + +t_expect_success 'Add more commits' \ +'git reset --hard commits/3 && + for i in $(test_seq 6 10) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git reset --hard commits/3 && + for i in $(test_seq 11 15) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git reset --hard commits/7 && + git merge commits/11 && + git branch merge/1 && + git reset --hard commits
[PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head
It is possible to have multiple commit graph files in a pack directory, but only one is important at a time. Use a 'graph_head' file to point to the important file. Teach git-commit-graph to write 'graph_head' upon writing a new commit graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 34 ++ builtin/commit-graph.c | 38 +++--- commit-graph.c | 25 + commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 12 ++-- 5 files changed, 106 insertions(+), 5 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 09aeaf6c82..99ced16ddc 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -12,15 +12,49 @@ SYNOPSIS 'git commit-graph' --write [--pack-dir ] 'git commit-graph' --read [--pack-dir ] +OPTIONS +--- +--pack-dir:: + Use given directory for the location of packfiles, graph-head, + and graph files. + +--read:: + Read a graph file given by the graph-head file and output basic + details about the graph file. (Cannot be combined with --write.) + +--graph-id:: + When used with --read, consider the graph file graph-.graph. + +--write:: + Write a new graph file to the pack directory. (Cannot be combined + with --read.) + +--update-head:: + When used with --write, update the graph-head file to point to + the written graph file. + EXAMPLES +* Output the hash of the graph file pointed to by /graph-head. ++ + +$ git commit-graph --pack-dir= + + * Write a commit graph file for the packed commits in your local .git folder. + $ git commit-graph --write +* Write a graph file for the packed commits in your local .git folder, +* and update graph-head. ++ + +$ git commit-graph --write --update-head + + * Read basic information from a graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 218740b1f8..d73cbc907d 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -11,7 +11,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), N_("git commit-graph --read [--graph-hash=]"), - N_("git commit-graph --write [--pack-dir ]"), + N_("git commit-graph --write [--pack-dir ] [--update-head]"), NULL }; @@ -20,6 +20,9 @@ static struct opts_commit_graph { int read; const char *graph_hash; int write; + int update_head; + int has_existing; + struct object_id old_graph_hash; } opts; static int graph_read(void) @@ -30,8 +33,8 @@ static int graph_read(void) if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ) get_oid_hex(opts.graph_hash, _hash); - else - die("no graph hash specified"); + else if (!get_graph_head_hash(opts.pack_dir, _hash)) + die("no graph-head exists"); graph_file = get_commit_graph_filename_hash(opts.pack_dir, _hash); graph = load_commit_graph_one(graph_file, opts.pack_dir); @@ -62,10 +65,33 @@ static int graph_read(void) return 0; } +static void update_head_file(const char *pack_dir, const struct object_id *graph_hash) +{ + struct strbuf head_path = STRBUF_INIT; + int fd; + struct lock_file lk = LOCK_INIT; + + strbuf_addstr(_path, pack_dir); + strbuf_addstr(_path, "/"); + strbuf_addstr(_path, "graph-head"); + + fd = hold_lock_file_for_update(, head_path.buf, LOCK_DIE_ON_ERROR); + strbuf_release(_path); + + if (fd < 0) + die_errno("unable to open graph-head"); + + write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ); + commit_lock_file(); +} + static int graph_write(void) { struct object_id *graph_hash = construct_commit_graph(opts.pack_dir); + if (opts.update_head) + update_head_file(opts.pack_dir, graph_hash); + if (graph_hash) printf("%s\n", oid_to_hex(graph_hash)); @@ -83,6 +109,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) N_("read graph file")), OPT_BOOL('w', "write", , N_("write commit graph file")), + OPT_BOOL('u', &quo
[PATCH v2 13/14] commit-graph: close under reachability
Teach construct_commit_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 26 ++ t/t5318-commit-graph.sh | 14 ++ 2 files changed, 40 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index e5a1d9ee8b..cfa0415a21 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -5,6 +5,7 @@ #include "packfile.h" #include "commit.h" #include "object.h" +#include "revision.h" #include "commit-graph.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ @@ -638,6 +639,29 @@ static int if_packed_commit_add_to_list(const struct object_id *oid, return 0; } +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct rev_info revs; + struct commit *commit; + init_revisions(, NULL); + + for (i = 0; i < oids->num; i++) { + commit = lookup_commit(oids->list[i]); + if (commit && !parse_commit(commit)) + revs.commits = commit_list_insert(commit, ); + } + + if (prepare_revision_walk()) + die(_("revision walk setup failed")); + + while ((commit = get_revision()) != NULL) { + ALLOC_GROW(oids->list, oids->num + 1, oids->size); + oids->list[oids->num] = &(commit->object.oid); + (oids->num)++; + } +} + struct object_id *construct_commit_graph(const char *pack_dir, char **pack_indexes, int nr_packs) @@ -696,6 +720,8 @@ struct object_id *construct_commit_graph(const char *pack_dir, } else { for_each_packed_object(if_packed_commit_add_to_list, , 0); } + + close_reachable(); QSORT(oids.list, oids.num, commit_compare); count_distinct = 1; diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index b9a73f398c..2001b0b5b5 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -213,6 +213,20 @@ test_expect_success 'clear graph' \ _graph_git_behavior commits/20 merge/1 _graph_git_behavior commits/20 merge/2 +test_expect_success 'build graph from latest pack with closure' \ +'graph5=$(cat new-idx | git commit-graph --write --update-head --stdin-packs) && + test_path_is_file ${packdir}/graph-${graph5}.graph && + test_path_is_file ${packdir}/graph-${graph1}.graph && + test_path_is_file ${packdir}/graph-head && + echo ${graph5} >expect && + cmp -n 40 expect ${packdir}/graph-head && + git commit-graph --read --graph-hash=${graph5} >output && + _graph_read_expect "21" "${packdir}" && + cmp expect output' + +_graph_git_behavior commits/20 merge/1 +_graph_git_behavior commits/20 merge/2 + test_expect_success 'setup bare repo' \ 'cd .. && git clone --bare full bare && -- 2.16.0.15.g9c3cf44.dirty
[PATCH v2 04/14] commit-graph: implement construct_commit_graph()
Teach Git to write a commit graph file by checking all packed objects to see if they are commits, then store the file in the given pack directory. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + commit-graph.c | 376 + commit-graph.h | 20 +++ 3 files changed, 397 insertions(+) create mode 100644 commit-graph.c create mode 100644 commit-graph.h diff --git a/Makefile b/Makefile index aee5d3f7b9..894432b35b 100644 --- a/Makefile +++ b/Makefile @@ -773,6 +773,7 @@ LIB_OBJS += color.o LIB_OBJS += column.o LIB_OBJS += combine-diff.o LIB_OBJS += commit.o +LIB_OBJS += commit-graph.o LIB_OBJS += compat/obstack.o LIB_OBJS += compat/terminal.o LIB_OBJS += config.o diff --git a/commit-graph.c b/commit-graph.c new file mode 100644 index 00..db2b7390c7 --- /dev/null +++ b/commit-graph.c @@ -0,0 +1,376 @@ +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "pack.h" +#include "packfile.h" +#include "commit.h" +#include "object.h" +#include "commit-graph.h" + +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ + +#define GRAPH_DATA_WIDTH 36 + +#define GRAPH_VERSION_1 0x1 +#define GRAPH_VERSION GRAPH_VERSION_1 + +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 20 +#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1 +#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 + +#define GRAPH_LARGE_EDGES_NEEDED 0x8000 +#define GRAPH_PARENT_MISSING 0x7fff +#define GRAPH_EDGE_LAST_MASK 0x7fff +#define GRAPH_PARENT_NONE 0x7000 + +#define GRAPH_LAST_EDGE 0x8000 + +#define GRAPH_FANOUT_SIZE (4*256) +#define GRAPH_CHUNKLOOKUP_SIZE (5 * 12) +#define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \ + GRAPH_OID_LEN + sizeof(struct commit_graph_header)) + +char* get_commit_graph_filename_hash(const char *pack_dir, +struct object_id *hash) +{ + size_t len; + struct strbuf head_path = STRBUF_INIT; + strbuf_addstr(_path, pack_dir); + strbuf_addstr(_path, "/graph-"); + strbuf_addstr(_path, oid_to_hex(hash)); + strbuf_addstr(_path, ".graph"); + + return strbuf_detach(_path, ); +} + +static void write_graph_chunk_fanout(struct sha1file *f, +struct commit **commits, +int nr_commits) +{ + uint32_t i, count = 0; + struct commit **list = commits; + struct commit **last = commits + nr_commits; + + /* +* Write the first-level table (the list is sorted, +* but we use a 256-entry lookup to be able to avoid +* having to do eight extra binary search iterations). +*/ + for (i = 0; i < 256; i++) { + uint32_t swap_count; + + while (list < last) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } + + swap_count = htonl(count); + sha1write(f, _count, 4); + } +} + +static void write_graph_chunk_oids(struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list, **last = commits + nr_commits; + for (list = commits; list < last; list++) + sha1write(f, (*list)->object.oid.hash, (int)hash_len); +} + +static int commit_pos(struct commit **commits, int nr_commits, + const struct object_id *oid, uint32_t *pos) +{ + uint32_t first = 0, last = nr_commits; + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + struct object_id *current; + int cmp; + + current = &(commits[mid]->object.oid); + cmp = oidcmp(oid, current); + if (!cmp) { + *pos = mid; + return 1; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + *pos = first; + return 0; +} + +static void write_graph_chunk_data(struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_large_edges = 0; + + while (list < last) { + s
[PATCH v2 14/14] commit-graph: build graph from starting commits
Teach git-commit-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 7 ++- builtin/commit-graph.c | 34 +- commit-graph.c | 26 +++--- commit-graph.h | 4 +++- t/t5318-commit-graph.sh| 18 ++ 5 files changed, 75 insertions(+), 14 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index d0571cd896..3357c0cf8f 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -46,7 +46,12 @@ OPTIONS --stdin-packs:: When used with --write, generate the new graph by walking objects only in the specified packfiles and any commits in the - existing graph-head. + existing graph-head. (Cannot be combined with --stdin-commits.) + +--stdin-commits:: + When used with --write, generate the new graph by walking commits + starting at the commits specified in stdin as a list of OIDs in + hex, one OID per line. (Cannot be combined with --stdin-packs.) EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 80a409e784..adc05f0582 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), N_("git commit-graph --clear [--pack-dir ]"), N_("git commit-graph --read [--graph-hash=]"), - N_("git commit-graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs]"), + N_("git commit-graph --write [--pack-dir ] [--update-head] [--delete-expired] [--stdin-packs|--stdin-commits]"), NULL }; @@ -25,6 +25,7 @@ static struct opts_commit_graph { int update_head; int delete_expired; int stdin_packs; + int stdin_commits; int has_existing; struct object_id old_graph_hash; } opts; @@ -117,23 +118,36 @@ static int graph_write(void) { struct object_id *graph_hash; char **pack_indexes = NULL; + char **commits = NULL; int num_packs = 0; - int size_packs = 0; + int num_commits = 0; + char **lines = NULL; + int num_lines = 0; + int size_lines = 0; - if (opts.stdin_packs) { + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; - size_packs = 128; - ALLOC_ARRAY(pack_indexes, size_packs); + size_lines = 128; + ALLOC_ARRAY(lines, size_lines); while (strbuf_getline(, stdin) != EOF) { - ALLOC_GROW(pack_indexes, num_packs + 1, size_packs); - pack_indexes[num_packs++] = buf.buf; + ALLOC_GROW(lines, num_lines + 1, size_lines); + lines[num_lines++] = buf.buf; strbuf_detach(, NULL); } - } - graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, num_packs); + if (opts.stdin_packs) { + pack_indexes = lines; + num_packs = num_lines; + } + if (opts.stdin_commits) { + commits = lines; + num_commits = num_lines; + } + } + graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, num_packs, + commits, num_commits); if (opts.update_head) update_head_file(opts.pack_dir, graph_hash); @@ -172,6 +186,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) N_("delete expired head graph file")), OPT_BOOL('s', "stdin-packs", _packs, N_("only scan packfiles listed by stdin")), + OPT_BOOL('C', "stdin-commits", _commits, + N_("start walk at commits listed by stdin")), { OPTION_STRING, 'H', "graph-hash", _hash, N_("hash"), N_("A hash for a specific graph file in the pack-dir."), diff --git a/commit-graph.c b/commit-graph.c index cfa0415a21..7f31a6c795 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -66
Re: [PATCH v2 10/27] protocol: introduce enum protocol_version value protocol_v2
On 1/25/2018 6:58 PM, Brandon Williams wrote: Introduce protocol_v2, a new value for 'enum protocol_version'. Subsequent patches will fill in the implementation of protocol_v2. Signed-off-by: Brandon Williams--- builtin/fetch-pack.c | 3 +++ builtin/receive-pack.c | 6 ++ builtin/send-pack.c| 3 +++ builtin/upload-pack.c | 7 +++ connect.c | 3 +++ protocol.c | 2 ++ protocol.h | 1 + remote-curl.c | 3 +++ transport.c| 9 + 9 files changed, 37 insertions(+) diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c index 85d4faf76..f492e8abd 100644 --- a/builtin/fetch-pack.c +++ b/builtin/fetch-pack.c @@ -201,6 +201,9 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix) PACKET_READ_GENTLE_ON_EOF); switch (discover_version()) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol_v1: case protocol_v0: get_remote_heads(, , 0, NULL, ); diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c index b7ce7c7f5..3656e94fd 100644 --- a/builtin/receive-pack.c +++ b/builtin/receive-pack.c @@ -1963,6 +1963,12 @@ int cmd_receive_pack(int argc, const char **argv, const char *prefix) unpack_limit = receive_unpack_limit; switch (determine_protocol_version_server()) { + case protocol_v2: + /* +* push support for protocol v2 has not been implemented yet, +* so ignore the request to use v2 and fallback to using v0. +*/ + break; case protocol_v1: /* * v1 is just the original protocol with a version string, diff --git a/builtin/send-pack.c b/builtin/send-pack.c index 83cb125a6..b5427f75e 100644 --- a/builtin/send-pack.c +++ b/builtin/send-pack.c @@ -263,6 +263,9 @@ int cmd_send_pack(int argc, const char **argv, const char *prefix) PACKET_READ_GENTLE_ON_EOF); switch (discover_version()) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol_v1: case protocol_v0: get_remote_heads(, _refs, REF_NORMAL, diff --git a/builtin/upload-pack.c b/builtin/upload-pack.c index 2cb5cb35b..8d53e9794 100644 --- a/builtin/upload-pack.c +++ b/builtin/upload-pack.c @@ -47,6 +47,13 @@ int cmd_upload_pack(int argc, const char **argv, const char *prefix) die("'%s' does not appear to be a git repository", dir); switch (determine_protocol_version_server()) { + case protocol_v2: + /* +* fetch support for protocol v2 has not been implemented yet, +* so ignore the request to use v2 and fallback to using v0. +*/ + upload_pack(); + break; case protocol_v1: /* * v1 is just the original protocol with a version string, diff --git a/connect.c b/connect.c index db3c9d24c..f2157a821 100644 --- a/connect.c +++ b/connect.c @@ -84,6 +84,9 @@ enum protocol_version discover_version(struct packet_reader *reader) /* Maybe process capabilities here, at least for v2 */ switch (version) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol_v1: /* Read the peeked version line */ packet_reader_read(reader); diff --git a/protocol.c b/protocol.c index 43012b7eb..5e636785d 100644 --- a/protocol.c +++ b/protocol.c @@ -8,6 +8,8 @@ static enum protocol_version parse_protocol_version(const char *value) return protocol_v0; else if (!strcmp(value, "1")) return protocol_v1; + else if (!strcmp(value, "2")) + return protocol_v2; else return protocol_unknown_version; } diff --git a/protocol.h b/protocol.h index 1b2bc94a8..2ad35e433 100644 --- a/protocol.h +++ b/protocol.h @@ -5,6 +5,7 @@ enum protocol_version { protocol_unknown_version = -1, protocol_v0 = 0, protocol_v1 = 1, + protocol_v2 = 2, }; /* diff --git a/remote-curl.c b/remote-curl.c index 9f6d07683..dae8a4a48 100644 --- a/remote-curl.c +++ b/remote-curl.c @@ -185,6 +185,9 @@ static struct ref *parse_git_refs(struct discovery *heads, int for_push) PACKET_READ_GENTLE_ON_EOF); switch (discover_version()) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol_v1: case protocol_v0: get_remote_heads(, , for_push ? REF_NORMAL : 0, diff --git a/transport.c b/transport.c index 2378dcb38..83d9dd1df 100644 ---
Re: [PATCH v2 09/27] transport: store protocol version
On 1/25/2018 6:58 PM, Brandon Williams wrote: + switch (data->version) { + case protocol_v1: + case protocol_v0: + refs = fetch_pack(, data->fd, data->conn, + refs_tmp ? refs_tmp : transport->remote_refs, + dest, to_fetch, nr_heads, >shallow, + >pack_lockfile); + break; + case protocol_unknown_version: + BUG("unknown protocol version"); + } After seeing this pattern a few times, I think it would be good to convert it to a macro that calls a statement for protocol_v1/v0 (and later calls a different one for protocol_v2). It would at minimum reduce the code clones surrounding this handling of unknown_version, and we could have one place that is clear this BUG() is due to an unexpected response from discover_version().
Re: [PATCH v2 05/27] upload-pack: factor out processing lines
On 1/26/2018 4:33 PM, Brandon Williams wrote: On 01/26, Stefan Beller wrote: On Thu, Jan 25, 2018 at 3:58 PM, Brandon Williamswrote: Factor out the logic for processing shallow, deepen, deepen_since, and deepen_not lines into their own functions to simplify the 'receive_needs()' function in addition to making it easier to reuse some of this logic when implementing protocol_v2. Signed-off-by: Brandon Williams --- upload-pack.c | 113 ++ 1 file changed, 74 insertions(+), 39 deletions(-) diff --git a/upload-pack.c b/upload-pack.c index 2ad73a98b..42d83d5b1 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -724,6 +724,75 @@ static void deepen_by_rev_list(int ac, const char **av, packet_flush(1); } +static int process_shallow(const char *line, struct object_array *shallows) +{ + const char *arg; + if (skip_prefix(line, "shallow ", )) { stylistic nit: You could invert the condition in each of the process_* functions to just have if (!skip_prefix...)) return 0 /* less indented code goes here */ return 1; That way we have less indentation as well as easier code. (The reader doesn't need to keep in mind what the else part is about; it is a rather local decision to bail out instead of having the return at the end of the function.) I was trying to move the existing code into helper functions so rewriting them in transit may make it less reviewable? I think the way you kept to the existing code as much as possible is good and easier to review. Perhaps a style pass after the patch lands is good for #leftoverbits. + struct object_id oid; + struct object *object; + if (get_oid_hex(arg, )) + die("invalid shallow line: %s", line); + object = parse_object(); + if (!object) + return 1; + if (object->type != OBJ_COMMIT) + die("invalid shallow object %s", oid_to_hex()); + if (!(object->flags & CLIENT_SHALLOW)) { + object->flags |= CLIENT_SHALLOW; + add_object_array(object, NULL, shallows); + } + return 1; + } + + return 0; +} + +static int process_deepen(const char *line, int *depth) +{ + const char *arg; + if (skip_prefix(line, "deepen ", )) { + char *end = NULL; + *depth = (int) strtol(arg, , 0); nit: space between (int) and strtol? + if (!end || *end || *depth <= 0) + die("Invalid deepen: %s", line); + return 1; + } + + return 0; +} + +static int process_deepen_since(const char *line, timestamp_t *deepen_since, int *deepen_rev_list) +{ + const char *arg; + if (skip_prefix(line, "deepen-since ", )) { + char *end = NULL; + *deepen_since = parse_timestamp(arg, , 0); + if (!end || *end || !deepen_since || + /* revisions.c's max_age -1 is special */ + *deepen_since == -1) + die("Invalid deepen-since: %s", line); + *deepen_rev_list = 1; + return 1; + } + return 0; +} + +static int process_deepen_not(const char *line, struct string_list *deepen_not, int *deepen_rev_list) +{ + const char *arg; + if (skip_prefix(line, "deepen-not ", )) { + char *ref = NULL; + struct object_id oid; + if (expand_ref(arg, strlen(arg), , ) != 1) + die("git upload-pack: ambiguous deepen-not: %s", line); + string_list_append(deepen_not, ref); + free(ref); + *deepen_rev_list = 1; + return 1; + } + return 0; +} + static void receive_needs(void) { struct object_array shallows = OBJECT_ARRAY_INIT; @@ -745,49 +814,15 @@ static void receive_needs(void) if (!line) break; - if (skip_prefix(line, "shallow ", )) { - struct object_id oid; - struct object *object; - if (get_oid_hex(arg, )) - die("invalid shallow line: %s", line); - object = parse_object(); - if (!object) - continue; - if (object->type != OBJ_COMMIT) - die("invalid shallow object %s", oid_to_hex()); - if (!(object->flags & CLIENT_SHALLOW)) { - object->flags |= CLIENT_SHALLOW; - add_object_array(object, NULL, ); - } + if (process_shallow(line, ))
Re: [PATCH v2 08/27] connect: discover protocol version outside of get_remote_heads
On 1/25/2018 6:58 PM, Brandon Williams wrote: In order to prepare for the addition of protocol_v2 push the protocol version discovery outside of 'get_remote_heads()'. This will allow for keeping the logic for processing the reference advertisement for protocol_v1 and protocol_v0 separate from the logic for protocol_v2. Signed-off-by: Brandon Williams--- builtin/fetch-pack.c | 16 +++- builtin/send-pack.c | 17 +++-- connect.c| 27 ++- connect.h| 3 +++ remote-curl.c| 20 ++-- remote.h | 5 +++-- transport.c | 24 +++- 7 files changed, 83 insertions(+), 29 deletions(-) diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c index 366b9d13f..85d4faf76 100644 --- a/builtin/fetch-pack.c +++ b/builtin/fetch-pack.c @@ -4,6 +4,7 @@ #include "remote.h" #include "connect.h" #include "sha1-array.h" +#include "protocol.h" static const char fetch_pack_usage[] = "git fetch-pack [--all] [--stdin] [--quiet | -q] [--keep | -k] [--thin] " @@ -52,6 +53,7 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix) struct fetch_pack_args args; struct oid_array shallow = OID_ARRAY_INIT; struct string_list deepen_not = STRING_LIST_INIT_DUP; + struct packet_reader reader; packet_trace_identity("fetch-pack"); @@ -193,7 +195,19 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix) if (!conn) return args.diag_url ? 0 : 1; } - get_remote_heads(fd[0], NULL, 0, , 0, NULL, ); + + packet_reader_init(, fd[0], NULL, 0, + PACKET_READ_CHOMP_NEWLINE | + PACKET_READ_GENTLE_ON_EOF); + + switch (discover_version()) { + case protocol_v1: + case protocol_v0: + get_remote_heads(, , 0, NULL, ); + break; + case protocol_unknown_version: + BUG("unknown protocol version"); Is this really a BUG in the client, or a bug/incompatibility in the server? Perhaps I'm misunderstanding, but it looks like discover_version() will die() on an unknown version (the die() is in protocol.c:determine_protocol_version_client()). So maybe that's why this is a BUG()? If there is something to change here, this BUG() appears three more times. + } ref = fetch_pack(, fd, conn, ref, dest, sought, nr_sought, , pack_lockfile_ptr); diff --git a/builtin/send-pack.c b/builtin/send-pack.c index fc4f0bb5f..83cb125a6 100644 --- a/builtin/send-pack.c +++ b/builtin/send-pack.c @@ -14,6 +14,7 @@ #include "sha1-array.h" #include "gpg-interface.h" #include "gettext.h" +#include "protocol.h" static const char * const send_pack_usage[] = { N_("git send-pack [--all | --mirror] [--dry-run] [--force] " @@ -154,6 +155,7 @@ int cmd_send_pack(int argc, const char **argv, const char *prefix) int progress = -1; int from_stdin = 0; struct push_cas_option cas = {0}; + struct packet_reader reader; struct option options[] = { OPT__VERBOSITY(), @@ -256,8 +258,19 @@ int cmd_send_pack(int argc, const char **argv, const char *prefix) args.verbose ? CONNECT_VERBOSE : 0); } - get_remote_heads(fd[0], NULL, 0, _refs, REF_NORMAL, -_have, ); + packet_reader_init(, fd[0], NULL, 0, + PACKET_READ_CHOMP_NEWLINE | + PACKET_READ_GENTLE_ON_EOF); + + switch (discover_version()) { + case protocol_v1: + case protocol_v0: + get_remote_heads(, _refs, REF_NORMAL, +_have, ); + break; + case protocol_unknown_version: + BUG("unknown protocol version"); + } transport_verify_remote_names(nr_refspecs, refspecs); diff --git a/connect.c b/connect.c index 00e90075c..db3c9d24c 100644 --- a/connect.c +++ b/connect.c @@ -62,7 +62,7 @@ static void die_initial_contact(int unexpected) "and the repository exists.")); } -static enum protocol_version discover_version(struct packet_reader *reader) +enum protocol_version discover_version(struct packet_reader *reader) { enum protocol_version version = protocol_unknown_version; @@ -234,7 +234,7 @@ enum get_remote_heads_state { /* * Read all the refs from the other end */ -struct ref **get_remote_heads(int in, char *src_buf, size_t src_len, +struct ref **get_remote_heads(struct packet_reader *reader, struct ref **list, unsigned int flags, struct oid_array *extra_have, struct oid_array *shallow_points) @@ -242,24 +242,17 @@ struct ref **get_remote_heads(int in, char *src_buf, size_t
Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
On 2/2/2018 5:48 PM, Junio C Hamano wrote: Stefan Bellerwrites: It is true for git-submodule and a few others (the minority of commands IIRC) git-tag for example takes subcommands such as --list or --verify. https://public-inbox.org/git/xmqqiomodkt9@gitster.dls.corp.google.com/ Thanks. It refers to an article at gmane, which is not readily accessible unless you use newsreader. The original discussion it refers to appears at: https://public-inbox.org/git/7vbo5itjfl@alter.siamese.dyndns.org/ for those who are interested. Thanks for the links. I am still not sure if it is a good design to add a new command like this series does, though. I would naively have expected that this would be a new pack index format that is produced by pack-objects and index-pack, for example, in which case its maintenance would almost be invisible to end users (i.e. just like how the pack bitmap feature was added to the system). I agree that the medium-term goal is to have this happen without user intervention. Something like a "core.autoCommitGraph" setting to trigger commit-graph writes during other cleanup activities, such as a repack or a gc. I don't think pairing this with pack-objects or index-pack is a good direction, because the commit graph is not locked into a packfile the way the bitmap is. In fact, the entire ODB could be replaced independently and the graph is still valid (the commits in the graph may no longer have their paired commits in the ODB due to a GC; you should never navigate to those commits without having a ref pointing to them, so this is not immediately a problem). This sort of interaction with GC is one reason why I did not include the automatic updates in this patch. The integration with existing maintenance tasks will be worth discussion in its own right. I'd rather demonstrate the value of having a graph (even if it is currently maintained manually) and then follow up with a focus to integrate with repack, gc, etc. I plan to clean up this patch on Monday given the feedback I received the last two days (Thanks Jonathan and Szeder!). However, if the current builtin design will block merging, then I'll wait until we can find one that works. Thanks, -Stolee
Re: [PATCH 0/2] Refactor hash search with fanout table
On 2/2/2018 6:30 PM, Junio C Hamano wrote: Jonathan Tan <jonathanta...@google.com> writes: After reviewing Derrick's Serialized Git Commit Graph patches [1], I noticed that "[PATCH v2 11/14] commit: integrate commit graph with commit parsing" contains (in bsearch_graph) a repeat of some packfile functionality. Here is a pack that refactors that functionality out. Yay. I had exactly the same reaction to that part of the series. Thanks for doing this refactor. I'm a fan of reducing code clones, but also don't want to break well-worn code paths. Jonathan: While you are doing this, I'm guessing you could use your new method to replace (and maybe speed up) the binary search in sha1_name.c:find_abbrev_len_for_pack(). Otherwise, I can take a stab at it next week. Please add Reviewed-by: Derrick Stolee <dsto...@microsoft.com> Thanks, -Stolee
Re: [PATCH v2 12/27] serve: introduce git-serve
On 1/25/2018 6:58 PM, Brandon Williams wrote: Introduce git-serve, the base server for protocol version 2. Protocol version 2 is intended to be a replacement for Git's current wire protocol. The intention is that it will be a simpler, less wasteful protocol which can evolve over time. Protocol version 2 improves upon version 1 by eliminating the initial ref advertisement. In its place a server will export a list of capabilities and commands which it supports in a capability advertisement. A client can then request that a particular command be executed by providing a number of capabilities and command specific parameters. At the completion of a command, a client can request that another command be executed or can terminate the connection by sending a flush packet. Signed-off-by: Brandon Williams--- .gitignore | 1 + Documentation/technical/protocol-v2.txt | 117 +++ Makefile| 2 + builtin.h | 1 + builtin/serve.c | 30 git.c | 1 + serve.c | 249 serve.h | 15 ++ t/t5701-git-serve.sh| 56 +++ 9 files changed, 472 insertions(+) create mode 100644 Documentation/technical/protocol-v2.txt create mode 100644 builtin/serve.c create mode 100644 serve.c create mode 100644 serve.h create mode 100755 t/t5701-git-serve.sh diff --git a/.gitignore b/.gitignore index 833ef3b0b..2d0450c26 100644 --- a/.gitignore +++ b/.gitignore @@ -140,6 +140,7 @@ /git-rm /git-send-email /git-send-pack +/git-serve /git-sh-i18n /git-sh-i18n--envsubst /git-sh-setup diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt new file mode 100644 index 0..7f619a76c --- /dev/null +++ b/Documentation/technical/protocol-v2.txt @@ -0,0 +1,117 @@ + Git Wire Protocol, Version 2 +== + +This document presents a specification for a version 2 of Git's wire +protocol. Protocol v2 will improve upon v1 in the following ways: + + * Instead of multiple service names, multiple commands will be +supported by a single service. As someone unfamiliar with the old protocol code, this statement is underselling the architectural significance of your change. The new model allows a single service to handle all different wire protocols (git://, ssh://, https://) while being agnostic to the command-specific logic. It also hides the protocol negotiation away from these consumers. The ease with which you are adding new commands in later commits really demonstrates the value of this patch. To make that point here, you would almost need to document the old model to show how it was difficult to use and extend. Perhaps this document will not need expanding since the code speaks for itself. I just wanted to state for the record that the new architecture is a big improvement and will make more commands much easier to implement. + * Easily extendable as capabilities are moved into their own section +of the protocol, no longer being hidden behind a NUL byte and +limited by the size of a pkt-line (as there will be a single +capability per pkt-line). + * Separate out other information hidden behind NUL bytes (e.g. agent +string as a capability and symrefs can be requested using 'ls-refs') + * Reference advertisement will be omitted unless explicitly requested + * ls-refs command to explicitly request some refs + nit: some bullets have full stops (.) and others do not. + Detailed Design += + +A client can request to speak protocol v2 by sending `version=2` in the +side-channel `GIT_PROTOCOL` in the initial request to the server. + +In protocol v2 communication is command oriented. When first contacting a +server a list of capabilities will advertised. Some of these capabilities +will be commands which a client can request be executed. Once a command +has completed, a client can reuse the connection and request that other +commands be executed. + + Special Packets +- + +In protocol v2 these special packets will have the following semantics: + + * '' Flush Packet (flush-pkt) - indicates the end of a message + * '0001' Delimiter Packet (delim-pkt) - separates sections of a message + + Capability Advertisement +-- + +A server which decides to communicate (based on a request from a client) +using protocol version 2, notifies the client by sending a version string +in its initial response followed by an advertisement of its capabilities. +Each capability is a key with an optional value. Clients must ignore all +unknown keys. Semantics of unknown values are left to the definition of +each key. Some capabilities will describe commands which can be
Re: [PATCH v2 00/27] protocol version 2
Sorry for chiming in with mostly nitpicks so late since sending this version. Mostly, I tried to read it to see if I could understand the scope of the patch and how this code worked before. It looks very polished, so I the nits were the best I could do. On 1/25/2018 6:58 PM, Brandon Williams wrote: Changes in v2: * Added documentation for fetch * changes #defines for state variables to be enums * couple code changes to pkt-line functions and documentation * Added unit tests for the git-serve binary as well as for ls-refs I'm a fan of more unit-level testing, and I think that will be more important as we go on with these multiple configuration options. Areas for improvement * Push isn't implemented, right now this is ok because if v2 is requested the server can just default to v0. Before this can be merged we may want to change how the client request a new protocol, and not allow for sending "version=2" when pushing even though the user has it configured. Or maybe its fine to just have an older client who doesn't understand how to push (and request v2) to die if the server tries to speak v2 at it. Fixing this essentially would just require piping through a bit more information to the function which ultimately runs connect (for both builtins and remote-curl) Definitely save push for a later patch. Getting 'fetch' online did require 'ls-refs' at the same time. Future reviews will be easier when adding one command at a time. * I want to make sure that the docs are well written before this gets merged so I'm hoping that someone can do a through review on the docs themselves to make sure they are clear. I made a comment in the docs about the architectural changes. While I think a discussion on that topic would be valuable, I'm not sure that's the point of the document (i.e. documenting what v2 does versus selling the value of the patch). I thought the docs were clear for how the commands work. * Right now there is a capability 'stateless-rpc' which essentially makes sure that a server command completes after a single round (this is to make sure http works cleanly). After talking with some folks it may make more sense to just have v2 be stateless in nature so that all commands terminate after a single round trip. This makes things a bit easier if a server wants to have ssh just be a proxy for http. One potential thing would be to flip this so that by default the protocol is stateless and if a server/command has a state-full mode that can be implemented as a capability at a later point. Thoughts? At minimum, all commands should be designed with a "stateless first" philosophy since a large number of users communicate via HTTP[S] and any decisions that make stateless communication painful should be rejected. * Shallow repositories and shallow clones aren't supported yet. I'm working on it and it can be either added to v2 by default if people think it needs to be in there from the start, or we can add it as a capability at a later point. I'm happy to say the following: 1. Shallow repositories should not be used for servers, since they cannot service all requests. 2. Since v2 has easy capability features, I'm happy to leave shallow for later. We will want to verify that a shallow clone command reverts to v1. I fetched bw/protocol-v2 with tip 13c70148, built, set 'protocol.version=2' in the config, and tested fetches against GitHub and VSTS just as a compatibility test. Everything worked just fine. Is there an easy way to test the existing test suite for clone and fetch using protocol v2 to make sure there are no regressions with protocol.version=2 in the config? Thanks, -Stolee
Re: [PATCH v2 14/27] connect: request remote refs using v2
On 1/25/2018 6:58 PM, Brandon Williams wrote: Teach the client to be able to request a remote's refs using protocol v2. This is done by having a client issue a 'ls-refs' request to a v2 server. Signed-off-by: Brandon Williams--- builtin/upload-pack.c | 10 ++-- connect.c | 123 - remote.h | 4 ++ t/t5702-protocol-v2.sh | 28 +++ transport.c| 2 +- 5 files changed, 160 insertions(+), 7 deletions(-) create mode 100755 t/t5702-protocol-v2.sh diff --git a/builtin/upload-pack.c b/builtin/upload-pack.c index 8d53e9794..a757df8da 100644 --- a/builtin/upload-pack.c +++ b/builtin/upload-pack.c @@ -5,6 +5,7 @@ #include "parse-options.h" #include "protocol.h" #include "upload-pack.h" +#include "serve.h" static const char * const upload_pack_usage[] = { N_("git upload-pack [] "), @@ -16,6 +17,7 @@ int cmd_upload_pack(int argc, const char **argv, const char *prefix) const char *dir; int strict = 0; struct upload_pack_options opts = { 0 }; + struct serve_options serve_opts = SERVE_OPTIONS_INIT; struct option options[] = { OPT_BOOL(0, "stateless-rpc", _rpc, N_("quit after a single request/response exchange")), @@ -48,11 +50,9 @@ int cmd_upload_pack(int argc, const char **argv, const char *prefix) switch (determine_protocol_version_server()) { case protocol_v2: - /* -* fetch support for protocol v2 has not been implemented yet, -* so ignore the request to use v2 and fallback to using v0. -*/ - upload_pack(); + serve_opts.advertise_capabilities = opts.advertise_refs; + serve_opts.stateless_rpc = opts.stateless_rpc; + serve(_opts); break; case protocol_v1: /* diff --git a/connect.c b/connect.c index f2157a821..3c653b65b 100644 --- a/connect.c +++ b/connect.c @@ -12,9 +12,11 @@ #include "sha1-array.h" #include "transport.h" #include "strbuf.h" +#include "version.h" #include "protocol.h" static char *server_capabilities; +static struct argv_array server_capabilities_v2 = ARGV_ARRAY_INIT; static const char *parse_feature_value(const char *, const char *, int *); static int check_ref(const char *name, unsigned int flags) @@ -62,6 +64,33 @@ static void die_initial_contact(int unexpected) "and the repository exists.")); } +/* Checks if the server supports the capability 'c' */ +static int server_supports_v2(const char *c, int die_on_error) +{ + int i; + + for (i = 0; i < server_capabilities_v2.argc; i++) { + const char *out; + if (skip_prefix(server_capabilities_v2.argv[i], c, ) && + (!*out || *out == '=')) + return 1; + } + + if (die_on_error) + die("server doesn't support '%s'", c); + + return 0; +} + +static void process_capabilities_v2(struct packet_reader *reader) +{ + while (packet_reader_read(reader) == PACKET_READ_NORMAL) + argv_array_push(_capabilities_v2, reader->line); + + if (reader->status != PACKET_READ_FLUSH) + die("protocol error"); +} + enum protocol_version discover_version(struct packet_reader *reader) { enum protocol_version version = protocol_unknown_version; @@ -85,7 +114,7 @@ enum protocol_version discover_version(struct packet_reader *reader) /* Maybe process capabilities here, at least for v2 */ switch (version) { case protocol_v2: - die("support for protocol v2 not implemented yet"); + process_capabilities_v2(reader); break; case protocol_v1: /* Read the peeked version line */ @@ -293,6 +322,98 @@ struct ref **get_remote_heads(struct packet_reader *reader, return list; } +static int process_ref_v2(const char *line, struct ref ***list) +{ + int ret = 1; + int i = 0; nit: you set 'i' here, but first use it in a for loop with blank initializer. Perhaps keep the first assignment closer to the first use? + struct object_id old_oid; + struct ref *ref; + struct string_list line_sections = STRING_LIST_INIT_DUP; + + if (string_list_split(_sections, line, ' ', -1) < 2) { + ret = 0; + goto out; + } + + if (get_oid_hex(line_sections.items[i++].string, _oid)) { + ret = 0; + goto out; + } + + ref = alloc_ref(line_sections.items[i++].string); + + oidcpy(>old_oid, _oid); + **list = ref; + *list = >next; + + for (; i < line_sections.nr; i++) { + const char *arg = line_sections.items[i].string; + if (skip_prefix(arg,
Re: [PATCH v2 14/27] connect: request remote refs using v2
On 1/31/2018 3:10 PM, Eric Sunshine wrote: On Wed, Jan 31, 2018 at 10:22 AM, Derrick Stolee <sto...@gmail.com> wrote: On 1/25/2018 6:58 PM, Brandon Williams wrote: +static int process_ref_v2(const char *line, struct ref ***list) +{ + int ret = 1; + int i = 0; nit: you set 'i' here, but first use it in a for loop with blank initializer. Perhaps keep the first assignment closer to the first use? Hmm, I see 'i' being incremented a couple times before the loop... + if (string_list_split(_sections, line, ' ', -1) < 2) { + ret = 0; + goto out; + } + + if (get_oid_hex(line_sections.items[i++].string, _oid)) { here... + ret = 0; + goto out; + } + + ref = alloc_ref(line_sections.items[i++].string); and here... + + oidcpy(>old_oid, _oid); + **list = ref; + *list = >next; + + for (; i < line_sections.nr; i++) { then it is used in the loop. + const char *arg = line_sections.items[i].string; + if (skip_prefix(arg, "symref-target:", )) + ref->symref = xstrdup(arg); Thanks! Sorry I missed this. -Stolee
Re: [PATCH v2 10/27] protocol: introduce enum protocol_version value protocol_v2
On 2/2/2018 5:44 PM, Brandon Williams wrote: On 01/31, Derrick Stolee wrote: On 1/25/2018 6:58 PM, Brandon Williams wrote: Introduce protocol_v2, a new value for 'enum protocol_version'. Subsequent patches will fill in the implementation of protocol_v2. Signed-off-by: Brandon Williams <bmw...@google.com> --- builtin/fetch-pack.c | 3 +++ builtin/receive-pack.c | 6 ++ builtin/send-pack.c| 3 +++ builtin/upload-pack.c | 7 +++ connect.c | 3 +++ protocol.c | 2 ++ protocol.h | 1 + remote-curl.c | 3 +++ transport.c| 9 + 9 files changed, 37 insertions(+) diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c index 85d4faf76..f492e8abd 100644 --- a/builtin/fetch-pack.c +++ b/builtin/fetch-pack.c @@ -201,6 +201,9 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix) PACKET_READ_GENTLE_ON_EOF); switch (discover_version()) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol_v1: case protocol_v0: get_remote_heads(, , 0, NULL, ); diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c index b7ce7c7f5..3656e94fd 100644 --- a/builtin/receive-pack.c +++ b/builtin/receive-pack.c @@ -1963,6 +1963,12 @@ int cmd_receive_pack(int argc, const char **argv, const char *prefix) unpack_limit = receive_unpack_limit; switch (determine_protocol_version_server()) { + case protocol_v2: + /* +* push support for protocol v2 has not been implemented yet, +* so ignore the request to use v2 and fallback to using v0. +*/ + break; case protocol_v1: /* * v1 is just the original protocol with a version string, diff --git a/builtin/send-pack.c b/builtin/send-pack.c index 83cb125a6..b5427f75e 100644 --- a/builtin/send-pack.c +++ b/builtin/send-pack.c @@ -263,6 +263,9 @@ int cmd_send_pack(int argc, const char **argv, const char *prefix) PACKET_READ_GENTLE_ON_EOF); switch (discover_version()) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol_v1: case protocol_v0: get_remote_heads(, _refs, REF_NORMAL, diff --git a/builtin/upload-pack.c b/builtin/upload-pack.c index 2cb5cb35b..8d53e9794 100644 --- a/builtin/upload-pack.c +++ b/builtin/upload-pack.c @@ -47,6 +47,13 @@ int cmd_upload_pack(int argc, const char **argv, const char *prefix) die("'%s' does not appear to be a git repository", dir); switch (determine_protocol_version_server()) { + case protocol_v2: + /* +* fetch support for protocol v2 has not been implemented yet, +* so ignore the request to use v2 and fallback to using v0. +*/ + upload_pack(); + break; case protocol_v1: /* * v1 is just the original protocol with a version string, diff --git a/connect.c b/connect.c index db3c9d24c..f2157a821 100644 --- a/connect.c +++ b/connect.c @@ -84,6 +84,9 @@ enum protocol_version discover_version(struct packet_reader *reader) /* Maybe process capabilities here, at least for v2 */ switch (version) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol_v1: /* Read the peeked version line */ packet_reader_read(reader); diff --git a/protocol.c b/protocol.c index 43012b7eb..5e636785d 100644 --- a/protocol.c +++ b/protocol.c @@ -8,6 +8,8 @@ static enum protocol_version parse_protocol_version(const char *value) return protocol_v0; else if (!strcmp(value, "1")) return protocol_v1; + else if (!strcmp(value, "2")) + return protocol_v2; else return protocol_unknown_version; } diff --git a/protocol.h b/protocol.h index 1b2bc94a8..2ad35e433 100644 --- a/protocol.h +++ b/protocol.h @@ -5,6 +5,7 @@ enum protocol_version { protocol_unknown_version = -1, protocol_v0 = 0, protocol_v1 = 1, + protocol_v2 = 2, }; /* diff --git a/remote-curl.c b/remote-curl.c index 9f6d07683..dae8a4a48 100644 --- a/remote-curl.c +++ b/remote-curl.c @@ -185,6 +185,9 @@ static struct ref *parse_git_refs(struct discovery *heads, int for_push) PACKET_READ_GENTLE_ON_EOF); switch (discover_version()) { + case protocol_v2: + die("support for protocol v2 not implemented yet"); + break; case protocol
Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
On 2/2/2018 10:32 AM, SZEDER Gábor wrote: Teach Git to write a commit graph file by checking all packed objects to see if they are commits, then store the file in the given pack directory. I'm afraid that scanning all packed objects is a bit of a roundabout way to approach this. In my git repo, with 9 pack files at the moment, i.e. not that big a repo and not that many pack files: $ time ./git commit-graph --write --update-head 4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803 real0m27.550s user0m27.113s sys 0m0.376s In comparison, performing a good old revision walk to gather all the info that is written into the graph file: $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l 52954 real0m0.903s user0m0.972s sys 0m0.058s Two reasons this is in here: (1) It's easier to get the write implemented this way and add the reachable closure later (which I do). (2) For GVFS, we want to add all commits that arrived in a "prefetch pack" to the graph even if we do not have a ref that points to the commit yet. We expect many commits to become reachable soon and having them in the graph saves a lot of time in merge-base calculations. So, (1) is for patch simplicity, and (2) is why I want it to be an option in the final version. See the --stdin-packs argument later for a way to do this incrementally. I expect almost all users to use the reachable closure method with --stdin-commits (and that's how I will integrate automatic updates with 'fetch', 'repack', and 'gc' in a later patch). +char* get_commit_graph_filename_hash(const char *pack_dir, +struct object_id *hash) +{ + size_t len; + struct strbuf head_path = STRBUF_INIT; + strbuf_addstr(_path, pack_dir); + strbuf_addstr(_path, "/graph-"); + strbuf_addstr(_path, oid_to_hex(hash)); + strbuf_addstr(_path, ".graph"); Nit: this is assembling the path of a graph file, not that of a graph-head, so the strbuf should be renamed accordingly. + + return strbuf_detach(_path, ); +}
Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
On 2/1/2018 6:48 PM, SZEDER Gábor wrote: Teach git-commit-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 18 +++ builtin/commit-graph.c | 30 t/t5318-commit-graph.sh| 96 ++ 3 files changed, 144 insertions(+) create mode 100755 t/t5318-commit-graph.sh diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index c8ea548dfb..3f3790d9a8 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -5,3 +5,21 @@ NAME git-commit-graph - Write and verify Git commit graphs (.graph files) + +SYNOPSIS + +[verse] +'git commit-graph' --write [--pack-dir ] + What do these options do and what is the command's output? IOW, an 'OPTIONS' section would be nice. +EXAMPLES + + +* Write a commit graph file for the packed commits in your local .git folder. ++ + +$ git commit-graph --write + + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh new file mode 100755 index 00..6bcd1cc264 --- /dev/null +++ b/t/t5318-commit-graph.sh @@ -0,0 +1,96 @@ +#!/bin/sh + +test_description='commit graph' +. ./test-lib.sh + +test_expect_success 'setup full repo' \ +'rm -rf .git && + mkdir full && + cd full && + git init && + git config core.commitgraph true && + git config pack.threads 1 && Does this pack.threads=1 make a difference? + packdir=".git/objects/pack"' We tend to put single quotes around tests like this: test_expect_success 'setup full repo' ' do-this && check-that ' This is not a mere style nit: those newlines before and after the test block make the test's output with '--verbose-log' slightly more readable. Furthermore, we prefer tabs for indentation. Oops! My bad for using t5302-pack-index.sh as my model for creating test scripts. It's pretty old, but I do see some of the newer tests using this newer style. Finally, 'cd'-ing around such that it affects subsequent tests is usually frowned upon. However, in this particular case (going into one repo, doing a bunch of tests there, then going into another repo, and doing another bunch of tests) I think it's better than changing directory in a subshell in every single test. + +test_expect_success 'write graph with no packs' \ +'git commit-graph --write --pack-dir .' + +test_expect_success 'create commits and repack' \ +'for i in $(test_seq 5) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git repack' + +test_expect_success 'write graph' \ +'graph1=$(git commit-graph --write) && + test_path_is_file ${packdir}/graph-${graph1}.graph' Style nit: those {} around the variable names are unnecessary, but I see you use them a lot. + +t_expect_success 'Add more commits' \ This must be test_expect_success. +'git reset --hard commits/3 && + for i in $(test_seq 6 10) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git reset --hard commits/3 && + for i in $(test_seq 11 15) + do +echo $i >$i.txt && +git add $i.txt && +git commit -m "commit $i" && +git branch commits/$i + done && + git reset --hard commits/7 && + git merge commits/11 && + git branch merge/1 && + git reset --hard commits/8 && + git merge commits/12 && + git branch merge/2 && + git reset --hard commits/5 && + git merge commits/10 commits/15 && + git branch merge/3 && + git repack' + +# Current graph structure: +# +# M3 +# / |\_ +#/ 10 15 +# / | | +# /9 M2 14 +# | |/ \ | +# | 8 M1 | 13 +# | |/ | \_| +# 5 7 | 12 +# | | \__| +# 4 6 11 +# |/__/ +# 3 +# | +# 2 +# | +# 1 + +test_expect_success 'write graph with merges' \ +'graph2=$(git commit-graph --write) && + test_path_is_file ${packdir}/graph-${graph2}.graph' + +test_expect_success 'setup bare repo' \ +'cd .. && + git clone --bare full bare && + cd bare && + git config core.graph true && + git config pack.threads 1 && + baredir="objects/pack"' + +test_expect_success 'write graph in bare repo' \ +'graphbare=$(git commit-graph --write) && + test_path_is_file ${baredir}/graph-${graphbare}.graph' + +test_done -- 2.16.0.15.g9c3cf44.dirty
[PATCH v3 01/14] commit-graph: add format document
Add document specifying the binary format for commit graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. The format automatically includes two parent positions for every commit. This favors speed over space, since using only one position per commit would cause an extra level of indirection for every merge commit. (Octopus merges suffer from this indirection, but they are very rare.) Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph-format.txt | 91 + 1 file changed, 91 insertions(+) create mode 100644 Documentation/technical/commit-graph-format.txt diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 00..349fa0c14c --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,91 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +== graph-*.graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks, +hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) + + 1-byte number (C) of "chunks" + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) +* The first H bytes are for the OID of the root tree. +* The next 8 bytes are for the int-ids of the first two parents + of the ith commit. Stores value 0x if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. +* The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through this list of int-ids + for the parents until reaching a value with the most-significant bit on. + The other bits correspond to the int-id of the last parent. This chunk + should always be present, but may be empty. + +TRAILER: + + H-byte HASH-checksum of all of the above. + -- 2.15.1.45.g9b7079f
Re: [PATCH v3 03/14] commit-graph: create git-commit-graph builtin
On 2/8/2018 4:27 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: Teach git the 'commit-graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a '--pack-dir' option. Why do we want to use "pack" dir, when this is specifically designed not tied to packfile? .git/objects/pack/ certainly is a possibility in the sense that anywhere inside .git/objects/ would make sense, but using the "pack" dir smells like signalling a wrong message to users. I wanted to have the smallest footprint as possible in the objects directory, and the .git/objects directory currently only holds folders. I suppose this feature, along with the multi-pack-index (MIDX), extends the concept of the pack directory to be a "compressed data" directory, but keeps the "pack" name to be compatible with earlier versions. Another option is to create a .git/objects/graph directory instead, but then we need to worry about that directory being present. Thanks, -Stolee
[PATCH v3 09/14] commit-graph: implement --delete-expired
Teach git-commit-graph to delete the graph files in the pack directory that were not referenced by 'graph_head' during this process. This cleans up space for the user while not causing race conditions with other running Git processes that may be referencing the previous graph file. To delete old graph files, a user (or managing process) would call git commit-graph write --update-head --delete-expired but there is some responsibility that the caller must consider. Specifically, ensure that processes that started before a previous 'commit-graph write' command have completed. Otherwise, they may have an open handle on a graph file that will be deleted by the new call. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 11 -- builtin/commit-graph.c | 73 -- t/t5318-commit-graph.sh| 7 ++-- 3 files changed, 84 insertions(+), 7 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 8c2cbbc923..7ae8f7484d 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -37,6 +37,11 @@ checksum hash of the written file. + With `--update-head` option, update the graph-head file to point to the written graph file. ++ +With the `--delete-expired` option, delete the graph files in the pack +directory that are not referred to by the graph-head file. To avoid race +conditions, do not delete the file previously referred to by the +graph-head file if it is updated by the `--update-head` option. 'read':: @@ -60,11 +65,11 @@ EXAMPLES $ git commit-graph write -* Write a graph file for the packed commits in your local .git folder -* and update graph-head. +* Write a graph file for the packed commits in your local .git folder, +* update graph-head, and delete state graph files. + -$ git commit-graph write --update-head +$ git commit-graph write --update-head --delete-expired * Read basic information from a graph file. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 529cb80de6..15f647fd81 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), N_("git commit-graph clear [--pack-dir ]"), N_("git commit-graph read [--graph-hash=]"), - N_("git commit-graph write [--pack-dir ] [--update-head]"), + N_("git commit-graph write [--pack-dir ] [--update-head] [--delete-expired]"), NULL }; @@ -24,7 +24,7 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--pack-dir ] [--update-head]"), + N_("git commit-graph write [--pack-dir ] [--update-head] [--delete-expired]"), NULL }; @@ -32,6 +32,7 @@ static struct opts_commit_graph { const char *pack_dir; const char *graph_hash; int update_head; + int delete_expired; } opts; static int graph_clear(int argc, const char **argv) @@ -153,9 +154,68 @@ static void update_head_file(const char *pack_dir, const struct object_id *graph commit_lock_file(); } +/* + * To avoid race conditions and deleting graph files that are being + * used by other processes, look inside a pack directory for all files + * of the form "graph-.graph" that do not match the old or new + * graph hashes and delete them. + */ +static void do_delete_expired(const char *pack_dir, + struct object_id *old_graph_hash, + struct object_id *new_graph_hash) +{ + DIR *dir; + struct dirent *de; + int dirnamelen; + struct strbuf path = STRBUF_INIT; + char *old_graph_path, *new_graph_path; + + if (old_graph_hash) + old_graph_path = get_commit_graph_filename_hash(pack_dir, old_graph_hash); + else + old_graph_path = NULL; + new_graph_path = get_commit_graph_filename_hash(pack_dir, new_graph_hash); + + dir = opendir(pack_dir); + if (!dir) { + if (errno != ENOENT) + error_errno("unable to open object pack directory: %s", + pack_dir); + return; + } + + strbuf_addstr(, pack_dir); + strbuf_addch(, '/'); + + dirnamelen = path.len; + while ((de = readdir(dir)) != NULL) { + size_t base_len; + + if (is_dot_or_dotdot(de->d_name)) + continue; + + strbuf_setlen(, dirnamelen); + strbuf_addstr(, de-&
Re: [PATCH v3 01/14] commit-graph: add format document
On 2/8/2018 4:21 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: Add document specifying the binary format for commit graphs. This format allows for: * New versions. * New hash functions and hash lengths. It still is unclear, at least to me, why OID and OID length are stored as if they can be independent. If a reader does not understand a new Object Id hash, is there anything the reader can still do by knowing how long the hash (which it cannot recompute to validate) is? And if a reader does know what OID hashing scheme is used to refer to the objects, it certainly would know how long the OIDs are. Giving length may make sense only when a reader can treat these OIDs as completely opaque identifiers, without having to (re)hash from the contents, but if that is the case, then there is no point saying what exact hash function is used to compute OID. So I'd understand storing only either one or the other, but not both. Am I missing something? You're right that this data is redundant. It is easy to describe the width of the tables using the OID length, so it is convenient to have that part of the format. Also, it is good to have 4-byte alignment here, so we are not wasting space. There isn't a strong reason to put that here, but I don't have a great reason to remove it, either. Perhaps leave a byte blank for possible future use? +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". This "most natural" definition of generation number is stricter than absolutely necessary (a looser definition that is sufficient is "gennum of a child is larger than all of its parents'"). While I personally think that is OK, some people who floated different ideas in previous discussions on generation numbers may want to articulate their ideas again. One idea that I found clever was to use the total number of commits that are ancestors of a commit instead (it is far more expensive to compute than the most natural gennum, but doing so may help other topology math, like "describe"). It is more difficult to compute the number of reachable commits, since you cannot learn that only by looking at the parents (you need to know how many commits are in the intersection of their reachable sets for a two-parent merge, or just walk all of the commits). This leads to a quadratic computation to discover the value for N commits. I define it this rigidly now because I will submit a patch soon after this one lands that computes generation numbers and consumes them in paint_down_to_common(). I've got it sitting in my local repo ready for a rebase. +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Aren't chunks numbered contiguously, starting from #1, thereby making it unnecessary to store the 4-byte? How does a reader obtain the length of the last chunk? Ahh, that is why there are C+1 entries in this table, not just C, so that the reader knows where to stop while reading the last one. Does that mean that this table looks like this? { 1, offset_1 }, { 2, offset_2 }, ... { C, offset_C }, { 0, offset_C+1 }, where where (offset_N+1 - offset_N) gives the length of chunk #N? This is correct. + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) +* The first H bytes are for the OID of the root tree. +* The next 8 bytes are for the int-ids of the first two parents + of the ith commit. Stores value 0x if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. +* The next 8 bytes store the generation number of the commit and + the commit time in seconds
[PATCH v3 03/14] commit-graph: create git-commit-graph builtin
Teach git the 'commit-graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for a '--pack-dir' option. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .gitignore | 1 + Documentation/git-commit-graph.txt | 11 +++ Makefile | 1 + builtin.h | 1 + builtin/commit-graph.c | 37 + command-list.txt | 1 + git.c | 1 + 7 files changed, 53 insertions(+) create mode 100644 Documentation/git-commit-graph.txt create mode 100644 builtin/commit-graph.c diff --git a/.gitignore b/.gitignore index 833ef3b0b7..e82f90184d 100644 --- a/.gitignore +++ b/.gitignore @@ -34,6 +34,7 @@ /git-clone /git-column /git-commit +/git-commit-graph /git-commit-tree /git-config /git-count-objects diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt new file mode 100644 index 00..e1c3078ca1 --- /dev/null +++ b/Documentation/git-commit-graph.txt @@ -0,0 +1,11 @@ +git-commit-graph(1) +=== + +NAME + +git-commit-graph - Write and verify Git commit graphs (.graph files) + +GIT +--- +Part of the linkgit:git[1] suite + diff --git a/Makefile b/Makefile index ee9d5eb11e..fc40b816dc 100644 --- a/Makefile +++ b/Makefile @@ -932,6 +932,7 @@ BUILTIN_OBJS += builtin/clone.o BUILTIN_OBJS += builtin/column.o BUILTIN_OBJS += builtin/commit-tree.o BUILTIN_OBJS += builtin/commit.o +BUILTIN_OBJS += builtin/commit-graph.o BUILTIN_OBJS += builtin/config.o BUILTIN_OBJS += builtin/count-objects.o BUILTIN_OBJS += builtin/credential.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..079855b6d4 100644 --- a/builtin.h +++ b/builtin.h @@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix); extern int cmd_clean(int argc, const char **argv, const char *prefix); extern int cmd_column(int argc, const char **argv, const char *prefix); extern int cmd_commit(int argc, const char **argv, const char *prefix); +extern int cmd_commit_graph(int argc, const char **argv, const char *prefix); extern int cmd_commit_tree(int argc, const char **argv, const char *prefix); extern int cmd_config(int argc, const char **argv, const char *prefix); extern int cmd_count_objects(int argc, const char **argv, const char *prefix); diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c new file mode 100644 index 00..a01c5d9981 --- /dev/null +++ b/builtin/commit-graph.c @@ -0,0 +1,37 @@ +#include "builtin.h" +#include "config.h" +#include "parse-options.h" + +static char const * const builtin_commit_graph_usage[] = { + N_("git commit-graph [--pack-dir ]"), + NULL +}; + +static struct opts_commit_graph { + const char *pack_dir; +} opts; + + +int cmd_commit_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_commit_graph_options[] = { + { OPTION_STRING, 'p', "pack-dir", _dir, + N_("dir"), + N_("The pack directory to store the graph") }, + OPT_END(), + }; + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); + + git_config(git_default_config, NULL); + argc = parse_options(argc, argv, prefix, +builtin_commit_graph_options, +builtin_commit_graph_usage, +PARSE_OPT_STOP_AT_NON_OPTION); + + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); +} + diff --git a/command-list.txt b/command-list.txt index a1fad28fd8..835c5890be 100644 --- a/command-list.txt +++ b/command-list.txt @@ -34,6 +34,7 @@ git-clean mainporcelain git-clone mainporcelain init git-column purehelpers git-commit mainporcelain history +git-commit-graphplumbingmanipulators git-commit-tree plumbingmanipulators git-config ancillarymanipulators git-count-objects ancillaryinterrogators diff --git a/git.c b/git.c index 9e96dd4090..d4832c1e0d 100644 --- a/git.c +++ b/git.c @@ -388,6 +388,7 @@ static struct cmd_struct commands[] = { { "clone", cmd_clone }, { "column", cmd_column, RUN_SETUP_GENTLY }, { "commit", cmd_commit, RUN_SETUP | NEED_WORK_TREE }, + { "commit-graph", cmd_commit_graph, RUN_SETUP }, { "commit-tree&
[PATCH v3 00/14] Serialized Git Commit Graph
Thanks to everyone who gave comments on v1 and v2. Hopefully the following points have been addressed: * Fixed inter-commit problems where certain fixes did not arrive until later commits. * Converted from submode flags ("git commit-graph --write") to subcommands ("git commit-graph write"). * Fixed a bug where a non-commit OID would cause a segfault when using --stdin-commits. Added a test for an annotated tag. * Numerous style issues, especially in the test script. I also based my patches on the branch jt/binsearch-with-fanout to make use of the bsearch_hash() method. I look forward to your feedback. Thanks, -Stolee -- >8 -- As promised [1], this patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv | 0.42s | 0.27s | -35% | | rev-list --all | 6.4s | 1.0s | -84% | | rev-list --all --objects | 32.6s | 27.6s | -15% | To test this yourself, run the following on your repo: git config core.commitGraph true git show-ref -s | git commit-graph write --update-head --stdin-commits The second command writes a commit graph file containing every commit reachable from your refs. Now, all git commands that walk commits will check your graph first before consulting the ODB. You can run your own performance comparisions by toggling the 'core.commitgraph' setting. [1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/ Re: What's cooking in git.git (Jan 2018, #03; Tue, 23) [2] https://github.com/derrickstolee/git/pull/2 A GitHub pull request containing the latest version of this patch. Derrick Stolee (14): commit-graph: add format document graph: add commit graph design document commit-graph: create git-commit-graph builtin commit-graph: implement write_commit_graph() commit-graph: implement 'git-commit-graph write' commit-graph: implement 'git-commit-graph read' commit-graph: update graph-head during write commit-graph: implement 'git-commit-graph clear' commit-graph: implement --delete-expired commit-graph: add core.commitGraph setting commit: integrate commit graph with commit parsing commit-graph: close under reachability commit-graph: read only from specific pack-indexes commit-graph: build graph from starting commits .gitignore | 1 + Documentation/config.txt| 3 + Documentation/git-commit-graph.txt | 115 Documentation/technical/commit-graph-format.txt | 91 +++ Documentation/technical/commit-graph.txt| 189 ++ Makefile| 2 + alloc.c | 1 + builtin.h | 1 + builtin/commit-graph.c | 335 ++ cache.h | 1 + command-list.txt| 1 + commit-graph.c | 828 commit-graph.h | 60 ++ commit.c| 3 + commit.h
[PATCH v3 07/14] commit-graph: update graph-head during write
It is possible to have multiple commit graph files in a pack directory, but only one is important at a time. Use a 'graph_head' file to point to the important file. Teach git-commit-graph to write 'graph_head' upon writing a new commit graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 11 ++- builtin/commit-graph.c | 27 +-- commit-graph.c | 8 commit-graph.h | 1 + t/t5318-commit-graph.sh| 25 +++-- 5 files changed, 63 insertions(+), 9 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 67e107f06a..5e32c43b27 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -33,7 +33,9 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. Includes all commits from the existing commit graph file. Outputs the checksum hash of the written file. - ++ +With `--update-head` option, update the graph-head file to point +to the written graph file. 'read':: @@ -53,6 +55,13 @@ EXAMPLES $ git commit-graph write +* Write a graph file for the packed commits in your local .git folder +* and update graph-head. ++ + +$ git commit-graph write --update-head + + * Read basic information from a graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 3ffa7ec433..776ca087e8 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -1,12 +1,13 @@ #include "builtin.h" #include "config.h" +#include "lockfile.h" #include "parse-options.h" #include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--pack-dir ]"), N_("git commit-graph read [--graph-hash=]"), - N_("git commit-graph write [--pack-dir ]"), + N_("git commit-graph write [--pack-dir ] [--update-head]"), NULL }; @@ -16,13 +17,14 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--pack-dir ]"), + N_("git commit-graph write [--pack-dir ] [--update-head]"), NULL }; static struct opts_commit_graph { const char *pack_dir; const char *graph_hash; + int update_head; } opts; static int graph_read(int argc, const char **argv) @@ -87,6 +89,22 @@ static int graph_read(int argc, const char **argv) return 0; } +static void update_head_file(const char *pack_dir, const struct object_id *graph_hash) +{ + int fd; + struct lock_file lk = LOCK_INIT; + char *head_fname = get_graph_head_filename(pack_dir); + + fd = hold_lock_file_for_update(, head_fname, LOCK_DIE_ON_ERROR); + FREE_AND_NULL(head_fname); + + if (fd < 0) + die_errno("unable to open graph-head"); + + write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ); + commit_lock_file(); +} + static int graph_write(int argc, const char **argv) { struct object_id *graph_hash; @@ -95,6 +113,8 @@ static int graph_write(int argc, const char **argv) { OPTION_STRING, 'p', "pack-dir", _dir, N_("dir"), N_("The pack directory to store the graph") }, + OPT_BOOL('u', "update-head", _head, + N_("update graph-head to written graph file")), OPT_END(), }; @@ -111,6 +131,9 @@ static int graph_write(int argc, const char **argv) graph_hash = write_commit_graph(opts.pack_dir); + if (opts.update_head) + update_head_file(opts.pack_dir, graph_hash); + if (graph_hash) { printf("%s\n", oid_to_hex(graph_hash)); FREE_AND_NULL(graph_hash); diff --git a/commit-graph.c b/commit-graph.c index 9a337cea4d..9789fe37f9 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -38,6 +38,14 @@ #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \ GRAPH_OID_LEN + 8) +char *get_graph_head_filename(const char *pack_dir) +{ + struct strbuf fname = STRBUF_INIT; + strbuf_addstr(, pack_dir); + strbuf_addstr(, "/graph-head"); + return strbuf_detach(, 0); +} + char* get_commit_graph_filename_hash(const char *pack_dir, struct object_id *hash) { diff --git a/commit-graph.h b/commit-graph.h index c1608976b3..726
[PATCH v3 12/14] commit-graph: close under reachability
Teach write_commit_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Also automatically add commits from the existing graph file, if it exists. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 37 + 1 file changed, 37 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index aff67c458e..d711a2cd81 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -633,6 +633,28 @@ static int if_packed_commit_add_to_list(const struct object_id *oid, return 0; } +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct rev_info revs; + struct commit *commit; + init_revisions(, NULL); + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(oids->list[i]); + if (commit && !parse_commit(commit)) + revs.commits = commit_list_insert(commit, ); + } + + if (prepare_revision_walk()) + die(_("revision walk setup failed")); + + while ((commit = get_revision()) != NULL) { + ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc); + oids->list[oids->nr] = &(commit->object.oid); + (oids->nr)++; + } +} + struct object_id *write_commit_graph(const char *pack_dir) { struct packed_oid_list oids; @@ -650,12 +672,27 @@ struct object_id *write_commit_graph(const char *pack_dir) char *fname; struct commit_list *parent; + prepare_commit_graph(); + oids.nr = 0; oids.alloc = 1024; + + if (commit_graph && oids.alloc < commit_graph->num_commits) + oids.alloc = commit_graph->num_commits; + ALLOC_ARRAY(oids.list, oids.alloc); + if (commit_graph) { + for (i = 0; i < commit_graph->num_commits; i++) { + oids.list[i] = malloc(sizeof(struct object_id)); + get_nth_commit_oid(commit_graph, i, oids.list[i]); + } + oids.nr = commit_graph->num_commits; + } + for_each_packed_object(if_packed_commit_add_to_list, , 0); + close_reachable(); QSORT(oids.list, oids.nr, commit_compare); count_distinct = 1; -- 2.15.1.45.g9b7079f