Re: Git Test Coverage Report (Saturday, Oct 27)
On 10/27/2018 9:55 AM, Junio C Hamano wrote: Derrick Stolee writes: Uncovered in mater not in master@{1} Does this typo indicate that some part of the process to produce and send out this report involve manual editing? I kick off four builds with on Azure Pipelines [1], wait for the builds to finish, then copy the appropriate sections out of the output (trimming the timestamps). If I were more familiar with the Pipelines features, I'm pretty sure I could make four parallel jobs inside the same build that ran the tests on each branch, export the logs as artifacts, then auto-assemble the email at the end. I plan to do that soon, but I'm still keeping a close eye on the results to see when and where failures occur. Thanks, -Stolee [1] https://dev.azure.com/git/git/_build?definitionId=5
Git Test Coverage Report (Saturday, Oct 27)
d not write '%s''"), todo_file); 6424061be4 110) return error(_("could not copy '%s' to '%s'."), todo_file, b74a37a5a7 174) goto leave_check; refs.c 3a3b9d8cde 657) return 0; refs/files-backend.c revision.c a5dda2204a 726) if (repo_parse_commit(the_repository, p) < 0) sequencer.c a5dda2204a 1597) repo_unuse_commit_buffer(the_repository, head_commit, a5dda2204a 3821) repo_unuse_commit_buffer(the_repository, b5d6062402 4486) strbuf_insert(buf, todo_list->items[insert].offset_in_buf + b5d6062402 4498) int sequencer_add_exec_commands(const char *commands) 06d8136126 4505) return error_errno(_("could not read '%s'."), todo_file); b5d6062402 4507) if (todo_list_parse_insn_buffer(todo_list.buf.buf, &todo_list)) { b5d6062402 4512) todo_list_add_exec_commands(&todo_list, commands); b5d6062402 4513) res = write_message(todo_list.buf.buf, todo_list.buf.len, todo_file, 0); 0cce4a2756 4514) todo_list_release(&todo_list); b5d6062402 4516) return res; b74a37a5a7 4576) goto out; b74a37a5a7 4581) goto out; b8dac44d10 4721) todo_list_release(&new_todo); 009173ed7b 4726) todo_list_release(&new_todo); 009173ed7b 4727) return error_errno(_("could not write '%s'"), todo_file); 9787d17d40 4921) int rearrange_squash_in_todo_file(void) 9787d17d40 4923) const char *todo_file = rebase_path_todo(); 9787d17d40 4924) struct todo_list todo_list = TODO_LIST_INIT; 9787d17d40 4925) int res = 0; 9787d17d40 4927) if (strbuf_read_file_or_whine(&todo_list.buf, todo_file) < 0) 9787d17d40 4928) return -1; 9787d17d40 4929) if (todo_list_parse_insn_buffer(todo_list.buf.buf, &todo_list) < 0) { 9787d17d40 4930) todo_list_release(&todo_list); 9787d17d40 4931) return -1; 9787d17d40 4934) res = todo_list_rearrange_squash(&todo_list); 9787d17d40 4935) if (!res) 9787d17d40 4936) res = rewrite_file(todo_file, todo_list.buf.buf, todo_list.buf.len); 9787d17d40 4938) todo_list_release(&todo_list); setup.c 58b284a2e9 413) return config_error_nonbool(var); sha1-array.c 7fdf90d541 91) oidcpy(&oids[dst], &oids[src]); submodule-config.c bcbc780d14 740) return CONFIG_INVALID_KEY; 45f5ef3d77 755) warning(_("Could not update .gitmodules entry %s"), key); submodule.c b303ef65e7 524) the_repository->submodule_prefix : 060675d4fc 1378) strbuf_release(&gitdir); 183be9660a 1501) struct get_next_submodule_task *task = task_cb; 183be9660a 1505) get_next_submodule_task_release(task); 183be9660a 1532) return 0; 183be9660a 1536) goto out; 183be9660a 1551) return 0; tree.c a5dda2204a 108) if (repo_parse_commit(the_repository, commit)) worktree.c 3a3b9d8cde 495) return -1; 3a3b9d8cde 508) return -1; 3a3b9d8cde 517) return -1; ab3e1f78ae 537) break; wrapper.c 5efde212fc 70) die("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", 5efde212fc 73) error("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", Commits introducing uncovered code: Alban Gruin 009173ed7: sequencer: change complete_action() to use the refactored functions Alban Gruin 06d813612: sequencer: fix a call to error() in transform_todo_file() Alban Gruin 6424061be: rebase-interactive: rewrite edit_todo_list() to handle the initial edit Alban Gruin 7ccfac40b: rebase--interactive: move transform_todo_file() to rebase--interactive.c Alban Gruin 9787d17d4: sequencer: refactor rearrange_squash() to work on a todo_list Alban Gruin b5d606240: sequencer: refactor sequencer_add_exec_commands() to work on a todo_list Alban Gruin b74a37a5a: sequencer: refactor check_todo_list() to work on a todo_list Alban Gruin b8dac44d1: sequencer: refactor skip_unnecessary_picks() to work on a todo_list Antonio Ospite 45f5ef3d7: submodule: factor out a config_set_in_gitmodules_file_gently function Antonio Ospite bcbc780d1: submodule: add a print_config_from_gitmodules() helper Antonio Ospite c22b82014: submodule: support reading .gitmodules when it's not in the working tree Derrick Stolee 1dcd9f204: midx: close multi-pack-index on repack Derrick Stolee dc7d66433: packfile: close multi-pack-index in close_all_packs Elijah Newren 387361a6a: merge-recursive: improve rename/rename(1to2)/add[/add] handling Elijah Newren 4cdc48e41: merge-recursive: new function for better colliding conflict resolutions Elijah Newren b58ae691c: merge-recursive: fix rename/add conflict handling Junio C Hamano a5dda2204: treewide: apply cocci patch Liam Beguin 0cce4a275: rebase -i -x: add exec commands via the rebase--helper Linus Torvalds 74e8221b5: Add 'human' date format Martin Koegler 5efde212f: zlib.c: use size_t for size Nguyễn Thái Ngọc Duy 3a3b9d8cd: refs: new ref types to make per-worktree refs visible to all worktrees Nguyễn Thái Ngọc Duy 58b2
Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
On 10/25/2018 5:43 AM, Jeff King wrote: On Thu, Oct 11, 2018 at 12:21:44PM -0400, Derrick Stolee wrote: 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. I wondered how this would work for INFINITY. We can't know the order of a bunch of INFINITY nodes at all, so we never know when their in-degree values are "done". But if I understand the EXPLORE walk, we'd basically walk all of INFINITY down to something with a real generation number. Is that right? But after that, I'm not totally clear on why we need this INDEGREE walk. The INDEGREE walk is an important element for Kahn's algorithm. The final output order is dictated by peeling commits of "indegree zero" to ensure all children are output before their parents. (Note: since we use literal 0 to mean "uninitialized", we peel commits when the indegree slab has value 1.) This walk replaces the indegree logic from sort_in_topological_order(). That method performs one walk that fills the indegree slab, then another walk that peels the commits with indegree 0 and inserts them into a list. I guess my big question here was: if we have generation numbers, do we need Kahn's algorithm? That is, in a fully populated set of generation numbers (i.e., no INFINITY), we could always just pick a commit with the highest generation number to show. So if we EXPLORE down to a real generation number in phase 1, why do we need to care about INDEGREE anymore? Or am I wrong that we always have a real generation number (i.e., not INFINITY) after EXPLORE? (And if so, why is exploring to a real generation number a bad idea; presumably it's due to a worst-case that goes deeper than we'd otherwise need to here). The issue is that we our final order (used by level 3) is unrelated to generation number. Yes, if we prioritized by generation number then we could output the commits in _some_ order that doesn't violate topological constraints. However, we are asking for a different priority, which is different than the generation number priority. In the case of "--topo-order", we want to output the commits reachable from the second parent of a merge before the commits reachable from the first parent. However, in most cases the generation number of the first parent is higher than the second parent (there are more things in the merge chain than in a small topic that got merged). The INDEGREE is what allows us to know when we can peel these commits while still respecting the priority we want at the end. Thanks, -Stolee
[PATCH] packfile: close multi-pack-index in close_all_packs
Whenever we delete pack-files from the object directory, we need to also delete the multi-pack-index that may refer to those objects. Sometimes, this connection is obvious, like during a repack. Other times, this is less obvious, like when gc calls a repack command and then does other actions on the objects, like write a commit-graph file. The pattern we use to avoid out-of-date in-memory packed_git structs is to call close_all_packs(). This should also call close_midx(). Since we already pass an object store to close_all_packs(), this is a nicely scoped operation. This fixes a test failure when running t6500-gc.sh with GIT_TEST_MULTI_PACK_INDEX=1. Reported-by: Szeder Gábor Signed-off-by: Derrick Stolee --- Thanks for the report, Szeder! I think this is the correct fix, and more likely to be permanent across all builtins that run auto-GC. I based it on ds/test-multi-pack-index so it could easily be added on top. -Stolee packfile.c | 5 + 1 file changed, 5 insertions(+) diff --git a/packfile.c b/packfile.c index 841b36182f..37fcd8f136 100644 --- a/packfile.c +++ b/packfile.c @@ -339,6 +339,11 @@ void close_all_packs(struct raw_object_store *o) BUG("want to close pack marked 'do-not-close'"); else close_pack(p); + + if (o->multi_pack_index) { + close_midx(o->multi_pack_index); + o->multi_pack_index = NULL; + } } /* base-commit: 0465a50506023df0932fe0534fe6ac6712c0d854 -- 2.17.1
Recommended configurations (was Re: [PATCH v1 2/2] reset: add new reset.quietDefault config setting)
On 10/23/2018 4:03 PM, Ævar Arnfjörð Bjarmason wrote: [snip] The --ahead-behind config setting stalled on-list before: https://public-inbox.org/git/36e3a9c3-f7e2-4100-1bfc-647b809a0...@jeffhostetler.com/ Now we have this similarly themed thing. I think we need to be mindful of how changes like this can add up to very confusing UI. I.e. in this case I can see a "how take make git fast on large repos" post on stackoverflow in our future where the answer is setting a bunch of seemingly irrelevant config options like reset.quiet and status.aheadbehind=false etc. So maybe we should take a step back and consider if the real thing we want is just some way for the user to tell git "don't work so hard at coming up with these values". That can also be smart, e.g. some "auto" setting that tweaks it based on estimated repo size so even with the same config your tiny dotfiles.git will get "ahead/behind" reporting, but not when you cd into windows.git. Generally, there are a lot of config settings that are likely in the "if you have a big repo, then you should use this" category. However, there is rarely a one-size-fits-all solution to these problems, just like there are different ways a repo can be "big" (working directory? number of commits? submodules?). I would typically expect that users with _really_ big repos have the resources to have an expert tweak the settings that are best for that data shape and share those settings with all the users. In VFS for Git, we turn certain config settings on by default when mounting the repo [1], but others are either signaled through warning messages (like the status.aheadBehind config setting [2]). We never upstreamed the status.aheadBehind config setting [2], but we _did_ upstream the command-line option as fd9b544 "status: add --[no-]ahead-behind to status and commit for V2 format". We didn't want to change the expected output permanently, so we didn't add the config setting to our list of "required" settings, but instead created a list of optional settings [3]; these settings don't override the existing settings so users can opt-out. (Now that we have the commit-graph enabled and kept up-to-date, it may be time to revisit the importance of this setting.) All of this is to say: it is probably a good idea to have some "recommended configuration" for big repos, but there will always be power users who want to tweak each and every one of these settings. I'm open to design ideas of how to store a list of recommended configurations and how to set a group of config settings with one command (say, a "git recommended-config [small|large|submodules]" builtin that fills the local config with the important settings). Thanks, -Stolee [1] https://github.com/Microsoft/VFSForGit/blob/7daa9f1133764a4e4bd87014833fc2091e6702c1/GVFS/GVFS/CommandLine/GVFSVerb.cs#L79-L104 Code in VFS for Git that enables "required" config settings. [2] https://github.com/Microsoft/git/commit/0cbe9e6b23e4d9008d4a1676e1dd6a87bdcd6ed5 status: add status.aheadbehind setting [3] https://github.com/Microsoft/VFSForGit/blob/7daa9f1133764a4e4bd87014833fc2091e6702c1/GVFS/GVFS/CommandLine/GVFSVerb.cs#L120-L123 Code in VFS for Git that enables "optional" config settings.
Re: [PATCH v3 09/12] commit-graph: convert to using the_hash_algo
On 10/21/2018 10:43 PM, brian m. carlson wrote: Instead of using hard-coded constants for object sizes, use the_hash_algo to look them up. In addition, use a function call to look up the object ID version and produce the correct value. For now, we use version 1, which means to use the default algorithm used in the rest of the repository. This patch looks good to me. Later, we will want to delete the commit-graph file during the "upgrade the repo to newhash" process, but that can wait. Thanks, -Stolee
Re: [PATCH] commit-reach: fix sorting commits by generation
On 10/23/2018 4:32 PM, Thomas Gummerer wrote: On 10/22, René Scharfe wrote: Am 22.10.2018 um 23:10 schrieb Thomas Gummerer: Anyway, your implied question was discussed back then. Derrick wrote: The reason to sort is to hopefully minimize the amount we walk by exploring the "lower" commits first. This is a performance-only thing, not a correctness issue (which is why the bug exists). Even then, it is just a heuristic. Thanks for pointing that out! Does b6723e4671 in pu (commit-reach: fix first-parent heuristic) change that picture? Did a quick test and found no performance difference with and without the fix on top, i.e. proper sorting didn't seem to matter. I just gave 'test-tool reach can_all_from_reach' a try and got the same results, with or without the fix the times are very similar. I haven't had time to follow the commit-graph series though, so I'm not sure I used it correctly. Thanks for your attention here. I've been thinking a lot about this heuristic and have concluded the following two things are true: (1) When we return 1, the order that we explore the 'from' commits does not change the explored set of commits. (2) When we return 0, the order that we explore the 'to' commits will change the explored set, but it is difficult to say that the heuristic helps more than it hurts. Item (1) is contrary to what I had thought when I first created the heuristic. The details are tricky, but essentially each DFS starting at a 'from' commit may be short-circuited due to a prior walk, but swapping the order of those two 'from' commits would lead to the same set of commits to be explored (with the short-circuit happening in the other commit). The only change is that we can terminate early if we fully explore a 'from' commit and do not find a commit marked with 'with_flag'. In this sense, it would be best to explore the commits that are "closest" to the generation number cutoff first, as we can maybe find a negative answer earlier in the search. In this sense, we could remove the sort entirely and probably not have much performance hit. But since the set of 'from' commits is probably much smaller than the set of commits to explore, the sort is likely inexpensive. In conclusion: I cannot say that this sort is super-important. As for the potential benefits in (2), I'll leave that to people who run git as a server who may have telemetry around fetch negotiation. How often do we actually say we need more rounds of negotiation? What kinds of data shapes matter there? Thanks, -Stolee
Re: [PATCH v4 6/7] revision.c: generation-based topo-order algorithm
On 10/22/2018 9:37 AM, Jakub Narebski wrote: "Derrick Stolee via GitGitGadget" writes: From: Derrick Stolee The current --topo-order algorithm requires walking all reachable commits up front, topo-sorting them, all before outputting the first value. This patch introduces a new algorithm which uses stored generation numbers to incrementally walk in topo-order, outputting commits as we go. This can dramatically reduce the computation time to write a fixed number of commits, such as when limiting with "-n " or filling the first page of a pager. When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. Because in-degree has very specific defined meaning of number of children, i.e. the number of _incoming_ edges, I would say "if and only if their in-degree-plus-one is one". It is more exact, even if it looks a bit funny. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). All right, so those are separate steps (separate walks): prepare and parse commits, topologically sort list of commits from previous step, output sorted list of commits from previous step. I would rephrase your explanation above as: prepare and parse commits, compute in-degrees, and peel commits of in-degree zero. In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. What does 'higher order' walk means: steps 3, 2, 1, in this order, i.e. output being the highest order, or something different? Yes. We only walk "level 2" in order to satisfy how far we are in "level 3". Sidenote: the new algorithm looks a bit like Unix pipeline, where each step of pipeline does not output much more than next step needs / requests. That's essentially the idea. We know when we can pause each walk by using generation numbers from the commit- graph feature. Do I understand it correctly that this is mainly used in Kahn's algorithm to find out through the negative-cut index of generation number which commits in the to-be-sorted list cannot have an in-degree of zero (or otherise cannot be next commit to be shown in output)? In each step of the algorithm, we operate under the assumption that certain vertices have "all necessary information". In the case of "level 3", we need to know that all descendants were walked and our in-degree calculation is correct. We guarantee this by ensuring that "level 2" has walked beyond that commit's generation number. In the case of "level 2", we need to know that we have parsed all descendants and determined their simplifications (if necessary, such as in file-history) and if they are UNINTERESTING. We guarantee this by ensuring that "level 1" has walked beyond that commit's generation number. In the previous algorithm, these guarantees were handled by doing each step on all reachable commits before moving to the next level. Recall that the generation number of a commit satisfies: * If the commit has at least one parent, then the generation number is one more than the maximum generation number among its parents. * If the commit has no parent, then the generation number is one. There are two special generation numbers: * GENERATION_NUMBER_INFINITY: this value is 0x and indicates that the commit is not stored in the commit-graph and the generation number was not previously calculated. * GENERATION_NUMBER_ZERO: this value (0) is a special indicator to say that the commit-graph was generated by a version of Git that does not compute generation numbers (such as v2.18.0). Since we use generation_numbers_enabled() before using the new algorithm, we do not need to worry about GENERATION_NUMBER_ZERO. However, the existence of GENERATION_NUMBER_INFINITY implies the following weaker statement t
[RFC PATCH] revision.c: use new algorithm in A..B case
Signed-off-by: Derrick Stolee --- I just wanted to mention that in order to use the new logic for 'git log --topo-order A..B', we just need the following patch. It is an extra time that sets 'revs->limited' to 1, triggering the old logic. You can use this for comparison purposes, but I'm not ready to do this until more performance testing is ready in this case. Since these comparison commands are already pretty fast when the diff is small, there is less urgency to improve performance here. Thanks, -Stolee revision.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/revision.c b/revision.c index 472f3994e3..8e5656f7b4 100644 --- a/revision.c +++ b/revision.c @@ -278,10 +278,8 @@ static struct commit *handle_commit(struct rev_info *revs, if (parse_commit(commit) < 0) die("unable to parse commit %s", name); - if (flags & UNINTERESTING) { + if (flags & UNINTERESTING) mark_parents_uninteresting(commit); - revs->limited = 1; - } if (revs->sources) { char **slot = revision_sources_at(revs->sources, commit); -- 2.19.1
Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
On 10/21/2018 9:12 PM, Junio C Hamano wrote: Jakub Narebski writes: So if revs->limited is set (but not because revs->topo_order is set), which means A..B queries, we will be still using the old algorithm. All right, though I wonder if it could be improved in the future (perhaps with the help of other graph labelling / indices than generation numbers, maybe a positive-cut index). Do you have an idea why there is no improvement with the new code in this case? I didn't get the impression that it would not be possible to improve the "--topo A..B" case by using generation numbers from this series. Isn't it just because the necessary code has not been written yet? In addition to what is needed for "--topo P1 P2 P3..." (all positive), limited walk needs to notice the bottom boundary and stop traversal. Having generation numbers would make it slightly easier than without, as you know that a positive commit you have will not be marked UNINTERESTING due to a negative commit whose ancestors have not been explored, as long as that negative commit has a higher generation number. But you still need to adjust the traversal logic to properly terminate upon hitting UNINTERESTING one, and also propagate the bit down the history, which is not needed at all if you only want to support the "positive only" case. Actually, the code has been written, but the problem is the same as the performance issue when I made paint_down_to_common() use generation numbers: the algorithm for deciding what is in the set "reachable from A but not reachable from B" uses commit-date order as a heuristic to avoid walking the entire graph. Yes, the revision parameters specify "limited" in order to call "limit_list()", but it uses the same algorithm to determine the reachable set difference. You can test this yourself! Run the following two commands in the Git repository using v2.19.1: time git log --topo-order -10 master >/dev/null time git log --topo-order -10 maint..master >/dev/null I get 0.39s for the first call and 0.01s for the second. (Note: I specified "-10" to ensure we are only writing 10 commits and the output size does not factor into the time.) This is because the first walks the entire history, while the second uses the heuristic walk to identify a much smaller subgraph that the topo-order algorithm uses. Just as before, by using this algorithm for the B..A case, we are adding an extra restriction on the algorithm: always be correct. This results in us walking a larger set (everything reachable from B or A with generation number at least the smallest generation of a commit reachable from only one). I believe this can be handled by using a smarter generation number (one that relies on commit date as a heuristic, but still have enough information to guarantee topological relationships), and I've already started testing a few of these directions. It is possible now that we have concrete graph algorithms to use on real repositories. I hope to share a report on my findings in a couple weeks. I'll include how using this algorithm compares to the old algorithm in the B..A case. Thanks, -Stolee
Re: [PATCH v4 3/7] test-reach: add rev-list tests
On 10/21/2018 6:21 AM, Jakub Narebski wrote: "Derrick Stolee via GitGitGadget" writes: From: Derrick Stolee The rev-list command is critical to Git's functionality. Ensure it works in the three commit-graph environments constructed in t6600-test-reach.sh. Here are a few important types of rev-list operations: * Basic: git rev-list --topo-order HEAD * Range: git rev-list --topo-order compare..HEAD * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD * Symmetric Difference: git rev-list --topo-order compare...HEAD Could you remind us here which of those operations will be using generation numbers after this patch series? For this series, we are focused only on the --topo-order with a single start position. The versions that use a compare branch still use the old logic. In the future, I would like to use the new logic for these other modes. +test_expect_success 'rev-list: basic topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \ + commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-6-6 +' I wonder if this test could be make easier to write and less error prone, e.g. creating it from ASCII-art graphics. But it is good enough. I did lay out the branch names in a grid layout similar to the commit-graph layout. It's easier to see the purposeful layout in the comparison sections where some commits don't appear in the output. Thanks, -Stolee
Re: What's cooking in git.git (Oct 2018, #04; Fri, 19)
On 10/19/2018 2:02 AM, Junio C Hamano wrote: * ds/ci-commit-graph-and-midx (2018-10-19) 1 commit - ci: add optional test variables One of our CI tests to run with "unusual/experimental/random" settings now also uses commti-graph and midx. Will merge to 'next'. s/commti-graph/commit-graph/ * ds/test-multi-pack-index (2018-10-09) 3 commits - multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX - midx: close multi-pack-index on repack - midx: fix broken free() in close_midx() Tests for the recently introduced multi-pack index machinery. Expecting a reroll. cf. <8b5dbe3d-b382-bf48-b524-d9e8a074a...@gmail.com> A reroll exists: https://public-inbox.org/git/pull.27.v2.git.gitgitgad...@gmail.com/ Thanks, -Stolee
Re: [PATCH 1/1] commit-reach: fix first-parent heuristic
On 10/19/2018 1:24 AM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: We can also re-run the performance tests from commit 4fbcca4e "commit-reach: make can_all_from_reach... linear". Performance was measured on the Linux repository using 'test-tool reach can_all_from_reach'. The input included rows seeded by tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as v3.[0-9]*. This mimics a (very large) fetch that says "I have all major v3 releases and want all major v4 releases." The "large" case included X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate tags to the set, which does not greatly increase the number of objects that are considered, but does increase the number of 'from' commits, demonstrating the quadratic nature of the previous code. These micro-benchmarks are interesting, but we should also remember to describe the impact of the bug being fixed in the larger picture. What end-user visible operations are affected? Computing merge-base? Finding if a push fast-forwards? Something else? Sorry about that. Here is a description that could be inserted into the commit message: The can_all_from_reach() method synthesizes the logic for one iteration of can_all_from_reach_with_flags() which is used by the server during fetch negotiation to determine if more haves/wants are needed. The logic is also used by is_descendant_of() which is used to check if a force-push is required and in 'git branch --contains'. Thanks, -Stolee
Re: [PATCH] multi-pack-index: avoid dead store for struct progress
On 10/18/2018 2:59 PM, Carlo Marcelo Arenas Belón wrote: it is initialized unconditionally by a call to start_progress below. Signed-off-by: Carlo Marcelo Arenas Belón --- midx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/midx.c b/midx.c index ea2f3ffe2e..4fac0cd08a 100644 --- a/midx.c +++ b/midx.c @@ -941,7 +941,7 @@ static void midx_report(const char *fmt, ...) int verify_midx_file(const char *object_dir) { uint32_t i; - struct progress *progress = NULL; + struct progress *progress; struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); verify_midx_error = 0; This makes sense as a cleanup. Is there a tool that reports a wasted initialization that you used to find this? -Stolee
[PATCH 1/1] commit-reach: fix first-parent heuristic
From: Derrick Stolee The algorithm in can_all_from_reach_with_flags() performs a depth- first-search, terminated by generation number, intending to use a hueristic that "important" commits are found in the first-parent history. This heuristic is valuable in scenarios like fetch negotiation. However, there is a problem! After the search finds a target commit, it should pop all commits off the stack and mark them as "can reach". This logic is incorrect, so the algorithm instead walks all reachable commits above the generation-number cutoff. The existing algorithm is still an improvement over the previous algorithm, as the worst-case complexity went from quadratic to linear. The performance measurement at the time was good, but not dramatic. By fixing this heuristic, we reduce the number of walked commits. We can also re-run the performance tests from commit 4fbcca4e "commit-reach: make can_all_from_reach... linear". Performance was measured on the Linux repository using 'test-tool reach can_all_from_reach'. The input included rows seeded by tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as v3.[0-9]*. This mimics a (very large) fetch that says "I have all major v3 releases and want all major v4 releases." The "large" case included X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate tags to the set, which does not greatly increase the number of objects that are considered, but does increase the number of 'from' commits, demonstrating the quadratic nature of the previous code. Small Case: 4fbcca4e~1: 0.85 s 4fbcca4e: 0.26 s (num_walked: 1,011,035) HEAD: 0.14 s (num_walked: 8,601) Large Case: 4fbcca4e~1: 24.0 s 4fbcca4e: 0.12 s (num_walked: 503,925) HEAD: 0.06 s (num_walked: 217,243) Signed-off-by: Derrick Stolee --- commit-reach.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/commit-reach.c b/commit-reach.c index 9f79ce0a22..79419be8af 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -593,8 +593,10 @@ int can_all_from_reach_with_flag(struct object_array *from, while (stack) { struct commit_list *parent; - if (stack->item->object.flags & with_flag) { + if (stack->item->object.flags & (with_flag | RESULT)) { pop_commit(&stack); + if (stack) + stack->item->object.flags |= RESULT; continue; } -- gitgitgadget
[PATCH 0/1] commit-reach: fix first-parent heuristic
I originally reported this fix [1] after playing around with the trace2 series for measuring performance. Since trace2 isn't merging quickly, I pulled the performance fix patch out and am sending it on its own. The only difference here is that we don't have the tracing to verify the performance fix in the test script. See the patch message for details about the fix. Thanks, -Stolee [1] https://public-inbox.org/git/20180906151309.66712-7-dsto...@microsoft.com/ [RFC PATCH 6/6] commit-reach: fix first-parent heuristic Derrick Stolee (1): commit-reach: fix first-parent heuristic commit-reach.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-51%2Fderrickstolee%2Ffirst-parent-heuristic-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-51/derrickstolee/first-parent-heuristic-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/51 -- gitgitgadget
Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256
On 10/17/2018 8:06 PM, brian m. carlson wrote: On Wed, Oct 17, 2018 at 04:31:19PM +0200, Duy Nguyen wrote: On Wed, Oct 17, 2018 at 12:44 AM brian m. carlson wrote: Honestly, anything in the .git directory that is not the v3 pack indexes or the loose object file should be in exactly one hash algorithm. We could simply just leave this value at 1 all the time and ignore the field, since we already know what algorithm it will use. In this particular case, I agree, but not as a general principle. It's nice to have independence for fsck-like tools. I don't know if we have a tool that simply validates commit-graph file format (and not trying to access any real object). But for such a tool, I guess we can just pass the hash algorithm from command line. The user would have to guess a bit. I'm going to drop this patch for now. I'll send a follow-up series later which bumps the format version for this and the multi-pack index and serializes them with the four-byte value. I probably should have caught this earlier, but unfortunately I don't always have the time to look at every series that hits the list. We should coordinate before incrementing the version number. I was working on making the file formats incremental (think split-index) but couldn't come up with a way to do it without incrementing the file format. It would be best to combine these features into v2 of each format. Thanks, -Stolee
Re: [PATCH 0/3] Use commit-graph by default
On 10/17/2018 11:47 PM, Junio C Hamano wrote: If I recall correctly, one more task that was discussed but hasn't been addressed well is how the generation and incremental update of it should integrate with the normal repository maintenance workflow (perhaps "gc --auto"). If we are going to turn it on by default, it would be good to see if we can avoid multiple independent walks done over the same history graph by repack, prune and now commit-graph, before such a change happens. I don't remember a discussion on this, but it is an interesting point. I'll give it some thought to see if we can combine these walks. Thanks, -Stolee
Re: [PATCH 1/3] t6501: use --quiet when testing gc stderr
On 10/18/2018 1:23 AM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: From: Derrick Stolee The test script t6501-freshen-objects.sh has some tests that care if 'git gc' has any output to stderr. This is intended to say that no warnings occurred related to broken links. However, when we have operations that output progress (like writing the commit-graph) this causes the test to fail. I see that the descriptor #2 is redirected into a regular file. Why should we be writing the progress indicator in that case in the first place? Shoudln't we be doing the usual "are we showing this to an interactive terminal?" test? This code from builtin/gc.c makes it look like we are doing that: if (gc_write_commit_graph) write_commit_graph_reachable(get_object_directory(), 0, !quiet && !daemonized); But really, daemonized is only for when running in the background. Do you have an example of a builtin that checks this interactive terminal behavior? Thanks, -Stolee
[PATCH 3/3] commit-graph: Use commit-graph by default
From: Derrick Stolee The config setting "core.commitGraph" enables using the commit-graph file to accelerate commit walks through parsing speed and generation numbers. The setting "gc.writeCommitGraph" enables writing the commit-graph file on every non-trivial 'git gc' operation. Together, they help users automatically improve their performance. By setting these config variables to true by default, we make the commit-graph feature an "opt-out" feature instead of "opt-in". Signed-off-by: Derrick Stolee --- Documentation/config.txt | 4 ++-- builtin/gc.c | 2 +- commit-graph.c | 6 +++--- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index f6f4c21a54..dc5ee7c145 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -923,7 +923,7 @@ the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. core.commitGraph:: If true, then git will read the commit-graph file (if it exists) - to parse the graph structure of commits. Defaults to false. See + to parse the graph structure of commits. Defaults to true. See linkgit:git-commit-graph[1] for more information. core.useReplaceRefs:: @@ -1639,7 +1639,7 @@ gc.writeCommitGraph:: If true, then gc will rewrite the commit-graph file when linkgit:git-gc[1] is run. When using linkgit:git-gc[1] '--auto' the commit-graph will be updated if housekeeping is - required. Default is false. See linkgit:git-commit-graph[1] + required. Default is true. See linkgit:git-commit-graph[1] for details. gc.logExpiry:: diff --git a/builtin/gc.c b/builtin/gc.c index 871a56f1c5..77e7413f94 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -41,7 +41,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; -static int gc_write_commit_graph; +static int gc_write_commit_graph = 1; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; diff --git a/commit-graph.c b/commit-graph.c index a686758603..a459272466 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -232,15 +232,15 @@ static int prepare_commit_graph(struct repository *r) { struct alternate_object_database *alt; char *obj_dir; - int config_value; + int config_value = 1; if (r->objects->commit_graph_attempted) return !!r->objects->commit_graph; r->objects->commit_graph_attempted = 1; + repo_config_get_bool(r, "core.commitgraph", &config_value); if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) && - (repo_config_get_bool(r, "core.commitgraph", &config_value) || - !config_value)) + !config_value) /* * This repository is not configured to use commit graphs, so * do not load one. (But report commit_graph_attempted anyway -- gitgitgadget
[PATCH 1/3] t6501: use --quiet when testing gc stderr
From: Derrick Stolee The test script t6501-freshen-objects.sh has some tests that care if 'git gc' has any output to stderr. This is intended to say that no warnings occurred related to broken links. However, when we have operations that output progress (like writing the commit-graph) this causes the test to fail. Use 'git gc --quiet' to avoid these progress indicators from causing a test failure. Signed-off-by: Derrick Stolee --- t/t6501-freshen-objects.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh index 033871ee5f..0973130f06 100755 --- a/t/t6501-freshen-objects.sh +++ b/t/t6501-freshen-objects.sh @@ -137,7 +137,7 @@ test_expect_success 'do not complain about existing broken links (commit)' ' some message EOF commit=$(git hash-object -t commit -w broken-commit) && - git gc 2>stderr && + git gc --quiet 2>stderr && verbose git cat-file -e $commit && test_must_be_empty stderr ' @@ -147,7 +147,7 @@ test_expect_success 'do not complain about existing broken links (tree)' ' 100644 blob 0003foo EOF tree=$(git mktree --missing stderr && + git gc --quiet 2>stderr && git cat-file -e $tree && test_must_be_empty stderr ' @@ -162,7 +162,7 @@ test_expect_success 'do not complain about existing broken links (tag)' ' this is a broken tag EOF tag=$(git hash-object -t tag -w broken-tag) && - git gc 2>stderr && + git gc --quiet 2>stderr && git cat-file -e $tag && test_must_be_empty stderr ' -- gitgitgadget
[PATCH 0/3] Use commit-graph by default
The commit-graph feature is starting to stabilize. Based on what is in master right now, we have: Git 2.18: * Ability to write commit-graph (requires user interaction). * Commit parsing is faster when commit-graph exists. * Must have core.commitGraph true to use. Git 2.19: * Ability to write commit-graph on GC with gc.writeCommitGraph. * Generation numbers written in commit-graph * A few reachability algorithms make use of generation numbers. (queued for) master: * The test suite passes with GIT_TEST_COMMIT_GRAPH=1 * 'git commit-graph write' has progress indicators. * The commit-graph is automatically disabled when grafts or replace-objects exist. There are some other things coming that are in review (like 'git log --graph' speedups), but it is probably time to consider enabling the commit-graph by default. This series does that. For timing, I'm happy to leave this queued for a merge after the Git 2.20 release. There are enough things in master to justify not enabling this by default until that goes out and more people use it. PATCH 3/3 is rather simple, and is the obvious thing to do to achieve enabling these config values by default. PATCH 1/3 is a required change to make the test suite work with this change. This change isn't needed with GIT_TEST_COMMIT_GRAPH=1 because the commit-graph is up-to-date for these 'git gc' calls, so no progress is output. PATCH 2/3 is also a necessary evil, since we already had to disable GIT_TEST_COMMIT_GRAPH for some tests, we now also need to turn off core.commitGraph. Thanks, -Stolee Derrick Stolee (3): t6501: use --quiet when testing gc stderr t: explicitly turn off core.commitGraph as needed commit-graph: Use commit-graph by default Documentation/config.txt| 4 ++-- builtin/gc.c| 2 +- commit-graph.c | 6 +++--- t/t0410-partial-clone.sh| 3 ++- t/t5307-pack-missing-commit.sh | 3 ++- t/t6011-rev-list-with-bad-commit.sh | 3 ++- t/t6024-recursive-merge.sh | 3 ++- t/t6501-freshen-objects.sh | 6 +++--- 8 files changed, 17 insertions(+), 13 deletions(-) base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-50%2Fderrickstolee%2Fcommit-graph-default-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-50/derrickstolee/commit-graph-default-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/50 -- gitgitgadget
[PATCH 2/3] t: explicitly turn off core.commitGraph as needed
From: Derrick Stolee There are a few tests that already require GIT_TEST_COMMIT_GRAPH=0 as they rely on an interaction with the commits in the object store that is circumvented by parsing commit information from the commit-graph instead. Before enabling core.commitGraph as true by default, explicitly turn the setting off for these tests. Signed-off-by: Derrick Stolee --- t/t0410-partial-clone.sh| 3 ++- t/t5307-pack-missing-commit.sh | 3 ++- t/t6011-rev-list-with-bad-commit.sh | 3 ++- t/t6024-recursive-merge.sh | 3 ++- 4 files changed, 8 insertions(+), 4 deletions(-) diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index cfd0655ea1..f5277fafb1 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -193,7 +193,8 @@ test_expect_success 'rev-list stops traversal at missing and promised commit' ' git -C repo config core.repositoryformatversion 1 && git -C repo config extensions.partialclone "arbitrary string" && - GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects --objects bar >out && + GIT_TEST_COMMIT_GRAPH=0 git -c core.commitGraph=false \ + -C repo rev-list --exclude-promisor-objects --objects bar >out && grep $(git -C repo rev-parse bar) out && ! grep $FOO out ' diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh index dacb440b27..dc4c19d0aa 100755 --- a/t/t5307-pack-missing-commit.sh +++ b/t/t5307-pack-missing-commit.sh @@ -16,7 +16,8 @@ test_expect_success setup ' obj=$(git rev-parse --verify tag3) && fanout=$(expr "$obj" : "\(..\)") && remainder=$(expr "$obj" : "..\(.*\)") && - rm -f ".git/objects/$fanout/$remainder" + rm -f ".git/objects/$fanout/$remainder" && + git config core.commitGraph false ' test_expect_success 'check corruption' ' diff --git a/t/t6011-rev-list-with-bad-commit.sh b/t/t6011-rev-list-with-bad-commit.sh index 545b461e51..da6949743d 100755 --- a/t/t6011-rev-list-with-bad-commit.sh +++ b/t/t6011-rev-list-with-bad-commit.sh @@ -42,7 +42,8 @@ test_expect_success 'corrupt second commit object' \ ' test_expect_success 'rev-list should fail' ' - test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --all > /dev/null + test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \ + git -c core.commitGraph=false rev-list --all > /dev/null ' test_expect_success 'git repack _MUST_ fail' \ diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh index 27c7de90ce..fccdf96f13 100755 --- a/t/t6024-recursive-merge.sh +++ b/t/t6024-recursive-merge.sh @@ -61,7 +61,8 @@ GIT_AUTHOR_DATE="2006-12-12 23:00:08" git commit -m F ' test_expect_success 'combined merge conflicts' ' - test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G + test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \ + git -c core.commitGraph=false merge -m final G ' cat > expect << EOF -- gitgitgadget
Re: commit-graph is cool (overcoming add_missing_tags() perf issues)
On 10/17/2018 2:00 PM, Elijah Newren wrote: Hi, Just wanted to give a shout-out for the commit-graph work and how impressive it is. I had an internal report from a user that git pushes containing only one new tiny commit were taking over a minute (in a moderate size repo with good network connectivity). After digging for a while, I noticed three unusual things about the repo[1]: * he had push.followTags set to true * upstream repo had about 20k tags (despite only 55k commits) * his repo had an additional 2.5k tags, but none of these were in the history of the branches he was pushing and thus would not be included in any pushes. Digging in, almost all the time was CPU-bound and spent in add_missing_tags()[2]. If I'm reading the code correctly, it appears that function loops over each tag, calling in_merge_bases_many() once per tag. Thus, for his case, we were potentially walking all of history of the main branch 2.5k times. That seemed rather suboptimal. Thanks for the report. I made a note to inspect add_missing_tags() for more improvement in the future. Before attempting to optimize, I decided to try out the commit-graph with a version of git from pu. While I expected a speed-up, I was a bit suprised that it was a factor of over 100; dropping the time for local dry-run push[2] to sub-second. A quick look suggests that commit-graph doesn't fix the fact that we call in_merge_bases_many() N times from add_missing_tags() and thus likely need to do N merge base computations, it just makes each of the N much faster. So, perhaps there's still another scaling issue we'll eventually need to address, but for now, I'm pretty excited about commit-graph. Without the commit-graph, you are getting a quadratic problem (N commits * T tags), but with the commit-graph you are also getting the benefit of generation numbers, so the "N commits" is actually likely _zero_ for most tags, because the tags have strictly lower generation number. In those cases, we can terminate without any walk at all. Thanks! -Stolee
Git Test Coverage Report (Wednesday, Oct 17)
513) ieot->nr = nr; 3255089ada 3514) for (i = 0; i < nr; i++) { 3255089ada 3515) ieot->entries[i].offset = get_be32(index); 3255089ada 3516) index += sizeof(uint32_t); 3255089ada 3517) ieot->entries[i].nr = get_be32(index); 3255089ada 3518) index += sizeof(uint32_t); 3255089ada 3521) return ieot; 3255089ada 3524) static void write_ieot_extension(struct strbuf *sb, struct index_entry_offset_table *ieot) 3255089ada 3530) put_be32(&buffer, IEOT_VERSION); 3255089ada 3531) strbuf_add(sb, &buffer, sizeof(uint32_t)); 3255089ada 3534) for (i = 0; i < ieot->nr; i++) { 3255089ada 3537) put_be32(&buffer, ieot->entries[i].offset); 3255089ada 3538) strbuf_add(sb, &buffer, sizeof(uint32_t)); 3255089ada 3541) put_be32(&buffer, ieot->entries[i].nr); 3255089ada 3542) strbuf_add(sb, &buffer, sizeof(uint32_t)); 3255089ada 3544) } revision.c 2abf350385 1525) if (ce_path_match(istate, ce, &revs->prune_data, NULL)) { 2abf350385 1531) while ((i+1 < istate->cache_nr) && 2abf350385 1532) ce_same_name(ce, istate->cache[i+1])) split-index.c e3d837989e 335) ce->ce_flags |= CE_UPDATE_IN_BASE; transport-helper.c fb19d32f05 643) if (!data->connect && !data->stateless_connect) transport.c wt-status.c f3bd35fa0d 671) s->committable = 1; 73ba5d78b4 1953) if (s->state.rebase_in_progress || 73ba5d78b4 1954) s->state.rebase_interactive_in_progress) 73ba5d78b4 1955) branch_name = s->state.onto; 73ba5d78b4 1956) else if (s->state.detached_from) 73ba5d78b4 1957) branch_name = s->state.detached_from; Commits introducing uncovered code: Ben Peart 3255089ad: ieot: add Index Entry Offset Table (IEOT) extension Ben Peart 3b1d9e045: eoie: add End of Index Entry (EOIE) extension Ben Peart 77ff1127a: read-cache: load cache entries on worker threads Ben Peart abb4bb838: read-cache: load cache extensions on a worker thread Ben Peart c780b9cfe: config: add new index.threads config setting Josh Steadmon e001fd3a5: archive: implement protocol v2 archive command Josh Steadmon fb19d32f0: archive: allow archive over HTTP(S) with proto v2 Matthew DeVore 7c0fe330d: rev-list: handle missing tree objects properly Matthew DeVore bc5975d24: list-objects-filter: implement filter tree:0 Matthew DeVore cc0b05a4c: list-objects-filter-options: do not over-strbuf_init Matthew DeVore f447a499d: list-objects: store common func args in struct Nguyễn Thái Ngọc Duy 252d079cb: read-cache.c: optimize reading index format v4 Nguyễn Thái Ngọc Duy 26c7d0678: help -a: improve and make --verbose default Nguyễn Thái Ngọc Duy 2abf35038: revision.c: remove implicit dependency on the_index Nguyễn Thái Ngọc Duy a470beea3: blame.c: rename "repo" argument to "r" Nguyễn Thái Ngọc Duy ae9af1228: status: show progress bar if refreshing the index takes too long Nguyễn Thái Ngọc Duy b78ea5fc3: diff.c: reduce implicit dependency on the_index René Scharfe 8b2f8cbcb: oidset: use khash Stephen P. Smith 73ba5d78b: roll wt_status_state into wt_status and populate in the collect phase Stephen P. Smith f3bd35fa0: wt-status.c: set the committable flag in the collect phase SZEDER Gábor e3d837989: split-index: don't compare cached data of entries already marked for split index Uncovered Code in 'master' (a4b8ab5) but not in 5a0cc8a --- builtin/gc.c 3029970275 builtin/gc.c 461) ret = error_errno(_("cannot stat '%s'"), gc_log_path); 3029970275 builtin/gc.c 470) ret = error_errno(_("cannot read '%s'"), gc_log_path); fec2ed2187 builtin/gc.c 495) die(FAILED_RUN, pack_refs_cmd.argv[0]); fec2ed2187 builtin/gc.c 498) die(FAILED_RUN, reflog.argv[0]); 3029970275 builtin/gc.c 585) exit(128); fec2ed2187 builtin/gc.c 637) die(FAILED_RUN, repack.argv[0]); fec2ed2187 builtin/gc.c 647) die(FAILED_RUN, prune.argv[0]); fec2ed2187 builtin/gc.c 654) die(FAILED_RUN, prune_worktrees.argv[0]); fec2ed2187 builtin/gc.c 658) die(FAILED_RUN, rerere.argv[0]); builtin/pack-objects.c 2fa233a554 builtin/pack-objects.c 1512) hashcpy(base_oid.hash, base_sha1); 2fa233a554 builtin/pack-objects.c 1513) if (!in_same_island(&delta->idx.oid, &base_oid)) 2fa233a554 builtin/pack-objects.c 1514) return 0; commit-graph.c 5cef295f28 67) return 0; 20fd6d5799 79) return 0; midx.c e43d2dcce1 285) if (hasheq(oid.hash, e43d2dcce1 286) p->bad_object_sha1 + the_hash_algo->rawsz * i)) refs.c 4a6067cda5 1405) return 0; Commits introducing uncovered code: Derrick Stolee 20fd6d579: commit-graph: not compatible with grafts Derrick Stolee 5cef295f2: commit-graph: not compatible with uninitialized repo Jeff King 2fa233a55: pack-objects: handle island check for "external" delta base Jeff King e43d2dcce: more oideq/hasheq conversions Jonathan Nieder 302997027: gc: do not return error for prior errors in daemonized mode Jonathan Nieder fec2ed218: gc: exit with status 128 on failure Stefan Beller 4a6067cda: refs.c: migrate internal ref iteration to pass thru repository argument
Re: [PATCH 0/1] Run GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX during CI
On 10/17/2018 9:00 AM, Derrick Stolee via GitGitGadget wrote: [1] https://git.visualstudio.com/git/_build?definitionId=4 Newlines are hard. Sorry for the formatting issues when translating from a PR description. Build definition that tests Git with different arrangements of GIT_TEST_* variables. Thanks, -Stolee
Re: [PATCH] test-tool: show tool list on error
On 10/17/2018 5:25 AM, Jeff King wrote: Before we switched to one big test-tool binary, if you forgot the name of a tool, you could use tab-completion in the shell to get a hint. But these days, all you get is: $ t/helper/test-tool approxidate fatal: There is no test named 'approxidate' and you're stuck reading the source code to find it. Let's print a list of the available tools in this case. Signed-off-by: Jeff King --- Not really user-facing, but this bugged me enough earlier to write the patch. ;) Some of the individual tools have nice help, too (try "t/helper/test-tool date", which shows the approxidate command I was looking for), but some of them could probably stand to improve their friendliness (try "t/helper/test-tool config"). I think it's fine for people to improve them over time if and when they get annoyed. This will improve contributors' lives! Thanks. Reviewed-by: Derrick Stolee
[PATCH 0/1] Run GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX during CI
Our CI scripts include a step to run the test suite with certain optional variables enabled. Now that all branches should build and have tests succeed with GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX enabled, add those variables to that stage. Note: the GIT_TEST_MULTI_PACK_INDEX variable has not merged all the way down, so will be ignored if this series is merged faster than that one. However, it is safe to make these changes orthogonally as all (known) test breaks with GIT_TEST_MULTI_PACK_INDEX=1 are fixed in the topic that introduces the variable. I also created a build definition on Azure Pipelines that runs the test suite with different subsets of the test variables, split by the following types: 1) Index options 2) Commit-graph 3) Multi-pack-index These builds are found at [1]. There are benefits to testing the variables all together but also separately. I didn't want to create new stages in the CI scripts to avoid consuming extra resources. This series is based on js/vsts-ci to avoid conflicts with that series, but is not necessarily a hard dependence. Thanks, -Stolee [1] https://git.visualstudio.com/git/_build?definitionId=4Build definition that tests Git with different arrangements of GIT_TEST_* variables. Derrick Stolee (1): ci: add optional test variables ci/run-build-and-tests.sh | 2 ++ 1 file changed, 2 insertions(+) base-commit: d82963f34cf6921ed29d1fc2d96b16acf9005159 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-49%2Fderrickstolee%2Fci-vars-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-49/derrickstolee/ci-vars-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/49 -- gitgitgadget
[PATCH 1/1] ci: add optional test variables
From: Derrick Stolee The commit-graph and multi-pack-index features introduce optional data structures that are not required for normal Git operations. It is important to run the normal test suite without them enabled, but it is helpful to also run the test suite using them. Our continuous integration scripts include a second test stage that runs with optional GIT_TEST_* variables enabled. Add the following two variables to that stage: GIT_TEST_COMMIT_GRAPH GIT_TEST_MULTI_PACK_INDEX This will slow down the operation, as we build a commit-graph file after every 'git commit' operation and build a multi-pack-index during every 'git repack' operation. However, it is important that future changes are compatible with these features. Signed-off-by: Derrick Stolee --- ci/run-build-and-tests.sh | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh index e28ac2fb9a..db342bb6a8 100755 --- a/ci/run-build-and-tests.sh +++ b/ci/run-build-and-tests.sh @@ -15,6 +15,8 @@ then export GIT_TEST_FULL_IN_PACK_ARRAY=true export GIT_TEST_OE_SIZE=10 export GIT_TEST_OE_DELTA_SIZE=5 + export GIT_TEST_COMMIT_GRAPH=1 + export GIT_TEST_MULTI_PACK_INDEX=1 make --quiet test fi -- gitgitgadget
Re: [PATCH 00/19] Bring more repository handles into our code base
On 10/16/2018 7:35 PM, Stefan Beller wrote: This series takes another approach as it doesn't change the signature of functions, but introduces new functions that can deal with arbitrary repositories, keeping the old function signature around using a shallow wrapper. I think this is a good direction, and the changes look good to me. Additionally each patch adds a semantic patch, that would port from the old to the new function. These semantic patches are all applied in the very last patch, but we could omit applying the last patch if it causes too many merge conflicts and trickl in the semantic patches over time when there are no merge conflicts. The semantic patches are a good idea. At some point in the future, we can submit a series that applies the patches to any leftover calls and removes the old callers. We can hopefully rely on review (and the semantic patches warning that there is work to do) to prevent new callers from being introduced in future topics. That doesn't count topics that come around while this one isn't merged down. I had one high-level question: How are we testing that these "arbitrary repository" changes are safe? I just remember the issue we had with the commit-graph code relying on arbitrary repositories, but then adding a reference to the replace objects broke the code (because replace-objects wasn't using arbitrary repos correctly, but the commit-graph was tested with arbitrary repositories using "test-tool repository"). It would be nice to introduce more method calls in t/helper/test-repository.c that help us know these are safe conversions. Otherwise, we are essentially waiting until we try to take submodule things in-process and find out what breaks. As we discovered with the refstore, we can't just ensure that all references to the_repository are removed. Thanks, -Stolee
Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256
On 10/14/2018 10:19 PM, brian m. carlson wrote: Since the commit-graph code wants to serialize the hash algorithm into the data store, specify a version number for each supported algorithm. Note that we don't use the values of the constants themselves, as they are internal and could change in the future. Signed-off-by: brian m. carlson --- commit-graph.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 7a28fbb03f..e587c21bb6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir) static uint8_t oid_version(void) { - return 1; + switch (hash_algo_by_ptr(the_hash_algo)) { hash_algo_by_ptr is specified as an inline function in hash.h, so this leads to a linking error in jch and pu right now. I think the fix is simply to add '#include "hash.h"' to the list of includes. Thanks, -Stolee
[PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
From: Derrick Stolee When running 'git rev-list --topo-order' and its kin, the topo_order setting in struct rev_info implies the limited setting. This means that the following things happen during prepare_revision_walk(): * revs->limited implies we run limit_list() to walk the entire reachable set. There are some short-cuts here, such as if we perform a range query like 'git rev-list COMPARE..HEAD' and we can stop limit_list() when all queued commits are uninteresting. * revs->topo_order implies we run sort_in_topological_order(). See the implementation of that method in commit.c. It implies that the full set of commits to order is in the given commit_list. These two methods imply that a 'git rev-list --topo-order HEAD' command must walk the entire reachable set of commits _twice_ before returning a single result. If we have a commit-graph file with generation numbers computed, then there is a better way. This patch introduces some necessary logic redirection when we are in this situation. In v2.18.0, the commit-graph file contains zero-valued bytes in the positions where the generation number is stored in v2.19.0 and later. Thus, we use generation_numbers_enabled() to check if the commit-graph is available and has non-zero generation numbers. When setting revs->limited only because revs->topo_order is true, only do so if generation numbers are not available. There is no reason to use the new logic as it will behave similarly when all generation numbers are INFINITY or ZERO. In prepare_revision_walk(), if we have revs->topo_order but not revs->limited, then we trigger the new logic. It breaks the logic into three pieces, to fit with the existing framework: 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info struct. We use the presence of this struct as a signal to use the new methods during our walk. In this patch, this method simply calls limit_list() and sort_in_topological_order(). In the future, this method will set up a new data structure to perform that logic in-line. 2. next_topo_commit() provides get_revision_1() with the next topo- ordered commit in the list. Currently, this simply pops the commit from revs->commits. 3. expand_topo_walk() provides get_revision_1() with a way to signal walking beyond the latest commit. Currently, this calls add_parents_to_list() exactly like the old logic. While this commit presents method redirection for performing the exact same logic as before, it allows the next commit to focus only on the new logic. Signed-off-by: Derrick Stolee --- revision.c | 42 ++ revision.h | 4 2 files changed, 42 insertions(+), 4 deletions(-) diff --git a/revision.c b/revision.c index e18bd530e4..2dcde8a8ac 100644 --- a/revision.c +++ b/revision.c @@ -25,6 +25,7 @@ #include "worktree.h" #include "argv-array.h" #include "commit-reach.h" +#include "commit-graph.h" volatile show_early_output_fn_t show_early_output; @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s if (revs->diffopt.objfind) revs->simplify_history = 0; - if (revs->topo_order) + if (revs->topo_order && !generation_numbers_enabled(the_repository)) revs->limited = 1; if (revs->prune_data.nr) { @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } +struct topo_walk_info {}; + +static void init_topo_walk(struct rev_info *revs) +{ + struct topo_walk_info *info; + revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info)); + info = revs->topo_walk_info; + memset(info, 0, sizeof(struct topo_walk_info)); + + limit_list(revs); + sort_in_topological_order(&revs->commits, revs->sort_order); +} + +static struct commit *next_topo_commit(struct rev_info *revs) +{ + return pop_commit(&revs->commits); +} + +static void expand_topo_walk(struct rev_info *revs, struct commit *commit) +{ + if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) { + if (!revs->ignore_missing_links) + die("Failed to traverse parents of commit %s", + oid_to_hex(&commit->object.oid)); + } +} + int prepare_revision_walk(struct rev_info *revs) { int i; @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(&revs->commits); if (revs->no_walk) return 0; - if (revs->limited) + if (revs->limited) { if (limit_list(revs) < 0) return -1; - if (revs->topo_order) - sort_in_topological_order(&a
[PATCH v4 6/7] revision.c: generation-based topo-order algorithm
From: Derrick Stolee The current --topo-order algorithm requires walking all reachable commits up front, topo-sorting them, all before outputting the first value. This patch introduces a new algorithm which uses stored generation numbers to incrementally walk in topo-order, outputting commits as we go. This can dramatically reduce the computation time to write a fixed number of commits, such as when limiting with "-n " or filling the first page of a pager. When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. We know when we can pause each walk by using generation numbers from the commit- graph feature. Recall that the generation number of a commit satisfies: * If the commit has at least one parent, then the generation number is one more than the maximum generation number among its parents. * If the commit has no parent, then the generation number is one. There are two special generation numbers: * GENERATION_NUMBER_INFINITY: this value is 0x and indicates that the commit is not stored in the commit-graph and the generation number was not previously calculated. * GENERATION_NUMBER_ZERO: this value (0) is a special indicator to say that the commit-graph was generated by a version of Git that does not compute generation numbers (such as v2.18.0). Since we use generation_numbers_enabled() before using the new algorithm, we do not need to worry about GENERATION_NUMBER_ZERO. However, the existence of GENERATION_NUMBER_INFINITY implies the following weaker statement than the usual we expect from generation numbers: If A and B are commits with generation numbers gen(A) and gen(B) and gen(A) < gen(B), then A cannot reach B. Thus, we will walk in each of our stages until the "maximum unexpanded generation number" is strictly lower than the generation number of a commit we are about to use. The walks are as follows: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number), parse each reachable commit until all commits in the queue have generation number strictly lower than needed. During this walk, update the UNINTERESTING flags as necessary. 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. 3. TOPO: using the topo_queue priority queue (ordered based on the sort_order given, which could be commit-date, author- date, or typical topo-order which treats the queue as a LIFO stack), remove a commit from the queue and decrement the in-degree of each parent. If a parent has an in-degree of one, then we add it to the topo_queue. Before we decrement the in-degree, however, ensure the INDEGREE walk has walked beyond that generation number. The implementations of these walks are in the following methods: * explore_walk_step and explore_to_depth * indegree_walk_step and compute_indegrees_to_depth * next_topo_commit and expand_topo_walk These methods have some patterns that may seem strange at first, but they are probably carry-overs from their equivalents in limit_list and sort_in_topological_order. One thing that is missing from this implementation is a proper way to stop walking when the entire queue is UNINTERESTING, so this imp
[PATCH v4 3/7] test-reach: add rev-list tests
From: Derrick Stolee The rev-list command is critical to Git's functionality. Ensure it works in the three commit-graph environments constructed in t6600-test-reach.sh. Here are a few important types of rev-list operations: * Basic: git rev-list --topo-order HEAD * Range: git rev-list --topo-order compare..HEAD * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD * Symmetric Difference: git rev-list --topo-order compare...HEAD Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 84 +++ 1 file changed, 84 insertions(+) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index 9d65b8b946..288f703b7b 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'rev-list: basic topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \ + commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-6-6 +' + +test_expect_success 'rev-list: first-parent topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: first-parent range topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: ancestry-path topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + >expect && + run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: symmetric difference topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + commit-3-8 commit-2-8 commit-1-8 \ + commit-3-7 commit-2-7 commit-1-7 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8...commit-6-6 +' + test_done -- gitgitgadget
[PATCH v4 1/7] prio-queue: add 'peek' operation
From: Derrick Stolee When consuming a priority queue, it can be convenient to inspect the next object that will be dequeued without actually dequeueing it. Our existing library did not have such a 'peek' operation, so add it as prio_queue_peek(). Add a reference-level comparison in t/helper/test-prio-queue.c so this method is exercised by t0009-prio-queue.sh. Further, add a test that checks the behavior when the compare function is NULL (i.e. the queue becomes a stack). Signed-off-by: Derrick Stolee --- prio-queue.c | 9 + prio-queue.h | 6 ++ t/helper/test-prio-queue.c | 26 ++ t/t0009-prio-queue.sh | 14 ++ 4 files changed, 47 insertions(+), 8 deletions(-) diff --git a/prio-queue.c b/prio-queue.c index a078451872..d3f488cb05 100644 --- a/prio-queue.c +++ b/prio-queue.c @@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue) } return result; } + +void *prio_queue_peek(struct prio_queue *queue) +{ + if (!queue->nr) + return NULL; + if (!queue->compare) + return queue->array[queue->nr - 1].data; + return queue->array[0].data; +} diff --git a/prio-queue.h b/prio-queue.h index d030ec9dd6..682e51867a 100644 --- a/prio-queue.h +++ b/prio-queue.h @@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing); */ extern void *prio_queue_get(struct prio_queue *); +/* + * Gain access to the "thing" that would be returned by + * prio_queue_get, but do not remove it from the queue. + */ +extern void *prio_queue_peek(struct prio_queue *); + extern void clear_prio_queue(struct prio_queue *); /* Reverse the LIFO elements */ diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c index 9807b649b1..5bc9c46ea5 100644 --- a/t/helper/test-prio-queue.c +++ b/t/helper/test-prio-queue.c @@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv) struct prio_queue pq = { intcmp }; while (*++argv) { - if (!strcmp(*argv, "get")) - show(prio_queue_get(&pq)); - else if (!strcmp(*argv, "dump")) { - int *v; - while ((v = prio_queue_get(&pq))) - show(v); - } - else { + if (!strcmp(*argv, "get")) { + void *peek = prio_queue_peek(&pq); + void *get = prio_queue_get(&pq); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } else if (!strcmp(*argv, "dump")) { + void *peek; + void *get; + while ((peek = prio_queue_peek(&pq))) { + get = prio_queue_get(&pq); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } + } else if (!strcmp(*argv, "stack")) { + pq.compare = NULL; + } else { int *v = malloc(sizeof(*v)); *v = atoi(*argv); prio_queue_put(&pq, v); diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh index e56dfce668..3941ad2528 100755 --- a/t/t0009-prio-queue.sh +++ b/t/t0009-prio-queue.sh @@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' ' test_cmp expect actual ' +cat >expect <<'EOF' +3 +2 +6 +4 +5 +1 +8 +EOF +test_expect_success 'stack order' ' + test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual && + test_cmp expect actual +' + test_done -- gitgitgadget
[PATCH v4 7/7] t6012: make rev-list tests more interesting
From: Derrick Stolee As we are working to rewrite some of the revision-walk machinery, there could easily be some interesting interactions between the options that force topological constraints (--topo-order, --date-order, and --author-date-order) along with specifying a path. Add extra tests to t6012-rev-list-simplify.sh to add coverage of these interactions. To ensure interesting things occur, alter the repo data shape to have different orders depending on topo-, date-, or author-date-order. When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering the new logic for topo-order walks using generation numbers. The extra tests can be added indepently. Signed-off-by: Derrick Stolee --- t/t6012-rev-list-simplify.sh | 45 1 file changed, 36 insertions(+), 9 deletions(-) diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh index b5a1190ffe..a10f0df02b 100755 --- a/t/t6012-rev-list-simplify.sh +++ b/t/t6012-rev-list-simplify.sh @@ -12,6 +12,22 @@ unnote () { git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 |g" } +# +# Create a test repo with interesting commit graph: +# +# A--B--G--H--I--K--L +# \ \ / / +# \ \ / / +#C--E---F J +#\_/ +# +# The commits are laid out from left-to-right starting with +# the root commit A and terminating at the tip commit L. +# +# There are a few places where we adjust the commit date or +# author date to make the --topo-order, --date-order, and +# --author-date-order flags produce different output. + test_expect_success setup ' echo "Hi there" >file && echo "initial" >lost && @@ -21,10 +37,18 @@ test_expect_success setup ' git branch other-branch && + git symbolic-ref HEAD refs/heads/unrelated && + git rm -f "*" && + echo "Unrelated branch" >side && + git add side && + test_tick && git commit -m "Side root" && + note J && + git checkout master && + echo "Hello" >file && echo "second" >lost && git add file lost && - test_tick && git commit -m "Modified file and lost" && + test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m "Modified file and lost" && note B && git checkout other-branch && @@ -63,13 +87,6 @@ test_expect_success setup ' test_tick && git commit -a -m "Final change" && note I && - git symbolic-ref HEAD refs/heads/unrelated && - git rm -f "*" && - echo "Unrelated branch" >side && - git add side && - test_tick && git commit -m "Side root" && - note J && - git checkout master && test_tick && git merge --allow-unrelated-histories -m "Coolest" unrelated && note K && @@ -103,14 +120,24 @@ check_result () { check_outcome success "$@" } -check_result 'L K J I H G F E D C B A' --full-history +check_result 'L K J I H F E D C G B A' --full-history --topo-order +check_result 'L K I H G F E D C B J A' --full-history +check_result 'L K I H G F E D C B J A' --full-history --date-order +check_result 'L K I H G F E D B C J A' --full-history --author-date-order check_result 'K I H E C B A' --full-history -- file check_result 'K I H E C B A' --full-history --topo-order -- file check_result 'K I H E C B A' --full-history --date-order -- file +check_result 'K I H E B C A' --full-history --author-date-order -- file check_result 'I E C B A' --simplify-merges -- file +check_result 'I E C B A' --simplify-merges --topo-order -- file +check_result 'I E C B A' --simplify-merges --date-order -- file +check_result 'I E B C A' --simplify-merges --author-date-order -- file check_result 'I B A' -- file check_result 'I B A' --topo-order -- file +check_result 'I B A' --date-order -- file +check_result 'I B A' --author-date-order -- file check_result 'H' --first-parent -- another-file +check_result 'H' --first-parent --topo-order -- another-file check_result 'E C B A' --full-history E -- lost test_expect_success 'full history simplification without parent' ' -- gitgitgadget
[PATCH v4 5/7] commit/revisions: bookkeeping before refactoring
From: Derrick Stolee There are a few things that need to move around a little before making a big refactoring in the topo-order logic: 1. We need access to record_author_date() and compare_commits_by_author_date() in revision.c. These are used currently by sort_in_topological_order() in commit.c. 2. Moving these methods to commit.h requires adding the author_slab definition to commit.h. 3. The add_parents_to_list() method in revision.c performs logic around the UNINTERESTING flag and other special cases depending on the struct rev_info. Allow this method to ignore a NULL 'list' parameter, as we will not be populating the list for our walk. Also rename the method to the slightly more generic name process_parents() to make clear that this method does more than add to a list (and no list is required anymore). Helped-by: Jeff King Signed-off-by: Derrick Stolee --- commit.c | 11 +-- commit.h | 8 revision.c | 18 ++ 3 files changed, 23 insertions(+), 14 deletions(-) diff --git a/commit.c b/commit.c index d0f199e122..861a485e93 100644 --- a/commit.c +++ b/commit.c @@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack) /* count number of children that have not been emitted */ define_commit_slab(indegree_slab, int); -/* record author-date for each commit object */ -define_commit_slab(author_date_slab, timestamp_t); +implement_shared_commit_slab(author_date_slab, timestamp_t); -static void record_author_date(struct author_date_slab *author_date, - struct commit *commit) +void record_author_date(struct author_date_slab *author_date, + struct commit *commit) { const char *buffer = get_commit_buffer(commit, NULL); struct ident_split ident; @@ -684,8 +683,8 @@ fail_exit: unuse_commit_buffer(commit, buffer); } -static int compare_commits_by_author_date(const void *a_, const void *b_, - void *cb_data) +int compare_commits_by_author_date(const void *a_, const void *b_, + void *cb_data) { const struct commit *a = a_, *b = b_; struct author_date_slab *author_date = cb_data; diff --git a/commit.h b/commit.h index 2b1a734388..977d397356 100644 --- a/commit.h +++ b/commit.h @@ -8,6 +8,7 @@ #include "gpg-interface.h" #include "string-list.h" #include "pretty.h" +#include "commit-slab.h" #define COMMIT_NOT_FROM_GRAPH 0x #define GENERATION_NUMBER_INFINITY 0x @@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf); */ extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); +/* record author-date for each commit object */ +define_shared_commit_slab(author_date_slab, timestamp_t); + +void record_author_date(struct author_date_slab *author_date, + struct commit *commit); + +int compare_commits_by_author_date(const void *a_, const void *b_, void *unused); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); diff --git a/revision.c b/revision.c index 2dcde8a8ac..36458265a0 100644 --- a/revision.c +++ b/revision.c @@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit *p, struct commit_li *cache = new_entry; } -static int add_parents_to_list(struct rev_info *revs, struct commit *commit, - struct commit_list **list, struct commit_list **cache_ptr) +static int process_parents(struct rev_info *revs, struct commit *commit, + struct commit_list **list, struct commit_list **cache_ptr) { struct commit_list *parent = commit->parents; unsigned left_flag; @@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, if (p->object.flags & SEEN) continue; p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } return 0; } @@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, p->object.flags |= left_flag; if (!(p->object.flags & SEEN)) { p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } if (revs->
[PATCH v4 0/7] Use generation numbers for --topo-order
This patch series performs a decently-sized refactoring of the revision-walk machinery. Well, "refactoring" is probably the wrong word, as I don't actually remove the old code. Instead, when we see certain options in the 'rev_info' struct, we redirect the commit-walk logic to a new set of methods that distribute the workload differently. By using generation numbers in the commit-graph, we can significantly improve 'git log --graph' commands (and the underlying 'git rev-list --topo-order'). On the Linux repository, I got the following performance results when comparing to the previous version with or without a commit-graph: Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s If you want to read this series but are unfamiliar with the commit-graph and generation numbers, then I recommend reading Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the subject. In particular, the three-part walk described in "revision.c: refactor basic topo-order logic" is present (but underexplained) as an animated PNG [2]. Since revision.c is an incredibly important (and old) portion of the codebase -- and because there are so many orthogonal options in 'struct rev_info' -- I consider this submission to be "RFC quality". That is, I am not confident that I am not missing anything, or that my solution is the best it can be. I did merge this branch with ds/commit-graph-with-grafts and the "DO-NOT-MERGE: write and read commit-graph always" commit that computes a commit-graph with every 'git commit' command. The test suite passed with that change, available on GitHub [3]. To ensure that I cover at least the case I think are interesting, I added tests to t6600-test-reach.sh to verify the walks report the correct results for the three cases there (no commit-graph, full commit-graph, and a partial commit-graph so the walk starts at GENERATION_NUMBER_INFINITY). One notable case that is not included in this series is the case of a history comparison such as 'git rev-list --topo-order A..B'. The existing code in limit_list() has ways to cut the walk short when all pending commits are UNINTERESTING. Since this code depends on commit_list instead of the prio_queue we are using here, I chose to leave it untouched for now. We can revisit it in a separate series later. Since handle_commit() turns on revs->limited when a commit is UNINTERESTING, we do not hit the new code in this case. Removing this 'revs->limited = 1;' line yields correct results, but the performance is worse. This series was based on ds/reachable, but is now based on 'master' to not conflict with 182070 "commit: use timestamp_t for author_date_slab". There is a small conflict with md/filter-trees, because it renamed a flag in revisions.h in the line before I add new flags. Hopefully this conflict is not too difficult to resolve. Changes in V3: I added a new patch that updates the tab-alignment for flags in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the recommended changes to run_three_modes and test_three_modes from Szeder and Junio. Thanks! Changes in V4: I'm sending a V4 to respond to the feedback so far. Still looking forward to more on the really big commit! * Removed the whitespace changes to the flags in revision.c that caused merge pain. * The prio-queue peek function is now covered by tests when in "stack" mode. * The "add_parents_to_list()" function is now renamed to "process_parents()" * Added a new commit that expands test coverage with alternate orders and file history (use GIT_TEST_COMMIT_GRAPH to have t6012-rev-list-simplify.sh cover the new logic). These tests found a problem with author dates (I forgot to record them during the explore walk). * Commit message edits. Thanks, -Stolee [1] https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/ Supercharging the Git Commit Graph III: Generations and Graph Algorithms [2] https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png Animation showing three-part walk [3] https://github.com/derrickstolee/git/tree/topo-order/testA branch containing this series along with commits to compute commit-graph in entire test suite. Cc: avarab@gmail.comCc: szeder@gmail.com Derrick Stolee (7): prio-queue: add 'peek' operation test-reach: add run_three_modes method test-reach: add rev-list tests revision.c: begin refactoring --topo-order logic commit/revisions: bookkeeping before refactoring
[PATCH v4 2/7] test-reach: add run_three_modes method
From: Derrick Stolee The 'test_three_modes' method assumes we are using the 'test-tool reach' command for our test. However, we may want to use the data shape of our commit graph and the three modes (no commit-graph, full commit-graph, partial commit-graph) for other git commands. Split test_three_modes to be a simple translation on a more general run_three_modes method that executes the given command and tests the actual output to the expected output. Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index d139a00d1d..9d65b8b946 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -53,18 +53,22 @@ test_expect_success 'setup' ' git config core.commitGraph true ' -test_three_modes () { +run_three_modes () { test_when_finished rm -rf .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-full .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-half .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual } +test_three_modes () { + run_three_modes test-tool reach "$@" +} + test_expect_success 'ref_newer:miss' ' cat >input <<-\EOF && A:commit-5-7 -- gitgitgadget
Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256
On 10/16/2018 11:35 AM, Duy Nguyen wrote: On Mon, Oct 15, 2018 at 4:23 AM brian m. carlson wrote: Since the commit-graph code wants to serialize the hash algorithm into the data store, specify a version number for each supported algorithm. Note that we don't use the values of the constants themselves, as they are internal and could change in the future. Signed-off-by: brian m. carlson --- commit-graph.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 7a28fbb03f..e587c21bb6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir) static uint8_t oid_version(void) { - return 1; + switch (hash_algo_by_ptr(the_hash_algo)) { + case GIT_HASH_SHA1: + return 1; + case GIT_HASH_SHA256: + return 2; Should we just increase this field to uint32_t and store format_id instead? That will keep oid version unique in all data formats. Both the commit-graph and multi-pack-index store a single byte for the hash version, so that ship has sailed (without incrementing the full file version number in each format). It may be good to make this method accessible to both formats. I'm not sure if Brian's branch is built on top of the multi-pack-index code. Probably best to see if ds/multi-pack-verify is in the history. Thanks, -Stolee
Re: [PATCH 0/4] Bloom filter experiment
On 10/16/2018 8:57 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, Oct 16 2018, Derrick Stolee wrote: On 10/16/2018 12:45 AM, Junio C Hamano wrote: Derrick Stolee writes: 2. The filters are sized according to the number of changes in each commit, with a minimum of one 64-bit word. ... 6. When we compute the Bloom filters, we don't store a filter for commits whose first-parent diff has more than 512 paths. Just being curious but was 512 taken out of thin air or is there some math behind it, e.g. to limit false positive rate down to certain threshold? With a wide-enough bitset, you could store arbitrary large number of paths with low enough false positive, I guess, but is there a point where there is too many paths in the change that gives us diminishing returns and not worth having a filter in the first place? 512 is somewhat arbitrary, but having a maximum size is not. In a normal source-code-control context, the set of paths modified by any single commit ought to be a small subset of the entire paths, and whole-tree changes ought to be fairly rare. In a project for which that assumption does not hold, it might help to have a negative bloom filter (i.e. "git log -- A" asks "does the commit modify A?" and the filter would say "we know it does not, because we threw all the paths that are not touched to the bloom filter"), but I think that would optimize for a wrong case. A commit with many changed paths is very rare. The 512 I picked above is enough to cover 99% of commits in each of the repos I sampled when first investigating Bloom filters. When a Bloom filter response says "maybe yes" (in our case, "maybe not TREESAME"), then we need to verify that it is correct. In the extreme case that every path is changed, then the Bloom filter does nothing but add extra work. These extreme cases are also not unprecedented: in our Azure Repos codebase, we were using core.autocrlf to smudge CRLFs to LFs, but when it was time to dogfood VFS for Git, we needed to turn off the smudge filter. So, there is one commit that converts every LF to a CRLF in every text file. Storing a Bloom filter for those ~250,000 entries would take ~256KB for essentially no value. By not storing a filter for this commit, we go immediately to the regular TREESAME check, which would happen for most pathspecs. This is all to say: having a maximum size is good. 512 is big enough to cover _most_ commits, but not so big that we may store _really_ big filters. Makes sense. 512 is good enough to hardcode initially, but I couldn't tell from briefly skimming the patches if it was possible to make this size dynamic per-repo when the graph/filter is written. My proof-of-concept has it as a constant, but part of my plan is to make these all config options, as of this item in my message: >>> 2. We need config values for writing and consuming bloom filters, but also to override the default settings. I.e. we might later add some discovery step where we look at N number of commits at random, until we're satisfied that we've come up with some average/median number of total (recursive) tree entries & how many tend to be changed per-commit. I.e. I can imagine repositories (with some automated changes) where we have 10k files and tend to change 1k per commit, or ones with 10k files where we tend to change just 1-10 per commit, which would mean a larger/smaller filter would be needed / would do. I'm not sure a dynamic approach would be worth the effort, but I'm open to hearing the results of an experiment. Thanks, -Stolee
Re: [PATCH 0/4] Bloom filter experiment
On 10/16/2018 12:45 AM, Junio C Hamano wrote: Derrick Stolee writes: 2. The filters are sized according to the number of changes in each commit, with a minimum of one 64-bit word. ... 6. When we compute the Bloom filters, we don't store a filter for commits whose first-parent diff has more than 512 paths. Just being curious but was 512 taken out of thin air or is there some math behind it, e.g. to limit false positive rate down to certain threshold? With a wide-enough bitset, you could store arbitrary large number of paths with low enough false positive, I guess, but is there a point where there is too many paths in the change that gives us diminishing returns and not worth having a filter in the first place? 512 is somewhat arbitrary, but having a maximum size is not. In a normal source-code-control context, the set of paths modified by any single commit ought to be a small subset of the entire paths, and whole-tree changes ought to be fairly rare. In a project for which that assumption does not hold, it might help to have a negative bloom filter (i.e. "git log -- A" asks "does the commit modify A?" and the filter would say "we know it does not, because we threw all the paths that are not touched to the bloom filter"), but I think that would optimize for a wrong case. A commit with many changed paths is very rare. The 512 I picked above is enough to cover 99% of commits in each of the repos I sampled when first investigating Bloom filters. When a Bloom filter response says "maybe yes" (in our case, "maybe not TREESAME"), then we need to verify that it is correct. In the extreme case that every path is changed, then the Bloom filter does nothing but add extra work. These extreme cases are also not unprecedented: in our Azure Repos codebase, we were using core.autocrlf to smudge CRLFs to LFs, but when it was time to dogfood VFS for Git, we needed to turn off the smudge filter. So, there is one commit that converts every LF to a CRLF in every text file. Storing a Bloom filter for those ~250,000 entries would take ~256KB for essentially no value. By not storing a filter for this commit, we go immediately to the regular TREESAME check, which would happen for most pathspecs. This is all to say: having a maximum size is good. 512 is big enough to cover _most_ commits, but not so big that we may store _really_ big filters. Thanks, -Stolee
Re: Git Test Coverage Report (Monday, Oct 15)
On 10/15/2018 12:24 PM, Derrick Stolee wrote: Uncovered code in 'jch' (22f2f0f) and not in 'next' (152ad8e) - prio-queue.c 2d181390f3 94) return queue->array[queue->nr - 1].data; (I have a fix to cover this in my private branch for this topic.) revision.c 4943d28849 2931) return; 4943d28849 2934) return; 4943d28849 2937) c->object.flags |= UNINTERESTING; 4943d28849 2940) return; 4943d28849 2943) mark_parents_uninteresting(c); 4943d28849 2966) return; 4943d28849 2969) return; 4943d28849 2974) return; 4943d28849 3022) init_author_date_slab(&info->author_date); 4943d28849 3023) info->topo_queue.compare = compare_commits_by_author_date; 4943d28849 3024) info->topo_queue.cb_data = &info->author_date; 4943d28849 3025) break; 4943d28849 3038) continue; 4943d28849 3048) record_author_date(&info->author_date, c); 6c04ff3001 3086) if (!revs->ignore_missing_links) 6c04ff3001 3087) die("Failed to traverse parents of commit %s", 4943d28849 3088) oid_to_hex(&commit->object.oid)); 4943d28849 3096) continue; Looks like a number of these lines are important to cover, but are not covered by tests that _also_ specify '--topo-order'. I bet I can cover more of these by overriding the sort logic to use the new algorithm if GIT_TEST_COMMIT_GRAPH is specified. Or, should I create yet another test variable to cover these cases? (Note: I run these coverage reports with a variety of optional test variables.) Uncovered code in 'next' (152ad8e) and not in 'master' (5a0cc8a) builtin/rev-list.c 7c0fe330d5 builtin/rev-list.c 227) die("unexpected missing %s object '%s'", 7c0fe330d5 builtin/rev-list.c 228) type_name(obj->type), oid_to_hex(&obj->oid)); commit-graph.c 5cef295f28 67) return 0; 20fd6d5799 79) return 0; These are two ways to say the commit-graph should not be used, but are not covered by tests currently. One is if we say "is the repo shallow?" which happens to return when the repo has grafts (but we keep the check here in case the way shallows are implemented changes) and the other is if the repo is not initialized, but I fixed the test-helpers that had not initialized the repo yet. Uncovered code in 'master' (5a0cc8a) and not in (fe8321ec05) - builtin/fsck.c 66ec0390e7 builtin/fsck.c 862) midx_argv[2] = "--object-dir"; 66ec0390e7 builtin/fsck.c 863) midx_argv[3] = alt->path; 66ec0390e7 builtin/fsck.c 864) if (run_command(&midx_verify)) 66ec0390e7 builtin/fsck.c 865) errors_found |= ERROR_COMMIT_GRAPH; These are underneath the "for all alternates" loop, and _should_ be covered with the coming GIT_TEST_MULTI_PACK_INDEX test variable. Thanks, -Stolee
Git Test Coverage Report (Monday, Oct 15)
not have " 9f630e7480 builtin/stash--helper.c 1137) ret = 1; 9f630e7480 builtin/stash--helper.c 1138) goto done; c2cc69f192 builtin/stash--helper.c 1154) if (!quiet) c2cc69f192 builtin/stash--helper.c 1155) fprintf_ln(stderr, _("Cannot save the current " 9f630e7480 builtin/stash--helper.c 1157) ret = -1; 9f630e7480 builtin/stash--helper.c 1158) goto done; c2cc69f192 builtin/stash--helper.c 1163) if (!quiet) c2cc69f192 builtin/stash--helper.c 1164) fprintf_ln(stderr, _("Cannot save " 9f630e7480 builtin/stash--helper.c 1166) ret = -1; 9f630e7480 builtin/stash--helper.c 1167) goto done; c2cc69f192 builtin/stash--helper.c 1174) if (!quiet) c2cc69f192 builtin/stash--helper.c 1175) fprintf_ln(stderr, _("Cannot save the current " 9f630e7480 builtin/stash--helper.c 1177) goto done; c2cc69f192 builtin/stash--helper.c 1183) if (!quiet) c2cc69f192 builtin/stash--helper.c 1184) fprintf_ln(stderr, _("Cannot save the current " 9f630e7480 builtin/stash--helper.c 1186) ret = -1; 9f630e7480 builtin/stash--helper.c 1187) goto done; c2cc69f192 builtin/stash--helper.c 1213) if (!quiet) c2cc69f192 builtin/stash--helper.c 1214) fprintf_ln(stderr, _("Cannot record " 9f630e7480 builtin/stash--helper.c 1216) ret = -1; 9f630e7480 builtin/stash--helper.c 1217) goto done; 1a0f0409a7 builtin/stash--helper.c 1289) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1290) goto done; 1a0f0409a7 builtin/stash--helper.c 1300) ret = -1; c2cc69f192 builtin/stash--helper.c 1301) if (!quiet) c2cc69f192 builtin/stash--helper.c 1302) fprintf_ln(stderr, _("Cannot initialize stash")); 1a0f0409a7 builtin/stash--helper.c 1303) goto done; 1a0f0409a7 builtin/stash--helper.c 1313) ret = -1; c2cc69f192 builtin/stash--helper.c 1314) if (!quiet) c2cc69f192 builtin/stash--helper.c 1315) fprintf_ln(stderr, _("Cannot save the current status")); 1a0f0409a7 builtin/stash--helper.c 1316) goto done; 1a0f0409a7 builtin/stash--helper.c 1333) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1352) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1353) goto done; 1a0f0409a7 builtin/stash--helper.c 1362) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1363) goto done; 1a0f0409a7 builtin/stash--helper.c 1371) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1380) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1391) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1392) goto done; 1a0f0409a7 builtin/stash--helper.c 1401) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1402) goto done; 1a0f0409a7 builtin/stash--helper.c 1410) ret = -1; 1a0f0409a7 builtin/stash--helper.c 1436) ret = -1; 193c3e3516 builtin/stash.c 1568) usage_msg_opt(xstrfmt(_("unknown subcommand: %s"), argv[0]), 193c3e3516 builtin/stash.c 1596) continue; packfile.c 1127a98cce 117) return error("index file %s is too small", path); 1127a98cce 119) return error("empty data"); prio-queue.c 2d181390f3 94) return queue->array[queue->nr - 1].data; rebase-interactive.c 64a43cbd5d 62) return error_errno(_("could not read '%s'."), todo_file); 64a43cbd5d 66) strbuf_release(&buf); 64a43cbd5d 67) return -1; a9f5476fbc 75) return error_errno(_("could not read '%s'."), todo_file); a9f5476fbc 79) strbuf_release(&buf); a9f5476fbc 80) return -1; 64a43cbd5d 86) return -1; revision.c 4943d28849 2931) return; 4943d28849 2934) return; 4943d28849 2937) c->object.flags |= UNINTERESTING; 4943d28849 2940) return; 4943d28849 2943) mark_parents_uninteresting(c); 4943d28849 2966) return; 4943d28849 2969) return; 4943d28849 2974) return; 4943d28849 3022) init_author_date_slab(&info->author_date); 4943d28849 3023) info->topo_queue.compare = compare_commits_by_author_date; 4943d28849 3024) info->topo_queue.cb_data = &info->author_date; 4943d28849 3025) break; 4943d28849 3038) continue; 4943d28849 3048) record_author_date(&info->author_date, c); 6c04ff3001 3086) if (!revs->ignore_missing_links) 6c04ff3001 3087) die("Failed to traverse parents of commit %s", 4943d28849 3088) oid_to_hex(&commit->object.oid)); 4943d28849 3096) continue; sequencer.c 65850686cf 2278) return; 65850686cf 2375) write_file(rebase_path_quiet(), "%s\n", quiet); 2c58483a59 3373) return error(_("could not checkout %s"), commit); 4df66c40b0 3387) return error(_("%s: not a valid OID"), orig_head); 71f82465b1 3407) fprintf(stderr, _("Stopped at HEAD\n")); b97e187364 4771) return -1; b97e187364 4774) return -1; b97e187364 4780) return error_errno(_("could not read '%s'."), todo_file); b97e187364 4783) todo_list_release(&todo_list); b97e187364 4784) return error(_("unusable todo list: '%s'"), todo_file); b97e187364 4803) todo_list_release(&todo_list); b97e187364 4804) return -1; b97e187364 4808) return error(_("could not copy '%s' to '%s'."), todo
Re: [PATCH 15/16] commit-reach: make can_all_from_reach... linear
On 10/14/2018 10:29 AM, René Scharfe wrote: It still has some repetition, converted code is a bit longer than the current one, and I don't know how to build a Coccinelle rule that would do that conversion. Looked for a possibility to at least leave QSORT call-sites alone by enhancing that macro, but didn't find any. Found a few websites showing off mindblowing macros, thouhg, this one in particular: https://github.com/pfultz2/Cloak/wiki/C-Preprocessor-tricks,-tips,-and-idioms Anyway, drove the generative approach a bit further, and came up with the new DEFINE_SORT below. I'm unsure about the name; perhaps it should be called DEFINE_SORT_BY_COMPARE_FUNCTION_BODY, but that's a bit long. It handles casts and const attributes behind the scenes and avoids repetition, but looks a bit weird, as it is placed where a function signature would go. Apart from that the macro is simple and doesn't use any tricks or added checks. It just sets up boilerplate functions to offer type-safe sorting. diffcore-rename.c and refs/packed-backend.c receive special treatment in the patch because their compare functions are used outside of sorting as well. I made them take typed pointers nevertheless and used them from DEFINE_SORT; the wrapper generated by that macro is supposed to be private. Given that such reuse is rare and I think we don't need a way to make it public. What do y'all think about this direction? --- [snip] diff --git a/git-compat-util.h b/git-compat-util.h index 5f2e90932f..491230fc57 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -1066,6 +1066,21 @@ static inline void sane_qsort(void *base, size_t nmemb, size_t size, qsort(base, nmemb, size, compar); } +#define DECLARE_SORT(scope, name, elemtype) \ +scope void name(elemtype, size_t) + +#define DEFINE_SORT(scope, name, elemtype, one, two) \ +static int name##_compare(const elemtype, const elemtype); \ +static int name##_compare_void(const void *a, const void *b) \ +{ \ + return name##_compare(a, b);\ +} \ +scope void name(elemtype base, size_t nmemb) \ +{ \ + QSORT(base, nmemb, name##_compare_void);\ +} \ +static int name##_compare(const elemtype one, const elemtype two) + Since you were worried about the "private" name of the compare function, maybe split this macro into two: DEFINE_COMPARE and DEFINE_SORT. Then, if someone wants direct access to the compare function, they could use the DEFINE_COMPARE to ensure the typing is correct, and use QSORT as normal with name##_compare_void. As I think about this, I think this is less of a problem than is worth this split. The commit-slab definitions generate a lot of methods using the "name##" convention, so perhaps we should just trust developers using the macros to look up the macro definition or similar examples. In that sense, including a conversion that consumes the compare function directly can be a signpost for future callers. I would say that maybe the times where you need to do something special should be pulled out into their own patches, so we can call attention to them directly. [snip] diff --git a/midx.c b/midx.c index 713d6f9dde..4407db7949 100644 --- a/midx.c +++ b/midx.c @@ -419,10 +419,8 @@ struct pack_pair { char *pack_name; }; -static int pack_pair_compare(const void *_a, const void *_b) +DEFINE_SORT(static, sort_by_pack_name, struct pack_pair *, a, b) { - struct pack_pair *a = (struct pack_pair *)_a; - struct pack_pair *b = (struct pack_pair *)_b; return strcmp(a->pack_name, b->pack_name); } @@ -438,7 +436,7 @@ static void sort_packs_by_name(char **pack_names, uint32_t nr_packs, uint32_t *p pairs[i].pack_name = pack_names[i]; } - QSORT(pairs, nr_packs, pack_pair_compare); + sort_by_pack_name(pairs, nr_packs); I like this "sort_by_" convention.. for (i = 0; i < nr_packs; i++) { pack_names[i] = pairs[i].pack_name; @@ -455,10 +453,8 @@ struct pack_midx_entry { uint64_t offset; }; -static int midx_oid_compare(const void *_a, const void *_b) +DEFINE_SORT(static, sort_midx, struct pack_midx_entry *, a, b) { - const struct pack_midx_entry *a = (const struct pack_midx_entry *)_a; - const struct pack_midx_entry *b = (const struct pack_midx_entry *)_b; int cmp = oidcmp(&a->oid, &b->oid); if (cmp) @@ -573,7 +569,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, } } - QSORT(entries_by_fanout, nr_fanout, midx_oid_compare)
Re: [PATCH v2 13/13] commit-graph: specify OID version for SHA-256
On 10/14/2018 10:19 PM, brian m. carlson wrote: Since the commit-graph code wants to serialize the hash algorithm into the data store, specify a version number for each supported algorithm. Note that we don't use the values of the constants themselves, as they are internal and could change in the future. Signed-off-by: brian m. carlson --- commit-graph.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 7a28fbb03f..e587c21bb6 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -45,7 +45,14 @@ char *get_commit_graph_filename(const char *obj_dir) static uint8_t oid_version(void) { - return 1; + switch (hash_algo_by_ptr(the_hash_algo)) { + case GIT_HASH_SHA1: + return 1; + case GIT_HASH_SHA256: + return 2; + default: + BUG("unknown hash algorithm"); + } } static struct commit_graph *alloc_commit_graph(void) Simple and easy! Thanks, -Stolee
Re: [PATCH v2 09/13] commit-graph: convert to using the_hash_algo
On 10/14/2018 10:18 PM, brian m. carlson wrote: Instead of using hard-coded constants for object sizes, use the_hash_algo to look them up. In addition, use a function call to look up the object ID version and produce the correct value. This looks good and I can see already how this new organization will help make the hash size more flexible. Thanks! -Stolee
Re: [PATCH 0/4] Bloom filter experiment
On 10/9/2018 3:34 PM, SZEDER Gábor wrote: To keep the ball rolling, here is my proof of concept in a somewhat cleaned-up form, with still plenty of rough edges. Peff, Szeder, and Jonathan, Thanks for giving me the kick in the pants to finally write a proof of concept for my personal take on how this should work. My implementation borrows things from both Szeder and Jonathan's series. You can find my commits for all of the versions on GitHub (it's a bit too messy to share as a patch series right now, I think): Repo: https://github.com/derrickstolee/git Branches: bloom/* (includes bloom/stolee, bloom/peff, bloom/szeder, and bloom/tan for the respective implementations, and bloom/base as the common ancestor) My implementation uses the following scheme: 1. Bloom filters are computed and stored on a commit-by-commit basis. 2. The filters are sized according to the number of changes in each commit, with a minimum of one 64-bit word. 3. The filters are stored in the commit-graph using two new optional chunks: one stores a single 32-bit integer for each commit that provides the end of its Bloom filter in the second "data" chunk. The data chunk also stores the magic constants (hash version, num hash keys, and num bits per entry). 4. We fill the Bloom filters as (const char *data, int len) pairs as "struct bloom_filter"s in a commit slab. 5. In order to evaluate containment, we need the struct bloom_filter, but also struct bloom_settings (stores the magic constants in one place), and struct bloom_key (stores the _k_ hash values). This allows us to hash a path once and test the same path against many Bloom filters. 6. When we compute the Bloom filters, we don't store a filter for commits whose first-parent diff has more than 512 paths. 7. When we compute the commit-graph, we can re-use the pre-existing filters without needing to recompute diffs. (Caveat: the current implementation will re-compute diffs for the commits with diffs that were too large.) You can build the Bloom filters in my implementation this way: GIT_TEST_BLOOM_FILTERS=1 ./git commit-graph write --reachable You can play around with it like this: $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write Computing commit graph generation numbers: 100% (52801/52801), done. Computing bloom filter: 100% (52801/52801), done. # Yeah, I even added progress indicator! :) $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh 886 20:40:24.783699 revision.c:486 bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873 Jonathan used this same test, so will I. Here is a summary table: | Implementation | Queries | Maybe | FP # | FP % | ||-|---|--|---| | Szeder | 66095 | 1142 | 256 | 0.38% | | Jonathan | 66459 | 107 | 89 | 0.16% | | Stolee | 53025 | 492 | 479 | 0.90% | (Note that we must have used different starting points, which is why my "Queries" is so much smaller.) The increase in false-positive percentage is expected in my implementation. I'm using the correct filter sizes to hit a <1% FP ratio. This could be lowered by changing the settings, and the size would dynamically grow. For my Git repo (which contains git-for-windows/git and microsoft/git) this implementation grows the commmit-graph file from 5.8 MB to 7.3 MB (1.5 MB total, compared to Szeder's 8MB filter). For 105,260 commits, that rounds out to less than 20 bytes per commit (compared to Jonathan's 256 bytes per commit). Related stats for my Linux repo: 781,756 commits, commit-graph grows from 43.8 to 55.6 MB (~12 MB additional, ~16 bytes per commit). I haven't done a side-by-side performance test for these implementations, but it would be interesting to do so. Despite writing a lot of code in a short amount of time, there is a lot of work to be done before this is submittable: 1. There are three different environment variables right now. It would be better to have one GIT_TEST_ variable and rely on existing tracing for logs (trace2 values would be good here). 2. We need config values for writing and consuming bloom filters, but also to override the default settings. 3. My bloom.c/bloom.h is too coupled to the commit-graph. I want to harden that interface to let Bloom filters live as their own thing, but that the commit-graph could load a bloom filter from the file instead of from the slab. 4. Tests, tests, and more tests. We'll see how much time I have to do this polish, but I think the benefit is proven. Thanks, -Stolee
Re: Git Test Coverage Report (Friday, Oct 12)
On 10/15/2018 4:09 AM, Johannes Schindelin wrote: Hi Stolee, On Fri, 12 Oct 2018, Derrick Stolee wrote: In an effort to ensure new code is reasonably covered by the test suite, we now have contrib/coverage-diff.sh to combine the gcov output from 'make coverage-test ; make coverage-report' with the output from 'git diff A B' to discover _new_ lines of code that are not covered. Thanks for doing this. I do notice, though, that there are a few mentions of BUG() lines, e.g. 0af129b2ed builtin/rebase--interactive2.c 267) BUG("invalid command '%d'", command); I do not think that we will ever want to ensure that all of these lines get code coverage ;-) Maybe it would be possible to exclude those lines from the analysis? I'll incorporate that into my build script, but leave it out of contrib/coverage-diff.sh in case people really want to see those lines when the inspect a single topic. Thanks, -Stolee
Re: [PATCH v2 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable
On 10/12/2018 1:34 PM, Derrick Stolee via GitGitGadget wrote: To increase coverage of the multi-pack-index feature, add a GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_* variables. After creating the environment variable and running the test suite with it enabled, I found a few bugs in the multi-pack-index implementation. These are handled by the first two patches. I have set up a CI build on Azure Pipelines [1] that runs the test suite with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure they work well with the rest of the ongoing work. Eventually, we can add these variables to the Travis CI scripts. [1] https://git.visualstudio.com/git/_build?definitionId=4 Derrick Stolee (3): midx: fix broken free() in close_midx() midx: close multi-pack-index on repack multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX builtin/repack.c| 7 +-- midx.c | 26 -- midx.h | 6 +- t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 37 insertions(+), 11 deletions(-) base-commit: 5a0cc8aca797dbd7d2be3b67458ff880ed45cddf I should explicitly mention that this base commit is different as otherwise I will conflict with ds/multi-pack-verify with the new prototype in midx.h. Thanks, -Stolee
[PATCH v2 2/3] midx: close multi-pack-index on repack
From: Derrick Stolee When repacking, we may remove pack-files. This invalidates the multi-pack-index (if it exists). Previously, we removed the multi-pack-index file before removing any pack-file. In some cases, the repack command may load the multi-pack-index into memory. This may lead to later in-memory references to the non-existent pack- files. Signed-off-by: Derrick Stolee --- builtin/repack.c | 3 +-- midx.c | 15 --- midx.h | 4 +++- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index c6a7943d5c..44965cbaa3 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -431,8 +431,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) char *fname, *fname_old; if (!midx_cleared) { - /* if we move a packfile, it will invalidated the midx */ - clear_midx_file(get_object_directory()); + clear_midx_file(the_repository); midx_cleared = 1; } diff --git a/midx.c b/midx.c index bf1f511862..22247a30ab 100644 --- a/midx.c +++ b/midx.c @@ -176,9 +176,13 @@ cleanup_fail: return NULL; } -static void close_midx(struct multi_pack_index *m) +void close_midx(struct multi_pack_index *m) { uint32_t i; + + if (!m) + return; + munmap((unsigned char *)m->data, m->data_len); close(m->fd); m->fd = -1; @@ -914,9 +918,14 @@ cleanup: return 0; } -void clear_midx_file(const char *object_dir) +void clear_midx_file(struct repository *r) { - char *midx = get_midx_filename(object_dir); + char *midx = get_midx_filename(r->objects->objectdir); + + if (r->objects && r->objects->multi_pack_index) { + close_midx(r->objects->multi_pack_index); + r->objects->multi_pack_index = NULL; + } if (remove_path(midx)) { UNLEAK(midx); diff --git a/midx.h b/midx.h index ce80b91c68..0f68bccdd5 100644 --- a/midx.h +++ b/midx.h @@ -42,7 +42,9 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_name); int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local); int write_midx_file(const char *object_dir); -void clear_midx_file(const char *object_dir); +void clear_midx_file(struct repository *r); int verify_midx_file(const char *object_dir); +void close_midx(struct multi_pack_index *m); + #endif -- gitgitgadget
[PATCH v2 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX
From: Derrick Stolee The multi-pack-index feature is tested in isolation by t5319-multi-pack-index.sh, but there are many more interesting scenarios in the test suite surrounding pack-file data shapes and interactions. Since the multi-pack-index is an optional data structure, it does not make sense to include it by default in those tests. Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable that enables core.multiPackIndex and writes a multi-pack-index after each 'git repack' command. This adds extra test coverage when needed. There are a few spots in the test suite that need to react to this change: * t5319-multi-pack-index.sh: there is a test that checks that 'git repack' deletes the multi-pack-index. Disable the environment variable to ensure this still happens. * t5310-pack-bitmaps.sh: One test moves a pack-file from the object directory to an alternate. This breaks the multi-pack-index, so delete the multi-pack-index at this point, if it exists. * t9300-fast-import.sh: One test verifies the number of files in the .git/objects/pack directory is exactly 8. Exclude the multi-pack-index from this count so it is still 8 in all cases. Signed-off-by: Derrick Stolee --- builtin/repack.c| 4 midx.c | 9 +++-- midx.h | 2 ++ t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 20 insertions(+), 4 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 44965cbaa3..26dcccdafc 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -553,6 +553,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!no_update_server_info) update_server_info(0); remove_temporary_files(); + + if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) + write_midx_file(get_object_directory()); + string_list_clear(&names, 0); string_list_clear(&rollback, 0); string_list_clear(&existing_packs, 0); diff --git a/midx.c b/midx.c index 22247a30ab..02b2211e31 100644 --- a/midx.c +++ b/midx.c @@ -335,9 +335,14 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i struct multi_pack_index *m; struct multi_pack_index *m_search; int config_value; + static int env_value = -1; - if (repo_config_get_bool(r, "core.multipackindex", &config_value) || - !config_value) + if (env_value < 0) + env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0); + + if (!env_value && + (repo_config_get_bool(r, "core.multipackindex", &config_value) || + !config_value)) return 0; for (m_search = r->objects->multi_pack_index; m_search; m_search = m_search->next) diff --git a/midx.h b/midx.h index 0f68bccdd5..f2bb7e681c 100644 --- a/midx.h +++ b/midx.h @@ -3,6 +3,8 @@ #include "repository.h" +#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX" + struct multi_pack_index { struct multi_pack_index *next; diff --git a/t/README b/t/README index 5e48a043ce..9bfdd3004c 100644 --- a/t/README +++ b/t/README @@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to be written after every 'git commit' command, and overrides the 'core.commitGraph' setting to true. +GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack- +index to be written after every 'git repack' command, and overrides the +'core.multiPackIndex' setting to true. + Naming Tests diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 1be3459c5b..82d7f7f6a5 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pa test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' ' mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ && + rm -f .git/objects/pack/multi-pack-index && test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" && echo HEAD | git pack-objects --local --stdout --revs >3b.pack && git index-pack 3b.pack && diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index bd8e841b81..70926b5bc0 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -271,7 +271,7 @@ test_expect_success 'git-fsck incorrect offset' ' test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && - git repack -adf && + GIT_TEST_MULTI_PA
[PATCH v2 1/3] midx: fix broken free() in close_midx()
From: Derrick Stolee When closing a multi-pack-index, we intend to close each pack-file and free the struct packed_git that represents it. However, this line was previously freeing the array of pointers, not the pointer itself. This leads to a double-free issue. Signed-off-by: Derrick Stolee --- midx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 713d6f9dde..bf1f511862 100644 --- a/midx.c +++ b/midx.c @@ -186,7 +186,7 @@ static void close_midx(struct multi_pack_index *m) for (i = 0; i < m->num_packs; i++) { if (m->packs[i]) { close_pack(m->packs[i]); - free(m->packs); + free(m->packs[i]); } } FREE_AND_NULL(m->packs); -- gitgitgadget
[PATCH v2 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable
To increase coverage of the multi-pack-index feature, add a GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_* variables. After creating the environment variable and running the test suite with it enabled, I found a few bugs in the multi-pack-index implementation. These are handled by the first two patches. I have set up a CI build on Azure Pipelines [1] that runs the test suite with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure they work well with the rest of the ongoing work. Eventually, we can add these variables to the Travis CI scripts. [1] https://git.visualstudio.com/git/_build?definitionId=4 Derrick Stolee (3): midx: fix broken free() in close_midx() midx: close multi-pack-index on repack multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX builtin/repack.c| 7 +-- midx.c | 26 -- midx.h | 6 +- t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 37 insertions(+), 11 deletions(-) base-commit: 5a0cc8aca797dbd7d2be3b67458ff880ed45cddf Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-27/derrickstolee/midx-test/upstream-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/27 Range-diff vs v1: 1: 9fcbbe336d = 1: 8bd672fe26 midx: fix broken free() in close_midx() 2: 725ebadc92 ! 2: 2d8f26679d midx: close multi-pack-index on repack @@ -15,16 +15,15 @@ --- a/builtin/repack.c +++ b/builtin/repack.c @@ + char *fname, *fname_old; if (!midx_cleared) { - /* if we move a packfile, it will invalidated the midx */ -+ if (the_repository->objects) { -+ close_midx(the_repository->objects->multi_pack_index); -+ the_repository->objects->multi_pack_index = NULL; -+ } - clear_midx_file(get_object_directory()); +- /* if we move a packfile, it will invalidated the midx */ +- clear_midx_file(get_object_directory()); ++ clear_midx_file(the_repository); midx_cleared = 1; } + diff --git a/midx.c b/midx.c --- a/midx.c @@ -44,13 +43,34 @@ munmap((unsigned char *)m->data, m->data_len); close(m->fd); m->fd = -1; +@@ + return 0; + } + +-void clear_midx_file(const char *object_dir) ++void clear_midx_file(struct repository *r) + { +- char *midx = get_midx_filename(object_dir); ++ char *midx = get_midx_filename(r->objects->objectdir); ++ ++ if (r->objects && r->objects->multi_pack_index) { ++ close_midx(r->objects->multi_pack_index); ++ r->objects->multi_pack_index = NULL; ++ } + + if (remove_path(midx)) { + UNLEAK(midx); diff --git a/midx.h b/midx.h --- a/midx.h +++ b/midx.h @@ + int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local); + int write_midx_file(const char *object_dir); - void clear_midx_file(const char *object_dir); +-void clear_midx_file(const char *object_dir); ++void clear_midx_file(struct repository *r); + int verify_midx_file(const char *object_dir); +void close_midx(struct multi_pack_index *m); + 3: 04e3e91082 = 3: 57c64e814c multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX -- gitgitgadget
Re: [PATCH v3 1/2] commit-graph write: add progress output
On 10/12/2018 11:07 AM, Ævar Arnfjörð Bjarmason wrote: On Fri, Oct 12 2018, Junio C Hamano wrote: Makes sense. If this second iteration were also time consuming, then it probably is a good idea to split these into two separate phases? "Counting 1...N" followed by "Inspecting 1...N" or something like that. Of course, if the latter does not take much time, then doing without any progress indicator is also fine. That's a good point. Derrick: If the three loops in close_reachable() had to be split up into different progress stages and given different names what do you think they should be? Now it's "Annotating commits in commit graph" for all of them. The following is the best I can think of right now: 1. Loading known commits. 2. Expanding reachable commits. 3. Clearing commit marks. -Stolee
Git Test Coverage Report (Friday, Oct 12)
i < nr; i++) { 3255089ada 3515) ieot->entries[i].offset = get_be32(index); 3255089ada 3516) index += sizeof(uint32_t); 3255089ada 3517) ieot->entries[i].nr = get_be32(index); 3255089ada 3518) index += sizeof(uint32_t); 3255089ada 3521) return ieot; 3255089ada 3524) static void write_ieot_extension(struct strbuf *sb, struct index_entry_offset_table *ieot) 3255089ada 3530) put_be32(&buffer, IEOT_VERSION); 3255089ada 3531) strbuf_add(sb, &buffer, sizeof(uint32_t)); 3255089ada 3534) for (i = 0; i < ieot->nr; i++) { 3255089ada 3537) put_be32(&buffer, ieot->entries[i].offset); 3255089ada 3538) strbuf_add(sb, &buffer, sizeof(uint32_t)); 3255089ada 3541) put_be32(&buffer, ieot->entries[i].nr); 3255089ada 3542) strbuf_add(sb, &buffer, sizeof(uint32_t)); 3255089ada 3544) } rebase-interactive.c 64a43cbd5d 61) return error_errno(_("could not read '%s'."), todo_file); 64a43cbd5d 65) strbuf_release(&buf); 64a43cbd5d 66) return -1; a9f5476fbc 74) return error_errno(_("could not read '%s'."), todo_file); a9f5476fbc 78) strbuf_release(&buf); a9f5476fbc 79) return -1; 64a43cbd5d 85) return -1; revision.c 4943d28849 2931) return; 4943d28849 2934) return; 4943d28849 2937) c->object.flags |= UNINTERESTING; 4943d28849 2940) return; 4943d28849 2943) mark_parents_uninteresting(c); 4943d28849 2966) return; 4943d28849 2969) return; 4943d28849 2974) return; 4943d28849 3019) info->topo_queue.compare = compare_commits_by_commit_date; 4943d28849 3020) break; 4943d28849 3022) init_author_date_slab(&info->author_date); 4943d28849 3023) info->topo_queue.compare = compare_commits_by_author_date; 4943d28849 3024) info->topo_queue.cb_data = &info->author_date; 4943d28849 3025) break; 4943d28849 3038) continue; 4943d28849 3048) record_author_date(&info->author_date, c); 6c04ff3001 3086) if (!revs->ignore_missing_links) 6c04ff3001 3087) die("Failed to traverse parents of commit %s", 4943d28849 3088) oid_to_hex(&commit->object.oid)); 4943d28849 3096) continue; sequencer.c 65850686cf 2276) return; 65850686cf 2373) write_file(rebase_path_quiet(), "%s\n", quiet); 2c58483a59 3371) return error(_("could not checkout %s"), commit); 4df66c40b0 3385) return error(_("%s: not a valid OID"), orig_head); b97e187364 4748) return -1; b97e187364 4751) return -1; b97e187364 4757) return error_errno(_("could not read '%s'."), todo_file); b97e187364 4760) todo_list_release(&todo_list); b97e187364 4761) return error(_("unusable todo list: '%s'"), todo_file); b97e187364 4780) todo_list_release(&todo_list); b97e187364 4781) return -1; b97e187364 4785) return error(_("could not copy '%s' to '%s'."), todo_file, b97e187364 4789) return error(_("could not transform the todo list")); b97e187364 4818) return error(_("could not transform the todo list")); b97e187364 4821) return error(_("could not skip unnecessary pick commands")); b97e187364 4827) return -1; split-index.c 568f3a6073 310) const unsigned int ondisk_flags = 568f3a6073 314) ce_flags = ce->ce_flags; 568f3a6073 315) base_flags = base->ce_flags; 568f3a6073 317) ce->ce_flags &= ondisk_flags; 568f3a6073 318) base->ce_flags &= ondisk_flags; 568f3a6073 319) ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data, 568f3a6073 322) ce->ce_flags = ce_flags; 568f3a6073 323) base->ce_flags = base_flags; 568f3a6073 324) if (ret) 568f3a6073 325) ce->ce_flags |= CE_UPDATE_IN_BASE; strbuf.c f95736288a 127) --sb->len; transport-helper.c fb19d32f05 643) if (!data->connect && !data->stateless_connect) transport.c 99bcb883cb 299) BUG("buffer must be empty at the end of handshake()"); Commits introducing uncovered code: Alban Gruin 0af129b2e: rebase--interactive2: rewrite the submodes of interactive rebase in C Alban Gruin 2c58483a5: rebase -i: rewrite setup_reflog_action() in C Alban Gruin 34b47315d: rebase -i: move rebase--helper modes to rebase--interactive Alban Gruin 4df66c40b: rebase -i: rewrite checkout_onto() in C Alban Gruin 53bbcfbde: rebase -i: implement the main part of interactive rebase as a builtin Alban Gruin 64a43cbd5: rebase -i: rewrite the edit-todo functionality in C Alban Gruin 65850686c: rebase -i: rewrite write_basic_state() in C Alban Gruin a9f5476fb: sequencer: refactor append_todo_help() to write its message to a buffer Alban Gruin b97e18736: rebase -i: rewrite complete_action() in C Ben Peart 3255089ad: ieot: add Index Entry Offset Table (IEOT) extension Ben Peart 3b1d9e045: eoie: add End of Index Entry (EOIE) extension Ben Peart 77ff1127a: read-cache: load cache entries on worker threads Ben
Re: [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
On 10/12/2018 2:33 AM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: * revs->limited implies we run limit_list() to walk the entire reachable set. There are some short-cuts here, such as if we perform a range query like 'git rev-list COMPARE..HEAD' and we can stop limit_list() when all queued commits are uninteresting. * revs->topo_order implies we run sort_in_topological_order(). See the implementation of that method in commit.c. It implies that the full set of commits to order is in the given commit_list. These two methods imply that a 'git rev-list --topo-order HEAD' command must walk the entire reachable set of commits _twice_ before returning a single result. With or without "--topo-order", running rev-list without any negative commit means we must dig down to the roots that can be reached from the positive commits we have. If we use default order in 'git log', we don't walk all the way to the root commits, and instead trust the commit-date. (This is different than --date-order, which does guarantee parents after children.) In this case, revs->limited is false. I am to sure if having to run the "sort" of order N counts as "walk the entire reachable set once" (in addition to the enumeration that must be done to prepare that N commits, performed in limit_list()). sort_in_topological_order() does actually _two_ walks (the in-degree computation plus the walk that peels commits of in-degree zero), but those walks are cheaper because we've already parsed the commits in limit_list(). 3. expand_topo_walk() provides get_revision_1() with a way to signal walking beyond the latest commit. Currently, this calls add_parents_to_list() exactly like the old logic. "latest"? We dig down the history from newer to older, so at some point we hit an old commit and need to find the parents to keep walking towards even older parts of the history. Did you mean "earliest" instead? I mean "latest" in terms of the algorithm, so "the commit that was returned by get_revision_1() most recently". This could use some rewriting for clarity. @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s if (revs->diffopt.objfind) revs->simplify_history = 0; - if (revs->topo_order) + if (revs->topo_order && !generation_numbers_enabled(the_repository)) revs->limited = 1; Are we expecting that this is always a bool? Can there be new commits for which generation numbers are not computed and stored while all the old, stable and packed commits have generation numbers? For this algorithm to work, we only care that _some_ commits have generation numbers. We expect that if a commit-graph file exists with generation numbers, then the majority of commits have generation numbers. The commits that were added or fetched since the commit-graph was written will have generation number INFINITY, but the topo-order algorithm will still work and be efficient in those cases. (This is also why we have the "half graph" case in test_three_modes.) @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } +struct topo_walk_info {}; + +static void init_topo_walk(struct rev_info *revs) +{ + struct topo_walk_info *info; + revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info)); + info = revs->topo_walk_info; + memset(info, 0, sizeof(struct topo_walk_info)); There is no member in the struct at this point. Are we sure this is safe? Just being curious. I know xmalloc() gives us at least one byte and info won't be NULL. I just do not know offhand if we have a guarantee that memset() acts sensibly to fill the first 0 bytes. This is a good question. It seems to work for me when I check out your version of this commit (6c04ff30 "revision.c: begin refactoring --topo-order logic") and run all tests. + limit_list(revs); + sort_in_topological_order(&revs->commits, revs->sort_order); +} + +static struct commit *next_topo_commit(struct rev_info *revs) +{ + return pop_commit(&revs->commits); +} + +static void expand_topo_walk(struct rev_info *revs, struct commit *commit) +{ + if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) { + if (!revs->ignore_missing_links) + die("Failed to traverse parents of commit %s", + oid_to_hex(&commit->object.oid)); + } +} + int prepare_revision_walk(struct rev_info *revs) { int i; @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(&revs->commits); if (revs->no_wa
Re: [PATCH v4 0/1] contrib: Add script to show uncovered "new" lines
On 10/11/2018 11:01 PM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: CHANGES IN V4: I reduced the blame output using -s which decreases the width. I include a summary of the commit authors at the end to help people see the lines they wrote. This version is also copied into a build definition in the public Git project on Azure Pipelines [1]. I'll use this build definition to generate the coverage report after each "What's Cooking" email. [1] https://git.visualstudio.com/git/_build?definitionId=5 Thanks. Let's move this forward by merging it down to 'next'. Sounds good. When it moves into 'master' I can update my build to call the script from source. Thanks, -Stolee
Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
On 10/11/2018 11:35 AM, Jeff King wrote: On Fri, Sep 21, 2018 at 10:39:36AM -0700, Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: [...] In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, A minor nit, but this commit message doesn't mention the most basic thing up front: that its main purpose is to introduce a new algorithm for topo-order. ;) It's obvious in the context of reviewing the series, but somebody reading "git log" later may want a bit more. Perhaps: revision.c: implement generation-based topo-order algorithm as a subject, and/or an introductory paragraph like: The current --topo-order algorithm requires walking all commits we are going to output up front, topo-sorting them, all before outputting the first value. This patch introduces a new algorithm which uses stored generation numbers to incrementally walk in topo-order, outputting commits as we go. Other than that, I find this to be a wonderfully explanatory commit message. :) Good idea. I'll make that change. The walks are as follows: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number), parse each reachable commit until all commits in the queue have generation number strictly lower than needed. During this walk, update the UNINTERESTING flags as necessary. OK, this makes sense. If we know that everybody else in our queue is at generation X, then it is safe to output a commit at generation greater than X. I think this by itself would allow us to implement "show no parents before all of its children are shown", right? But --topo-order promises a bit more: "avoid showing commits no multiple lines of history intermixed". I guess also INFINITY generation numbers need more. For a real generation number, we know that "gen(A) == gen(B)" implies that there is no ancestry relationship between the two. But not so for INFINITY. Yeah, to deal with INFINITY (and ZERO, but that won't happen if generation_numbers_enabled() returns true), we treat gen(A) == gen(B) as a "no information" state. So, to output a commit at generation X, we need to have our maximum generation number in the unexplored area to be at most X - 1. You'll see strict inequality when checking generations. 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. I wondered how this would work for INFINITY. We can't know the order of a bunch of INFINITY nodes at all, so we never know when their in-degree values are "done". But if I understand the EXPLORE walk, we'd basically walk all of INFINITY down to something with a real generation number. Is that right? But after that, I'm not totally clear on why we need this INDEGREE walk. The INDEGREE walk is an important element for Kahn's algorithm. The final output order is dictated by peeling commits of "indegree zero" to ensure all children are output before their parents. (Note: since we use literal 0 to mean "uninitialized", we peel commits when the indegree slab has value 1.) This walk replaces the indegree logic from sort_in_topological_order(). That method performs one walk that fills the indegree slab, then another walk that peels the commits with indegree 0 and inserts them into a list. 3. TOPO: using the topo_queue priority queue (ordered based on the sort_order given, which could be commit-date, author- date, or typical topo-order which treats the queue as a LIFO stack), remove a commit from the queue and decrement the in-degree of each parent. If a parent has an in-degree of one, then we add it to the topo_queue. Before we decrement the in-degree, however, ensure the INDEGREE walk has walked beyond that generation number. OK, this makes sense to make --author-date-order, etc, work. Potentially those numbers might have no relationship at all with the graph structure, but we promise "no parent before its children are shown", so this is really just a tie-breaker after the topo-sort anyway. As long as steps 1 and 2 are correct and produce a complete set of commits for one "layer", this should be OK. I guess
Re: [PATCH 1/2] One filter per commit
On 10/10/2018 9:21 PM, Jonathan Tan wrote: diff --git a/commit-graph.c b/commit-graph.c index f415d3b41f..90b0b3df90 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -715,13 +715,11 @@ static int add_ref_to_list(const char *refname, static void add_changes_to_bloom_filter(struct bloom_filter *bf, struct commit *parent, struct commit *commit, + int index, struct diff_options *diffopt) { - unsigned char p_c_hash[GIT_MAX_RAWSZ]; int i; - hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash); - diff_tree_oid(&parent->object.oid, &commit->object.oid, "", diffopt); diffcore_std(diffopt); @@ -756,8 +754,8 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf, the_hash_algo->update_fn(&ctx, path, p - path); the_hash_algo->final_fn(name_hash, &ctx); - hashxor(name_hash, p_c_hash, hash); - bloom_filter_add_hash(bf, hash); + hashxor(name_hash, parent->object.oid.hash, hash); + bloom_filter_add_hash(bf, index, hash); } while (*p); diff_free_filepair(diff_queued_diff.queue[i]); [snip] @@ -768,11 +766,10 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf, } static void fill_bloom_filter(struct bloom_filter *bf, - struct progress *progress) + struct progress *progress, struct commit **commits, int commit_nr) { struct rev_info revs; const char *revs_argv[] = {NULL, "--all", NULL}; - struct commit *commit; int i = 0; /* We (re-)create the bloom filter from scratch every time for now. */ @@ -783,18 +780,19 @@ static void fill_bloom_filter(struct bloom_filter *bf, if (prepare_revision_walk(&revs)) die("revision walk setup failed while preparing bloom filter"); - while ((commit = get_revision(&revs))) { + for (i = 0; i < commit_nr; i++) { + struct commit *commit = commits[i]; struct commit_list *parent; for (parent = commit->parents; parent; parent = parent->next) - add_changes_to_bloom_filter(bf, parent->item, commit, + add_changes_to_bloom_filter(bf, parent->item, commit, i, &revs.diffopt); [snip] - hashxor(pi->name_hash, p_c_hash, hash); - if (bloom_filter_check_hash(&bf, hash)) { + hashxor(pi->name_hash, parent->object.oid.hash, hash); + if (bloom_filter_check_hash(&bf, commit->graph_pos, hash)) { /* * At least one of the interesting pathspecs differs, * so we can return early and let the diff machinery One main benefit of storing on Bloom filter per commit is to avoid recomputing hashes at every commit. Currently, this patch only improves locality when checking membership at the cost of taking up more space. Drop the dependence on the parent oid and then we can save the time spent hashing during history queries. -Stolee
Re: Bloom Filters
On 10/9/2018 7:12 PM, Jeff King wrote: On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote: Hmph. It really sounds like we could do better with a custom RLE solution. But that makes me feel like I'm missing something, because surely I can't invent something better than the state of the art in a simple thought experiment, right? I know what I'm proposing would be quite bad for random access, but my impression is that EWAH is the same. For the scale of bitmaps we're talking about, I think linear/streaming access through the bitmap would be OK. Thinking on it more, what I was missing is that for truly dense random bitmaps, this will perform much worse. Because it will use a byte to say "there's one 1", rather than a bit. But I think it does OK in practice for the very sparse bitmaps we tend to see in this application. I was able to generate a complete output that can reproduce "log --name-status -t" for linux.git in 32MB. But: - 15MB of that is commit sha1s, which will be stored elsewhere in a "real" system - 5MB of that is path list (which should shrink by a factor of 10 with prefix compression, and is really a function of a tree size less than history depth) So the per-commit cost is not too bad. That's still not counting merges, though, which would add another 10-15% (or maybe more; their bitmaps are less sparse). I don't know if this is a fruitful path at all or not. I was mostly just satisfying my own curiosity on the bitmap encoding question. But I'll post the patches, just to show my work. The first one is the same initial proof of concept I showed earlier. [1/3]: initial tree-bitmap proof of concept [2/3]: test-tree-bitmap: add "dump" mode [3/3]: test-tree-bitmap: replace ewah with custom rle encoding Makefile| 1 + t/helper/test-tree-bitmap.c | 344 2 files changed, 345 insertions(+) create mode 100644 t/helper/test-tree-bitmap.c I'm trying to test this out myself, and am having trouble reverse engineering how I'm supposed to test it. Looks like running "t/helper/test-tree-bitmap gen" will output a lot of binary data. Where should I store that? Does any file work? Is this series just for the storage costs, assuming that we would replace all TREESAME checks with a query into this database? Or do you have a way to test how much this would improve a "git log -- " query? Thanks, -Stolee
Re: What's cooking in git.git (Oct 2018, #01; Wed, 10)
On 10/10/2018 1:43 AM, Junio C Hamano wrote: * ds/reachable-topo-order (2018-09-21) 7 commits - revision.c: refactor basic topo-order logic - revision.h: add whitespace in flag definitions - commit/revisions: bookkeeping before refactoring - revision.c: begin refactoring --topo-order logic - test-reach: add rev-list tests - test-reach: add run_three_modes method - prio-queue: add 'peek' operation The revision walker machinery learned to take advantage of the commit generation numbers stored in the commit-graph file. What's the status of this topic? I've reached out for review, especially on the rather large patch "revision.c: refactor basic topo-order logic" but have received no messages about it. I don't think anyone has even done a cursory style review. I really think that this is a valuable topic, and it would be nice to have in 2.20, but I'm not pushing to merge something that no one has reviewed. * ds/format-commit-graph-docs (2018-08-21) 2 commits - commit-graph.txt: improve formatting for asciidoc - Docs: Add commit-graph tech docs to Makefile Design docs for the commit-graph machinery is now made into HTML as well as text. Will discard. I am inclined to drop these, as I do not see much clarity in HTML output over the text source. Opinions? These have been marked for discard for a few weeks. I agree they should be discarded. Thanks, -Stolee
Re: [RFC PATCH 6/6] commit-reach: fix first-parent heuristic
On 10/10/2018 9:50 PM, Jonathan Nieder wrote: Hi, Derrick Stolee wrote: commit-reach.c| 4 +++- t/t6600-test-reach.sh | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) I like this testing technique, and the test passes for me. Except: if I put CC = cc -m32 NO_OPENSSL = YesPlease NO_CURL = YesPlease in config.mak (the first line to force 32-bit pointers, the others to avoid some dependencies on libs that I don't have 32-bit versions of), then the test fails for me: $ ./t6600-test-reach.sh -v -x -i [...] expecting success: cp commit-graph-full .git/objects/info/commit-graph && run_and_check_trace2 can_all_from_reach_with_flag num_walked 19 input \ "test-tool reach can_all_from_reach" ++ cp commit-graph-full .git/objects/info/commit-graph ++ run_and_check_trace2 can_all_from_reach_with_flag num_walked 19 input 'test-tool reach can_all_from_r each' ++ CATEGORY=can_all_from_reach_with_flag ++ KEY=num_walked ++ VALUE=19 ++ INPUT=input ++ COMMAND='test-tool reach can_all_from_reach' +++ pwd ++ GIT_TR2_PERFORMANCE='/usr/local/google/home/jrn/src/git/t/trash directory.t6600-test-reach/perf-log.t xt' ++ test-tool reach can_all_from_reach can_all_from_reach(X,Y):1 ++ cat perf-log.txt ++ grep 'category:can_all_from_reach_with_flag key:num_walked value:19' error: last command exited with $?=1 not ok 11 - can_all_from_reach:perf # # cp commit-graph-full .git/objects/info/commit-graph && # run_and_check_trace2 can_all_from_reach_with_flag num_walked 19 input \ # "test-tool reach can_all_from_reach" # When I cat perf-log.txt, I see ..category:can_all_from_reach_with_flag key:num_walked value:20 instead of the expected 19. It is possible this is related to the sort-order problem reported and fixed by Rene [1]. I'll look into it in any case. [1] https://public-inbox.org/git/dca35e44-a763-bcf0-f457-b8dab5381...@web.de/
Re: [PATCH 0/4] Bloom filter experiment
On 10/9/2018 3:34 PM, SZEDER Gábor wrote: To keep the ball rolling, here is my proof of concept in a somewhat cleaned-up form, with still plenty of rough edges. You can play around with it like this: $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write Computing commit graph generation numbers: 100% (52801/52801), done. Computing bloom filter: 100% (52801/52801), done. # Yeah, I even added progress indicator! :) $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh 886 20:40:24.783699 revision.c:486 bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873 The value of $GIT_USE_POC_BLOOM_FILTER only really matters when writing the Bloom filter, and it specifies the number of bits in the filter's bitmap, IOW the above command creates a 8MB Bloom filter. To make use of the filter the variable can be anything non-empty. Writing the Bloom filter is very slow as it is (yeah, that's why bothered with the progress indicator ;). I wrote about it in patch 2's commit message: the cause for about half of the slowness is rather obvious, but I don't (yet) know what's responsible for the other half. Not a single test... but I've run loops over all files in git.git comparing 'git rev-list HEAD -- $file's output with and without the Bloom filter, and, surprisingly, they match. My quick'n'dirty experiments usually don't fare this well... It's also available at: https://github.com/szeder/git bloom-filter-experiment Thanks! I will take a close look at this tomorrow and start playing with it. -Stolee
Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
On 10/9/2018 2:46 PM, Jeff King wrote: On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote: [I snipped all of the parts about bloom filters that seemed entirely reasonable to me ;) ] Imagine we have that list. Is a bloom filter still the best data structure for each commit? At the point that we have the complete universe of paths, we could give each commit a bitmap of changed paths. That lets us ask "did this commit touch these paths" (collect the bits from the list of paths, then check for 1's), as well as "what paths did we touch" (collect the 1 bits, and then index the path list). Those bitmaps should compress very well via EWAH or similar (most of them would be huge stretches of 0's punctuated by short runs of 1's). I'm not convinced we would frequently have runs of 1's, and the bitmap would not compress much better than simply listing the positions. For example, a path "foo/bar" that resolves to a tree would only start a run if the next changes are the initial section of entries in that tree (sorted lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree, then we will break the run of 1's unless we changed every path deeper than that tree. Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or 2). I agree that listing the positions could work, though I sort of assumed that was more or less what a decent compressed bitmap would turn into. E.g., if bit N is set, we should be able to say "N-1 zeroes, 1 one" in about the same size as we could say "position N". EWAH seems pretty awful in that regard. Or at least its serialized format is (or maybe it's our implementation that is bad). The patch below generates a bitmap for each commit in a repository (it doesn't output the total list of paths; I've left that as an exercise for the reader). On linux.git, the result is 57MB. But when I look at the individual bitmap sizes (via GIT_TRACE), they're quite silly. Storing a single set bit takes 28 bytes in serialized form! There are only around 120k unique paths (including prefix trees). Naively using run-length encoding and varints, our worst case should be something like 18-20 bits to say "120k zeroes, then a 1, then all zeroes". And the average case should be better (you don't even need to say "120k", but some smaller number). I wonder if Roaring does better here. In these sparse cases, usually Roaring will organize the data as "array chunks" which are simply lists of the values. The thing that makes this still compressible is that we store two bytes per entry, as the entries are grouped by a common most-significant two bytes. SInce you say ~120k unique paths, the Roaring bitmap would have two or three chunks per bitmap (and those chunks could be empty). The overhead to store the chunk positions, types, and lengths does come at a cost, but it's more like 32 bytes _per commit_. Gzipping the resulting bitmaps drops the total size to about 7.5MB. That's not a particularly important number, but I think it shows that the built-in ewah compression is far from ideal. Just listing the positions with a series of varints would generally be fine, since we expect sparse bitmaps. I just hoped that a good RLE scheme would degrade to roughly that for the sparse case, but also perform well for more dense cases. So at any rate, I do think it would not be out of the question to store bitmaps like this. I'm much more worried about the maintenance cost of adding new entries incrementally. I think it's only feasible if we give up sorting, and then I wonder what other problems that might cause. The patch below gives me a starting point to try the Bloom filter approach and see what the numbers are like. You did all the "git" stuff like computing the changed paths, so thanks! -- >8 -- diff --git a/Makefile b/Makefile index 13e1c52478..f6e823f2d6 100644 --- a/Makefile +++ b/Makefile @@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options TEST_PROGRAMS_NEED_X += test-pkt-line TEST_PROGRAMS_NEED_X += test-svn-fe TEST_PROGRAMS_NEED_X += test-tool +TEST_PROGRAMS_NEED_X += test-tree-bitmap TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X)) diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c new file mode 100644 index 00..bc5cf0e514 --- /dev/null +++ b/t/helper/test-tree-bitmap.c @@ -0,0 +1,167 @@ +#include "cache.h" +#include "revision.h" +#include "diffcore.h" +#include "argv-array.h" +#include "ewah/ewok.h" + +/* map of pathnames to bit positions */ +struct pathmap_entry { + struct hashmap_entry ent; + unsigned pos; + char path[FLEX_ARRAY]; +}; + +static int pathmap_entry_hashcmp(const void *unused_cmp_data, +
Re: [PATCH 2/3] midx: close multi-pack-index on repack
On 10/9/2018 5:10 AM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: diff --git a/builtin/repack.c b/builtin/repack.c index c6a7943d5c..7925bb976e 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -432,6 +432,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!midx_cleared) { /* if we move a packfile, it will invalidated the midx */ + if (the_repository->objects) { + close_midx(the_repository->objects->multi_pack_index); + the_repository->objects->multi_pack_index = NULL; + } clear_midx_file(get_object_directory()); midx_cleared = 1; It somehow looks like a bit of layering violation, doesn't it? When we are clearing a midx file, don't we always want to do this as well? You're right. It did feel a bit wrong. In v2, I'll replace this commit with a refactor of clear_midx_file() to do that. One tricky part is that we need to clear the file even if the in-memory struct hasn't been initialized, but I think passing a repository will suffice for that. CC Stefan: Is there a plan to make get_object_directory() take a repository parameter? Thanks, -Stolee
Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
(Changing title to reflect the new topic.) On 10/8/2018 11:08 PM, Jeff King wrote: On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote: There are two questions that I was hoping to answer by looking at your code: 1. How do you store your Bloom filter? Is it connected to the commit-graph and split on a commit-by-commit basis (storing "$path" as a key), or is it one huge Bloom filter (storing "$commitid:$path" as key)? I guess you've probably thought all of this through for your implementation, but let me pontificate. I'd have done it as one fixed-size filter per commit. Then you should be able to hash the path keys once, and apply the result as a bitwise query to each individual commit (I'm assuming that it's constant-time to access the filter for each, as an index into an mmap'd array, with the offset coming from a commit-graph entry we'd be able to look up anyway). You're right that we want to hash the path a constant number of times. Add in that we want to re-use information already serialized when updating the file, and we see that having a commit-by-commit Bloom filter is a good idea. Using (commit, path) pairs requires lots of re-hashing, repeated work when extending the filter, and poor locality when evaluating membership of a single key. The nice thing is that you can generate k 32-bit hash values based on two 32-bit hash values that are "independent enough" (see [1]). We used Murmur3 with two different seed values to generate hashes a & b, then used the arithmetic progression a, a + b, a + 2b, ..., a + (k-1)b as our k hash values. These can be computed up front and then dropped into any size filter using a simple modulo operation. This allows flexible sizes in our filters. I don't think fixed size filters are a good idea, and instead would opt for flex-sized filters with a maximum size. The typical parameters use 7 hash functions and a filter size of (at least) 10 bits per entry. For most commits (say 60-70%), 256 bits (32 bytes) is enough. Using a maximum of 512 bytes covers 99% of commits. We will want these bounds to be configurable via config. If we had a fixed size, then we either make it too small (and don't have sufficient coverage of commits) or too large (and waste a lot of space on the commits that change very little). We can store these flex-sized filters in the commit-graph using two columns of data (two new optional chunks): * Bloom filter data: stores the binary data of each commit's Bloom filter, concatenated together in the same order as the commits appear in the commit-graph. * Bloom filter positions: The ith position of this column stores the start of the (i+1)th Bloom filter (The 0th filter starts at byte 0). A Bloom filter of size 0 is intended to mean "we didn't store this filter because it would be too large". We can compute the lengths of the filter by inspecting adjacent values. In order to be flexible, we will want to encode some basic information into the Bloom filter data chunk, such as a tuple of (hash version, num hash bits, num bits per entry). This allows us to change the parameters in config but still be able to read a serialized filter. Here I assume that all filters share the same parameters. The "hash version" here is different than the_hash_algo, because we don't care about cryptographic security, only a uniform distrubution (hence, Murmur3 is a good, fast option). [1] https://web.archive.org/web/20090131053735/http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf I think it would also be easier to deal with maintenance, since each filter is independent (IIRC, you cannot delete from a bloom filter without re-adding all of the other keys). The obvious downside is that it's O(commits) storage instead of O(1). It would always be O(changes), as the Bloom filter needs to grow in size as the number of entries grows. Now let me ponder a bit further afield. A bloom filter lets us answer the question "did this commit (probably) touch these paths?". But it does not let us answer "which paths did this commit touch?". That second one is less useful than you might think, because we almost always care about not just the names of the paths, but their actual object ids. Think about a --raw diff, or even traversing for reachability (where if we knew the tree-diff cheaply, we could avoid asking "have we seen this yet?" on most of the tree entries). The names alone can make that a bit faster, but in the worst case you still have to walk the whole tree to find their entries. But there's also a related question: how do we match pathspec patterns? For a changed path like "foo/bar/baz", I imagine a bloom filter would mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I don't think a
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/8/2018 2:10 PM, SZEDER Gábor wrote: On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote: Nice! These numbers make sense to me, in terms of how many TREESAME queries we actually need to perform for such a query. Yeah... because you didn't notice that I deliberately cheated :) As it turned out, it's not just about the number of diff queries that we can spare, but, for the speedup _ratio_, it's more about how expensive those diff queries are. git.git has a rather flat hierarchy, and 't/' is the 372th entry in the current root tree object, while 'valgrind/' is the 923th entry, and the diff machinery spends considerable time wading through the previous entries. Notice the "carefully chosen path" remark in my previous email; I think this particular path has the highest number of preceeding tree entries, and, in addition, 't/' changes rather frequently, so the diff machinery often has to scan two relatively big tree objects. Had I chosen 'Documentation/RelNotes/1.5.0.1.txt' instead, i.e. another path two directories deep, but whose leading path components are both near the beginning of the tree objects, the speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only" ~5.7x instead of ~24.8x. This is expected. The performance ratio is better when the path is any of the following: 1. A very deep path (need to walk multiple trees to answer TREESAME) 2. An entry is late in a very wide tree (need to spend extra time parsing tree object) 3. The path doesn't change very often (need to inspect many TREESAME pairs before finding enough interesting commits) 4. Some sub-path changes often (so the TREESAME comparison needs to parse beyond that sub-path often) Our standard examples (Git and Linux repos) don't have many paths that have these properties. But: they do exist. In other projects, this is actually typical. Think about Java projects that frequently have ~5 levels of folders before actually touching a code file. When I was implementing the Bloom filter feature for Azure Repos, I ran performance tests on the Linux repo using a random sampling of paths. The typical speedup was 5x while some outliers were in the 25x range. But I'm afraid it will take a while until I get around to turn it into something presentable... Do you have the code pushed somewhere public where one could take a look? I Do you have the code pushed somewhere public where one could take a look? I could provide some early feedback. Nah, definitely not... I know full well how embarassingly broken this implementation is, I don't need others to tell me that ;) There are two questions that I was hoping to answer by looking at your code: 1. How do you store your Bloom filter? Is it connected to the commit-graph and split on a commit-by-commit basis (storing "$path" as a key), or is it one huge Bloom filter (storing "$commitid:$path" as key)? 2. Where does your Bloom filter check plug into the TREESAME logic? I haven't investigated this part, but hopefully it isn't too complicated. Thanks, -Stolee
Re: [PATCH][Outreachy] remove all the inclusions of git-compat-util.h in header files
On 10/8/2018 1:05 PM, Ananya Krishna Maram wrote: Hi All, Hello, Ananya! Welcome. I was searching through #leftovers and found this. https://public-inbox.org/git/cabpp-bgvvxcbzx44er6to-pusfen_6gnyj1u5cuon9deaa4...@mail.gmail.com/ This patch address the task discussed in the above link. The discussion above seems to not be intended for your commit message, but it does show up when I run `git am` and provide your email as input. The typical way to avoid this is to place all commentary below the "---" that signifies the commit message is over. From: Ananya Krishan Maram skip the #include of git-compat-util.h since all .c files include it. Signed-off-by: Ananya Krishna Maram --- advice.h | 1 - commit-graph.h | 1 - hash.h | 1 - pkt-line.h | 1 - t/helper/test-tool.h | 1 - 5 files changed, 5 deletions(-) diff --git a/advice.h b/advice.h index ab24df0fd..09148baa6 100644 --- a/advice.h +++ b/advice.h @@ -1,7 +1,6 @@ #ifndef ADVICE_H #define ADVICE_H -#include "git-compat-util.h" extern int advice_push_update_rejected; extern int advice_push_non_ff_current; diff --git a/commit-graph.h b/commit-graph.h index b05047676..0e93c2bed 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -1,7 +1,6 @@ #ifndef COMMIT_GRAPH_H #define COMMIT_GRAPH_H -#include "git-compat-util.h" #include "repository.h" #include "string-list.h" #include "cache.h" diff --git a/hash.h b/hash.h index 7c8238bc2..9a4334c5d 100644 --- a/hash.h +++ b/hash.h @@ -1,7 +1,6 @@ #ifndef HASH_H #define HASH_H -#include "git-compat-util.h" #if defined(SHA1_PPC) #include "ppc/sha1.h" diff --git a/pkt-line.h b/pkt-line.h index 5b28d4347..fdd316494 100644 --- a/pkt-line.h +++ b/pkt-line.h @@ -1,7 +1,6 @@ #ifndef PKTLINE_H #define PKTLINE_H -#include "git-compat-util.h" #include "strbuf.h" /* diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index e07495727..24e0a1589 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -1,7 +1,6 @@ #ifndef __TEST_TOOL_H__ #define __TEST_TOOL_H__ -#include "git-compat-util.h" int cmd__chmtime(int argc, const char **argv); int cmd__config(int argc, const char **argv); I applied these changes locally and confirmed the code compiles, so all .c files including these _do_ include git-compat-util.h properly. Thanks, -Stolee
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/8/2018 12:41 PM, SZEDER Gábor wrote: On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote: I'm still excited about the prospect of a bloom filter for paths which each commit touches. I think that's the next big frontier in getting things like "git log -- path" to a reasonable run-time. There is certainly potential there. With a (very) rough PoC experiment, a 8MB bloom filter, and a carefully choosen path I can achieve a nice, almost 25x speedup: $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh 6 real0m1.563s user0m1.519s sys 0m0.045s $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh 6 real0m0.063s user0m0.043s sys 0m0.020s bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934 Nice! These numbers make sense to me, in terms of how many TREESAME queries we actually need to perform for such a query. But I'm afraid it will take a while until I get around to turn it into something presentable... Do you have the code pushed somewhere public where one could take a look? I could provide some early feedback. Thanks, -Stolee
[PATCH 1/3] midx: fix broken free() in close_midx()
From: Derrick Stolee When closing a multi-pack-index, we intend to close each pack-file and free the struct packed_git that represents it. However, this line was previously freeing the array of pointers, not the pointer itself. This leads to a double-free issue. Signed-off-by: Derrick Stolee --- midx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/midx.c b/midx.c index f3e8dbc108..999717b96f 100644 --- a/midx.c +++ b/midx.c @@ -190,7 +190,7 @@ static void close_midx(struct multi_pack_index *m) for (i = 0; i < m->num_packs; i++) { if (m->packs[i]) { close_pack(m->packs[i]); - free(m->packs); + free(m->packs[i]); } } FREE_AND_NULL(m->packs); -- gitgitgadget
[PATCH 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable
To increase coverage of the multi-pack-index feature, add a GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_* variables. After creating the environment variable and running the test suite with it enabled, I found a few bugs in the multi-pack-index implementation. These are handled by the first two patches. I have set up a CI build on Azure Pipelines [1] that runs the test suite with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure they work well with the rest of the ongoing work. Eventually, we can add these variables to the Travis CI scripts. [1] https://git.visualstudio.com/git/_build?definitionId=4 Derrick Stolee (3): midx: fix broken free() in close_midx() midx: close multi-pack-index on repack multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX builtin/repack.c| 8 midx.c | 17 + midx.h | 4 t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 32 insertions(+), 6 deletions(-) base-commit: f84b9b09d40408cf91bbc500d9f190a7866c3e0f Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-27/derrickstolee/midx-test/upstream-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/27 -- gitgitgadget
[PATCH 2/3] midx: close multi-pack-index on repack
From: Derrick Stolee When repacking, we may remove pack-files. This invalidates the multi-pack-index (if it exists). Previously, we removed the multi-pack-index file before removing any pack-file. In some cases, the repack command may load the multi-pack-index into memory. This may lead to later in-memory references to the non-existent pack- files. Signed-off-by: Derrick Stolee --- builtin/repack.c | 4 midx.c | 6 +- midx.h | 2 ++ 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/builtin/repack.c b/builtin/repack.c index c6a7943d5c..7925bb976e 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -432,6 +432,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!midx_cleared) { /* if we move a packfile, it will invalidated the midx */ + if (the_repository->objects) { + close_midx(the_repository->objects->multi_pack_index); + the_repository->objects->multi_pack_index = NULL; + } clear_midx_file(get_object_directory()); midx_cleared = 1; } diff --git a/midx.c b/midx.c index 999717b96f..fe8532a9d1 100644 --- a/midx.c +++ b/midx.c @@ -180,9 +180,13 @@ cleanup_fail: return NULL; } -static void close_midx(struct multi_pack_index *m) +void close_midx(struct multi_pack_index *m) { uint32_t i; + + if (!m) + return; + munmap((unsigned char *)m->data, m->data_len); close(m->fd); m->fd = -1; diff --git a/midx.h b/midx.h index a210f1af2a..af6b5cb58f 100644 --- a/midx.h +++ b/midx.h @@ -44,4 +44,6 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i int write_midx_file(const char *object_dir); void clear_midx_file(const char *object_dir); +void close_midx(struct multi_pack_index *m); + #endif -- gitgitgadget
[PATCH 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX
From: Derrick Stolee The multi-pack-index feature is tested in isolation by t5319-multi-pack-index.sh, but there are many more interesting scenarios in the test suite surrounding pack-file data shapes and interactions. Since the multi-pack-index is an optional data structure, it does not make sense to include it by default in those tests. Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable that enables core.multiPackIndex and writes a multi-pack-index after each 'git repack' command. This adds extra test coverage when needed. There are a few spots in the test suite that need to react to this change: * t5319-multi-pack-index.sh: there is a test that checks that 'git repack' deletes the multi-pack-index. Disable the environment variable to ensure this still happens. * t5310-pack-bitmaps.sh: One test moves a pack-file from the object directory to an alternate. This breaks the multi-pack-index, so delete the multi-pack-index at this point, if it exists. * t9300-fast-import.sh: One test verifies the number of files in the .git/objects/pack directory is exactly 8. Exclude the multi-pack-index from this count so it is still 8 in all cases. Signed-off-by: Derrick Stolee --- builtin/repack.c| 4 midx.c | 9 +++-- midx.h | 2 ++ t/README| 4 t/t5310-pack-bitmaps.sh | 1 + t/t5319-multi-pack-index.sh | 2 +- t/t9300-fast-import.sh | 2 +- 7 files changed, 20 insertions(+), 4 deletions(-) diff --git a/builtin/repack.c b/builtin/repack.c index 7925bb976e..418442bfe2 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -558,6 +558,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!no_update_server_info) update_server_info(0); remove_temporary_files(); + + if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) + write_midx_file(get_object_directory()); + string_list_clear(&names, 0); string_list_clear(&rollback, 0); string_list_clear(&existing_packs, 0); diff --git a/midx.c b/midx.c index fe8532a9d1..aeafb58fa3 100644 --- a/midx.c +++ b/midx.c @@ -338,9 +338,14 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i struct multi_pack_index *m; struct multi_pack_index *m_search; int config_value; + static int env_value = -1; - if (repo_config_get_bool(r, "core.multipackindex", &config_value) || - !config_value) + if (env_value < 0) + env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0); + + if (!env_value && + (repo_config_get_bool(r, "core.multipackindex", &config_value) || + !config_value)) return 0; for (m_search = r->objects->multi_pack_index; m_search; m_search = m_search->next) diff --git a/midx.h b/midx.h index af6b5cb58f..bec8f73d28 100644 --- a/midx.h +++ b/midx.h @@ -3,6 +3,8 @@ #include "repository.h" +#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX" + struct multi_pack_index { struct multi_pack_index *next; diff --git a/t/README b/t/README index 3ea6c85460..9d0277c338 100644 --- a/t/README +++ b/t/README @@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to be written after every 'git commit' command, and overrides the 'core.commitGraph' setting to true. +GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack- +index to be written after every 'git repack' command, and overrides the +'core.multiPackIndex' setting to true. + Naming Tests diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 1be3459c5b..82d7f7f6a5 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects --honor-pack-keep (local bitmapped pa test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' ' mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ && + rm -f .git/objects/pack/multi-pack-index && test_when_finished "mv alt.git/objects/pack/$packbitmap.* .git/objects/pack/" && echo HEAD | git pack-objects --local --stdout --revs >3b.pack && git index-pack 3b.pack && diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 6f56b38674..4024ff9a39 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -152,7 +152,7 @@ compare_results_with_midx "twelve packs" test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && - git repack -adf && + GIT_TEST_MULTI_PACK_INDEX=0 gi
Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH
On 10/8/2018 10:58 AM, Ævar Arnfjörð Bjarmason wrote: On Mon, Oct 08 2018, Derrick Stolee wrote: On 10/8/2018 9:43 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, Aug 28 2018, Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee The commit-graph feature is tested in isolation by t5318-commit-graph.sh and t6600-test-reach.sh, but there are many more interesting scenarios involving commit walks. Many of these scenarios are covered by the existing test suite, but we need to maintain coverage when the optional commit-graph structure is not present. To allow running the full test suite with the commit-graph present, add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try to load the commit-graph when parsing commits, and writes the commit-graph file after every 'git commit' command. There are a few tests that rely on commits not existing in pack-files to trigger important events, so manually set GIT_TEST_COMMIT_GRAPH to false for the necessary commands. There is one test in t6024-recursive-merge.sh that relies on the merge-base algorithm picking one of two ambiguous merge-bases, and the commit-graph feature changes which merge-base is picked. The test feature itself seems fine, but this consistently fails ever since it got introduced (a reset --hard on the commit merged to msater in git.git): GIT_TEST_COMMIT_GRAPH=true prove -j$(parallel --number-of-cores) t5500-fetch-pack.sh t6001-rev-list-graft.sh t6050-replace.sh Test Summary Report --- t6001-rev-list-graft.sh (Wstat: 256 Tests: 14 Failed: 6) Failed tests: 3, 5, 7, 9, 11, 13 Non-zero exit status: 1 t6050-replace.sh (Wstat: 256 Tests: 35 Failed: 9) Failed tests: 12-16, 24-25, 30, 35 Non-zero exit status: 1 t5500-fetch-pack.sh(Wstat: 256 Tests: 357 Failed: 1) Failed test: 351 Non-zero exit status: 1 This is on Linux/Debian 4.17.0-1-amd64. Can you reproduce this? If not I can provide more info (-x output etc..). I see these failures, too, but I believe they are due to ds/commit-graph-with-grafts not being merged to 'next' yet. The purpose of that branch is to fix these test breaks. The environment variable got merged a lot faster. I just built & tested the 'jch' branch at 515d82d9 with GIT_TEST_COMMIT_GRAPH=1 and they all passed. I should have tested "pu" first. These failures are indeed fixed there. Thanks, and sorry about the noise. Thanks for testing with the optional features! It's good to keep them exercised.
[PATCH v4 1/1] contrib: add coverage-diff script
From: Derrick Stolee We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks covering very unlikely (or near-impossible) situations may not warrant coverage. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. The output uses 'git blame -s' format so you can find the commits responsible and view the line numbers for quick access to the context, but trims leading tabs in the file contents to reduce output width. Signed-off-by: Derrick Stolee --- contrib/coverage-diff.sh | 108 +++ 1 file changed, 108 insertions(+) create mode 100755 contrib/coverage-diff.sh diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh new file mode 100755 index 00..4ec419f900 --- /dev/null +++ b/contrib/coverage-diff.sh @@ -0,0 +1,108 @@ +#!/bin/sh + +# Usage: Run 'contrib/coverage-diff.sh ' from source-root +# after running +# +# make coverage-test +# make coverage-report +# +# while checked out at . This script combines the *.gcov files +# generated by the 'make' commands above with 'git diff ' +# to report new lines that are not covered by the test suite. + +V1=$1 +V2=$2 + +diff_lines () { + perl -e ' + my $line_num; + while (<>) { + # Hunk header? Grab the beginning in postimage. + if (/^@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@/) { + $line_num = $1; + next; + } + + # Have we seen a hunk? Ignore "diff --git" etc. + next unless defined $line_num; + + # Deleted line? Ignore. + if (/^-/) { + next; + } + + # Show only the line number of added lines. + if (/^\+/) { + print "$line_num\n"; + } + # Either common context or added line appear in + # the postimage. Count it. + $line_num++; + } + ' +} + +files=$(git diff --name-only "$V1" "$V2" -- \*.c) + +# create empty file +>coverage-data.txt + +for file in $files +do + git diff "$V1" "$V2" -- "$file" | + diff_lines | + sort >new_lines.txt + + if ! test -s new_lines.txt + then + continue + fi + + hash_file=$(echo $file | sed "s/\//\#/") + + if ! test -s "$hash_file.gcov" + then + continue + fi + + sed -ne '/#:/{ + s/#:// + s/:.*// + s/ //g + p + }' "$hash_file.gcov" | + sort >uncovered_lines.txt + + comm -12 uncovered_lines.txt new_lines.txt | + sed -e 's/$/\)/' | + sed -e 's/^/ /' >uncovered_new_lines.txt + + grep -q '[^[:space:]]' >coverage-data.txt && + git blame -s "$V2" -- "$file" | + sed 's/\t//g' | + grep -f uncovered_new_lines.txt >>coverage-data.txt && + echo >>coverage-data.txt + + rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt +done + +cat coverage-data.txt + +echo "Commits introducing uncovered code:" + +commit_list=$(cat coverage-data.txt | + grep -E '^[0-9a-f]{7,} ' | + awk '{print $1;}' | + sort | + uniq) + +( + for commit in $commit_list + do + git log --no-decorate --pretty=format:'%an %h: %s' -1 $commit + echo + done +) | sort + +rm coverage-data.txt -- gitgitgadget
[PATCH v4 0/1] contrib: Add script to show uncovered "new" lines
We have coverage targets in our Makefile for using gcov to display line coverage based on our test suite. The way I like to do it is to run: make coverage-test make coverage-report This leaves the repo in a state where every X.c file that was covered has an X.c.gcov file containing the coverage counts for every line, and "#" at every uncovered line. There have been a few bugs in recent patches what would have been caught if the test suite covered those blocks (including a few of mine). I want to work towards a "sensible" amount of coverage on new topics. In my opinion, this means that any logic should be covered, but the 'die()' blocks in error cases do not need to be covered. It is important to not measure the coverage of the codebase by what old code is not covered. To help, I created the 'contrib/coverage-diff.sh' script. After creating the coverage statistics at a version (say, 'topic') you can then run contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the test suite. For example, I ran this against the 'next' branch (e82ca0) versus 'master' (f84b9b) and got the following output: builtin/commit.c 76f2f5c1e3 builtin/commit.c 1657) write_commit_graph_reachable(get_object_directory(), 0, 0); builtin/fsck.c 66ec0390e7 builtin/fsck.c 862) midx_argv[2] = "--object-dir"; 66ec0390e7 builtin/fsck.c 863) midx_argv[3] = alt->path; 66ec0390e7 builtin/fsck.c 864) if (run_command(&midx_verify)) 66ec0390e7 builtin/fsck.c 865) errors_found |= ERROR_COMMIT_GRAPH; fsck.c fb8952077d 214) die_errno("Could not read '%s'", path); midx.c 56ee7ff156 949) return 0; cc6af73c02 990) midx_report(_("failed to load pack-index for packfile %s"), cc6af73c02 991) e.p->pack_name); cc6af73c02 992) break; Commits introducing uncovered code: Derrick Stolee 56ee7ff15: multi-pack-index: add 'verify' verb Derrick Stolee 66ec0390e: fsck: verify multi-pack-index Derrick Stolee cc6af73c0: multi-pack-index: verify object offsets Junio C Hamano 76f2f5c1e: Merge branch 'ab/commit-graph-progress' into next René Scharfe fb8952077: fsck: use strbuf_getline() to read skiplist file Thanks, -Stolee CHANGES IN V3: I took Junio's perl script verbatim, which speeds up the performance greatly. Some of the other sed commands needed some massaging, but also added extra cleanup. Thanks for the help! CHANGES IN V4: I reduced the blame output using -s which decreases the width. I include a summary of the commit authors at the end to help people see the lines they wrote. This version is also copied into a build definition in the public Git project on Azure Pipelines [1]. I'll use this build definition to generate the coverage report after each "What's Cooking" email. [1] https://git.visualstudio.com/git/_build?definitionId=5 Derrick Stolee (1): contrib: add coverage-diff script contrib/coverage-diff.sh | 108 +++ 1 file changed, 108 insertions(+) create mode 100755 contrib/coverage-diff.sh base-commit: 1d4361b0f344188ab5eec6dcea01f61a3a3a1670 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-40%2Fderrickstolee%2Fcoverage-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-40/derrickstolee/coverage-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/40 Range-diff vs v3: 1: 21214cc321 ! 1: 6daf310a43 contrib: add coverage-diff script @@ -26,10 +26,10 @@ contrib/coverage-diff.sh base topic to see the lines added between 'base' and 'topic' that are not covered by the -test suite. The output uses 'git blame -c' format so you can find the commits -responsible and view the line numbers for quick access to the context. +test suite. The output uses 'git blame -s' format so you can find the commits +responsible and view the line numbers for quick access to the context, but +trims leading tabs in the file contents to reduce output width. -Helped-by: Junio C Hamano Signed-off-by: Derrick Stolee diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh @@ -81,13 +81,16 @@ + ' +} + -+files=$(git diff --name-only $V1 $V2 -- *.c) ++files=$(git diff --name-only "$V1" "$V2" -- \*.c) ++ ++# create empty file ++>coverage-data.txt + +for file in $files +do -+ git diff $V1 $V2 -- $file \ -+ | diff_lines \ -+ | sort >new_lines.txt ++ git diff "$V1" "$V2" -- "$file" | ++ diff_lines | ++ sort >new_lines.txt + + if ! test -s new_lines.txt + then
Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH
On 10/8/2018 9:43 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, Aug 28 2018, Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee The commit-graph feature is tested in isolation by t5318-commit-graph.sh and t6600-test-reach.sh, but there are many more interesting scenarios involving commit walks. Many of these scenarios are covered by the existing test suite, but we need to maintain coverage when the optional commit-graph structure is not present. To allow running the full test suite with the commit-graph present, add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try to load the commit-graph when parsing commits, and writes the commit-graph file after every 'git commit' command. There are a few tests that rely on commits not existing in pack-files to trigger important events, so manually set GIT_TEST_COMMIT_GRAPH to false for the necessary commands. There is one test in t6024-recursive-merge.sh that relies on the merge-base algorithm picking one of two ambiguous merge-bases, and the commit-graph feature changes which merge-base is picked. The test feature itself seems fine, but this consistently fails ever since it got introduced (a reset --hard on the commit merged to msater in git.git): GIT_TEST_COMMIT_GRAPH=true prove -j$(parallel --number-of-cores) t5500-fetch-pack.sh t6001-rev-list-graft.sh t6050-replace.sh Test Summary Report --- t6001-rev-list-graft.sh (Wstat: 256 Tests: 14 Failed: 6) Failed tests: 3, 5, 7, 9, 11, 13 Non-zero exit status: 1 t6050-replace.sh (Wstat: 256 Tests: 35 Failed: 9) Failed tests: 12-16, 24-25, 30, 35 Non-zero exit status: 1 t5500-fetch-pack.sh(Wstat: 256 Tests: 357 Failed: 1) Failed test: 351 Non-zero exit status: 1 This is on Linux/Debian 4.17.0-1-amd64. Can you reproduce this? If not I can provide more info (-x output etc..). I see these failures, too, but I believe they are due to ds/commit-graph-with-grafts not being merged to 'next' yet. The purpose of that branch is to fix these test breaks. The environment variable got merged a lot faster. I just built & tested the 'jch' branch at 515d82d9 with GIT_TEST_COMMIT_GRAPH=1 and they all passed. Thanks, -Stolee
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/5/2018 3:47 PM, Jeff King wrote: On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote: So can we really just take (total_objects - commit_graph_objects) and compare it to some threshold? The commit-graph only stores the number of _commits_, not total objects. Oh, right, of course. That does throw a monkey wrench in that line of thought. ;) There's unfortunately not a fast way of doing that. One option would be to keep a counter of "ungraphed commit objects", and have callers update it. Anybody admitting a pack via index-pack or unpack-objects can easily get this information. Commands like fast-import can do likewise, and "git commit" obviously increments it by one. I'm not excited about adding a new global on-disk data structure (and the accompanying lock). If we want, then we can add an optional chunk to the commit-graph file that stores the object count.
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/5/2018 3:21 PM, Jeff King wrote: On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote: My misunderstanding was that your proposed change to gc computes the commit-graph in either of these two cases: (1) The auto-GC threshold is met. (2) There is no commit-graph file. And what I hope to have instead of (2) is (3): (3) The commit-graph file is "sufficiently behind" the tip refs. This condition is intentionally vague at the moment. It could be that we hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a pack, and it probably contains a lot of new commits") or we could create some more complicated condition based on counting reachable commits with infinite generation number (the number of commits not in the commit-graph file). I like that you are moving forward to make the commit-graph be written more frequently, but I'm trying to push us in a direction of writing it even more often than your proposed strategy. We should avoid creating too many orthogonal conditions that trigger the commit-graph write, which is why I'm pushing on your design here. Anyone else have thoughts on this direction? Yes, I think measuring "sufficiently behind" is the right thing. Everything else is a proxy or heuristic, and will run into corner cases. E.g., I have some small number of objects and then do a huge fetch, and now my commit-graph only covers 5% of what's available. We know how many objects are in the graph already. And it's not too expensive to get the number of objects in the repository. We can do the same sampling for loose objects that "gc --auto" does, and counting packed objects just involves opening up the .idx files (that can be slow if you have a ton of packs, but you'd want to either repack or use a .midx in that case anyway, either of which would help here). So can we really just take (total_objects - commit_graph_objects) and compare it to some threshold? The commit-graph only stores the number of _commits_, not total objects. Azure Repos' commit-graph does store the total number of objects, and that is how we trigger updating the graph, so it is not unreasonable to use that as a heuristic. Thanks, -Stolee
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote: On Fri, Oct 05 2018, Derrick Stolee wrote: On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: I don't have time to polish this up for submission now, but here's a WIP patch that implements this, highlights: * There's a gc.clone.autoDetach=false default setting which overrides gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a --cloning option to indicate this). I'll repeat that it could make sense to do the same thing on clone _and_ fetch. Perhaps a "--post-fetch" flag would be good here to communicate that we just downloaded a pack from a remote. I don't think that makes sense, but let's talk about why, because maybe I've missed something, you're certainly more familiar with the commit-graph than I am. The reason to do it on clone as a special-case or when the file is missing, is because we know the file is desired (via the GC config), and presumably is expected to help performance, and we have 0% of it. So by going from 0% to 100% on clone we'll get fast --contains and other goodies the graph helps with. But when we're doing a fetch, or really anything else that runs "git gc --auto" we can safely assume that we have a recent enough graph, because it will have been run whenever auto-gc kicked in. I.e.: # Slow, if we assume background forked commit-graph generation # (which I'm avoiding) git clone x && cd x && git tag --contains # Fast enough, since we have an existing commit-graph cd x && git fetch && git tag --contains I *do* think it might make sense to in general split off parts of "gc --auto" that we'd like to be more aggressive about, simply because the ratio of how long it takes to do, and how much it helps with performance makes more sense than a full repack, which is what the current heuristic is based on. And maybe when we run in that mode we should run in the foreground, but I don't see why git-fetch should be a special case there, and in this regard, the gc.clone.autoDetach=false setting I've made doesn't make much sence. I.e. maybe we should also skip forking to the background in such a mode when we trigger such a "mini gc" via git-commit or whatever. My misunderstanding was that your proposed change to gc computes the commit-graph in either of these two cases: (1) The auto-GC threshold is met. (2) There is no commit-graph file. And what I hope to have instead of (2) is (3): (3) The commit-graph file is "sufficiently behind" the tip refs. This condition is intentionally vague at the moment. It could be that we hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a pack, and it probably contains a lot of new commits") or we could create some more complicated condition based on counting reachable commits with infinite generation number (the number of commits not in the commit-graph file). I like that you are moving forward to make the commit-graph be written more frequently, but I'm trying to push us in a direction of writing it even more often than your proposed strategy. We should avoid creating too many orthogonal conditions that trigger the commit-graph write, which is why I'm pushing on your design here. Anyone else have thoughts on this direction? Thanks, -Stolee
Re: [PATCH 15/16] commit-reach: make can_all_from_reach... linear
On 10/4/2018 6:59 PM, René Scharfe wrote: Am 01.10.2018 um 22:37 schrieb René Scharfe: Am 01.10.2018 um 21:26 schrieb Derrick Stolee: Good catch! I'm disappointed that we couldn't use type-checking here, as it is quite difficult to discover that the types are wrong here. Generics in C are hard, and type checking traditionally falls by the wayside. You could use macros for that, like klib [*] does, but that has its own downsides (more object text, debugging the sort macros themselves is harder). [*] https://github.com/attractivechaos/klib/blob/master/ksort.h We could also do something like this to reduce the amount of manual casting, but do we want to? (Macro at the bottom, three semi-random examples at the top.) I like the idea! It certainly can assist in some of the repeat work when preparing to QSORT, and make it less error-prone. --- bisect.c | 11 +++ commit-graph.c| 9 ++--- commit-reach.c| 12 +--- git-compat-util.h | 12 4 files changed, 22 insertions(+), 22 deletions(-) diff --git a/bisect.c b/bisect.c index e8b17cf7e1..06be3a3c15 100644 --- a/bisect.c +++ b/bisect.c @@ -192,16 +192,11 @@ struct commit_dist { int distance; }; -static int compare_commit_dist(const void *a_, const void *b_) -{ - struct commit_dist *a, *b; - - a = (struct commit_dist *)a_; - b = (struct commit_dist *)b_; +DEFINE_SORT(sort_by_commit_dist, struct commit_dist, a, b, { if (a->distance != b->distance) return b->distance - a->distance; /* desc sort */ return oidcmp(&a->commit->object.oid, &b->commit->object.oid); -} +}) static struct commit_list *best_bisection_sorted(struct commit_list *list, int nr) { @@ -223,7 +218,7 @@ static struct commit_list *best_bisection_sorted(struct commit_list *list, int n array[cnt].distance = distance; cnt++; } - QSORT(array, cnt, compare_commit_dist); + sort_by_commit_dist(array, cnt); for (p = list, i = 0; i < cnt; i++) { struct object *obj = &(array[i].commit->object); diff --git a/commit-graph.c b/commit-graph.c index 7f4519ec3b..a2202414e0 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -550,12 +550,7 @@ static void write_graph_chunk_large_edges(struct hashfile *f, } } -static int commit_compare(const void *_a, const void *_b) -{ - const struct object_id *a = (const struct object_id *)_a; - const struct object_id *b = (const struct object_id *)_b; - return oidcmp(a, b); -} +DEFINE_SORT(sort_oids, struct object_id, a, b, return oidcmp(a, b)) struct packed_commit_list { struct commit **list; @@ -780,7 +775,7 @@ void write_commit_graph(const char *obj_dir, close_reachable(&oids); - QSORT(oids.list, oids.nr, commit_compare); + sort_oids(oids.list, oids.nr); count_distinct = 1; for (i = 1; i < oids.nr; i++) { diff --git a/commit-reach.c b/commit-reach.c index 2f5e592d16..3aef47c3dd 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -527,17 +527,15 @@ int commit_contains(struct ref_filter *filter, struct commit *commit, return is_descendant_of(commit, list); } -static int compare_commits_by_gen(const void *_a, const void *_b) -{ - const struct commit *a = *(const struct commit * const *)_a; - const struct commit *b = *(const struct commit * const *)_b; - +DEFINE_SORT(sort_commits_by_gen, struct commit *, ap, bp, { + const struct commit *a = *ap; + const struct commit *b = *bp; if (a->generation < b->generation) return -1; if (a->generation > b->generation) return 1; return 0; -} +}) Here, to make the macro version compile you need to cast ap and bp, which gives us a level of type-safety that wasn't there before. That can help us find errors at compile-time! int can_all_from_reach_with_flag(struct object_array *from, unsigned int with_flag, @@ -580,7 +578,7 @@ int can_all_from_reach_with_flag(struct object_array *from, nr_commits++; } - QSORT(list, nr_commits, compare_commits_by_gen); + sort_commits_by_gen(list, nr_commits); for (i = 0; i < nr_commits; i++) { /* DFS from list[i] */ diff --git a/git-compat-util.h b/git-compat-util.h index 5f2e90932f..f9e78d69a2 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -1066,6 +1066,18 @@ static inline void sane_qsort(void *base, size_t nmemb, size_t size, qsort(base, nmemb, size, compar); } +#define DEFINE_SORT(name, elemtype, one, two, code) \ +static int name##_compare(const void *one##_v_, const void *two##_v_) \ +{ \ + elemtype const *one
Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: I don't have time to polish this up for submission now, but here's a WIP patch that implements this, highlights: * There's a gc.clone.autoDetach=false default setting which overrides gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a --cloning option to indicate this). I'll repeat that it could make sense to do the same thing on clone _and_ fetch. Perhaps a "--post-fetch" flag would be good here to communicate that we just downloaded a pack from a remote. * A clone of say git.git with gc.writeCommitGraph=true looks like: [...] Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done. Resolving deltas: 100% (188947/188947), done. Computing commit graph generation numbers: 100% (55210/55210), done. This looks like good UX. Thanks for the progress here! * The 'git gc --auto' command also knows to (only) run the commit-graph (and space is left for future optimization steps) if general GC isn't needed, but we need "optimization": $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto; Annotating commits in commit graph: 341229, done. Computing commit graph generation numbers: 100% (165969/165969), done. $ Will this also trigger a full commit-graph rewrite on every 'git commit' command? Or is there some way we can compute the staleness of the commit-graph in order to only update if we get too far ahead? Previously, this was solved by relying on the auto-GC threshold. * The patch to gc.c looks less scary with -w, most of it is indenting the existing pack-refs etc. with a "!auto_gc || should_gc" condition. * I added a commit_graph_exists() exists function and only care if I get ENOENT for the purposes of this gc mode. This would need to be tweaked for the incremental mode Derrick talks about, but if we just set "should_optimize" that'll also work as far as gc --auto is concerned (e.g. on fetch, am etc.) The incremental mode would operate the same as split-index, which means we will still look for .git/objects/info/commit-graph. That file may point us to more files. +int commit_graph_exists(const char *graph_file) +{ + struct stat st; + if (stat(graph_file, &st)) { + if (errno == ENOENT) + return 0; + else + return -1; + } + return 1; +} + This method serves a very similar purpose to generation_numbers_enabled(), except your method only cares about the file existing. It ignores information like `core.commitGraph`, which should keep us from doing anything with the commit-graph file if false. Nothing about your method is specific to the commit-graph file, since you provide a filename as a parameter. It could easily be "int file_exists(const char *filename)". Thanks, -Stolee
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/3/2018 2:51 PM, Jeff King wrote: On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote: On Wed, Oct 03 2018, Stefan Beller wrote: So we wouldn't be spending 5 minutes repacking linux.git right after cloning it, just ~10s generating the commit graph, and the same would happen if you rm'd .git/objects/info/commit-graph and ran "git commit", which would kick of "gc --auto" in the background and do the same thing. Or generating local bitmaps or pack idx files as well? I'm less familiar with this area, but when I clone I get a pack *.idx file, why does it need to be regenerated? But yeah, in principle this would be a sensible addition, but I'm not aware of cases where clients get significant benefits from bitmaps (see https://githubengineering.com/counting-objects/), and I don't turn it on for clients, but maybe I've missed something. They don't help yet, and there's no good reason to enable bitmaps for clients. I have a few patches that use bitmaps for things like ahead/behind and --contains checks, but the utility of those may be lessened quite a bit by Stolee's commit-graph work. And if it isn't, I'm mildly in favor of replacing the existing .bitmap format with something better integrated with commit-graphs (which would give us an opportunity to clean up some of the rough edges). If the commit-graph doesn't improve enough on those applications, then we could consider adding a commit-to-commit reachability bitmap inside the commit-graph. ;) -Stolee
[PATCH v2 1/3] commit-graph: clean up leaked memory during write
From: Derrick Stolee The write_commit_graph() method in commit-graph.c leaks some lits and strings during execution. In addition, a list of strings is leaked in write_commit_graph_reachable(). Clean these up so our memory checking is cleaner. Further, if we use a list of pack-files to find the commits, we can leak the packed_git structs after scanning them for commits. Running the following commands demonstrates the leak before and the fix after: * valgrind --leak-check=full ./git commit-graph write --reachable * valgrind --leak-check=full ./git commit-graph write --stdin-packs Signed-off-by: Martin Ågren Signed-off-by: Derrick Stolee --- commit-graph.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 2a24eb8b5a..ceca6026b0 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -693,11 +693,12 @@ static int add_ref_to_list(const char *refname, void write_commit_graph_reachable(const char *obj_dir, int append, int report_progress) { - struct string_list list; + struct string_list list = STRING_LIST_INIT_DUP; - string_list_init(&list, 1); for_each_ref(add_ref_to_list, &list); write_commit_graph(obj_dir, NULL, &list, append, report_progress); + + string_list_clear(&list, 0); } void write_commit_graph(const char *obj_dir, @@ -764,6 +765,7 @@ void write_commit_graph(const char *obj_dir, die(_("error opening index for %s"), packname.buf); for_each_object_in_pack(p, add_packed_commits, &oids, 0); close_pack(p); + free(p); } stop_progress(&oids.progress); strbuf_release(&packname); @@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(&commits, report_progress); graph_name = get_commit_graph_filename(obj_dir); - if (safe_create_leading_directories(graph_name)) + if (safe_create_leading_directories(graph_name)) { + UNLEAK(graph_name); die_errno(_("unable to create leading directories of %s"), graph_name); + } hold_lock_file_for_update(&lk, graph_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); @@ -893,9 +897,9 @@ void write_commit_graph(const char *obj_dir, finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC); commit_lock_file(&lk); + free(graph_name); + free(commits.list); free(oids.list); - oids.alloc = 0; - oids.nr = 0; } #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2 -- gitgitgadget
[PATCH v2 3/3] commit-graph: reduce initial oid allocation
From: Derrick Stolee While writing a commit-graph file, we store the full list of commits in a flat list. We use this list for sorting and ensuring we are closed under reachability. The initial allocation assumed that (at most) one in four objects is a commit. This is a dramatic over-count for many repos, especially large ones. Since we grow the repo dynamically, reduce this count by a factor of eight. We still set it to a minimum of 1024 before allocating. Signed-off-by: Derrick Stolee --- commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index ceca6026b0..e773703e1d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -720,7 +720,7 @@ void write_commit_graph(const char *obj_dir, struct progress *progress = NULL; oids.nr = 0; - oids.alloc = approximate_object_count() / 4; + oids.alloc = approximate_object_count() / 32; oids.progress = NULL; oids.progress_done = 0; -- gitgitgadget
[PATCH v2 0/3] Clean up leaks in commit-graph.c
While looking at the commit-graph code, I noticed some memory leaks. These can be found by running valgrind --leak-check=full ./git commit-graph write --reachable The impact of these leaks are small, as we never call write_commit_graph _reachable in a loop, but it is best to be diligent here. While looking at memory consumption within write_commit_graph(), I noticed that we initialize our oid list with "object count / 4", which seems to be a huge over-count. I reduce this by a factor of eight. I built off of ab/commit-graph-progress, because my patch involves lines close to those changes. V2 includes feedback from V1 along with Martin's additional patches. Thanks, -Stolee Derrick Stolee (2): commit-graph: clean up leaked memory during write commit-graph: reduce initial oid allocation Martin Ågren (1): builtin/commit-graph.c: UNLEAK variables builtin/commit-graph.c | 11 ++- commit-graph.c | 16 ++-- 2 files changed, 16 insertions(+), 11 deletions(-) base-commit: 6b89a34c89fc763292f06012318b852b74825619 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-42/derrickstolee/commit-graph-leak-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/42 Range-diff vs v1: 1: 6906c25415 ! 1: ba65680b3d commit-graph: clean up leaked memory during write @@ -7,17 +7,29 @@ leaked in write_commit_graph_reachable(). Clean these up so our memory checking is cleaner. -Running 'valgrind --leak-check=full git commit-graph write ---reachable' demonstrates these leaks and how they are fixed after -this change. +Further, if we use a list of pack-files to find the commits, we +can leak the packed_git structs after scanning them for commits. +Running the following commands demonstrates the leak before and +the fix after: + +* valgrind --leak-check=full ./git commit-graph write --reachable +* valgrind --leak-check=full ./git commit-graph write --stdin-packs + +Signed-off-by: Martin Ågren Signed-off-by: Derrick Stolee diff --git a/commit-graph.c b/commit-graph.c --- a/commit-graph.c +++ b/commit-graph.c @@ - string_list_init(&list, 1); + void write_commit_graph_reachable(const char *obj_dir, int append, +int report_progress) + { +- struct string_list list; ++ struct string_list list = STRING_LIST_INIT_DUP; + +- string_list_init(&list, 1); for_each_ref(add_ref_to_list, &list); write_commit_graph(obj_dir, NULL, &list, append, report_progress); + @@ -25,6 +37,14 @@ } void write_commit_graph(const char *obj_dir, +@@ + die(_("error opening index for %s"), packname.buf); + for_each_object_in_pack(p, add_packed_commits, &oids, 0); + close_pack(p); ++ free(p); + } + stop_progress(&oids.progress); + strbuf_release(&packname); @@ compute_generation_numbers(&commits, report_progress); @@ -45,5 +65,8 @@ + free(graph_name); + free(commits.list); free(oids.list); - oids.alloc = 0; - oids.nr = 0; +- oids.alloc = 0; +- oids.nr = 0; + } + + #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2 -: -- > 2: 13032d8475 builtin/commit-graph.c: UNLEAK variables 2: e29a0eaf03 = 3: 1002fd34fc commit-graph: reduce initial oid allocation -- gitgitgadget
Re: [PATCH 0/2] commit-graph: more leak fixes
On 10/3/2018 11:36 AM, Martin Ågren wrote: Hi Derrick, These two patches on top of yours make the test suite (i.e., the subset of it that I run) leak-free with respect to builtin/commit-graph.c and commit-graph.c. Thanks! The first could be squashed into your patch 1/2. It touches the same function, but it requires a different usage to trigger, so squashing it in would require broadening the scope. I understand if you don't want to do that. I'm fine with squashing it in with both our sign-offs. It is the same idea, it just requires a different set of arguments to hit it. I'll adjust the commit message as necessary. If you want to pick these up as part of your re-roll in any way, shape or form, go ahead. If not, they can go in separately, either in parallel or after your series lands. Whatever the destiny of this posting, I'll follow through as appropriate. I'll add your PATCH 2/2 to my v2. Thanks!
Re: We should add a "git gc --auto" after "git clone" due to commit graph
On 10/3/2018 9:36 AM, SZEDER Gábor wrote: On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote: Don't have time to patch this now, but thought I'd send a note / RFC about this. Now that we have the commit graph it's nice to be able to set e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or /etc/gitconfig to apply them to all repos. But when I clone e.g. linux.git stuff like 'tag --contains' will be slow until whenever my first "gc" kicks in, which may be quite some time if I'm just using it passively. So we should make "git gc --auto" be run on clone, There is no garbage after 'git clone'... And since there is no garbage, the gc will not write the commit-graph. and change the need_to_gc() / cmd_gc() behavior so that we detect that the gc.writeCommitGraph=true setting is on, but we have no commit graph, and then just generate that without doing a full repack. Or just teach 'git clone' to run 'git commit-graph write ...' I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting until the file is incremental (on my to-do list soon), so the write is fast when only adding a few commits at a time. This would cover the clone case, too. Thanks, -Stolee
Re: [PATCH 1/2] commit-graph: clean up leaked memory during write
On 10/2/2018 6:44 PM, Stefan Beller wrote: My preference is to avoid them in the name of simplicity. If you're using "make SANITIZE=leak test" to check for leaks, it will skip these cases. If you're using valgrind, I think these may be reported as "reachable". But that number already isn't useful for finding real leaks, because it includes cases like the "foo" above as well as program-lifetime globals. The only argument (IMHO) for such an UNLEAK() is that it annotates the location for when somebody later changes the function to "return -1" instead of dying. But if we are going to do such annotation, we may as well actually call free(), which is what the "return" version will ultimately have to do. Heh, that was part of my reasoning why we'd want to have *something*. I'd actually be _more_ favorable to calling free() instead of UNLEAK() there, but I'm still mildly negative, just because it may go stale (and our leak-checking wouldn't usefully notice these cases). Anybody converting that die() to a return needs to re-analyze the function for what might need to be released (and that includes non-memory bits like descriptors, too). Sounds reasonable, so then the consensus (between Martin, you and me) is to drop the UNLEAK. Thanks for the discussion here. I'll drop the UNLEAK for now and think about how to remove the die() calls from commit-graph.c in a later series. Thanks, -Stolee
Re: [PATCH] more oideq/hasheq conversions
On 10/2/2018 5:19 PM, Jeff King wrote: We added faster equality-comparison functions for hashes in 14438c4497 (introduce hasheq() and oideq(), 2018-08-28). A few topics were in-flight at the time, and can now be converted. This covers all spots found by "make coccicheck" in master (the coccicheck results were tweaked by hand for style). Signed-off-by: Jeff King --- Jake: I was surprised that this was not a "patch 2" on top of your earlier coccicheck patch. Apologies if you were planning to send it out. This doesn't conflict with anything in "pu", so it's a reasonable time to apply it. There are a few lingering cases in pu, so another option is to wait a week or two and see if they get merged. These conversions look good to me! Reviewed-by: Derrick Stolee
[PATCH 2/2] commit-graph: reduce initial oid allocation
From: Derrick Stolee While writing a commit-graph file, we store the full list of commits in a flat list. We use this list for sorting and ensuring we are closed under reachability. The initial allocation assumed that (at most) one in four objects is a commit. This is a dramatic over-count for many repos, especially large ones. Since we grow the repo dynamically, reduce this count by a factor of eight. We still set it to a minimum of 1024 before allocating. Signed-off-by: Derrick Stolee --- commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 7226bd6b58..a24cceb55f 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -721,7 +721,7 @@ void write_commit_graph(const char *obj_dir, struct progress *progress = NULL; oids.nr = 0; - oids.alloc = approximate_object_count() / 4; + oids.alloc = approximate_object_count() / 32; oids.progress = NULL; oids.progress_done = 0; -- gitgitgadget
[PATCH 1/2] commit-graph: clean up leaked memory during write
From: Derrick Stolee The write_commit_graph() method in commit-graph.c leaks some lits and strings during execution. In addition, a list of strings is leaked in write_commit_graph_reachable(). Clean these up so our memory checking is cleaner. Running 'valgrind --leak-check=full git commit-graph write --reachable' demonstrates these leaks and how they are fixed after this change. Signed-off-by: Derrick Stolee --- commit-graph.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 2a24eb8b5a..7226bd6b58 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -698,6 +698,8 @@ void write_commit_graph_reachable(const char *obj_dir, int append, string_list_init(&list, 1); for_each_ref(add_ref_to_list, &list); write_commit_graph(obj_dir, NULL, &list, append, report_progress); + + string_list_clear(&list, 0); } void write_commit_graph(const char *obj_dir, @@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(&commits, report_progress); graph_name = get_commit_graph_filename(obj_dir); - if (safe_create_leading_directories(graph_name)) + if (safe_create_leading_directories(graph_name)) { + UNLEAK(graph_name); die_errno(_("unable to create leading directories of %s"), graph_name); + } hold_lock_file_for_update(&lk, graph_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); @@ -893,6 +897,8 @@ void write_commit_graph(const char *obj_dir, finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC); commit_lock_file(&lk); + free(graph_name); + free(commits.list); free(oids.list); oids.alloc = 0; oids.nr = 0; -- gitgitgadget
[PATCH 0/2] Clean up leaks in commit-graph.c
While looking at the commit-graph code, I noticed some memory leaks. These can be found by running valgrind --leak-check=full ./git commit-graph write --reachable The impact of these leaks are small, as we never call write_commit_graph _reachable in a loop, but it is best to be diligent here. While looking at memory consumption within write_commit_graph(), I noticed that we initialize our oid list with "object count / 4", which seems to be a huge over-count. I reduce this by a factor of eight. I built off of ab/commit-graph-progress, because my patch involves lines close to those changes. Thanks, -Stolee Derrick Stolee (2): commit-graph: clean up leaked memory during write commit-graph: reduce initial oid allocation commit-graph.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) base-commit: 6b89a34c89fc763292f06012318b852b74825619 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-42/derrickstolee/commit-graph-leak-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/42 -- gitgitgadget
Re: [PATCH 15/16] commit-reach: make can_all_from_reach... linear
On 10/1/2018 3:16 PM, René Scharfe wrote: Am 28.06.2018 um 14:31 schrieb Derrick Stolee via GitGitGadget: diff --git a/commit-reach.c b/commit-reach.c index c58e50fbb..ac132c8e4 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -513,65 +513,88 @@ int commit_contains(struct ref_filter *filter, struct commit *commit, return is_descendant_of(commit, list); } -int reachable(struct commit *from, int with_flag, int assign_flag, - time_t min_commit_date) +static int compare_commits_by_gen(const void *_a, const void *_b) { - struct prio_queue work = { compare_commits_by_commit_date }; + const struct commit *a = (const struct commit *)_a; + const struct commit *b = (const struct commit *)_b; This cast is bogus. QSORT gets handed a struct commit **, i.e. an array of pointers, and qsort(1) passes references to those pointers to the compare function, and not the pointer values. Good catch! I'm disappointed that we couldn't use type-checking here, as it is quite difficult to discover that the types are wrong here. As a result it's unlikely that the array is sorted in the intended order. Given that, a silly question: Is sorting even necessary here? The reason to sort is to hopefully minimize the amount we walk by exploring the "lower" commits first. This is a performance-only thing, not a correctness issue (which is why the bug exists). Even then, it is just a heuristic. Anyway, the patch below should fix it. -- >8 -- Subject: [PATCH] commit-reach: fix cast in compare_commits_by_gen() The elements of the array to be sorted are commit pointers, so the comparison function gets handed references to these pointers, not pointers to commit objects. Cast to the right type and dereference once to correctly get the commit reference. Found using Clang's ASan and t5500. Signed-off-by: Rene Scharfe --- Has this patch a performance impact? commit-reach.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/commit-reach.c b/commit-reach.c index 00e5ceee6f..2f5e592d16 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -529,8 +529,8 @@ int commit_contains(struct ref_filter *filter, struct commit *commit, static int compare_commits_by_gen(const void *_a, const void *_b) { - const struct commit *a = (const struct commit *)_a; - const struct commit *b = (const struct commit *)_b; + const struct commit *a = *(const struct commit * const *)_a; + const struct commit *b = *(const struct commit * const *)_b; I would expect s/* const */**/ here, but I'm guessing your formulation is a bit extra careful about types. Thanks! Reviewed-by: Derrick Stolee
Re: [PATCH] read-cache: fix division by zero core-dump
On 9/28/2018 11:31 AM, Ramsay Jones wrote: Also, this is not the first time some multi-threaded code in git has 'failed' by assuming more than one cpu, so ... I wonder if this is a good time to create a GIT_TEST_CPU_COUNT variable so we can mock out single-processor environments instead of relying on old hardware or custom VMs. Thanks, -Stolee
Re: Git Evolve
On 9/29/2018 7:00 PM, Stefan Xenos wrote: Hello, List! Hello! Welcome. I'm interested in porting something like Mercurial's evolve command to Git. I'll be following up with a formal proposal shortly, but before I do I thought I'd introduce myself to the list and find out if anyone else is interested in this topic. I'm CC'ing some contributors who have also expressed interest in this topic. What is the evolve command? I'm snipping the rest of your thread because I'm vaguely familiar with how this is used in hg, but I haven't used it myself. Instead, I'm going to ask you the same questions I asked the last time I had a conversation about this with someone. In my opinion, these questions should have good answers before we start working on the solution, or else we could paint ourselves into a corner as we build the first pieces. --- What would the command-line experience look like for this workflow? Be specific, including examples! How does one create a commit that obsoletes another? Are we in the middle of an interactive rebase, or are we simply checking out the commit? How does a use keep track of their progress in a topic? How do I view which commits in my topic are obsolete, and to what commits? If I want to obsolete commits on one machine and then finish the work on another machine, how do I do that? Similarly: how can I share obsolete commits with other users so they can apply them (or not)? Do obsolescence markers live in the ref space? (This is one way to help answer the question above.) Can I make multiple commits obsolete into one commit (merge patches)? Can I make my commit obsolete in multiple ways (split a patch)? How is this communicated to the user? --- In my opinion, a good way to move forward is to create a patch that adds a design document to Documentation/technical that answers these questions. Then we can dig in more to make the vision clearer. I'm not a UX expert, but this seems like a place where we could use someone with expertise in the area. If we are not careful, we could end up with something even harder to use than interactive rebase. The main goal here is to make it easy to rewrite a topic (plus, the ability to do so in stages). I look forward to see what goes on in this space. Count me in to review the technical details. Thanks, -Stolee
Re: [PATCH v2 0/4] git-commit-graph.txt: various cleanups
On 9/27/2018 3:12 PM, Martin Ågren wrote: This v2 starts with the same two patches as v1 did, then goes on to change "[commit] graph file" to "commit-graph file" with a dash, to match other instances as well as Derrick's feedback. Thanks! This version satisfies my concerns and looks good to me. Reviewed-by: Derrick Stolee
Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
On 9/21/2018 1:39 PM, Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. We know when we can pause each walk by using generation numbers from the commit- graph feature. Hello, Git contributors. I understand that this commit message and patch are pretty daunting. There is a lot to read and digest. I would like to see if anyone is willing to put the work in to review this patch, as I quite like what it does, and the performance numbers below. In my local testing, I used the following Git commands on the Linux repository in three modes: HEAD~1 with no commit-graph, HEAD~1 with a commit-graph, and HEAD with a commit-graph. This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting the set of commits we walk. Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s If there is something I can do to make this easier to review, then please let me know. Thanks, -Stolee
Re: Git for Windows for Unix?
On 9/27/2018 12:01 PM, Ævar Arnfjörð Bjarmason wrote: I had an IRC conversation with Johannes saying I didn't know Git For Windows builds perfectly well for Linux, this just isn't advertised in the ANNOUNCE E-Mails, so I hadn't tried. We run CI to ensure it builds and tests on Mac OSX, too. This is important to the VFS for Git group, as we work on making that work for Mac clients. We have our fork of Git for Windows at https://github.com/microsoft/git. GFW is a "friendly fork", but a permanent one it seems. The diff between it and 2.19.0 proper is ~10k lines, and e.g. this last release had experimental stash/rebase in C that 2.19.0 didn't. Hopefully we can learn from having this experimental feature in the wild and improve the patches on-list by fixing issues. We do have a desire to move as much as possible upstream. It's difficult to find time to pay down that "fork debt". Thanks, -Stolee