Re: [RFC] Generation Number v2
Jakub Narebski writes: > Jakub Narebski writes: >> Stefan Beller writes: > [...] >>> How would this impact creation of a commit? >>> >>> The current generation numbers can be lazily updated or not >>> updated at all. In my understanding of the maximum generation >>> numbers, a new commit would make these maximum generation >>> numbers invalid (i.e. they have to be recomputed). > [...] >>> For the V2 maximum generation numbers, would we need to >>> rewrite the numbers for all commits once we recompute them? >>> Assuming that is true, it sounds like the benchmark doesn't >>> cover the whole costs associated with V2, which is why the >>> exceptional performance can be explained. >> >> Let's check it using a simple example >> >> First, (minimum) parent-based generation numbers before and after >> extending the commit graph: >> >> 1 2 3 4 5 6 7new >> 1 2 3 4 5 - -old >> .---.-.---.-.---*---* >>\ >> \ 3 4 5 6new >> \ 3 4 5 6old >> \-.---.-.---. >> \ >> \ 5new >>\ -old >> \-* > > Let me check yet another idea, using (minimum) parent-based V0 generation > numbers (counting distance from the sink / root) as a starting number > for source / heads commits. [...] > [...] but let's check another example > >1 2 3 4 5 6 7 8 9 new >1 2 3 4 5 6 7 8 - old >.---.-.---.---.---.-.---.---* > \ / > \ 3 4 / 5 6 7 8 new > \ 5 6 / - - - - old >\-.---.-/---*---*---*---* But let's do this correctly. 1 2 3 4 5 6 7 8 9 new 1 2 3 4 5 6 7 8 - old .---.-.---.--.---.-.---.---* \/ \ 3 4 /new \ 5 6 / old \-.---./ \ \5 6 7 8 new \ - - - - old \--*---*-*---* Well, it looks as if I draw it incorrectly, but performed calculations right. You may need to modify / change some data, but it looks as if it is not that much of a problem. The new version of the maximum generation numbers looks like it gives the same results as generation numbers for the "longest" path, and update may affect only the side-branches that were added to. All branches merged into the trunk, and not added to should be safe with respect to updating. Can anyone here prove a thing about update of those modified maximum generation numbers? Thanks in advance. Best, -- Jakub Narębski
Re: [RFC] Generation Number v2
Derrick Stolee writes: > On 11/1/2018 8:27 AM, Jakub Narebski wrote: >> Derrick Stolee writes: >> >>> Please also let me know about any additional tests that I could >>> run. Now that I've got a lot of test scripts built up, I can re-run >>> the test suite pretty quickly. >> >> I would add straighforward 1-to-N and N-to-1 reachability tests in the >> form of `git branch / tag --contains` and `git branch / tag --merged`, >> and perhaps also ahead-behind calculations used by `git status`, and >> N-to-M reachability tests used by tag following code in push and fetch. 1-to-N and N-to-1 can be done with `git branch --merged` / `--contains`. >> Possibly also A...B walk, if it is not implemented via calculating >> merge-base. > > I believe this uses paint_down_to_common(), but looks at the PARENT1 and > PARENT2 flags afterward to determine the left/right/both relationships. Right, so I guess this is the same internal mechanism that `git merge-base` utilizes, just used a bit differently. Thus benchmarking `git merge-base` should cover also A...B performance, I guess. >> Maybe even --ancestry-path walk, though I am not sure how important best >> performance for rhis case is (we would want good performance, but >> perhaps best is not needed for rarely used command). > > Currently, the implementation of --ancestry-path does not use a > reachability index. Well, using reachability index would certainly give it a boost. [...] >>> Generation Number Performance Test >>> == >>> >>> Git uses the commit history to answer many basic questions, such as >>> computing a merge-base. Some of these algorithms can benefit from a >>> _reachability index_, which is a condition that allows us to answer >>> "commit A cannot reach commit B" in some cases. These require pre- >>> computing some values that we can load and use at run-time. Git >>> already has a notion of _generation number_, stores it in the commit- >>> graph file, and uses it in several reachability algorithms. >> >> Note that there are other kinds of reachability indices. >> >> First, there are reachability indices that can answer the full >> reachability query (if commit A can reach commit B, or if commit A >> cannot reach commit B) directly, without walking the commit graph at >> all: so called label-only approach. For example one could store for >> each commit the compressed list of all commits reachable from it >> (transitive closure compression). >> >> Those, I think (but I have not checked), would be of not much use for >> Git, as the size of the index grows stronger than linear with the number >> of commits, as grows the time to compute such index. So probably of no >> use to Git, at least not directly (Git uses so called "bitmap index", >> see e.g. [1], which stores reachability bit-vector as compressed >> bitmap... but not for all commits, only for a small subset). >> >> >> Second, beside negative-cut reachability indexes, that can answer with >> certainity that "commit A cannot reach commit B", like the generation >> numbers (also known as level, or topological level), there also >> positive-cut indexes (usually if not always based on the spanning tree >> or trees for the DAG), that can answer when "commit A can reach commit >> B". >> >> I think that those can lead to significant speedups for at least some >> types of commit traversal and reachability queries that Git needs to >> answer. But currently no algorithms has a provision for using >> positive-cut filter index. Anyway, such index would be auxiliary thing, >> see the next point. >> >> >> Third, we require more than having reachability index in the sense of >> having some kind of _labelling_, often composite, that can answer either >> "commit A cannot reach commit B" or "commit A can reach commit B", >> depending on the type. Git for most operations needs more, namely an >> actual _ordering_, that is a weak order (or to be more exact a total >> preorder, i.e. there can be two different commits with the same >> "generation number" or index, but always either idx(A) ≲ idx(B) or >> idx(B) ≲ idx(A)) and not only partial ordering (where there can be >> incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)). >> This is because it needs to be used in priority queue to decide which >> commit to travel next; more on that later. This is also why there can >> be only one such "main" reachability _index_. >> >> [1]: https://githubengineering.com/counting-objects/ > > Thanks for the details. At the moment, I'm only interested in improving our > negative-cut reachability index, as we have algorithms that can take > advantage of them (and can compare their performance directly). Right, I can agree with that. Positive-cut reachability index use would need to be added separately. What I wanted to emphasize is the need for a _number_ (or a total preorder), not simply an _index_ (a label, perhaps a composite one). This means, as I wrote, that there would
Re: [RFC] Generation Number v2
Derrick Stolee writes: > On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote: >> On Tue, Oct 30 2018, Junio C Hamano wrote: >>> Derrick Stolee writes: In contrast, maximum generation numbers and corrected commit dates both performed quite well. They are frequently the top two performing indexes, and rarely significantly different. The trade-off here now seems to be: which _property_ is more important, locally-computable or backwards-compatible? >>> >>> Nice summary. >>> >>> As I already said, I personally do not think being compatible with >>> currently deployed clients is important at all (primarily because I >>> still consider the whole thing experimental), and there is a clear >>> way forward once we correct the mistake of not having a version >>> number in the file format that tells the updated clients to ignore >>> the generation numbers. For longer term viability, we should pick >>> something that is immutable, reproducible, computable with minimum >>> input---all of which would lead to being incrementally computable, I >>> would think. >> >> I think it depends on what we mean by backwards compatibility. None of >> our docs are saying this is experimental right now, just that it's >> opt-in like so many other git-config(1) options. >> >> So if we mean breaking backwards compatibility in that we'll write a new >> file or clobber the existing one with a version older clients can't use >> as an optimization, fine. >> >> But it would be bad to produce a hard error on older clients, but >> avoiding that seems as easy as just creating a "commit-graph2" file in >> .git/objects/info/. > > Well, we have a 1-byte version number following the "CGPH" header in > the commit-graph file, and clients will ignore the commit-graph file > if that number is not "1". My hope for backwards-compatibility was > to avoid incrementing this value and instead use the unused 8th byte. How? Some of the considered new generation numbers were backwards compatibile in the sense that for graph traversal the old code for (minimum) generation numbers could be used. But that is for reading. If the old client tried to update new generation number using old generation number algorithm (assuming locality), it would write some chimaera that may or may not work. > However, it appears that we are destined to increment that version > number, anyway. Here is my list for what needs to be in the next > version of the commit-graph file format: As a reminder, here is how the commit-graph format looks now [1] HEADER: 4-byte signature: CGPH 1-byte version number: 1 1-byte Hash Version: 1 = SHA-1 1-byte number (C) of "chunks": 3 or 4 1-byte (reserved for later use) CHUNK LOOKUP: (C + 1) * 12 bytes listing the table of contents for the chunks CHUNK DATA: OIDF OIDL CDAT EDGE [optional, depends on graph structure] TRAILER: checksum [1]: Documentation/technical/commit-graph-format.txt > 1. A four-byte hash version. Why do you need a four-byte hash version? 255 possible hash functions is not enough? > 2. File incrementality (split commit-graph). This would probably require a segment lookup part, and one chunk of each type per segment (or even one chunk lookup table per segment). It may or may not need generation number which can support segmenting (here maximal generation numbers have the advantage); but you can assume that if segment(A) > segment(B) then reach.ix(A) > reach.ix(B), i.e. lexicographical-like sort. > 3. Reachability Index versioning I assume that you mean here versioning for the reachability index that is stored in the CDAT section (if it is in a separate chunk, it should be easy to version). The easiest thing to do when encountering unknown or not supported generation number (perhaps the commit-graph file was created in the past by earlier version of Git, perahs it came from server with newer Git, or perhaps the same repository is sometimes accessed using newer and sometimes older Git version), is to set commit->generation_number to GENERATION_NUMBER_ZERO, as you wrote in [2]. We could encode backward-compatibility (for using) somewhat, but I wonder how much it would be of use. [2]: https://public-inbox.org/git/61a829ce-0d29-81c9-880e-7aef1bec9...@gmail.com/ > Most of these changes will happen in the file header. The chunks > themselves don't need to change, but some chunks may be added that > only make sense in v2 commit-graphs. What I would like to see in v2 commit-graph format would be some kind convention for naming / categorizing chunks, so that Git would be able to handle unknown chunks gracefully. In PNG format there are various types of chunks. Quoting Wikipedia: Chunk types [in PNG format] are given a four-letter case sensitive ASCII type/name; compare FourCC. The case of the different letters in the name (bit 5 of the numeric value of the character) is a bit
Re: [RFC] Generation Number v2
Derrick Stolee writes: > On 10/29/2018 11:59 PM, Junio C Hamano wrote: >> Derrick Stolee writes: [...] >>> * **Compatible?** In our test implementation, we use a previously unused >>>byte of data in the commit-graph format to indicate which reachability >>>index version we are using. Existing clients ignore this value, so we >>>will want to consider if these new indexes are _backwards compatible_. >>>That is, will they still report correct values if they ignore this byte >>>and use the generation number column from the commit-graph file assuming >>>the values are minimum generation numbers? >> >> I personally consider that the current commit-graph with generation >> numbers experimental, so I am not sure how much we care about this. >> >> Having said that. >> >> By the above definition, any new index that is wider than the >> current generation number cannot be compatible (can we even tell the >> existing clients how wide each elements in the ix array is?) >> >> In any case, perhaps the first thing to do is to update the clients >> so that they stop ignoring the version number field, and instead >> work without generation number when there is no version of reach.ix >> available in the file? That way, a better reachablility index can >> be chosen freely without having to worry about the compatibility. > > I can work on that. It should be as simple as setting commit->generation to > GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph > has a different version. That is a very good idea, but we have here a chicken-and-egg problem. The only way to denote that the generation numbers / reachability index have changed (for reading/using: that it changed in backwards incompatibile way; for update: that it changed at all) is to change the format version: 1-byte version number: Currently, the only valid version is 1. But if we assume that different commit-graph format version simply means that commit->generation is to be set to GENERATION_NUMBER_ZERO, then we block possible changes to the structure of the commit-graph file. [Reads further in thread] Ah, I see that's why you want to introduce [1]: DS> DS> 3. Reachability Index versioning [1]: https://public-inbox.org/git/6902dbff-d9f6-e897-2c20-d0cb47a50...@gmail.com/ >>> * **Immutable?** Git objects are _immutable_. If you change an object you >>>actually create a new object with a new object ID. Are the values we >>> store >>>for these reachability indexes also immutable? >> >> Even if we do not embed the reachability ix in commit objects, >> having an immutable value is probably a must if we want to make them >> incrementally computable, so this is a very good property to have. >> Unless there is a clever idea to incrementally compute a mutable >> reach.ix, my gut instinct says that this property is a must. I think that the reachability index can be mutable and non-local, but still being able to be incrementally updated, though perhaps it would require not only calculations for new commits, but recalculations for some limited subset of commits existing in the commit-graph file. I'm trying to modify exact definition of maximal generation numbers to check if I can find out a version that exhibits nearly-incremental updates property. >> Another thing, perhaps related to "local" below, is if exactly the >> same reach.ix is computed by anybody, given an identical commit >> history graph (perhaps "reproducibility"?). I think most of the >> candidates you listed are reproducible without a fixed tie-breaker, >> but I am not sure about felineY() thing. felineY() is deterministic (or can be made deterministic) for given felineX() (i.e. given [reverse] topological order). >>> * **Local?** Are these values **locally computable**? That is, do we only >>>need to look at the parents of a commit (assuming those parents have >>>computed values) in order to determine the value at that commit? >> >> A subset of non-local reachability ix, for example, the ones that >> need to know what each commit's children are, cannot be immutable, >> as adding new objects to the graph (either with locally committing, >> or transferring objects from other repositories) would affect the >> ix; is this true for all non-local reachability ix, I wonder? > > As a thought experiment, we could define a function size(C) to be the > number of commits reachable from C. Let's call it reach(C), instead ;-) >This is not locally-computable > from the size values at C's parents due to the inclusion-exclusion > principle. We would need to compute it by walking the reachable set > and counting (resulting in quadratic performance overall) but is > immutable. Since the performance cost is so expensive (unlike the > linear costs in the other non-local versions) I didn't include it > in my comparison. Let's define Reach(C) as set of all commits reachable from commit C. Then reach(C) = ||Reach(C)||, where
Re: [RFC] Generation Number v2
Jakub Narebski writes: > Stefan Beller writes: [...] >> How would this impact creation of a commit? >> >> The current generation numbers can be lazily updated or not >> updated at all. In my understanding of the maximum generation >> numbers, a new commit would make these maximum generation >> numbers invalid (i.e. they have to be recomputed). [...] >> For the V2 maximum generation numbers, would we need to >> rewrite the numbers for all commits once we recompute them? >> Assuming that is true, it sounds like the benchmark doesn't >> cover the whole costs associated with V2, which is why the >> exceptional performance can be explained. > > Let's check it using a simple example > > First, (minimum) parent-based generation numbers before and after > extending the commit graph: > > 1 2 3 4 5 6 7new > 1 2 3 4 5 - -old > .---.-.---.-.---*---* >\ > \ 3 4 5 6new > \ 3 4 5 6old > \-.---.-.---. > \ > \ 5new >\ -old > \-* Let me check yet another idea, using (minimum) parent-based V0 generation numbers (counting distance from the sink / root) as a starting number for source / heads commits. 1 2 3 4 5 6 7new 1 2 3 4 5 - -old .---.-.---.-.---*---* \ \ 3 4 5 6new \ 3 4 5 6old \-.---.-.---. \ \ 5new \ -old \-* Well, on this example it looks like this variant of maximum generation numbers can be done incrementally, but let's check another example 1 2 3 4 5 6 7 8 9 new 1 2 3 4 5 6 7 8 - old .---.-.---.---.---.-.---.---* \ / \ 3 4 / 5 6 7 8 new \ 5 6 / - - - - old \-.---.-/---*---*---*---* It looks like it doesn; give as good results as I thought. Less values are changed, but you would still need to recalculate them, unless it can be proven that they do not need it. > > Next, maximum generation numbers. We start with 9 commits, and we end > up with 12 commits after the change > > 6 7 8 9 10 11 12 new > 4 5 7 8 9 - -old > .---.-.---.-.---*---* >\ > \ 9 1011 12 new > \ 6 7 8 9old > \-.---.-.---. > \ > \ 12 new >\ -old > \-* > > > As you can see all maximum generation numbers got rewritten. > > Though if instead using the number of commits, we use the maximum > generation number, or in other words the length of the longest path, we > get the following: > > 1 2 3 4 5 6 7new > 1 2 4 5 6 - -old > .---.-.---.-.---*---* >\ > \ 4 5 6 7new > \ 3 4 5 6old > \-.---.-.---. > \ > \ 7new >\ -old > \-* > > A bit better, but still much change in numbers. -- Jakub Narębski
Re: [RFC] Generation Number v2
Derrick Stolee writes: > Here is a re-formatted version of the tables I introduced earlier. > The tables were too wide for public-inbox to render correctly (when > paired with my email client). Hopefully this bulleted-list format > works better. Thanks, Stefan, for pointing out the rendering > problems! > > ### Test 1: `git log --topo-order -N` > > This test focuses on the number of commits that are parsed during > a `git log --topo-order` before writing `N` commits to output. > > You can reproduce this test using `topo-order-tests.sh` and > see all the data in `topo-order-summary.txt`. The values reported > here are a sampling of the data, ignoring tests where all values > were the same or extremely close in value. > > android-base, N = 100 > V0: 5487 > V1: 8534 > V2: 6937 > V3: 6419 > V4: 6453 [...] > ### Test 2: `git log --topo-order -10 A..B` [...] > android-base 53c1972bc8f 92f18ac3e39 > OLD: 39403 > V0: 1544 > V1: 6957 > V2: 26 > V3: 1015 > V4: 1098 Two minor issues. First, now that the data is not in the table format, where every bit of horizontal space matters, why not spell the names of new proposed "generation numbers" (or rather reachability orderings) in full, instead of using V0...V4 shortcuts? It is not as if we were short of space. Second, why OLD data is present in tests 2 and 3, but not in test 1? Best, -- Jakub Narębski
Re: [RFC] Generation Number v2
Stefan Beller writes: >> Based on the performance results alone, we should remove minimum >> generation numbers, (epoch, date) pairs, and FELINE index from >> consideration. There are enough examples of these indexes performing >> poorly. >> >> In contrast, maximum generation numbers and corrected commit >> dates both performed quite well. They are frequently the top >> two performing indexes, and rarely significantly different. >> >> The trade-off here now seems to be: which _property_ is more important, >> locally-computable or backwards-compatible? >> >> * Maximum generation number is backwards-compatible but not >>locally-computable or immutable. > > These maximum generation numbers sound like the reverse of > the generation numbers as they are currently implemented, i.e. > we count all commits between the commit A and all heads. Actually maximum generation number = number of commits - reverse generation number > How would this impact creation of a commit? > > The current generation numbers can be lazily updated or not > updated at all. In my understanding of the maximum generation > numbers, a new commit would make these maximum generation > numbers invalid (i.e. they have to be recomputed). > > Are there ways out by deferring the computation of maximum > generation numbers while still being able to work with some > commits that are un-accounted for? As I understand it, new commits added since commit-graph file was created would get simply INFINITY maximum/maximal generation numbers (if we were using reverse generation numbers, new commits would get ZERO for generation number). > When recomputing these numbers, the current generation number > (V0) has the property that already existing numbers can be re-used > as-is. We only need to compute the numbers for new commits, > and then insert this to the appropriate data structures (which is > a separate problem, one could imagine a split generation > numbers file like the split index) Sidenote: I wonder if there is some on-disk data structure with similarly fast read, but easier update. > For the V2 maximum generation numbers, would we need to > rewrite the numbers for all commits once we recompute them? > Assuming that is true, it sounds like the benchmark doesn't > cover the whole costs associated with V2, which is why the > exceptional performance can be explained. Let's check it using a simple example First, (minimum) parent-based generation numbers before and after extending the commit graph: 1 2 3 4 5 6 7new 1 2 3 4 5 - -old .---.-.---.-.---*---* \ \ 3 4 5 6new \ 3 4 5 6old \-.---.-.---. \ \ 5new \ -old \-* Next, maximum generation numbers. We start with 9 commits, and we end up with 12 commits after the change 6 7 8 9 10 11 12 new 4 5 7 8 9 - -old .---.-.---.-.---*---* \ \ 9 1011 12 new \ 6 7 8 9old \-.---.-.---. \ \ 12 new \ -old \-* As you can see all maximum generation numbers got rewritten. Though if instead using the number of commits, we use the maximum generation number, or in other words the length of the longest path, we get the following: 1 2 3 4 5 6 7new 1 2 4 5 6 - -old .---.-.---.-.---*---* \ \ 4 5 6 7new \ 3 4 5 6old \-.---.-.---. \ \ 7new \ -old \-* A bit better, but still much change in numbers. [...] >> >> * Corrected commit-date is locally-computable and immutable, >>but not backwards-compatible. > > How are these dates not backwards incompatible? > We don't have to expose these dates to the user, but > - just like generation numbers - could store them and use them > but not tell the user about it. > > We would need to be really careful to not expose them at all > as they look like the real dates, so that could make for an > awkward bug report. > > The approach of "corrected commit date" sounds like we could > have a very lazy approach, i.e. no extra data structures needed > for many commits (as the corrected date equals the real date) > and only need to store the corrections for some commits. > Such an approach however would not make it easy to know > if we operate on corrected dates, or if we even checked them > for correctness before. > > So if we'd have an additional row in the generation numbers file > telling the corrected date, then we should be able to be backwards > compatible? Here, from what I understand, being "backwards
Re: [RFC] Generation Number v2
On 11/1/2018 8:27 AM, Jakub Narebski wrote: [I have noticed that in some places I wrote A..B instead of B..A. Sorry about that] Derrick Stolee writes: Please also let me know about any additional tests that I could run. Now that I've got a lot of test scripts built up, I can re-run the test suite pretty quickly. I would add straighforward 1-to-N and N-to-1 reachability tests in the form of `git branch / tag --contains` and `git branch / tag --merged`, and perhaps also ahead-behind calculations used by `git status`, and N-to-M reachability tests used by tag following code in push and fetch. Possibly also A...B walk, if it is not implemented via calculating merge-base. I believe this uses paint_down_to_common(), but looks at the PARENT1 and PARENT2 flags afterward to determine the left/right/both relationships. Maybe even --ancestry-path walk, though I am not sure how important best performance for rhis case is (we would want good performance, but perhaps best is not needed for rarely used command). Currently, the implementation of --ancestry-path does not use a reachability index. See explanation below. Generation Number Performance Test == Git uses the commit history to answer many basic questions, such as computing a merge-base. Some of these algorithms can benefit from a _reachability index_, which is a condition that allows us to answer "commit A cannot reach commit B" in some cases. These require pre- computing some values that we can load and use at run-time. Git already has a notion of _generation number_, stores it in the commit- graph file, and uses it in several reachability algorithms. Note that there are other kinds of reachability indices. First, there are reachability indices that can answer the full reachability query (if commit A can reach commit B, or if commit A cannot reach commit B) directly, without walking the commit graph at all: so called label-only approach. For example one could store for each commit the compressed list of all commits reachable from it (transitive closure compression). Those, I think (but I have not checked), would be of not much use for Git, as the size of the index grows stronger than linear with the number of commits, as grows the time to compute such index. So probably of no use to Git, at least not directly (Git uses so called "bitmap index", see e.g. [1], which stores reachability bit-vector as compressed bitmap... but not for all commits, only for a small subset). Second, beside negative-cut reachability indexes, that can answer with certainity that "commit A cannot reach commit B", like the generation numbers (also known as level, or topological level), there also positive-cut indexes (usually if not always based on the spanning tree or trees for the DAG), that can answer when "commit A can reach commit B". I think that those can lead to significant speedups for at least some types of commit traversal and reachability queries that Git needs to answer. But currently no algorithms has a provision for using positive-cut filter index. Anyway, such index would be auxiliary thing, see the next point. Third, we require more than having reachability index in the sense of having some kind of _labelling_, often composite, that can answer either "commit A cannot reach commit B" or "commit A can reach commit B", depending on the type. Git for most operations needs more, namely an actual _ordering_, that is a weak order (or to be more exact a total preorder, i.e. there can be two different commits with the same "generation number" or index, but always either idx(A) ≲ idx(B) or idx(B) ≲ idx(A)) and not only partial ordering (where there can be incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)). This is because it needs to be used in priority queue to decide which commit to travel next; more on that later. This is also why there can be only one such "main" reachability _index_. [1]: https://githubengineering.com/counting-objects/ Thanks for the details. At the moment, I'm only interested in improving our negative-cut reachability index, as we have algorithms that can take advantage of them (and can compare their performance directly). You can read more about generation numbers and how to use them in algorithms in [this blog post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/). However, [generation numbers do not always improve our algorithms](https://public-inbox.org/git/pull.28.git.gitgitgad...@gmail.com/T/#u). Specifically, some algorithms in Git already use commit date as a heuristic reachability index. This has some problems, though, since commit date can be incorrect for several reasons (clock skew between machines, purposefully setting GIT_COMMIT_DATE to the author date, etc.). That's what we use the "slop" mechanism for: Git would walk a few commits more than necessary than if there were no clock skew (if
Re: [RFC] Generation Number v2
[I have noticed that in some places I wrote A..B instead of B..A. Sorry about that] Derrick Stolee writes: > We've discussed in several places how to improve upon generation > numbers. This RFC is a report based on my investigation into a > few new options, and how they compare for Git's purposes on > several existing open-source repos. > > You can find this report and the associated test scripts at > https://github.com/derrickstolee/gen-test. Which use prototype test implementation from the `reach-perf` branch in https://github.com/derrickstolee/git. > I have some more work to do in order to make this fully > reproducible (including a single script to clone all of the > repos I used, compute the commit-graphs using the different > index versions, and run the tests). > > TL;DR: I recommend we pursue one of the following two options: > > 1. Maximum Generation Number. > 2. Corrected Commit Date. > > Each of these perform well, but have different implementation > drawbacks. I'd like to hear what everyone thinks about those > drawbacks. I agree with Junio that incremental computation is more important than backwards compatibility (that the old clients can work with the new commit-graph without changes). We would have compatibility breaking anyway with the planned move from SHA-1 to NewHash aka. SHA-256. > Please also let me know about any additional tests that I could > run. Now that I've got a lot of test scripts built up, I can re-run > the test suite pretty quickly. I would add straighforward 1-to-N and N-to-1 reachability tests in the form of `git branch / tag --contains` and `git branch / tag --merged`, and perhaps also ahead-behind calculations used by `git status`, and N-to-M reachability tests used by tag following code in push and fetch. Possibly also A...B walk, if it is not implemented via calculating merge-base. Maybe even --ancestry-path walk, though I am not sure how important best performance for rhis case is (we would want good performance, but perhaps best is not needed for rarely used command). See explanation below. > Generation Number Performance Test > == > > Git uses the commit history to answer many basic questions, such as > computing a merge-base. Some of these algorithms can benefit from a > _reachability index_, which is a condition that allows us to answer > "commit A cannot reach commit B" in some cases. These require pre- > computing some values that we can load and use at run-time. Git > already has a notion of _generation number_, stores it in the commit- > graph file, and uses it in several reachability algorithms. Note that there are other kinds of reachability indices. First, there are reachability indices that can answer the full reachability query (if commit A can reach commit B, or if commit A cannot reach commit B) directly, without walking the commit graph at all: so called label-only approach. For example one could store for each commit the compressed list of all commits reachable from it (transitive closure compression). Those, I think (but I have not checked), would be of not much use for Git, as the size of the index grows stronger than linear with the number of commits, as grows the time to compute such index. So probably of no use to Git, at least not directly (Git uses so called "bitmap index", see e.g. [1], which stores reachability bit-vector as compressed bitmap... but not for all commits, only for a small subset). Second, beside negative-cut reachability indexes, that can answer with certainity that "commit A cannot reach commit B", like the generation numbers (also known as level, or topological level), there also positive-cut indexes (usually if not always based on the spanning tree or trees for the DAG), that can answer when "commit A can reach commit B". I think that those can lead to significant speedups for at least some types of commit traversal and reachability queries that Git needs to answer. But currently no algorithms has a provision for using positive-cut filter index. Anyway, such index would be auxiliary thing, see the next point. Third, we require more than having reachability index in the sense of having some kind of _labelling_, often composite, that can answer either "commit A cannot reach commit B" or "commit A can reach commit B", depending on the type. Git for most operations needs more, namely an actual _ordering_, that is a weak order (or to be more exact a total preorder, i.e. there can be two different commits with the same "generation number" or index, but always either idx(A) ≲ idx(B) or idx(B) ≲ idx(A)) and not only partial ordering (where there can be incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)). This is because it needs to be used in priority queue to decide which commit to travel next; more on that later. This is also why there can be only one such "main" reachability _index_. [1]: https://githubengineering.com/counting-objects/ > You can
Re: [RFC] Generation Number v2
On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, Oct 30 2018, Junio C Hamano wrote: Derrick Stolee writes: In contrast, maximum generation numbers and corrected commit dates both performed quite well. They are frequently the top two performing indexes, and rarely significantly different. The trade-off here now seems to be: which _property_ is more important, locally-computable or backwards-compatible? Nice summary. As I already said, I personally do not think being compatible with currently deployed clients is important at all (primarily because I still consider the whole thing experimental), and there is a clear way forward once we correct the mistake of not having a version number in the file format that tells the updated clients to ignore the generation numbers. For longer term viability, we should pick something that is immutable, reproducible, computable with minimum input---all of which would lead to being incrementally computable, I would think. I think it depends on what we mean by backwards compatibility. None of our docs are saying this is experimental right now, just that it's opt-in like so many other git-config(1) options. So if we mean breaking backwards compatibility in that we'll write a new file or clobber the existing one with a version older clients can't use as an optimization, fine. But it would be bad to produce a hard error on older clients, but avoiding that seems as easy as just creating a "commit-graph2" file in .git/objects/info/. Well, we have a 1-byte version number following the "CGPH" header in the commit-graph file, and clients will ignore the commit-graph file if that number is not "1". My hope for backwards-compatibility was to avoid incrementing this value and instead use the unused 8th byte. However, it appears that we are destined to increment that version number, anyway. Here is my list for what needs to be in the next version of the commit-graph file format: 1. A four-byte hash version. 2. File incrementality (split commit-graph). 3. Reachability Index versioning Most of these changes will happen in the file header. The chunks themselves don't need to change, but some chunks may be added that only make sense in v2 commit-graphs. Thanks, -Stolee
Re: [RFC] Generation Number v2
On Tue, Oct 30 2018, Junio C Hamano wrote: > Derrick Stolee writes: >> In contrast, maximum generation numbers and corrected commit >> dates both performed quite well. They are frequently the top >> two performing indexes, and rarely significantly different. >> >> The trade-off here now seems to be: which _property_ is more important, >> locally-computable or backwards-compatible? > > Nice summary. > > As I already said, I personally do not think being compatible with > currently deployed clients is important at all (primarily because I > still consider the whole thing experimental), and there is a clear > way forward once we correct the mistake of not having a version > number in the file format that tells the updated clients to ignore > the generation numbers. For longer term viability, we should pick > something that is immutable, reproducible, computable with minimum > input---all of which would lead to being incrementally computable, I > would think. I think it depends on what we mean by backwards compatibility. None of our docs are saying this is experimental right now, just that it's opt-in like so many other git-config(1) options. So if we mean breaking backwards compatibility in that we'll write a new file or clobber the existing one with a version older clients can't use as an optimization, fine. But it would be bad to produce a hard error on older clients, but avoiding that seems as easy as just creating a "commit-graph2" file in .git/objects/info/.
Re: [RFC] Generation Number v2
On 10/29/2018 11:59 PM, Junio C Hamano wrote: Derrick Stolee writes: **V3: Corrected Commit Date.** For a commit C, let its _corrected commit date_ (denoted by cdate(C)) be the maximum of the commit date of C and the commit dates of its parents. "maximum of the commit date of C and the corrected commit dates of its parents"? That's what I mean. Thanks. We've talked about exactly this one in the past (long before any of Microsoft folks other than Dscho came to the scene) but without an implementation, or detailed design and analysis. I am very happy to see the idea among other possibilities to be considered again. This time around, we may finally come up with something better than the "commit dates with SLOP" thing ;-). Essentially, the felineY order is selected with the goal of swapping positions of topologically-independent commits relative to the felinX ordering. The resulting reachability index is as follows: If felineX(A) < felineY(B), then A cannot reach B. If felineY(A) < felineY(B), then A cannot reach B. I presume that the first line is a typo and you compare the same X index? Yes, sorry for the typos. I fixed them in the report on GitHub. * **Compatible?** In our test implementation, we use a previously unused byte of data in the commit-graph format to indicate which reachability index version we are using. Existing clients ignore this value, so we will want to consider if these new indexes are _backwards compatible_. That is, will they still report correct values if they ignore this byte and use the generation number column from the commit-graph file assuming the values are minimum generation numbers? I personally consider that the current commit-graph with generation numbers experimental, so I am not sure how much we care about this. Having said that. By the above definition, any new index that is wider than the current generation number cannot be compatible (can we even tell the existing clients how wide each elements in the ix array is?) In any case, perhaps the first thing to do is to update the clients so that they stop ignoring the version number field, and instead work without generation number when there is no version of reach.ix available in the file? That way, a better reachablility index can be chosen freely without having to worry about the compatibility. I can work on that. It should be as simple as setting commit->generation to GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph has a different version. * **Immutable?** Git objects are _immutable_. If you change an object you actually create a new object with a new object ID. Are the values we store for these reachability indexes also immutable? Even if we do not embed the reachability ix in commit objects, having an immutable value is probably a must if we want to make them incrementally computable, so this is a very good property to have. Unless there is a clever idea to incrementally compute a mutable reach.ix, my gut instinct says that this property is a must. Another thing, perhaps related to "local" below, is if exactly the same reach.ix is computed by anybody, given an identical commit history graph (perhaps "reproducibility"?). I think most of the candidates you listed are reproducible without a fixed tie-breaker, but I am not sure about felineY() thing. * **Local?** Are these values **locally computable**? That is, do we only need to look at the parents of a commit (assuming those parents have computed values) in order to determine the value at that commit? A subset of non-local reachability ix, for example, the ones that need to know what each commit's children are, cannot be immutable, as adding new objects to the graph (either with locally committing, or transferring objects from other repositories) would affect the ix; is this true for all non-local reachability ix, I wonder? As a thought experiment, we could define a function size(C) to be the numberof commits reachable from C. This is not locally-computable from the size values at C's parents due to the inclusion-exclusion principle. We would need to compute it by walking the reachable set and counting (resulting in quadratic performance overall) but is immutable. Since the performance cost is so expensive (unlike the linear costs in the other non-local versions) I didn't include it in my comparison. We focused on three types of performance tests that test the indexes in different ways. Each test lists the `git` command that is used, and the table lists which repository is used and which inputs. ### Test 1: `git log --topo-order -N` This test focuses on the number of commits that are parsed during a `git log --topo-order` before writing `N` commits to output. A devil's advocate comment. Your patches seem to be very focused on this "unlimited" case for the past few weeks, but I am not so sure if that is a case worth optimizing for. If "git log --topo-order -N HEAD~M.."
Re: [RFC] Generation Number v2
Derrick Stolee writes: > **V3: Corrected Commit Date.** > For a commit C, let its _corrected commit date_ (denoted by cdate(C)) > be the maximum of the commit date of C and the commit dates of its > parents. "maximum of the commit date of C and the corrected commit dates of its parents"? We've talked about exactly this one in the past (long before any of Microsoft folks other than Dscho came to the scene) but without an implementation, or detailed design and analysis. I am very happy to see the idea among other possibilities to be considered again. This time around, we may finally come up with something better than the "commit dates with SLOP" thing ;-). > Essentially, the felineY order is selected with the goal of swapping > positions of topologically-independent commits relative to the felinX > ordering. The resulting reachability index is as follows: > >If felineX(A) < felineY(B), then A cannot reach B. >If felineY(A) < felineY(B), then A cannot reach B. I presume that the first line is a typo and you compare the same X index? > * **Compatible?** In our test implementation, we use a previously unused > byte of data in the commit-graph format to indicate which reachability > index version we are using. Existing clients ignore this value, so we > will want to consider if these new indexes are _backwards compatible_. > That is, will they still report correct values if they ignore this byte > and use the generation number column from the commit-graph file assuming > the values are minimum generation numbers? I personally consider that the current commit-graph with generation numbers experimental, so I am not sure how much we care about this. Having said that. By the above definition, any new index that is wider than the current generation number cannot be compatible (can we even tell the existing clients how wide each elements in the ix array is?) In any case, perhaps the first thing to do is to update the clients so that they stop ignoring the version number field, and instead work without generation number when there is no version of reach.ix available in the file? That way, a better reachablility index can be chosen freely without having to worry about the compatibility. > * **Immutable?** Git objects are _immutable_. If you change an object you > actually create a new object with a new object ID. Are the values we > store > for these reachability indexes also immutable? Even if we do not embed the reachability ix in commit objects, having an immutable value is probably a must if we want to make them incrementally computable, so this is a very good property to have. Unless there is a clever idea to incrementally compute a mutable reach.ix, my gut instinct says that this property is a must. Another thing, perhaps related to "local" below, is if exactly the same reach.ix is computed by anybody, given an identical commit history graph (perhaps "reproducibility"?). I think most of the candidates you listed are reproducible without a fixed tie-breaker, but I am not sure about felineY() thing. > * **Local?** Are these values **locally computable**? That is, do we only > need to look at the parents of a commit (assuming those parents have > computed values) in order to determine the value at that commit? A subset of non-local reachability ix, for example, the ones that need to know what each commit's children are, cannot be immutable, as adding new objects to the graph (either with locally committing, or transferring objects from other repositories) would affect the ix; is this true for all non-local reachability ix, I wonder? > We focused on three types of performance tests that test the indexes > in different ways. Each test lists the `git` command that is used, > and the table lists which repository is used and which inputs. > > ### Test 1: `git log --topo-order -N` > > This test focuses on the number of commits that are parsed during > a `git log --topo-order` before writing `N` commits to output. A devil's advocate comment. Your patches seem to be very focused on this "unlimited" case for the past few weeks, but I am not so sure if that is a case worth optimizing for. If "git log --topo-order -N HEAD~M.." (for some number M) gives us a similar result as unlimited case but with much less effort, wouldn't it be good enough that lets us concentrate on other use cases instead? > Based on the performance results alone, we should remove minimum > generation numbers, (epoch, date) pairs, and FELINE index from > consideration. There are enough examples of these indexes performing > poorly. OK. > In contrast, maximum generation numbers and corrected commit > dates both performed quite well. They are frequently the top > two performing indexes, and rarely significantly different. > > The trade-off here now seems to be: which _property_ is more important, > locally-computable or backwards-compatible? Nice summary. As I already said, I personally do not think
Re: [RFC] Generation Number v2
Here is a re-formatted version of the tables I introduced earlier. The tables were too wide for public-inbox to render correctly (when paired with my email client). Hopefully this bulleted-list format works better. Thanks, Stefan, for pointing out the rendering problems! ### Test 1: `git log --topo-order -N` This test focuses on the number of commits that are parsed during a `git log --topo-order` before writing `N` commits to output. You can reproduce this test using `topo-order-tests.sh` and see all the data in `topo-order-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. android-base, N = 100 V0: 5487 V1: 8534 V2: 6937 V3: 6419 V4: 6453 android-base, N = 1000 V0: 36029 V1: 44030 V2: 41493 V3: 41206 V4: 45431 chromium, N = 100 V0: 101 V1: 424406 V2: 101 V3: 101 V4: 101 gerrit, N = 100 V0: 8212 V1: 8533 V2: 164 V3: 159 V4: 162 gerrit, N = 1000 V0: 8512 V1: 8533 V2: 1990 V3: 1973 V4: 3766 Linux, N = 100 V0: 12458 V1: 12444 V2: 13683 V3: 13123 V4: 13124 Linux, N = 1000 V0: 24436 V1: 26247 V2: 27878 V3: 26430 V4: 27875 Linux, N = 1 V0: 30364 V1: 28891 V2: 27878 V3: 26430 V4: 27875 electron, N = 1000 V0: 19927 V1: 18733 V2: 1072 V3: 18214 V4: 18214 Ffmpeg, N = 1 V0: 32154 V1: 47429 V2: 10435 V3: 11054 V4: 11054 jgit, N = 1000 V0: 1550 V1: 6264 V2: 1067 V3: 1060 V4: 1233 julia, N = 1 V0: 43043 V1: 43043 V2: 10201 V3: 23505 V4: 23828 odoo, N = 1000 V0: 17175 V1: 9714 V2: 4043 V3: 4046 V4: 4111 php-src, N = 1000 V0: 19014 V1: 27530 V2: 1311 V3: 1305 V4: 1320 rails, N = 100 V0: 1420 V1: 2041 V2: 1757 V3: 1428 V4: 1441 rails, N = 1000 V0: 7952 V1: 10145 V2: 10053 V3: 8373 V4: 8373 swift, N = 1000 V0: 1914 V1: 4004 V2: 2071 V3: 1939 V4: 1940 tensorflow, N = 1000 V0: 10019 V1: 39221 V2: 6711 V3: 10051 V4: 10051 TypeScript, N = 1000 V0: 2873 V1: 12014 V2: 3049 V3: 2876 V4: 2876 ### Test 2: `git log --topo-order -10 A..B` This test focuses on the number of commits that are parsed during a `git log --topo-order A..B` before writing ten commits to output. Since we fix a very small set of output commits, we care more about the part of the walk that determines which commits are reachable from `B` but not reachable from `A`. This part of the walk uses commit date as a heuristic in the existing implementation. You can reproduce this test using `topo-compare-tests.sh` and see all the data in `topo-compare-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. _Note:_ For some of the rows, the `A` and `B` values may be swapped from what is expected. This is due to (1) a bug in the reference implementation that doesn't short-circuit the walk when `A` can reach `B`, and (2) data-entry errors by the author. The bug can be fixed, but would have introduced strange-looking rows in this table. android-base 53c1972bc8f 92f18ac3e39 OLD: 39403 V0: 1544 V1: 6957 V2: 26 V3: 1015 V4: 1098 gerrit c4311f7642 777a8cd1e0 OLD: 6457 V0: 7836 V1: 10869 V2: 415 V3: 414 V4: 445 electron 7da7dd85e addf069f2 OLD: 18164 V0: 945 V1: 6528 V2: 17 V3: 17 V4: 18 julia 7faee1b201 e2022b9f0f OLD: 22800 V0: 4221 V1: 12710 V2: 377 V3: 213 V4: 213 julia ae69259cd9 c8b5402afc OLD: 1864 V0: 1859 V1: 13287 V2: 12 V3: 1859 V4: 1859 Linux 69973b830859 c470abd4fde4 OLD: 111692 V0: 77263 V1: 96598 V2: 80238 V3: 76332 V4: 76495 Linux c3b92c878736 19f949f52599 OLD: 167418 V0: 5736 V1: 4684 V2: 9675 V3: 3887 V4: 3923 Linux c8d2bc9bc39e 69973b830859 OLD: 44940 V0: 4056 V1: 16636 V2: 10405 V3: 3475 V4: 4022 odoo 4a31f55d0a0 93fb2b4a616 OLD: 25139 V0: 19528 V1: 20418 V2: 19874 V3: 19634 V4: 27247 swift 4046359efd b34b6a14c7 OLD: 13411 V0: 662 V1: 321 V2: 12 V3: 80 V4: 134 tensorflow ec6d17219c fa1db5eb0d OLD: 10373 V0: 4762 V1: 36272 V2: 174 V3: 3631 V4: 3632 TypeScript 35ea2bea76 123edced90 OLD: 3450 V0: 267 V1: 10386 V2: 27 V3: 259 V4: 259 ### Test 3: `git merge-base A B` This test focuses on the number of commits that are parsed during a `git merge-base A B`. This part of the walk uses commit date as a heuristic in the existing implementation. You can reproduce this test using `merge-base-tests.sh` and see all the data in `merge-base-summary.txt`. The values reported here are a
Re: [RFC] Generation Number v2
On 10/29/2018 3:22 PM, Stefan Beller wrote: Based on the performance results alone, we should remove minimum generation numbers, (epoch, date) pairs, and FELINE index from consideration. There are enough examples of these indexes performing poorly. In contrast, maximum generation numbers and corrected commit dates both performed quite well. They are frequently the top two performing indexes, and rarely significantly different. The trade-off here now seems to be: which _property_ is more important, locally-computable or backwards-compatible? * Maximum generation number is backwards-compatible but not locally-computable or immutable. These maximum generation numbers sound like the reverse of the generation numbers as they are currently implemented, i.e. we count all commits between the commit A and all heads. How would this impact creation of a commit? The indexes listed here would be computed at the same time as a commit-graph (and only change on commit-graph writes). This idea of having something depend on the "global state" isn't ideal (in my opinion) but it certainly works for the tests specified below. The current generation numbers can be lazily updated or not updated at all. In my understanding of the maximum generation numbers, a new commit would make these maximum generation numbers invalid (i.e. they have to be recomputed). New commits would change the value you would get by the definition (on rewrite of the commit-graph) but don't violate the reachability index property (maxgen(A) < maxgen(B) => A can't reach B). So that doesn't effect their usefulness. Are there ways out by deferring the computation of maximum generation numbers while still being able to work with some commits that are un-accounted for? Creating a commit that doesn't exist in the commit-graph file results in a commit with generation number INFINITY. The new commits can't be reached by commits with finite generation number. When recomputing these numbers, the current generation number (V0) has the property that already existing numbers can be re-used as-is. We only need to compute the numbers for new commits, and then insert this to the appropriate data structures (which is a separate problem, one could imagine a split generation numbers file like the split index) For the V2 maximum generation numbers, would we need to rewrite the numbers for all commits once we recompute them? Assuming that is true, it sounds like the benchmark doesn't cover the whole costs associated with V2, which is why the exceptional performance can be explained. You're right, we need to recompute the entire list of maximum generation number values. However, you are a bit hung up on "maxgen" being a fixed function that is correct at all times. We could compute the "maxgen" for the set of commits that are written to the "split" commit-graph while leaving the base the same. There is one tricky part here: we need to initialize the values starting at "num commits in the combined commit-graph" (commits in the base and split portion). The minimum possible value in the split portion will be larger than the maximum possible value in the base portion. Since the (future) implementation of split commit-graphs would have all commit-parent edges pointing down into the base and not from the base into the split, this will continue to satisfy our reachability index property. As for the cost: we need to do a topo-order walk and then a quick minimum-value check across all parents. This is O(N), and the constant is small. This may be a cost worth paying. (Note: The table as read in https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/ seems line broken, using gmails web interface is not good for ASCII art and patches, git-send-email would fare better) Sorry for that. I copy-pasted into Thunderbird. Hopefully most users can redirect to the GitHub-rendered markdown if they have trouble. * Corrected commit-date is locally-computable and immutable, but not backwards-compatible. How are these dates not backwards incompatible? We don't have to expose these dates to the user, but - just like generation numbers - could store them and use them but not tell the user about it. We would need to be really careful to not expose them at all as they look like the real dates, so that could make for an awkward bug report. By "backwards-compatible" I mean that we could store them in the commit-graph file without breaking existing clients (or by introducing extra data). We could easily add these corrected commit dates as an optional chunk, but that adds 8 bytes per commit, and we already use 8 bytes for the (generation, date) pairs. The solution I made in my prototype is to replace the generation number with an offset for how much to add to the commit-date to get the corrected commit-date. However, an old client would interpret these offsets as generations and have incorrect output. A
Re: [RFC] Generation Number v2
> Based on the performance results alone, we should remove minimum > generation numbers, (epoch, date) pairs, and FELINE index from > consideration. There are enough examples of these indexes performing > poorly. > > In contrast, maximum generation numbers and corrected commit > dates both performed quite well. They are frequently the top > two performing indexes, and rarely significantly different. > > The trade-off here now seems to be: which _property_ is more important, > locally-computable or backwards-compatible? > > * Maximum generation number is backwards-compatible but not >locally-computable or immutable. These maximum generation numbers sound like the reverse of the generation numbers as they are currently implemented, i.e. we count all commits between the commit A and all heads. How would this impact creation of a commit? The current generation numbers can be lazily updated or not updated at all. In my understanding of the maximum generation numbers, a new commit would make these maximum generation numbers invalid (i.e. they have to be recomputed). Are there ways out by deferring the computation of maximum generation numbers while still being able to work with some commits that are un-accounted for? When recomputing these numbers, the current generation number (V0) has the property that already existing numbers can be re-used as-is. We only need to compute the numbers for new commits, and then insert this to the appropriate data structures (which is a separate problem, one could imagine a split generation numbers file like the split index) For the V2 maximum generation numbers, would we need to rewrite the numbers for all commits once we recompute them? Assuming that is true, it sounds like the benchmark doesn't cover the whole costs associated with V2, which is why the exceptional performance can be explained. (Note: The table as read in https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/ seems line broken, using gmails web interface is not good for ASCII art and patches, git-send-email would fare better) > > * Corrected commit-date is locally-computable and immutable, >but not backwards-compatible. How are these dates not backwards incompatible? We don't have to expose these dates to the user, but - just like generation numbers - could store them and use them but not tell the user about it. We would need to be really careful to not expose them at all as they look like the real dates, so that could make for an awkward bug report. The approach of "corrected commit date" sounds like we could have a very lazy approach, i.e. no extra data structures needed for many commits (as the corrected date equals the real date) and only need to store the corrections for some commits. Such an approach however would not make it easy to know if we operate on corrected dates, or if we even checked them for correctness before. So if we'd have an additional row in the generation numbers file telling the corrected date, then we should be able to be backwards compatible?
[RFC] Generation Number v2
We've discussed in several places how to improve upon generation numbers. This RFC is a report based on my investigation into a few new options, and how they compare for Git's purposes on several existing open-source repos. You can find this report and the associated test scripts at https://github.com/derrickstolee/gen-test. I have some more work to do in order to make this fully reproducible (including a single script to clone all of the repos I used, compute the commit-graphs using the different index versions, and run the tests). TL;DR: I recommend we pursue one of the following two options: 1. Maximum Generation Number. 2. Corrected Commit Date. Each of these perform well, but have different implementation drawbacks. I'd like to hear what everyone thinks about those drawbacks. Please also let me know about any additional tests that I could run. Now that I've got a lot of test scripts built up, I can re-run the test suite pretty quickly. Thanks, -Stolee Generation Number Performance Test == Git uses the commit history to answer many basic questions, such as computing a merge-base. Some of these algorithms can benefit from a _reachability index_, which is a condition that allows us to answer "commit A cannot reach commit B" in some cases. These require pre- computing some values that we can load and use at run-time. Git already has a notion of _generation number_, stores it in the commit- graph file, and uses it in several reachability algorithms. You can read more about generation numbers and how to use them in algorithms in [this blog post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/). However, [generation numbers do not always improve our algorithms](https://public-inbox.org/git/pull.28.git.gitgitgad...@gmail.com/T/#u). Specifically, some algorithms in Git already use commit date as a heuristic reachability index. This has some problems, though, since commit date can be incorrect for several reasons (clock skew between machines, purposefully setting GIT_COMMIT_DATE to the author date, etc.). However, the speed boost by using commit date as a cutoff was so important in these cases, that the potential for incorrect answers was considered acceptable. When these algorithms were converted to use generation numbers, we _added_ the extra constraint that the algorithms are _never incorrect_. Unfortunately, this led to some cases where performance was worse than before. There are known cases where `git merge-base A B` or `git log --topo-order A..B` are worse when using generation numbers than when using commit dates. This report investigates four replacements for generation numbers, and compares the number of walked commits to the existing algorithms (both using generation numbers and not using them at all). We can use this data to make decisions for the future of the feature. ### Implementation The reachability indexes below are implemented in [the `reach-perf` branch in https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf). This implementation is in a very rough condition, as it is intended to be a prototype, not production-quality. Using this implementation, you can compute commit-graph files for the specified reachability index using `git commit-graph write --reachable --version=`. The `git` client will read the version from the file, so our tests store each version as `.git/objects/info/commit-graph.` and copy the necessary file to `.git/objects/info/commit-graph` before testing. The algorithms count the number of commits walked, as to avoid the noise that would occur with time-based performance reporting. We use the (in progress) trace2 library for this. To find the values reported, use the `GIT_TR2_PERFORMANCE` environment variable. To ignore reachability indexes entirely and use the old algorithms (reported here as "OLD" values) use the environment variable `GIT_TEST_OLD_PAINT=1`. Reachability Index Versions --- **V0: (Minimum) Generation Number.** The _generation number_ of a commit is exactly one more than the maximum generation number of a parent (by convention, the generation number of a commit with no parents is 1). This is the definition currently used by Git (2.19.0 and later). Given two commits A and B, we can then use the following reachability index: If gen(A) < gen(B), then A cannot reach B. _Commentary:_ One issue with generation numbers is that some algorithms in Git use commit date as a heuristic, and allow incorrect answers if there is enough clock skew. Using that heuristic, the algorithms can walk fewer commits than the algorithms using generation number. The other reachability indexes introduced below attempt to fix this problem. **V1: (Epoch, Date) Pairs.** For each commit, store a pair of values called the _epoch_ and the _date_. The date is the normal commit date. The _epoch_ of a commit is