Re: Question about the ahead-behind computation and status
On 12/15/2017 1:30 PM, Junio C Hamano wrote: Derrick Stoleewrites: The biggest reason for the 20 seconds is not just the number of commits in the ahead/behind but how many commits are walked (including common to both branches) before paint_down_to_common() breaks its while loop due to queue_has_nonstale(). Hmm, queue_has_nonstale() looks to see if any element is not STALE (where the definition of STALE is "known to be a common ancestor") by potentially checking all elements in the queue. I wonder if we have an opportunity for a trivial optimization? When the caller knows that it dug one level and added the parents that are not stale, it does not have to ask queue_has_nonstale() if there is any non stale element, for example. I thought this, too, but my tracing did not show significant time spent in this method. 99% of the time is spent unzipping and parsing commits. If this was taking too long, then we could track a minimum timestamp for a commit that entered the queue in a non-stale state, but this will delay the termination condition a bit since commits can be marked stale after they enter the queue. What do you exactly mean by "not just the number of commits in the ahead/behind"? Aren't the number of these commits pretty much proportional to the number of commits we need to paint down to common ancestors? Is the latter a lot larger than the former (i.e. are we somehow not stopping when we _could_ notice that we can with better information)? With the wide history, there is actually a large set of commits that are in the common history but have newer commit times than the oldest commit in only one history. Consider the following ASCII art: A | 1 | 2 | 3 |\ 4 B |\| 5 C |\| 6 D |\| . . . |\| N Z |/ 0 Between A and B, A is ahead by commits {A,1,2,3,4,5,6,...,N}. Meanwhile, commits B,C,D,...,Z are in the common history, but still must be walked. Now imagine these two sets are actually much MUCH wider (thousands of commits that are pairwise non-ancestors). This causes the number of walked commits to be larger than just the number of commits in the symmetric difference of the histories. Unfortunately, generation numbers are not the only optimization needed to make this call be sub-100ms. A graph data structure that lists the edges between commits would prevent the time spent in zlib decompressing and parsing commits. I'm working on investigating how such a data structure (and corresponding file format) could integrate in the commit-walking code. Thanks, -Stolee
Re: Question about the ahead-behind computation and status
Derrick Stoleewrites: > The biggest reason for the 20 seconds is not just the number of > commits in the ahead/behind but how many commits are walked (including > common to both branches) before paint_down_to_common() breaks its > while loop due to queue_has_nonstale(). Hmm, queue_has_nonstale() looks to see if any element is not STALE (where the definition of STALE is "known to be a common ancestor") by potentially checking all elements in the queue. I wonder if we have an opportunity for a trivial optimization? When the caller knows that it dug one level and added the parents that are not stale, it does not have to ask queue_has_nonstale() if there is any non stale element, for example. What do you exactly mean by "not just the number of commits in the ahead/behind"? Aren't the number of these commits pretty much proportional to the number of commits we need to paint down to common ancestors? Is the latter a lot larger than the former (i.e. are we somehow not stopping when we _could_ notice that we can with better information)?
Re: Question about the ahead-behind computation and status
On 12/15/2017 10:08 AM, Jeff Hostetler wrote: On 12/15/2017 5:08 AM, Jeff King wrote: On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote: [*] Sadly, the local repo was only about 20 days out of date (including the Thanksgiving holidays) Taking 20 seconds to traverse 20 days worth of history sounds pretty awful. How many commits is it? 150,000 commits, give or take. The graph is many thousands of lanes wide because of the merges and that isn't helping. (But I should give you folks lots of credit -- it *only* took 20 seconds to find, unzip and decode 150,000 commit objects and walk the commit graph.) To give a few more data points, I created similar situation by checking out a big repo I hadn't updated in three months and it was 16,000 commits behind. That took 1.5s to calculate the ahead/behind. Moving it 100,000 commits behind it took 5s. This repo has about 300,000 total commits versus the 1.5 million commits in the Windows repo. The biggest reason for the 20 seconds is not just the number of commits in the ahead/behind but how many commits are walked (including common to both branches) before paint_down_to_common() breaks its while loop due to queue_has_nonstale(). Thanks, -Stolee
Re: Question about the ahead-behind computation and status
On 12/15/2017 5:08 AM, Jeff King wrote: On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote: I don't want to jump into the graph algorithm at this time, but was wondering about adding a --no-ahead-behind flag (or something similar or a config setting) that would disable the a/b computation during status. For status V2 output, we could omit the "# branch.ab x y" line. For normal status output, change the prose a/b message to say something like "are [not] up to date". Is it actually "status --porcelain=v2" that you care about here? Or is it normal "git status"? My primary concern is for porcelain V2 because Visual Studio uses --branch with V2 to get the other branch details with one process rather than two on inotify events. Fixing/Helping normal "git status" would be nice for interactive users. Do you care about seeing the branch at all? I.e., would "--no-branch" be a viable option (though I notice it seems to be a silent noop with the long-form, which should probably be fixed). VS does not use the a/b counts at all, so it is just overhead. VS does use some of the other branch fields, so it would be nice to still have them. It would be nice to minimize changes to the user experience. Not printing anything about the a/b in the normal/short/long forms might give the user a false sense of security. I was thinking if we changed the message to be "you are [not] up to date", they would still know they needed to push. As it is, at this scale, having the message give an exact count just isn't that useful -- who cares if you are 150,000 or 149,000 behind. Maybe we could do something like GCC and just give up at 100 and print "you are 100+ ...". If you really do want to see all branch details but just skip the ahead/behind, then yeah, I'd agree that adding an option to do so makes sense. Is "status" the only command that needs it? I think we do it unconditionally with "git checkout", too. ok. thanks. i'll look at doing that. Yeah, "checkout" and "branch -vv" could use it too. I haven't looked yet to see anywhere else we might want it, but these can wait. [*] Sadly, the local repo was only about 20 days out of date (including the Thanksgiving holidays) Taking 20 seconds to traverse 20 days worth of history sounds pretty awful. How many commits is it? 150,000 commits, give or take. The graph is many thousands of lanes wide because of the merges and that isn't helping. (But I should give you folks lots of credit -- it *only* took 20 seconds to find, unzip and decode 150,000 commit objects and walk the commit graph.) For a true solution, I think we want to wait for client-side generation numbers before we spend any time looking at the algorithm. Thanks, Jeff
Re: Question about the ahead-behind computation and status
On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote: > I don't want to jump into the graph algorithm at this time, > but was wondering about adding a --no-ahead-behind flag (or > something similar or a config setting) that would disable > the a/b computation during status. > > For status V2 output, we could omit the "# branch.ab x y" > line. For normal status output, change the prose a/b > message to say something like "are [not] up to date". Is it actually "status --porcelain=v2" that you care about here? Or is it normal "git status"? Do you care about seeing the branch at all? I.e., would "--no-branch" be a viable option (though I notice it seems to be a silent noop with the long-form, which should probably be fixed). If you really do want to see all branch details but just skip the ahead/behind, then yeah, I'd agree that adding an option to do so makes sense. Is "status" the only command that needs it? I think we do it unconditionally with "git checkout", too. > [*] Sadly, the local repo was only about 20 days out of > date (including the Thanksgiving holidays) Taking 20 seconds to traverse 20 days worth of history sounds pretty awful. How many commits is it? -Peff