Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-21 Thread Duy Nguyen
On Tue, Aug 21, 2018 at 3:13 AM Jeff King  wrote:
> I _think_ they should work together OK without further modification.
> Once upon a time the caller had to say "don't show if we're past N%
> after M seconds", but I think with the current code we'd just show it if
> we're not completely finished after 2 seconds.
>
> So it really should just be a simple:
>
>   progress = start_delayed_progress("Hashing packfile", 0);
>
> That said, counting bytes would probably be ugly (just because the
> numbers get really big). We'd probably prefer to show a throughput or
> something.

Or just an ascii throbber. I think the important thing is communicate
"I'm still doing something, not hanging up". "Hashing packfile"
satisfies the "something" part, the throbber the "hanging".
-- 
Duy


Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-21 Thread Ulrich Windl
>>> Jeff King  schrieb am 21.08.2018 um 03:07 in Nachricht
<20180821010712.ga32...@sigill.intra.peff.net>:
> On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
[...]
> So it really should just be a simple:
> 
>   progress = start_delayed_progress("Hashing packfile", 0);
> 
> That said, counting bytes would probably be ugly (just because the
> numbers get really big). We'd probably prefer to show a throughput or

Hi!

With big numbers you could apply scaling (the usual ISO suffixes), but after
having read GBs, there wouldn't be much change visible unless you add
significant fractional digits.
So scaling seems good for absolute numbers, but not for growing numbers. Of
course one could recursively scale the fractional parts again, but then the
result will be lengthy again, then.

Regards,
Ulrich

> something. And as you noted, there's some refactoring to be done with
> pack-check for it to show multiple progress meters.
> 
> (I still think in the long run we would want to scrap that code, but
> that's a much bigger job; I'm fine if somebody wants to do incremental
> improvements in the meantime).
> 
> -Peff





Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Jeff King
On Mon, Aug 20, 2018 at 10:57:13AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > That seems to apply. BTW: Is there a way go get some repository statistics
> > like a histogram of object sizes (or whatever that might be useful to help
> > making decisions)?
> 
> The git-sizer program is really helpful in this regard:
> https://github.com/github/git-sizer

Yeah, I'd very much agree that's the best tool for a general overview of
the repository stats.

Ulrich, if you want to more analysis (like a histogram), probably
something like:

  git cat-file --batch-all-objects --batch-check='%(objectsize)'

might be a good starting place to dump information (see the "BATCH
OUTPUT" section of "git help cat-file" for more format items).

> > If it's sorting, maybe add some code like (wild guess):
> >
> > if (objects_to_sort > magic_number)
> >message("Sorting something...");
> 
> I think a good solution to these cases is to just introduce something to
> the progress.c mode where it learns how to display a counter where we
> don't know what the end-state will be. Something like your proposed
> magic_number can just be covered under the more general case where we
> don't show the progress bar unless it's been 1 second (which I believe
> is the default).

Yeah.  We already have open-ended counters (e.g., "counting objects"),
and delayed meters (we actually normalized the default to 2s recently).

I _think_ they should work together OK without further modification.
Once upon a time the caller had to say "don't show if we're past N%
after M seconds", but I think with the current code we'd just show it if
we're not completely finished after 2 seconds.

So it really should just be a simple:

  progress = start_delayed_progress("Hashing packfile", 0);

That said, counting bytes would probably be ugly (just because the
numbers get really big). We'd probably prefer to show a throughput or
something. And as you noted, there's some refactoring to be done with
pack-check for it to show multiple progress meters.

(I still think in the long run we would want to scrap that code, but
that's a much bigger job; I'm fine if somebody wants to do incremental
improvements in the meantime).

-Peff


Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Ulrich Windl
Hi!

Here are some stats from the repository. First the fast import ones (which had
good performance, but probably all cached, also):

% git fast-import <../git-stream
/usr/lib/git/git-fast-import statistics:
-
Alloc'd objects:  55000
Total objects:51959 (56 duplicates  )
  blobs  :47991 ( 0 duplicates  0 deltas of   
  0 attempts)
  trees  : 3946 (56 duplicates994 deltas of  
3929 attempts)
  commits:   22 ( 0 duplicates  0 deltas of   
  0 attempts)
  tags   :0 ( 0 duplicates  0 deltas of   
  0 attempts)
Total branches:  15 (15 loads )
  marks:1048576 ( 48013 unique)
  atoms:  43335
Memory total:  9611 KiB
   pools:  7033 KiB
 objects:  2578 KiB
-
pack_report: getpagesize()=   4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit  = 8589934592
pack_report: pack_used_ctr=   1780
pack_report: pack_mmap_calls  = 23
pack_report: pack_open_windows=  1 /  1
pack_report: pack_mapped  =2905751 /2905751
-

Then the output from git-sizer:

Processing blobs: 47991
Processing trees: 3946
Processing commits: 22
Matching commits to trees: 22
Processing annotated tags: 0
Processing references: 15
| Name | Value | Level of concern   |
|  | - | -- |
| Overall repository size  |   ||
| * Blobs  |   ||
|   * Total size   |  13.7 GiB | *  |
|  |   ||
| Biggest objects  |   ||
| * Trees  |   ||
|   * Maximum entries  [1] |  13.4 k   | *  |
| * Blobs  |   ||
|   * Maximum size [2] |   279 MiB | *  |
|  |   ||
| Biggest checkouts|   ||
| * Maximum path depth [3] |10 | *  |
| * Maximum path length[3] |   130 B   | *  |
| * Total size of files[3] |  8.63 GiB | *  |

[1]  b701345cbd4317276568b9d9890fd77f38933bdc
(refs/heads/master:Resources/default/CIFP)
[2]  19f54c4a7595667329c1be23200234f0fe50ac56
(refs/heads/master:Resources/default/apt.dat)
[3]  b0e3d3a2b7f2504117408f13486c905a8eb8fb1e (refs/heads/master^{tree})

Some notes:
[1] is a directory with many short (typically < 1kB) text files
[2] is a very large text file with many changes
[3] Yes, some paths are long

Regards,
Ulrich


>>> Ævar Arnfjörð Bjarmason  schrieb am 20.08.2018 um 10:57
in
Nachricht <87woslpg9i@evledraar.gmail.com>:

> On Mon, Aug 20 2018, Ulrich Windl wrote:
> 
> Jeff King  schrieb am 16.08.2018 um 22:55 in Nachricht
>> <20180816205556.ga8...@sigill.intra.peff.net>:
>>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>>
 This is all interesting, but I think unrelated to what Ulrich is talking
 about. Quote:

 Between the two phases of "git fsck" (checking directories and
 checking objects) there was a break of several seconds where no
 progress was indicated

 I.e. it's not about the pause you get with your testcase (which is
 certainly another issue) but the break between the two progress bars.
>>>
>>> I think he's talking about both. What I said responds to this:
>>
>> Hi guys!
>>
>> Yes, I was wondering what git does between the two visible phases, and 
> between
>> the lines I was suggesting another progress message between those phases.
At
>> least the maximum unspecific three-dot-message "Thinking..." could be 
> displayed
>> ;-) Of course anything more appropriate would be welcome.
>> Also that message should only be displayed if it's foreseeable that the
>> operation will take significant time. In my case (I just repeated it a few
>> minutes ago) the delay is significant (at least 10 seconds). As noted 
> earlier I
>> was hoping to capture the timing in a screencast, but it seems 

Re: Antw: Re: non-smooth progress indication for git fsck and git gc

2018-08-20 Thread Ævar Arnfjörð Bjarmason


On Mon, Aug 20 2018, Ulrich Windl wrote:

 Jeff King  schrieb am 16.08.2018 um 22:55 in Nachricht
> <20180816205556.ga8...@sigill.intra.peff.net>:
>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>>> This is all interesting, but I think unrelated to what Ulrich is talking
>>> about. Quote:
>>>
>>> Between the two phases of "git fsck" (checking directories and
>>> checking objects) there was a break of several seconds where no
>>> progress was indicated
>>>
>>> I.e. it's not about the pause you get with your testcase (which is
>>> certainly another issue) but the break between the two progress bars.
>>
>> I think he's talking about both. What I said responds to this:
>
> Hi guys!
>
> Yes, I was wondering what git does between the two visible phases, and between
> the lines I was suggesting another progress message between those phases. At
> least the maximum unspecific three-dot-message "Thinking..." could be 
> displayed
> ;-) Of course anything more appropriate would be welcome.
> Also that message should only be displayed if it's foreseeable that the
> operation will take significant time. In my case (I just repeated it a few
> minutes ago) the delay is significant (at least 10 seconds). As noted earlier 
> I
> was hoping to capture the timing in a screencast, but it seems all the delays
> were just optimized away in the recording.
>
>>
>>> >> During "git gc" the writing objects phase did not update for some
>>> >> seconds, but then the percentage counter jumped like from 15% to 42%.
>>
>> But yeah, I missed that the fsck thing was specifically about a break
>> between two meters. That's a separate problem, but also worth
>> discussing (and hopefully much easier to address).
>>
>>> If you fsck this repository it'll take around (on my spinning rust
>>> server) 30 seconds between 100% of "Checking object directories" before
>>> you get any output from "Checking objects".
>>>
>>> The breakdown of that is (this is from approximate eyeballing):
>>>
>>>  * We spend 1-3 seconds just on this:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L181
>>
>> OK, so that's checking the sha1 over the .idx file. We could put a meter
>> on that. I wouldn't expect it to generally be all that slow outside of
>> pathological cases, since it scales with the number of objects (and 1s
>> is our minimum update anyway, so that might be OK as-is). Your case has
>> 13M objects, which is quite large.
>
> Sometimes an oldish CPU could bring performance surprises, maybe. Anyway my
> CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there is 
> a
> classic spinning disk with 5400RPM built in...
>
>>
>>>  * We spend the majority of the ~30s on this:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L70-L79
>>
>> This is hashing the actual packfile. This is potentially quite long,
>> especially if you have a ton of big objects.
>
> That seems to apply. BTW: Is there a way go get some repository statistics
> like a histogram of object sizes (or whatever that might be useful to help
> making decisions)?

The git-sizer program is really helpful in this regard:
https://github.com/github/git-sizer

>>
>> I wonder if we need to do this as a separate step anyway, though. Our
>> verification is based on index-pack these days, which means it's going
>> to walk over the whole content as part of the "Indexing objects" step to
>> expand base objects and mark deltas for later. Could we feed this hash
>> as part of that walk over the data? It's not going to save us 30s, but
>> it's likely to be more efficient. And it would fold the effort naturally
>> into the existing progress meter.
>>
>>>  * Wes spend another 3-5 seconds on this QSORT:
>>>
>>
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack
>
>> -check.c#L105
>>
>> That's a tough one. I'm not sure how we'd count it (how many compares we
>> do?). And each item is doing so little work that hitting the progress
>> code may make things noticeably slower.
>
> If it's sorting, maybe add some code like (wild guess):
>
> if (objects_to_sort > magic_number)
>message("Sorting something...");

I think a good solution to these cases is to just introduce something to
the progress.c mode where it learns how to display a counter where we
don't know what the end-state will be. Something like your proposed
magic_number can just be covered under the more general case where we
don't show the progress bar unless it's been 1 second (which I believe
is the default).

>>
>> Again, your case is pretty big. Just based on the number of objects,
>> linux.git should be 1.5-2.5 seconds on your machine for the same
>> operation. Which I think may be small enough to ignore (or even just
>> print a generic before/after). It's really the 30s packfile hash that's
>> making the whole thing so terrible.
>>
>> -Peff