subject:"Re\: \[PATCH\] gc\: do not warn about too many loose objects"

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-18 Thread Jeff King

On Wed, Jul 18, 2018 at 03:11:38PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Yeah, I agree that deferring repeated gc's based on time is going to run
> > into pathological corner cases. OTOH, what we've found at GitHub is that
> > "gc --auto" is quite insufficient for scheduling maintenance anyway
> > (e.g., if a machine gets pushes to 100 separate repositories in quick
> > succession, you probably want to queue and throttle any associated gc).
> 
> I'm sure you have better solutions for this at GitHub, but as an aside
> it might be interesting to add some sort of gc flock + retry setting for
> this use-case, i.e. even if you had 100 concurrent gc's due to
> too_many_*(), they'd wait + retry until they could get the flock on a
> given file.
> 
> Then again this is probably too specific, and could be done with a
> pre-auto-gc hook too..

Yeah, I think any multi-repo solution is getting way too specific for
Git, and the best thing we can do is provide a hook. I agree you could
probably do this today with a pre-auto-gc hook (if it skips gc it would
just queue itself and return non-zero). Or even just make a mark in a
database that says "there was some activity here".

Since we have so much other infrastructure sitting between the user and
Git anyway, we do that marking at a separate layer which is already
talking to a database. ;)

Anyway, I do agree with your general notion that this isn't the right
approach for many situations, and auto-gc is a better solution for many
cases.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-18 Thread Ævar Arnfjörð Bjarmason



On Tue, Jul 17 2018, Jeff King wrote:

> On Tue, Jul 17, 2018 at 10:59:50AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> That doesn't make sense to me. Just because git itself is happily
>> ignoring the exit code I don't think that should mean there shouldn't be
>> a meaningful exit code. Why don't you just ignore the exit code in the
>> repo tool as well?
>>
>> Now e.g. automation I have to see if git-gc ---auto is having issues
>> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
>> of servers, but will need to start caring if stderr was emitted to.
>
> If you're daemonizing gc, though, there are a number of cases where the
> exit code is not propagated. If you really care about the outcome,
> you're probably better off either:

In theory a lot of the other stuff can fail, but in practice I've only
seen it get stuck due to running out of disk space (easily detected in
other ways), and because of having too many loose objects (e.g. due to,
but not limited to
https://public-inbox.org/git/87fu6bmr0j@evledraar.gmail.com/).

>   - doing synchronous gc's, which will still return a meaningful code
> after Jonathan's patches

(Reply to this and "we've found at GitHub..." later in your mail)

I'm targeting clients (dev machines, laptops, staging boxes) which have
clones of repos, some put in place by automation, some manually cloned
by users in ~.

So it's much easier to rely on shipping a /etc/gitconfig than setting
gc.auto=0 and having some system-wide job that goes and hunts for local
git repos to manually gc.

It's also a big advantage that it's driven by user activity, because
it's an implicit guard of only_do_this_if_the_user_is_active_here()
before "git-gc" on a fleet of machines, which when some of those get
their disk space from shared NFS resources cuts down on overall I/O.

>   - inspecting the log yourself. I know that comes close to the un-unixy
> stderr thing. But in this case, the presence of a non-empty log is
> literally the on-disk bit for "the previous run failed". And doing
> so catches all of the daemonized cases, even the "first one" that
> you'd miss by paying attention to the immediate exit code.
>
> This will treat the zero-exit-code "warning" case as an error. I'm
> not against propagating the true original error code, if somebody
> wants to work on it. But I think Jonathan's patches are a strict
> improvement over the current situation, and a patch to propagate
> could come on top.

Yeah, I can check gc.log, if others are of a different opinion about
this #3 case at the end of the day I don't mind if it returns 0. It just
doesn't make any conceptual sense to me.

>> I think you're conflating two things here in a way that (if I'm reading
>> this correctly) produces a pretty bad regression for users.
>>
>>  a) If we have something in the gc.log we keep yelling at users until we
>> reach the gc.logExpiry, this was the subject of my thread back in
>> January. This sucks, and should be fixed somehow.
>>
>> Maybe we should only emit the warning once, e.g. creating a
>> .git/gc.log.wasemitted marker and as long as its ctime is later than
>> the mtime on .git/gc.log we don't emit it again).
>>
>> But that also creates the UX problem that it's easy to miss this
>> message in the middle of some huge "fetch" output. Perhaps we should
>> just start emitting this as part of "git status" or something (or
>> solve the root cause, as Peff notes...).
>
> I kind of like that "git status" suggestion. Of course many users run
> "git status" more often than "git commit", so it may actually spam them
> more!

Maybe piggy-packing on the advice facility ...

>>  b) We *also* use this presence of a gc.log as a marker for "we failed
>> too recently, let's not try again until after a day".
>>
>> The semantics of b) are very useful, and just because they're tied up
>> with a) right now, let's not throw out b) just because we're trying to
>> solve a).
>
> Yeah. In general my concern was the handling of (b), which I think this
> last crop of patches is fine with. As far as showing the message
> repeatedly or not, I don't have a strong opinion (it could even be
> configurable, once it's split from the "marker" behavior).
>
>> Right now one thing we do right is we always try to look at the actual
>> state of the repo with too_many_packs() and too_many_loose_objects().
>>
>> We don't assume the state of your repo hasn't drastically changed
>> recently. That means that we do the right thing and gc when the repo
>> needs it, not just because we GC'd recently enough.
>>
>> With these proposed semantics we'd skip a needed GC (potentially for
>> days, depending on the default) just because we recently ran one.
>
> Yeah, I agree that deferring repeated gc's based on time is going to run
> into pathological corner cases. OTOH, what we've found at GitHub is that
> "gc --auto" is quite insufficient for scheduling maintenance

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-17 Thread Jeff King

On Tue, Jul 17, 2018 at 10:59:50AM +0200, Ævar Arnfjörð Bjarmason wrote:

> That doesn't make sense to me. Just because git itself is happily
> ignoring the exit code I don't think that should mean there shouldn't be
> a meaningful exit code. Why don't you just ignore the exit code in the
> repo tool as well?
> 
> Now e.g. automation I have to see if git-gc ---auto is having issues
> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
> of servers, but will need to start caring if stderr was emitted to.

If you're daemonizing gc, though, there are a number of cases where the
exit code is not propagated. If you really care about the outcome,
you're probably better off either:

  - doing synchronous gc's, which will still return a meaningful code
after Jonathan's patches

  - inspecting the log yourself. I know that comes close to the un-unixy
stderr thing. But in this case, the presence of a non-empty log is
literally the on-disk bit for "the previous run failed". And doing
so catches all of the daemonized cases, even the "first one" that
you'd miss by paying attention to the immediate exit code.

This will treat the zero-exit-code "warning" case as an error. I'm
not against propagating the true original error code, if somebody
wants to work on it. But I think Jonathan's patches are a strict
improvement over the current situation, and a patch to propagate
could come on top.

> I think you're conflating two things here in a way that (if I'm reading
> this correctly) produces a pretty bad regression for users.
> 
>  a) If we have something in the gc.log we keep yelling at users until we
> reach the gc.logExpiry, this was the subject of my thread back in
> January. This sucks, and should be fixed somehow.
> 
> Maybe we should only emit the warning once, e.g. creating a
> .git/gc.log.wasemitted marker and as long as its ctime is later than
> the mtime on .git/gc.log we don't emit it again).
> 
> But that also creates the UX problem that it's easy to miss this
> message in the middle of some huge "fetch" output. Perhaps we should
> just start emitting this as part of "git status" or something (or
> solve the root cause, as Peff notes...).

I kind of like that "git status" suggestion. Of course many users run
"git status" more often than "git commit", so it may actually spam them
more!

>  b) We *also* use this presence of a gc.log as a marker for "we failed
> too recently, let's not try again until after a day".
> 
> The semantics of b) are very useful, and just because they're tied up
> with a) right now, let's not throw out b) just because we're trying to
> solve a).

Yeah. In general my concern was the handling of (b), which I think this
last crop of patches is fine with. As far as showing the message
repeatedly or not, I don't have a strong opinion (it could even be
configurable, once it's split from the "marker" behavior).

> Right now one thing we do right is we always try to look at the actual
> state of the repo with too_many_packs() and too_many_loose_objects().
> 
> We don't assume the state of your repo hasn't drastically changed
> recently. That means that we do the right thing and gc when the repo
> needs it, not just because we GC'd recently enough.
> 
> With these proposed semantics we'd skip a needed GC (potentially for
> days, depending on the default) just because we recently ran one.

Yeah, I agree that deferring repeated gc's based on time is going to run
into pathological corner cases. OTOH, what we've found at GitHub is that
"gc --auto" is quite insufficient for scheduling maintenance anyway
(e.g., if a machine gets pushes to 100 separate repositories in quick
succession, you probably want to queue and throttle any associated gc).

But there are probably more mid-size sites that are big enough to have
corner cases, but not so big that "gc --auto" is hopeless. ;)

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-17 Thread Junio C Hamano

Jonathan Nieder  writes:

> That doesn't really solve the problem:
>
>  1. "gc --auto" would produce noise *every time it runs* until gc.log
> is removed, for example via expiry
>
>  2. "gc --auto" would not do any garbage collection until gc.log is
> removed, for example by expiry
>
>  3. "gc --auto" would still not ratelimit itself, for example when
> there are a large number of loose unreachable objects that is not
> large enough to hit the loose object threshold.
>
> but maybe it's better than the present state.
>
> To solve (1) and (2), we could introduce a gc.warnings file that
> behaves like gc.log except that it only gets printed once and then
> self-destructs, allowing gc --auto to proceed.  

That makes it not rate-limit, no?

> To solve (3), we could
> introduce a gc.lastrun file that is touched whenever "gc --auto" runs
> successfully and make "gc --auto" a no-op for a while after each run.

Does absolute time-interval make sense here?  Some repositories may
be busily churning new objects and for them 5-minute interval may be
too infrequent, while other repositories may be relatively slow and
once a day may be sufficient for them.  I wonder if we can somehow
auto-tune this.

> To avoid wasteful repeated fruitless gcs, when gc.log is present, the
> subsequent "gc --auto" would die after print its contents.  Most git

s/print/&ing/

> commands, such as "git fetch", ignore the exit status from "git gc
> --auto" so all is well at this point: the user gets to see the error
> message, and the fetch succeeds, without a wasteful additional attempt
> at an automatic gc.

The above, by saying "Most git commands", leaves readers want to
know "then what are minority git commands that do not correctly do
so?"  If you are not going to answer that question, it probably is
better not to even say "Most".

> External tools like repo[1], though, do care about the exit status
> from "git gc --auto".  In non-daemonized mode, the exit status is
> straightforward: if there is an error, it is nonzero, but after a
> warning like the above, the status is zero.  The daemonized mode, as a
> side effect of the other properties provided, offers a very strange
> exit code convention:
>
>  - if no housekeeping was required, the exit status is 0

OK.

>  - the first real run, after forking into the background, returns exit
>status 0 unconditionally.  The parent process has no way to know
>whether gc will succeed.

Understandable; allowing the parent to exit before we know the
outcome of the gc is the whole point of backgrounding.

>  - if there is any diagnostic output in gc.log, subsequent runs return
>a nonzero exit status to indicate that gc was not triggered.

This is unfortunate.

> There's nothing for the calling program to act on on the basis of that
> error.  Use status 0 consistently instead, to indicate that we decided
> not to run a gc (just like if no housekeeping was required).  

s/.$/ in the last case./  And I fully agree with the reasoning.

> This
> way, repo and similar tools can get the benefit of the same behavior
> as tools like "git fetch" that ignore the exit status from gc --auto.
>
> Once the period of time described by gc.pruneExpire elapses, the
> unreachable loose objects will be removed by "git gc --auto"
> automatically.
>
> [1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/
>
> Reported-by: Andrii Dehtiarov 
> Helped-by: Jeff King 
> Signed-off-by: Jonathan Nieder 
> ---
>  Documentation/config.txt |  3 ++-
>  builtin/gc.c | 16 +++-
>  t/t6500-gc.sh|  6 +++---
>  3 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 1cc18a828c..5eaa4aaa7d 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should 
> go below
>  gc.autoPackLimit and gc.bigPackThreshold should be respected again.
>  
>  gc.logExpiry::
> - If the file gc.log exists, then `git gc --auto` won't run
> + If the file gc.log exists, then `git gc --auto` will print
> + its content and exit with status zero instead of running
>   unless that file is more than 'gc.logExpiry' old.  Default is
>   "1.day".  See `gc.pruneExpire` for more ways to specify its
>   value.
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 2bebc52bda..484ab21b8c 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -438,7 +438,7 @@ static const char *lock_repo_for_gc(int force, pid_t* 
> ret_pid)
>   return NULL;
>  }
>  
> -static void report_last_gc_error(void)
> +static int report_last_gc_error(void)
>  {
>   struct strbuf sb = STRBUF_INIT;
>   ssize_t ret;
> @@ -449,7 +449,7 @@ static void report_last_gc_error(void)
>   if (errno == ENOENT)
>   goto done;
>  
> - die_errno(_("cannot stat '%s'"), gc_log_path);
> + return error_errno(_

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-17 Thread Duy Nguyen

On Tue, Jul 17, 2018 at 3:53 AM Jonathan Nieder  wrote:
> Subject: gc: do not return error for prior errors in daemonized mode
>
> Some build machines started failing to fetch updated source using
> "repo sync", with error
>
>   error: The last gc run reported the following. Please correct the root cause
>   and remove /build/.repo/projects/tools/git.git/gc.log.
>   Automatic cleanup will not be performed until the file is removed.
>
>   warning: There are too many unreachable loose objects; run 'git prune' to 
> remove them.
>
> The cause takes some time to describe.
>
> In v2.0.0-rc0~145^2 (gc: config option for running --auto in
> background, 2014-02-08), "git gc --auto" learned to run in the
> background instead of blocking the invoking command.  In this mode, it
> closed stderr to avoid interleaving output with any subsequent
> commands, causing warnings like the above to be swallowed; v2.6.3~24^2
> (gc: save log from daemonized gc --auto and print it next time,
> 2015-09-19) addressed this by storing any diagnostic output in
> .git/gc.log and allowing the next "git gc --auto" run to print it.
>
> To avoid wasteful repeated fruitless gcs, when gc.log is present, the
> subsequent "gc --auto" would die after print its contents.  Most git
> commands, such as "git fetch", ignore the exit status from "git gc
> --auto" so all is well at this point: the user gets to see the error
> message, and the fetch succeeds, without a wasteful additional attempt
> at an automatic gc.
>
> External tools like repo[1], though, do care about the exit status
> from "git gc --auto".  In non-daemonized mode, the exit status is
> straightforward: if there is an error, it is nonzero, but after a
> warning like the above, the status is zero.  The daemonized mode, as a
> side effect of the other properties provided, offers a very strange
> exit code convention:
>
>  - if no housekeeping was required, the exit status is 0
>
>  - the first real run, after forking into the background, returns exit
>status 0 unconditionally.  The parent process has no way to know
>whether gc will succeed.
>
>  - if there is any diagnostic output in gc.log, subsequent runs return
>a nonzero exit status to indicate that gc was not triggered.
>
> There's nothing for the calling program to act on on the basis of that
> error.  Use status 0 consistently instead, to indicate that we decided
> not to run a gc (just like if no housekeeping was required).  This
> way, repo and similar tools can get the benefit of the same behavior
> as tools like "git fetch" that ignore the exit status from gc --auto.

This background gc thing is pretty much designed for interactive use.
When it's scripted, you probably should avoid it. Then you can fork a
new process by yourself and have much better control if you still want
"background" gc. So an alternative here is to turn on or off
gc.autodetach based on interactiveness (i.e. having tty). We can add a
new (and default) value "auto" to gc.autodetach for this purpose.

In automated scripts, it will always run in non-damonized mode. You
will get non-zero exit code when real errors occur. You don't worry
about hanging processes. Rate limit is also thrown out in this mode if
I'm not mistaken, but it could be added back and more tailored for
server needs.

> Once the period of time described by gc.pruneExpire elapses, the
> unreachable loose objects will be removed by "git gc --auto"
> automatically.
>
> [1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/
>
> Reported-by: Andrii Dehtiarov 
> Helped-by: Jeff King 
> Signed-off-by: Jonathan Nieder 
> ---
>  Documentation/config.txt |  3 ++-
>  builtin/gc.c | 16 +++-
>  t/t6500-gc.sh|  6 +++---
>  3 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 1cc18a828c..5eaa4aaa7d 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should 
> go below
>  gc.autoPackLimit and gc.bigPackThreshold should be respected again.
>
>  gc.logExpiry::
> -   If the file gc.log exists, then `git gc --auto` won't run
> +   If the file gc.log exists, then `git gc --auto` will print
> +   its content and exit with status zero instead of running
> unless that file is more than 'gc.logExpiry' old.  Default is
> "1.day".  See `gc.pruneExpire` for more ways to specify its
> value.
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 2bebc52bda..484ab21b8c 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -438,7 +438,7 @@ static const char *lock_repo_for_gc(int force, pid_t* 
> ret_pid)
> return NULL;
>  }
>
> -static void report_last_gc_error(void)
> +static int report_last_gc_error(void)
>  {
> struct strbuf sb = STRBUF_INIT;
> ssize_t ret;
> @@ -449,7 +449,7 @@ static void report_last_gc_error(void)
> if (err

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-17 Thread Ævar Arnfjörð Bjarmason

On Tue, Jul 17 2018, Jonathan Nieder wrote:

> Hi Ævar,
>
> Ævar Arnfjörð Bjarmason wrote:
>> On Tue, Jul 17 2018, Jonathan Nieder wrote:
>
>>> That suggests a possible improvement.  If all callers should be
>>> ignoring the exit code, can we make the exit code in daemonize mode
>>> unconditionally zero in this kind of case?
>>
>> That doesn't make sense to me. Just because git itself is happily
>> ignoring the exit code I don't think that should mean there shouldn't be
>> a meaningful exit code. Why don't you just ignore the exit code in the
>> repo tool as well?
>
> I don't maintain the repo tool.  I am just trying to improve git to
> make some users' lives better.
>
> repo worked fine for years, until gc's daemonized mode broke it.  I
> don't think it makes sense for us to declare that it has been holding
> git gc wrong for all that time before then.

Before the sequence of commits noted at the bottom of my
https://public-inbox.org/git/87inc89j38@evledraar.gmail.com/ plus
a831c06a2b ("gc: ignore old gc.log files", 2017-02-10) we wouldn't do
that, that's true.

We'd just happily run GC again pointlessly even though it wasn't going
to accomplish anything, in cases where you really did have ~>6700 loose
objects that weren't going to be pruned.

I don't think it makes sense to revert those patches and go back to the
much more naïve (and user waiting there twiddling his thumbs while it
runs) method, but I also don't think making no distinction between the
following states:

 1. gc --auto has nothing to do
 2. gc --auto has something to do, will fork and try to do it
 3. gc --auto has something to do, but notices that gc has been failing
before and can't do anything now.

I think #3 should exit with non-zero.

Yes, before this whole backgrounding etc. we wouldn't have exited with
non-zero either, we'd just print a thing to STDERR.

But with backgrounding we implicitly introduced a new state, which is we
won't do *any* maintenance at all, including consolidating packfiles
(very important for servers), so I think it makes sense to exit with
non-zero since gc can't run at all.

>> Now e.g. automation I have to see if git-gc ---auto is having issues
>> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
>> of servers, but will need to start caring if stderr was emitted to.
>
> Thanks for bringing this up.  The thing is, the current exit code is
> not useful for that purpose.  It doesn't report a failure until the
> *next* `git gc --auto` run, when it is too late to be useful.
>
> See the commit message on the proposed patch
> https://public-inbox.org/git/20180717065740.gd177...@aiede.svl.corp.google.com/
> for more on that subject.

Right. I know. What I mean is now I can (and do) use it to run 'git gc
--auto' across our server fleet and see whether I have any of #3, or
whether it's all #1 or #2. If there's nothing to do in #1 that's fine,
and it just so happens that I'll run gc due to #2 that's also fine, but
I'd like to see if gc really is stuck.

This of course relies on them having other users / scripts doing normal
git commands which would trigger previous 'git gc --auto' runs.

>> I don't care if we e.g. have a 'git gc --auto --exit-code' similar to
>> what git-diff does, but it doesn't make sense to me that we *know* we
>> can't background the gc due to a previous error and then always return
>> 0. Having to parse STDERR to see if it *really* failed is un-unixy,
>> let's use exit codes. That's what they're for.
>
> Do you know of anyone trying to use the exit code from gc --auto in
> this way?
>
> It sounds to me like for what you're proposing, a separate
>
>   git gc --auto --last-run-failed
>
> command that a tool could use to check the outcome from the previous
> gc --auto run without triggering a new one would be best.

Yeah this is admittedly a bit of a poweruser thing I'm doing, so I don't
mind if it's hidden behind something like that in principle, but it
seems wrong to exit with zero in this (#3) case:

$ git gc --auto; echo $?
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
error: The last gc run reported the following. Please correct the root cause
and remove .git/gc.log.
Automatic cleanup will not be performed until the file is removed.

warning: There are too many unreachable loose objects; run 'git prune' to 
remove them.

255

As a previous (good) patch in this series notes that shouldn't be 255,
let's fix that, but I don't see how emitting an "error" and "warning"
saying we're broken and can't gc at all here in conjunction with exiting
with zero makes sense.

> [...]
>> I think you're conflating two things here in a way that (if I'm reading
>> this correctly) produces a pretty bad regression for users.
>
> The patch doesn't touch the "ratelimiting" behavior at all, so I'm not
> sure what I'm doing to regress users.

Sorry about being unclear again, this is not a comm

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-17 Thread Jonathan Nieder

Hi Ævar,

Ævar Arnfjörð Bjarmason wrote:
> On Tue, Jul 17 2018, Jonathan Nieder wrote:

>> That suggests a possible improvement.  If all callers should be
>> ignoring the exit code, can we make the exit code in daemonize mode
>> unconditionally zero in this kind of case?
>
> That doesn't make sense to me. Just because git itself is happily
> ignoring the exit code I don't think that should mean there shouldn't be
> a meaningful exit code. Why don't you just ignore the exit code in the
> repo tool as well?

I don't maintain the repo tool.  I am just trying to improve git to
make some users' lives better.

repo worked fine for years, until gc's daemonized mode broke it.  I
don't think it makes sense for us to declare that it has been holding
git gc wrong for all that time before then.

> Now e.g. automation I have to see if git-gc ---auto is having issues
> can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
> of servers, but will need to start caring if stderr was emitted to.

Thanks for bringing this up.  The thing is, the current exit code is
not useful for that purpose.  It doesn't report a failure until the
*next* `git gc --auto` run, when it is too late to be useful.

See the commit message on the proposed patch
https://public-inbox.org/git/20180717065740.gd177...@aiede.svl.corp.google.com/
for more on that subject.

> I don't care if we e.g. have a 'git gc --auto --exit-code' similar to
> what git-diff does, but it doesn't make sense to me that we *know* we
> can't background the gc due to a previous error and then always return
> 0. Having to parse STDERR to see if it *really* failed is un-unixy,
> let's use exit codes. That's what they're for.

Do you know of anyone trying to use the exit code from gc --auto in
this way?

It sounds to me like for what you're proposing, a separate

git gc --auto --last-run-failed

command that a tool could use to check the outcome from the previous
gc --auto run without triggering a new one would be best.

[...]
> I think you're conflating two things here in a way that (if I'm reading
> this correctly) produces a pretty bad regression for users.

The patch doesn't touch the "ratelimiting" behavior at all, so I'm not
sure what I'm doing to regress users.

[...]
>> To solve (3), we could
>> introduce a gc.lastrun file that is touched whenever "gc --auto" runs
>> successfully and make "gc --auto" a no-op for a while after each run.
>
> This would work around my concern with b) above in most cases by
> introducing a purely time-based rate limit, but I'm uneasy about this
> change in git-gc semantics.
[..]
> With these proposed semantics we'd skip a needed GC (potentially for
> days, depending on the default) just because we recently ran one.
>
> In many environments, such as on busy servers, we could have tens of
> thousands of packs by this point, since this facility would (presumably)
> bypass both gc.autoPackLimit and gc.auto in favor of only running gc at
> a maximum of every N minutes, similarly in a local checkout I could have
> a crapload of loose objects because I ran a big rebase or a
> filter-branch on one of my branches.

I think we all agree that getting rid of the hack that 'explodes'
unreachable objects into loose objects is the best way forward, long
term.

Even in that future, some ratelimiting may be useful, though.  I also
suspect that we're never going to achieve a perfect set of defaults
that works well for both humans and servers, though we can try.

Thanks and hope that helps,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-17 Thread Ævar Arnfjörð Bjarmason

On Tue, Jul 17 2018, Jonathan Nieder wrote:

> Hi,
>
> Jeff King wrote:
>> On Mon, Jul 16, 2018 at 03:56:39PM -0700, Jonathan Nieder wrote:
>
>>> The calling command in the motivating example is Android's "repo" tool:
>>>
>>> bare_git.gc('--auto')
>>>
>>> from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
>>> think it's reasonable that it expects a status code of 0 in the normal
>>> case.  So life is less simple than I hoped.
>>
>> IMHO it should ignore the return code, since that's what Git does
>> itself. And at any rate, you'd miss errors from daemonized gc's (at
>> least until the _next_ one runs and propagates the error code).

I've read this whole thread, and as Peff noted I've barked up this
particular tree before[1] without coming up with a solution myself.

So please don't take the following as critique of any way of moving
forward, I'm just trying to poke holes in what you're doing to make sure
we don't have regressions to the currently (sucky) logic.

> That suggests a possible improvement.  If all callers should be
> ignoring the exit code, can we make the exit code in daemonize mode
> unconditionally zero in this kind of case?

That doesn't make sense to me. Just because git itself is happily
ignoring the exit code I don't think that should mean there shouldn't be
a meaningful exit code. Why don't you just ignore the exit code in the
repo tool as well?

Now e.g. automation I have to see if git-gc ---auto is having issues
can't just be 'git gc --auto || echo fail @ {hostname}' across a fleet
of servers, but will need to start caring if stderr was emitted to.

I don't care if we e.g. have a 'git gc --auto --exit-code' similar to
what git-diff does, but it doesn't make sense to me that we *know* we
can't background the gc due to a previous error and then always return
0. Having to parse STDERR to see if it *really* failed is un-unixy,
let's use exit codes. That's what they're for.

> That doesn't really solve the problem:
>
>  1. "gc --auto" would produce noise *every time it runs* until gc.log
> is removed, for example via expiry
>
>  2. "gc --auto" would not do any garbage collection until gc.log is
> removed, for example by expiry
>
>  3. "gc --auto" would still not ratelimit itself, for example when
> there are a large number of loose unreachable objects that is not
> large enough to hit the loose object threshold.
>
> but maybe it's better than the present state.
>
> To solve (1) and (2), we could introduce a gc.warnings file that
> behaves like gc.log except that it only gets printed once and then
> self-destructs, allowing gc --auto to proceed.

I think you're conflating two things here in a way that (if I'm reading
this correctly) produces a pretty bad regression for users.

 a) If we have something in the gc.log we keep yelling at users until we
reach the gc.logExpiry, this was the subject of my thread back in
January. This sucks, and should be fixed somehow.

Maybe we should only emit the warning once, e.g. creating a
.git/gc.log.wasemitted marker and as long as its ctime is later than
the mtime on .git/gc.log we don't emit it again).

But that also creates the UX problem that it's easy to miss this
message in the middle of some huge "fetch" output. Perhaps we should
just start emitting this as part of "git status" or something (or
solve the root cause, as Peff notes...).

 b) We *also* use this presence of a gc.log as a marker for "we failed
too recently, let's not try again until after a day".

The semantics of b) are very useful, and just because they're tied up
with a) right now, let's not throw out b) just because we're trying to
solve a).

We have dev machines with limited I/O & CPU/memory that occasionally
break the gc.auto limit, I really don't want those churning a background
"git gc" on every fetch/commit etc. until we're finally below the
gc.auto limit (possibly days later), which would be a side-effect of
printing the .git/gc.log once and then removing it.

> To solve (3), we could
> introduce a gc.lastrun file that is touched whenever "gc --auto" runs
> successfully and make "gc --auto" a no-op for a while after each run.

This would work around my concern with b) above in most cases by
introducing a purely time-based rate limit, but I'm uneasy about this
change in git-gc semantics.

Right now one thing we do right is we always try to look at the actual
state of the repo with too_many_packs() and too_many_loose_objects().

We don't assume the state of your repo hasn't drastically changed
recently. That means that we do the right thing and gc when the repo
needs it, not just because we GC'd recently enough.

With these proposed semantics we'd skip a needed GC (potentially for
days, depending on the default) just because we recently ran one.

In many environments, such as on busy servers, we could have tens of
thousands of packs by this point, since this facility would (presumably)
bypass b

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Hi,

Jeff King wrote:
> On Mon, Jul 16, 2018 at 03:56:39PM -0700, Jonathan Nieder wrote:

>> The calling command in the motivating example is Android's "repo" tool:
>> 
>> bare_git.gc('--auto')
>> 
>> from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
>> think it's reasonable that it expects a status code of 0 in the normal
>> case.  So life is less simple than I hoped.
>
> IMHO it should ignore the return code, since that's what Git does
> itself. And at any rate, you'd miss errors from daemonized gc's (at
> least until the _next_ one runs and propagates the error code).

That suggests a possible improvement.  If all callers should be
ignoring the exit code, can we make the exit code in daemonize mode
unconditionally zero in this kind of case?

That doesn't really solve the problem:

 1. "gc --auto" would produce noise *every time it runs* until gc.log
is removed, for example via expiry

 2. "gc --auto" would not do any garbage collection until gc.log is
removed, for example by expiry

 3. "gc --auto" would still not ratelimit itself, for example when
there are a large number of loose unreachable objects that is not
large enough to hit the loose object threshold.

but maybe it's better than the present state.

To solve (1) and (2), we could introduce a gc.warnings file that
behaves like gc.log except that it only gets printed once and then
self-destructs, allowing gc --auto to proceed.  To solve (3), we could
introduce a gc.lastrun file that is touched whenever "gc --auto" runs
successfully and make "gc --auto" a no-op for a while after each run.

-- >8 --
Subject: gc: do not return error for prior errors in daemonized mode

Some build machines started failing to fetch updated source using
"repo sync", with error

  error: The last gc run reported the following. Please correct the root cause
  and remove /build/.repo/projects/tools/git.git/gc.log.
  Automatic cleanup will not be performed until the file is removed.

  warning: There are too many unreachable loose objects; run 'git prune' to 
remove them.

The cause takes some time to describe.

In v2.0.0-rc0~145^2 (gc: config option for running --auto in
background, 2014-02-08), "git gc --auto" learned to run in the
background instead of blocking the invoking command.  In this mode, it
closed stderr to avoid interleaving output with any subsequent
commands, causing warnings like the above to be swallowed; v2.6.3~24^2
(gc: save log from daemonized gc --auto and print it next time,
2015-09-19) addressed this by storing any diagnostic output in
.git/gc.log and allowing the next "git gc --auto" run to print it.

To avoid wasteful repeated fruitless gcs, when gc.log is present, the
subsequent "gc --auto" would die after print its contents.  Most git
commands, such as "git fetch", ignore the exit status from "git gc
--auto" so all is well at this point: the user gets to see the error
message, and the fetch succeeds, without a wasteful additional attempt
at an automatic gc.

External tools like repo[1], though, do care about the exit status
from "git gc --auto".  In non-daemonized mode, the exit status is
straightforward: if there is an error, it is nonzero, but after a
warning like the above, the status is zero.  The daemonized mode, as a
side effect of the other properties provided, offers a very strange
exit code convention:

 - if no housekeeping was required, the exit status is 0

 - the first real run, after forking into the background, returns exit
   status 0 unconditionally.  The parent process has no way to know
   whether gc will succeed.

 - if there is any diagnostic output in gc.log, subsequent runs return
   a nonzero exit status to indicate that gc was not triggered.

There's nothing for the calling program to act on on the basis of that
error.  Use status 0 consistently instead, to indicate that we decided
not to run a gc (just like if no housekeeping was required).  This
way, repo and similar tools can get the benefit of the same behavior
as tools like "git fetch" that ignore the exit status from gc --auto.

Once the period of time described by gc.pruneExpire elapses, the
unreachable loose objects will be removed by "git gc --auto"
automatically.

[1] https://gerrit-review.googlesource.com/c/git-repo/+/10598/

Reported-by: Andrii Dehtiarov 
Helped-by: Jeff King 
Signed-off-by: Jonathan Nieder 
---
 Documentation/config.txt |  3 ++-
 builtin/gc.c | 16 +++-
 t/t6500-gc.sh|  6 +++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1cc18a828c..5eaa4aaa7d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1649,7 +1649,8 @@ will be repacked. After this the number of packs should 
go below
 gc.autoPackLimit and gc.bigPackThreshold should be respected again.

 gc.logExpiry::
-   If the file gc.log exists, then `git gc --auto` won't run
+   If the file gc.log exists, th

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 03:56:39PM -0700, Jonathan Nieder wrote:

> The calling command in the motivating example is Android's "repo" tool:
> 
> bare_git.gc('--auto')
> 
> from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
> think it's reasonable that it expects a status code of 0 in the normal
> case.  So life is less simple than I hoped.

IMHO it should ignore the return code, since that's what Git does
itself. And at any rate, you'd miss errors from daemonized gc's (at
least until the _next_ one runs and propagates the error code).

> Interesting!  It looks like that thread anticipated the problems we've
> seen here.  Three years without having to have fixed it is a good run,
> I suppose.
> 
> The discussion of stopping there appears to be primarily about
> stopping in the error case, not rate-limiting in the success or
> warning case.

I think the two are essentially the same. The point is that we cannot
make further progress by re-running the gc again, so we should not.

Amusingly, the warning we're talking about is the exact reason that the
logging in that thread was added.  329e6e8794 (gc: save log from
daemonized gc --auto and print it next time, 2015-09-19) says:

  The latest in this set is, as the result of daemonizing, stderr is
  closed and all warnings are lost. This warning at the end of cmd_gc()
  is particularly important because it tells the user how to avoid "gc
  --auto" running repeatedly. Because stderr is closed, the user does
  not know, naturally they complain about 'gc --auto' wasting CPU.

> -- >8 --
> Subject: gc: exit with status 128 on failure
> 
> A value of -1 returned from cmd_gc gets propagated to exit(),
> resulting in an exit status of 255.  Use die instead for a clearer
> error message and a controlled exit.

I agree it's better to not pass -1 to exit(). But I thought the original
motivation was the fact that we were returning non-zero in this case at
all?  (And I thought you and I both agreed with that motivation).

So is this meant to be a preparatory fix until somebody is interested in
fixing that?

> -static int gc_before_repack(void)
> +static void gc_before_repack(void)
>  {
>   if (pack_refs && run_command_v_opt(pack_refs_cmd.argv, RUN_GIT_CMD))
> - return error(FAILED_RUN, pack_refs_cmd.argv[0]);
> + die(FAILED_RUN, pack_refs_cmd.argv[0]);
>  
>   if (prune_reflogs && run_command_v_opt(reflog.argv, RUN_GIT_CMD))
> - return error(FAILED_RUN, reflog.argv[0]);
> + die(FAILED_RUN, reflog.argv[0]);

Dscho is going to yell at you about replacing error returns with die().  ;)

I wonder if it would be simpler to just reinterpret the "-1" into "128"
in cmd_gc(). I thought we had talked about having run_builtin() do that
at one point, but I don't think we ever did. I dunno. I'm fine with it
either way.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Elijah Newren

On Mon, Jul 16, 2018 at 3:55 PM, Jeff King  wrote:
> On Mon, Jul 16, 2018 at 03:07:34PM -0700, Elijah Newren wrote:
>
>> > If we were to delete those objects, wouldn't it be exactly the same as
>> > running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
>> > minutes"?  Or are you concerned with taking other objects along for the
>> > ride that weren't part of old reflogs? I think that's a valid concern,
>>
>> Yes, I was worried about taking other objects along for the ride that
>> weren't part of old reflogs.
>>
>> > but it's also an issue for objects which were previously referenced in
>> > a reflog, but are part of another current operation.
>>
>> I'm not certain what you're referring to here.
>
> I mean that an ongoing operation could refer to a blob that just
> recently became unreachable via reflog pruning. And then we would delete
> it, leaving the repository corrupt.
>
> One of the protections we have against that is that if I ask to write
> blob XYZ and we find that we already have the object, Git will freshen
> the mtime of the loose object or pack that contains it, protecting it
> from pruning. But with your suggestion, we'd delete it immediately,
> regardless of the mtime of the containing pack.
>
> Another way to think of it is this: a reflog mentioning an object does
> not mean it is the exclusive user of that object. So when we expire it,
> that does not mean that it is OK to delete it immediately; there may be
> other users.
>
>> > Also, what do you do if there weren't reflogs in the repo? Or the
>> > reflogs were deleted (e.g., because branch deletion drops the associated
>> > reflog entirely)?
>>
>> Yes, there are issues this rule won't help with, but in my experience
>> it was a primary (if not sole) actual cause in practice.  (I believe I
>> even said elsewhere in this thread that I knew there were unreachable
>> objects for other reasons and they might also become large in number).
>> At $DAYJOB we've had multiple people including myself hit the "too
>> many unreachable loose objects" nasty loop issue (some of us multiple
>> different times), and as far as I can tell, most (perhaps even all) of
>> them would have been avoided by just "properly" deleting garbage as
>> per my object-age-is-reflog-age-if-not-otherwise-referenced rule.
>
> I agree with you that this is a frequent cause, and probably even the
> primary one. But my concern is that your loosening increases the risk of
> corruption for other cases.
>
>> > I assume by "these objects" you mean ones which used to be reachable
>> > from a reflog, but that reflog entry just expired.  I think you'd be
>> > sacrificing some race-safety in that case.
>>
>> Is that inherently any more race unsafe than 'git prune
>> --expire=2.weeks.ago'?  I thought it'd be racy in the same ways, and
>> thus a tradeoff folks are already accepting (at least implicitly) when
>> running git-gc.  Since these objects are actually 90 days old rather
>> than a mere two weeks, it actually seemed more safe to me.  But maybe
>> I'm overlooking something with the pack->loose transition that makes
>> it more racy?
>
> I think it's worse in terms of race-safety because you're losing one of
> the signals that users of the objects can provide to git-prune to tell
> it the object is useful: updating the mtime. So yes, you think of the
> objects as "90 days old", but we don't know if there are other users.
> Has somebody else been accessing them in the meantime?
>
> To be honest, I'm not sure how meaningful that distinction is in
> practice. The current scheme is still racy, even if the windows are
> shorter in some circumstances. But it seems like cruft packfiles are
> a similar amount of work to your scheme, cover more cases, and are
> slightly safer. And possibly even give us a long-term route to true
> race-free pruning (via the "garbage pack" mark that Jonathan mentioned).
>
> Assuming you buy into the idea that objects in a cruft-pack are not
> hurting anything aside from a little wasted storage for up to 2 weeks
> (which it sounds like you're at least partially on board with ;) ).

Ah, I see now.  Thanks (to you and Jonathan) for the thorough
explanations.  I'm totally on board now.

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Jeff King wrote:
> On Mon, Jul 16, 2018 at 03:03:06PM -0700, Jonathan Nieder wrote:

>> Oh, good point.  In non-daemon mode, we don't let "gc --auto" failure
>> cause the invoking command to fail, but in daemon mode we do.  That
>> should be a straightforward fix; patch coming in a moment.
>
> OK, that definitely sounds like a bug. I'm still confused how that could
> happen, though, since from the caller's perspective they ignore git-gc's
> exit code either way. I guess I'll see in your patch. :)

Alas, I just misremembered.  What I was remembering is that gc --auto
does

if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
return 0;

which means that too_many_loose_objects is not an error in undaemonized
mode, while it is in daemonized mode.  But we've already discussed that.

The calling command in the motivating example is Android's "repo" tool:

bare_git.gc('--auto')

from https://gerrit-review.googlesource.com/c/git-repo/+/10598/.  I
think it's reasonable that it expects a status code of 0 in the normal
case.  So life is less simple than I hoped.

[...]
>> Can you point me to some discussion about building that rate-limiting?
>> The commit message for v2.12.2~17^2 (gc: ignore old gc.log files,
>> 2017-02-10) definitely doesn't describe that as its intent.
>
> I think that commit is a loosening of the rate-limiting (because we'd
> refuse to progress for something that was actually time-based). But the
> original stopping comes from this discussion, I think:
>
>   https://public-inbox.org/git/xmqqlhijznpm@gitster.dls.corp.google.com/

Interesting!  It looks like that thread anticipated the problems we've
seen here.  Three years without having to have fixed it is a good run,
I suppose.

The discussion of stopping there appears to be primarily about
stopping in the error case, not rate-limiting in the success or
warning case.

Here's a patch for the 'return -1' thing.

-- >8 --
Subject: gc: exit with status 128 on failure

A value of -1 returned from cmd_gc gets propagated to exit(),
resulting in an exit status of 255.  Use die instead for a clearer
error message and a controlled exit.

Signed-off-by: Jonathan Nieder 
---
 builtin/gc.c  | 36 
 t/t6500-gc.sh |  2 +-
 2 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index ccfb1ceaeb..2bebc52bda 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -438,10 +438,10 @@ static const char *lock_repo_for_gc(int force, pid_t* 
ret_pid)
return NULL;
 }
 
-static int report_last_gc_error(void)
+static void report_last_gc_error(void)
 {
struct strbuf sb = STRBUF_INIT;
-   int ret = 0;
+   ssize_t ret;
struct stat st;
char *gc_log_path = git_pathdup("gc.log");
 
@@ -449,16 +449,17 @@ static int report_last_gc_error(void)
if (errno == ENOENT)
goto done;
 
-   ret = error_errno(_("Can't stat %s"), gc_log_path);
-   goto done;
+   die_errno(_("cannot stat '%s'"), gc_log_path);
}
 
if (st.st_mtime < gc_log_expire_time)
goto done;
 
ret = strbuf_read_file(&sb, gc_log_path, 0);
+   if (ret < 0)
+   die_errno(_("cannot read '%s'"), gc_log_path);
if (ret > 0)
-   ret = error(_("The last gc run reported the following. "
+   die(_("The last gc run reported the following. "
   "Please correct the root cause\n"
   "and remove %s.\n"
   "Automatic cleanup will not be performed "
@@ -468,20 +469,18 @@ static int report_last_gc_error(void)
strbuf_release(&sb);
 done:
free(gc_log_path);
-   return ret;
 }
 
-static int gc_before_repack(void)
+static void gc_before_repack(void)
 {
if (pack_refs && run_command_v_opt(pack_refs_cmd.argv, RUN_GIT_CMD))
-   return error(FAILED_RUN, pack_refs_cmd.argv[0]);
+   die(FAILED_RUN, pack_refs_cmd.argv[0]);
 
if (prune_reflogs && run_command_v_opt(reflog.argv, RUN_GIT_CMD))
-   return error(FAILED_RUN, reflog.argv[0]);
+   die(FAILED_RUN, reflog.argv[0]);
 
pack_refs = 0;
prune_reflogs = 0;
-   return 0;
 }
 
 int cmd_gc(int argc, const char **argv, const char *prefix)
@@ -562,13 +561,11 @@ int cmd_gc(int argc, const char **argv, const char 
*prefix)
fprintf(stderr, _("See \"git help gc\" for manual 
housekeeping.\n"));
}
if (detach_auto) {
-   if (report_last_gc_error())
-   return -1;
+   report_last_gc_error(); /* dies on error */
 
if (lock_repo_for_gc(force, &pid))

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 03:07:34PM -0700, Elijah Newren wrote:

> > If we were to delete those objects, wouldn't it be exactly the same as
> > running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
> > minutes"?  Or are you concerned with taking other objects along for the
> > ride that weren't part of old reflogs? I think that's a valid concern,
> 
> Yes, I was worried about taking other objects along for the ride that
> weren't part of old reflogs.
> 
> > but it's also an issue for objects which were previously referenced in
> > a reflog, but are part of another current operation.
> 
> I'm not certain what you're referring to here.

I mean that an ongoing operation could refer to a blob that just
recently became unreachable via reflog pruning. And then we would delete
it, leaving the repository corrupt.

One of the protections we have against that is that if I ask to write
blob XYZ and we find that we already have the object, Git will freshen
the mtime of the loose object or pack that contains it, protecting it
from pruning. But with your suggestion, we'd delete it immediately,
regardless of the mtime of the containing pack.

Another way to think of it is this: a reflog mentioning an object does
not mean it is the exclusive user of that object. So when we expire it,
that does not mean that it is OK to delete it immediately; there may be
other users.

> > Also, what do you do if there weren't reflogs in the repo? Or the
> > reflogs were deleted (e.g., because branch deletion drops the associated
> > reflog entirely)?
> 
> Yes, there are issues this rule won't help with, but in my experience
> it was a primary (if not sole) actual cause in practice.  (I believe I
> even said elsewhere in this thread that I knew there were unreachable
> objects for other reasons and they might also become large in number).
> At $DAYJOB we've had multiple people including myself hit the "too
> many unreachable loose objects" nasty loop issue (some of us multiple
> different times), and as far as I can tell, most (perhaps even all) of
> them would have been avoided by just "properly" deleting garbage as
> per my object-age-is-reflog-age-if-not-otherwise-referenced rule.

I agree with you that this is a frequent cause, and probably even the
primary one. But my concern is that your loosening increases the risk of
corruption for other cases.

> > I assume by "these objects" you mean ones which used to be reachable
> > from a reflog, but that reflog entry just expired.  I think you'd be
> > sacrificing some race-safety in that case.
> 
> Is that inherently any more race unsafe than 'git prune
> --expire=2.weeks.ago'?  I thought it'd be racy in the same ways, and
> thus a tradeoff folks are already accepting (at least implicitly) when
> running git-gc.  Since these objects are actually 90 days old rather
> than a mere two weeks, it actually seemed more safe to me.  But maybe
> I'm overlooking something with the pack->loose transition that makes
> it more racy?

I think it's worse in terms of race-safety because you're losing one of
the signals that users of the objects can provide to git-prune to tell
it the object is useful: updating the mtime. So yes, you think of the
objects as "90 days old", but we don't know if there are other users.
Has somebody else been accessing them in the meantime?

To be honest, I'm not sure how meaningful that distinction is in
practice. The current scheme is still racy, even if the windows are
shorter in some circumstances. But it seems like cruft packfiles are
a similar amount of work to your scheme, cover more cases, and are
slightly safer. And possibly even give us a long-term route to true
race-free pruning (via the "garbage pack" mark that Jonathan mentioned).

Assuming you buy into the idea that objects in a cruft-pack are not
hurting anything aside from a little wasted storage for up to 2 weeks
(which it sounds like you're at least partially on board with ;) ).

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 03:03:06PM -0700, Jonathan Nieder wrote:

> Jeff King wrote:
> 
> > I don't think any command should report failure of its _own_ operation
> > if "gc --auto" failed. And grepping around the source code shows that we
> > typically ignore it.
> 
> Oh, good point.  In non-daemon mode, we don't let "gc --auto" failure
> cause the invoking command to fail, but in daemon mode we do.  That
> should be a straightforward fix; patch coming in a moment.

OK, that definitely sounds like a bug. I'm still confused how that could
happen, though, since from the caller's perspective they ignore git-gc's
exit code either way. I guess I'll see in your patch. :)

> > What I was trying to say earlier is that we _did_ build this
> > rate-limiting, and I think it is a bug that the non-daemon case does not
> > rate-limit (but nobody noticed, because the default is daemonizing).
> >
> > So the fix is not "rip out the rate-limiting in daemon mode", but rather
> > "extend it to the non-daemon case".
> 
> Can you point me to some discussion about building that rate-limiting?
> The commit message for v2.12.2~17^2 (gc: ignore old gc.log files,
> 2017-02-10) definitely doesn't describe that as its intent.

I think that commit is a loosening of the rate-limiting (because we'd
refuse to progress for something that was actually time-based). But the
original stopping comes from this discussion, I think:

  https://public-inbox.org/git/xmqqlhijznpm@gitster.dls.corp.google.com/

(I didn't read the whole thread, but that was what I hit by blaming the
log code and then tracing that back to the list).

> This is the kind of review that Dscho often complains about, where
> someone tries to fix something small but significant to users and gets
> told to build something larger that was not their itch instead.

I don't know how to say more emphatically that I am not asking anyone to
build something larger (like cruft packfiles). I'm just trying to bring
up an impact that the author didn't consider (and that IMHO would be a
regression). Isn't that what reviews are for?

I only mention packfiles because as the discussion turns to "well, all
of these solutions are mediocre hacks" (because they absolutely are),
it's important to realize that there _is_ a right solution, and we even
already know about it. Even if we don't work on it now, knowing that
it's there makes it easier to decide about the various hacks.

> The comments about the "Why is 'git commit' so slow?" experience and
> how having the warning helps with that are well taken.  I think we
> should be able to find a way to keep the warning in a v2 of this
> patch.  But the rest about rate-limiting and putting unreachable
> objects in packs etc as a blocker for this are demoralizing, since
> they gives the feeling that even if I handle the cases that are
> handled today well, it will never be enough for the project unless I
> solve the larger problems that were already there.

I really don't know why we are having such trouble communicating. I've
tried to make it clear several times that the pack thing is not
something I expect your or Jonathan Tan to work on, but obviously I
failed. I'd be _delighted_ if you wanted to work on it, but AFAICT this
patch is purely motivated by:

  1. there's a funny exit code thing going on (0 on the first run, -1 on
 the second)

  2. the warning is not actionable by users

I disagree with the second, and I think we've discussed easy solutions
for the first.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Elijah Newren

On Mon, Jul 16, 2018 at 2:21 PM, Jeff King  wrote:
> On Mon, Jul 16, 2018 at 02:09:43PM -0700, Elijah Newren wrote:

>> The point of gc is to: expire reflogs, repack referenced objects, and
>> delete loose objects that (1) are no longer referenced and (2) are
>> "old enough".
>>
>> The "old enough" bit is the special part here.  Before the gc, those
>> objects were referenced (only by old reflog entries) and were found in
>> a pack.  git currently writes those objects out to disk and gives them
>> the age of the packfile they are contained in, which I think is the
>> source of the bug.  We have a reference for those objects from the
>> reflogs, know they aren't referenced anywhere else, so those objects
>> to me are the age of those reflog entries: 90 days.  As such, they are
>> "old enough" and should be deleted.
>
> OK, I see what you're saying, but...
>
>> I never got around to fixing it properly, though, because 'git prune'
>> is such a handy workaround that for now.  Having people nuke all their
>> loose objects is a bit risky, but the only other workaround people had
>> was to re-clone (waiting the requisite 20+ minutes for the repo to
>> download) and throw away their old clone.  (Though some people even
>> went that route, IIRC.)
>
> If we were to delete those objects, wouldn't it be exactly the same as
> running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
> minutes"?  Or are you concerned with taking other objects along for the
> ride that weren't part of old reflogs? I think that's a valid concern,

Yes, I was worried about taking other objects along for the ride that
weren't part of old reflogs.

> but it's also an issue for objects which were previously referenced in
> a reflog, but are part of another current operation.

I'm not certain what you're referring to here.

> Also, what do you do if there weren't reflogs in the repo? Or the
> reflogs were deleted (e.g., because branch deletion drops the associated
> reflog entirely)?

Yes, there are issues this rule won't help with, but in my experience
it was a primary (if not sole) actual cause in practice.  (I believe I
even said elsewhere in this thread that I knew there were unreachable
objects for other reasons and they might also become large in number).
At $DAYJOB we've had multiple people including myself hit the "too
many unreachable loose objects" nasty loop issue (some of us multiple
different times), and as far as I can tell, most (perhaps even all) of
them would have been avoided by just "properly" deleting garbage as
per my object-age-is-reflog-age-if-not-otherwise-referenced rule.

>> With big repos, it's easy to get into situations where there are well
>> more than 1 objects satisfying these conditions.  In fact, it's
>> not all that rare that the repo has far more loose objects after a git
>> gc than before.
>
> Yes, this is definitely a wart and I think is worth addressing.
>
>> I totally agree with your general plan to put unreferenced loose
>> objects into a pack.  However, I don't think these objects should be
>> part of that pack; they should just be deleted instead.
>
> I assume by "these objects" you mean ones which used to be reachable
> from a reflog, but that reflog entry just expired.  I think you'd be
> sacrificing some race-safety in that case.

Is that inherently any more race unsafe than 'git prune
--expire=2.weeks.ago'?  I thought it'd be racy in the same ways, and
thus a tradeoff folks are already accepting (at least implicitly) when
running git-gc.  Since these objects are actually 90 days old rather
than a mere two weeks, it actually seemed more safe to me.  But maybe
I'm overlooking something with the pack->loose transition that makes
it more racy?

> If the objects went into a pack under a race-proof scheme, would that
> actually bother you? Is it the 10,000 objects that's a problem, or is it
> the horrible I/O from exploding them coupled with the annoying auto-gc
> behavior?

Yeah, good point.  It's mostly the annoying auto-gc behavior and the
horrible I/O of future git operations from having lots of loose
objects.  They've already been paying the cost of storing those
objects in packed form for 90 days; a few more won't hurt much.  I'd
be slightly annoyed knowing that we're storing garbage we don't need
to be, but I don't think it's a real issue.

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Jeff King wrote:

> I don't think any command should report failure of its _own_ operation
> if "gc --auto" failed. And grepping around the source code shows that we
> typically ignore it.

Oh, good point.  In non-daemon mode, we don't let "gc --auto" failure
cause the invoking command to fail, but in daemon mode we do.  That
should be a straightforward fix; patch coming in a moment.

> On Mon, Jul 16, 2018 at 02:40:03PM -0700, Jonathan Nieder wrote:

>> For comparison, in non-daemon mode, there is nothing enforcing the
>> kind of ratelimiting you are talking about.  It is an accident of
>> history.  If we want this kind of ratelimiting, I'd rather we build it
>> deliberately instead of relying on this accident.
>
> What I was trying to say earlier is that we _did_ build this
> rate-limiting, and I think it is a bug that the non-daemon case does not
> rate-limit (but nobody noticed, because the default is daemonizing).
>
> So the fix is not "rip out the rate-limiting in daemon mode", but rather
> "extend it to the non-daemon case".

Can you point me to some discussion about building that rate-limiting?
The commit message for v2.12.2~17^2 (gc: ignore old gc.log files,
2017-02-10) definitely doesn't describe that as its intent.

This is the kind of review that Dscho often complains about, where
someone tries to fix something small but significant to users and gets
told to build something larger that was not their itch instead.

The comments about the "Why is 'git commit' so slow?" experience and
how having the warning helps with that are well taken.  I think we
should be able to find a way to keep the warning in a v2 of this
patch.  But the rest about rate-limiting and putting unreachable
objects in packs etc as a blocker for this are demoralizing, since
they gives the feeling that even if I handle the cases that are
handled today well, it will never be enough for the project unless I
solve the larger problems that were already there.

Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 02:40:03PM -0700, Jonathan Nieder wrote:

> > The key thing is that the presence of the warning/log still suppress
> > the actual gc for the gc.logExpiry period.
> 
> Ah, now I think I see the source of the misunderstanding.
> 
> When I initially read the docs, I also assumed that
> 
>   If the file gc.log exists, then git gc --auto won’t run unless
>   that file is more than gc.logExpiry old.
> 
> meant "git gc --auto" would be skipped (and thus silently succeed) when
> the file is less than gc.logExpiry old.  But that assumption was wrong:
> it errors out!

Right. That's the mysterious "-1" error code, and I agree that part is
confusing.

> This makes scripting difficult, since arbitrary commands that
> incidentally run "git gc --auto" will fail in this state, until gc.log
> expires (but at that point they'll fail again anyway).

I don't think any command should report failure of its _own_ operation
if "gc --auto" failed. And grepping around the source code shows that we
typically ignore it.

> For comparison, in non-daemon mode, there is nothing enforcing the
> kind of ratelimiting you are talking about.  It is an accident of
> history.  If we want this kind of ratelimiting, I'd rather we build it
> deliberately instead of relying on this accident.

What I was trying to say earlier is that we _did_ build this
rate-limiting, and I think it is a bug that the non-daemon case does not
rate-limit (but nobody noticed, because the default is daemonizing).

So the fix is not "rip out the rate-limiting in daemon mode", but rather
"extend it to the non-daemon case".

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Jeff King wrote:
> On Mon, Jul 16, 2018 at 01:37:53PM -0700, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> With the current code, that produces a bunch of annoying warnings for
>>> every commit ("I couldn't gc because the last gc reported a warning").
>>> But after Jonathan's patch, every single commit will do a full gc of the
>>> repository. In this tiny repository that's relatively quick, but in a
>>> real repo it's a ton of CPU and I/O, all for nothing.
>>
>> I see.  Do I understand correctly that if we find a way to print the
>> warning but not error out, that would be good enough for you?
>
> Yes. I thought that's what I proposed originally. ;)
>
> The key thing is that the presence of the warning/log still suppress
> the actual gc for the gc.logExpiry period.

Ah, now I think I see the source of the misunderstanding.

When I initially read the docs, I also assumed that

If the file gc.log exists, then git gc --auto won’t run unless
that file is more than gc.logExpiry old.

meant "git gc --auto" would be skipped (and thus silently succeed) when
the file is less than gc.logExpiry old.  But that assumption was wrong:
it errors out!

This makes scripting difficult, since arbitrary commands that
incidentally run "git gc --auto" will fail in this state, until gc.log
expires (but at that point they'll fail again anyway).

For comparison, in non-daemon mode, there is nothing enforcing the
kind of ratelimiting you are talking about.  It is an accident of
history.  If we want this kind of ratelimiting, I'd rather we build it
deliberately instead of relying on this accident.

Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Elijah Newren wrote:

> I totally agree with your general plan to put unreferenced loose
> objects into a pack.  However, I don't think these objects should be
> part of that pack; they should just be deleted instead.

This might be the wrong thread to discuss it, but did you follow the
reference/prune race that Peff mentioned?  The simplest cure I'm aware
of to it does involve writing those objects to a pack.  The idea is to
enforce a straightforward contract:

There are two kinds of packs: GC and UNREACHABLE_GARBAGE.

Every object in a GC pack has a minimum lifetime of  (let's say
"1 days") from the time they are read.  If you start making use of an
object from a GC pack (e.g. by creating a new object referencing it),
you have three days to ensure it's referenced.

Each UNREACHABLE_GARBAGE pack has a  (let's say "3 days") from
the time it is created.  Objects in an UNREACHABLE_GARBAGE have no
minimum ttl from the time they are read.  If you want to start making
use of an object from an UNREACHABLE_GARBAGE pack (e.g. by creating a
new object referencing it), then copy it and everything it references
to a GC pack.

To avoid a proliferation of UNREACHABLE_GARBAGE packs, there's a rule
for coalescing them, but that's not relevant here.

It is perfectly possible for an object in a GC pack to reference an
object in an UNREACHABLE_GARBAGE pack via writes racing with gc, but
that's fine --- on the next gc run, the unreachable garbage objects
get copied to a GC pack.

We've been using this on a JGit DfsRepository based server for > 2
years now and it's been working well.  More details are in the "Loose
objects and unreachable objects" section in Documentation/technical/
mentioned before.

Thanks,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 02:09:43PM -0700, Elijah Newren wrote:

> >> Um, except it doesn't actually do that.  The testcase I provided shows
> >> that it leaves around 1 objects that are totally deletable and
> >> were only previously referenced by reflog entries -- entries that gc
> >> removed without removing the corresponding objects.
> >
> > What's your definition of "totally deletable"?
> 
> The point of gc is to: expire reflogs, repack referenced objects, and
> delete loose objects that (1) are no longer referenced and (2) are
> "old enough".
> 
> The "old enough" bit is the special part here.  Before the gc, those
> objects were referenced (only by old reflog entries) and were found in
> a pack.  git currently writes those objects out to disk and gives them
> the age of the packfile they are contained in, which I think is the
> source of the bug.  We have a reference for those objects from the
> reflogs, know they aren't referenced anywhere else, so those objects
> to me are the age of those reflog entries: 90 days.  As such, they are
> "old enough" and should be deleted.

OK, I see what you're saying, but...

> I never got around to fixing it properly, though, because 'git prune'
> is such a handy workaround that for now.  Having people nuke all their
> loose objects is a bit risky, but the only other workaround people had
> was to re-clone (waiting the requisite 20+ minutes for the repo to
> download) and throw away their old clone.  (Though some people even
> went that route, IIRC.)

If we were to delete those objects, wouldn't it be exactly the same as
running "git prune"? Or setting your gc.pruneExpire to "now" or even "5
minutes"?  Or are you concerned with taking other objects along for the
ride that weren't part of old reflogs? I think that's a valid concern,
but it's also an issue for objects which were previously referenced in
a reflog, but are part of another current operation.

Also, what do you do if there weren't reflogs in the repo? Or the
reflogs were deleted (e.g., because branch deletion drops the associated
reflog entirely)?

> With big repos, it's easy to get into situations where there are well
> more than 1 objects satisfying these conditions.  In fact, it's
> not all that rare that the repo has far more loose objects after a git
> gc than before.

Yes, this is definitely a wart and I think is worth addressing.

> I totally agree with your general plan to put unreferenced loose
> objects into a pack.  However, I don't think these objects should be
> part of that pack; they should just be deleted instead.

I assume by "these objects" you mean ones which used to be reachable
from a reflog, but that reflog entry just expired.  I think you'd be
sacrificing some race-safety in that case.

If the objects went into a pack under a race-proof scheme, would that
actually bother you? Is it the 10,000 objects that's a problem, or is it
the horrible I/O from exploding them coupled with the annoying auto-gc
behavior?

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 01:56:46PM -0700, Jonathan Nieder wrote:

> I don't believe we are very good at transitively propagating freshness
> today, by the way.

Perhaps I don't understand what you mean by transitive here. But we
should be traversing from any fresh object and marking any non-fresh
ones that are reachable from it to be saved. If you know of a case where
we're not, there's a bug, and I'd be happy to look into it.

The only problem I know about is the utter lack of atomicity.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Elijah Newren

On Mon, Jul 16, 2018 at 1:38 PM, Jeff King  wrote:
> On Mon, Jul 16, 2018 at 01:16:45PM -0700, Elijah Newren wrote:
>
>> >> The basic problem here, at least for us, is that gc has enough
>> >> information to know it could expunge some objects, but because of how
>> >> it is structured in terms of several substeps (reflog expiration,
>> >> repack, prune), the information is lost between the steps and it
>> >> instead writes them out as unreachable objects.  If we could prune (or
>> >> avoid exploding) loose objects that are only reachable from reflog
>> >> entries that we are expiring, then the problem goes away for us.  (I
>> >> totally understand that other repos may have enough unreachable
>> >> objects for other reasons that Peff's suggestion to just pack up
>> >> unreachable objects is still a really good idea.  But on its own, it
>> >> seems like a waste since it's packing stuff that we know we could just
>> >> expunge.)
>> >
>> > No, we should have expunged everything that could be during the "repack"
>> > and "prune" steps. We feed the expiration time to repack, so that it
>> > knows to drop objects entirely instead of exploding them loose.
>>
>> Um, except it doesn't actually do that.  The testcase I provided shows
>> that it leaves around 1 objects that are totally deletable and
>> were only previously referenced by reflog entries -- entries that gc
>> removed without removing the corresponding objects.
>
> What's your definition of "totally deletable"?

The point of gc is to: expire reflogs, repack referenced objects, and
delete loose objects that (1) are no longer referenced and (2) are
"old enough".

The "old enough" bit is the special part here.  Before the gc, those
objects were referenced (only by old reflog entries) and were found in
a pack.  git currently writes those objects out to disk and gives them
the age of the packfile they are contained in, which I think is the
source of the bug.  We have a reference for those objects from the
reflogs, know they aren't referenced anywhere else, so those objects
to me are the age of those reflog entries: 90 days.  As such, they are
"old enough" and should be deleted.

With big repos, it's easy to get into situations where there are well
more than 1 objects satisfying these conditions.  In fact, it's
not all that rare that the repo has far more loose objects after a git
gc than before.

I never got around to fixing it properly, though, because 'git prune'
is such a handy workaround that for now.  Having people nuke all their
loose objects is a bit risky, but the only other workaround people had
was to re-clone (waiting the requisite 20+ minutes for the repo to
download) and throw away their old clone.  (Though some people even
went that route, IIRC.)

> If the pack they were in is less than 2 weeks old, then they are
> unreachable but "fresh", and are intentionally retained. If it was older
> than 2 weeks, they should have been deleted.

That's what's currently implemented, yes, but as above I think that
logic is bad.

> If you don't like that policy, you can set gc.pruneExpire to something
> lower (including "now", but watch out, as that can race with other
> operations!). But I don't think that's an implementation short-coming.
> It's intentional to keep those objects. It's just that the format
> they're in is inefficient and confuses later gc runs.

I'm aware of pruneExpire, but I think it's the wrong way to handle this.

I totally agree with your general plan to put unreferenced loose
objects into a pack.  However, I don't think these objects should be
part of that pack; they should just be deleted instead.

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 01:37:53PM -0700, Jonathan Nieder wrote:

> >and I think the
> > right solution is to pack the unreachable objects instead of exploding
> > them.
> 
> That seems like a huge widening in scope relative to what this
> original patch tackled.

I'm not at all saying that Jonathan needs to work on that. It's a big
topic that we've been putting off for years. I _am_ saying that I don't
think his patch as-is should be merged, as it makes the daemonized
behavior here worse.

> I'm not aware of a way to do this without
> breaking compatibility with the current (broken) race prevention.  If
> you're saying that breaking compatibility in that way is okay with
> you, then great!

It depends what you're trying to fix. If you're trying to fix auto-gc
outputting an annoying message (or repeatedly doing useless invocations
of pack-objects), you don't have to break compatibility at all. You just
have to put the objects into a pack instead of exploding them loose.
Worst case if you use an old implementation to run "git gc", it may
accidentally freshen a cruft pack's timestamp.

If you want to improve the raciness, then yes, I think you need stronger
rules around marking packs as cruft/garbage, and making operations that
want to use those objects copy them out to a non-cruft location (like
you describe in the hash transition document). There you risk corruption
if an old implementation fails to make the non-cruft copy. But that's
really no worse than we are today.

So I don't even really see it as a backwards-compatibility break. But
even if it were, I'd probably still be fine with it. This is a
local-repo thing, and in the absolute worst case we could bump
core.repositoryformatversion (though again, I don't even think that
would be necessary since it would degrade to behaving just as the
current code does).

> > With the current code, that produces a bunch of annoying warnings for
> > every commit ("I couldn't gc because the last gc reported a warning").
> > But after Jonathan's patch, every single commit will do a full gc of the
> > repository. In this tiny repository that's relatively quick, but in a
> > real repo it's a ton of CPU and I/O, all for nothing.
> 
> I see.  Do I understand correctly that if we find a way to print the
> warning but not error out, that would be good enough for you?

Yes. I thought that's what I proposed originally. ;)

The key thing is that the presence of the warning/log still suppress
the actual gc for the gc.logExpiry period.

> >> Have you looked over the discussion in "Loose objects and unreachable
> >> objects" in Documentation/technical/hash-function-transition.txt?  Do
> >> you have thoughts on it (preferrably in a separate thread)?
> >
> > It seems to propose putting the unreachable objects into a pack. Which
> > yes, I absolutely agree with (as I thought I'd been saying in every
> > single email in this thread).
> 
> I figured you were proposing something like
> https://public-inbox.org/git/20180113100734.ga30...@sigill.intra.peff.net/,
> which is still racy (because it does not handle the freshening in a safe
> way).

That message is just me telling somebody that their idea _won't_ work. ;)

But I assume you mean the general idea of putting things in a cruft
pack[1]. Yes, I was only suggesting going that far as a solution to the
auto-gc woes. Solving races is another issue entirely (though obviously
it may make sense to build on the single-pack work, if it exists).

[1] The best message discussion on that is I think:

https://public-inbox.org/git/20170610080626.sjujpmgkli4mu...@sigill.intra.peff.net/

which I think I linked earlier, so possibly that is even what you
meant. :)

> What decides it for me is that the user did not invoke "git gc --auto"
> explicitly, so anything "git gc --auto" prints is tangential to what
> the user was trying to do.  If the gc failed, that is worth telling
> them, but if e.g. it encountered a disk I/O error reading and
> succeeded on retry (to make up a fake example), then that's likely
> worth logging to syslog but it's not something the user asked to be
> directly informed about.

I'm not sure I agree. If a repack discovered that you had a bit
corruption on disk, but you happened to have another copy of the object
available that made the repack succeed, I'd like to know that I'm
getting bit corruptions, and earlier is better. I think we need to
surface that information to the user eventually, and I don't think we
can count on syslog being universally available.

By definition we're daemonized here, so we can't count on even having
access to the user's terminal. But it would make more sense to me if the
logic were:

  - at the end of a "gc --auto" run, gc should write a machine-readable
indication of its exit code (either as a final line in the log file,
or to a gc.exitcode file). The warnings/errors remain intermingled
in the logfile.

  - on the

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Jeff King wrote:
> On Mon, Jul 16, 2018 at 01:21:40PM -0700, Elijah Newren wrote:
>> Jonathan Nieder wrote:

>>> My understanding is that exploding the objects is intentional behavior,
>>> to avoid a race where objects are newly referenced while they are being
>>> pruned.
[...]
>> Ah, that's good to know and at least makes sense.  It seems somewhat
>> odd, though; loose objects that are two weeks old are just as
>> susceptible to being referenced anew by new commits, so the default of
>> running 'git prune --expire=2.weeks.ago' as gc currently does would
>> also be unsafe, wouldn't it?  Why is that any more or less unsafe than
>> pruning objects only referenced by reflog entries that are more than
>> 90 days old?

Part of the answer is that this safety feature applies even when
reflogs are not in use.  Another part is that as you say, you can
start referencing an object at the same time as the reflog entry
pointing to it is expiring.

[...]
>That's why we retain non-fresh
> objects that are referenced from fresh ones (so as long as you made the
> new commit recently, it transitively infers freshness on the old blob),
> and why we fresh mtimes when we elide a write for an existing object.
>
> That's _still_ not race-proof, because none of these operations is
> atomic.

Indeed.  One of the advantages of using a packfile for unreachable
objects is that metadata associated with that packfile can be updated
atomically.

I don't believe we are very good at transitively propagating freshness
today, by the way.

Thanks,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 01:16:45PM -0700, Elijah Newren wrote:

> >> The basic problem here, at least for us, is that gc has enough
> >> information to know it could expunge some objects, but because of how
> >> it is structured in terms of several substeps (reflog expiration,
> >> repack, prune), the information is lost between the steps and it
> >> instead writes them out as unreachable objects.  If we could prune (or
> >> avoid exploding) loose objects that are only reachable from reflog
> >> entries that we are expiring, then the problem goes away for us.  (I
> >> totally understand that other repos may have enough unreachable
> >> objects for other reasons that Peff's suggestion to just pack up
> >> unreachable objects is still a really good idea.  But on its own, it
> >> seems like a waste since it's packing stuff that we know we could just
> >> expunge.)
> >
> > No, we should have expunged everything that could be during the "repack"
> > and "prune" steps. We feed the expiration time to repack, so that it
> > knows to drop objects entirely instead of exploding them loose.
> 
> Um, except it doesn't actually do that.  The testcase I provided shows
> that it leaves around 1 objects that are totally deletable and
> were only previously referenced by reflog entries -- entries that gc
> removed without removing the corresponding objects.

What's your definition of "totally deletable"?

If the pack they were in is less than 2 weeks old, then they are
unreachable but "fresh", and are intentionally retained. If it was older
than 2 weeks, they should have been deleted.

If you don't like that policy, you can set gc.pruneExpire to something
lower (including "now", but watch out, as that can race with other
operations!). But I don't think that's an implementation short-coming.
It's intentional to keep those objects. It's just that the format
they're in is inefficient and confuses later gc runs.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Jeff King wrote:
> On Mon, Jul 16, 2018 at 12:54:31PM -0700, Jonathan Nieder wrote:

>> Even restricting attention to the warning, not the exit code, I think
>> you are saying the current behavior is acceptable.  I do not believe
>> it is.
>
> I didn't say that at all. The current situation sucks,

Thanks for clarifying.  That helps.

>and I think the
> right solution is to pack the unreachable objects instead of exploding
> them.

That seems like a huge widening in scope relative to what this
original patch tackled.  I'm not aware of a way to do this without
breaking compatibility with the current (broken) race prevention.  If
you're saying that breaking compatibility in that way is okay with
you, then great!

[...]
>> I think you are saying Jonathan's patch makes the behavior
>> worse, and I'm not seeing it.  Can you describe an example user
>> interaction where the current behavior would be better than the
>> behavior after Jonathan's patch?  That might help make this discussion
>> more concrete.
>
> It makes it worse because there is nothing to throttle a "gc --auto"
> that is making no progress.
[...]
> With the current code, that produces a bunch of annoying warnings for
> every commit ("I couldn't gc because the last gc reported a warning").
> But after Jonathan's patch, every single commit will do a full gc of the
> repository. In this tiny repository that's relatively quick, but in a
> real repo it's a ton of CPU and I/O, all for nothing.

I see.  Do I understand correctly that if we find a way to print the
warning but not error out, that would be good enough for you?

[...]
>> Have you looked over the discussion in "Loose objects and unreachable
>> objects" in Documentation/technical/hash-function-transition.txt?  Do
>> you have thoughts on it (preferrably in a separate thread)?
>
> It seems to propose putting the unreachable objects into a pack. Which
> yes, I absolutely agree with (as I thought I'd been saying in every
> single email in this thread).

I figured you were proposing something like
https://public-inbox.org/git/20180113100734.ga30...@sigill.intra.peff.net/,
which is still racy (because it does not handle the freshening in a safe
way).

[...]
>   Even if we were going to remove this message to help the
> daemonized case, I think we'd want to retain it for the non-daemon case.

Interesting.  That should be doable, e.g. following the approach
described below.

[...]
>> A simple way to do that without changing formats is to truncate the
>> file when exiting with status 0.
>
> That's a different behavior than what we do now (and what was suggested
> earlier), in that it assumes that anything written to stderr is OK for
> gc to hide from the user if the process exits with code zero.
>
> That's probably OK in most cases, though I wonder if there are corner
> cases. For example, if you have a corrupt ref, we used to say "error:
> refs/heads/foo does not point to a valid object!" but otherwise ignore
> it. These days git-gc sets REF_PARANOIA=1, so we'll actually barf on a
> corrupt ref. But I wonder if there are other cases lurking.

What decides it for me is that the user did not invoke "git gc --auto"
explicitly, so anything "git gc --auto" prints is tangential to what
the user was trying to do.  If the gc failed, that is worth telling
them, but if e.g. it encountered a disk I/O error reading and
succeeded on retry (to make up a fake example), then that's likely
worth logging to syslog but it's not something the user asked to be
directly informed about.

Thanks,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 01:21:40PM -0700, Elijah Newren wrote:

> > My understanding is that exploding the objects is intentional behavior,
> > to avoid a race where objects are newly referenced while they are being
> > pruned.
> >
> > I am not a fan of that behavior.  It's still racy.  But when we've
> > brought it up in the past, the consensus seems to have been that it's
> > better than nothing.  Documentation/technical/hash-function-transition.txt
> > section "Loose objects and unreachable objects" describes a way to
> > eliminate the race.
> 
> Ah, that's good to know and at least makes sense.  It seems somewhat
> odd, though; loose objects that are two weeks old are just as
> susceptible to being referenced anew by new commits, so the default of
> running 'git prune --expire=2.weeks.ago' as gc currently does would
> also be unsafe, wouldn't it?  Why is that any more or less unsafe than
> pruning objects only referenced by reflog entries that are more than
> 90 days old?

The 2-week safety isn't primarily about things which just became
unreferenced.  It's about things which are in the act of being
referenced.

Imagine a "git commit" racing with a "git prune". The commit has to
create an object, and then it will update a ref to point to it. But
between those two actions, prune may racily delete the object!
The mtime grace period is what makes that work.

Using 2 weeks is sort of ridiculous for that. But it also helps with
manual recovery (e.g., imagine a blob added to the index but never
committed; 3 days later you may want to try to recover your old work).

And you're correct that a new git-commit may still reference an old
object (e.g., a blob that's 5 seconds shy of being 2 weeks old that
you're including in a new commit). That's why we retain non-fresh
objects that are referenced from fresh ones (so as long as you made the
new commit recently, it transitively infers freshness on the old blob),
and why we fresh mtimes when we elide a write for an existing object.

That's _still_ not race-proof, because none of these operations is
atomic. git-prune can decide the blob is unfresh at the exact moment
you're creating the commit object.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 12:54:31PM -0700, Jonathan Nieder wrote:

> Even restricting attention to the warning, not the exit code, I think
> you are saying the current behavior is acceptable.  I do not believe
> it is.

I didn't say that at all. The current situation sucks, and I think the
right solution is to pack the unreachable objects instead of exploding
them.

> I think you are saying Jonathan's patch makes the behavior
> worse, and I'm not seeing it.  Can you describe an example user
> interaction where the current behavior would be better than the
> behavior after Jonathan's patch?  That might help make this discussion
> more concrete.

It makes it worse because there is nothing to throttle a "gc --auto"
that is making no progress.

Try this:

-- >8 --
#!/bin/sh

rm -rf repo

# new mostly-empty repo
git init repo
cd repo
git commit --allow-empty -m base

# make a bunch of unreachable blobs
for i in $(seq 7000); do
  echo "blob"
  echo "data < [...]
> > See the thread I linked earlier on putting unreachable objects into a
> > pack, which I think is the real solution.
> 
> Have you looked over the discussion in "Loose objects and unreachable
> objects" in Documentation/technical/hash-function-transition.txt?  Do
> you have thoughts on it (preferrably in a separate thread)?

It seems to propose putting the unreachable objects into a pack. Which
yes, I absolutely agree with (as I thought I'd been saying in every
single email in this thread).

> > I mean that if you do not write a persistent log, then "gc --auto" will
> > do an unproductive gc every time it is invoked. I.e., it will see "oh,
> > there are too many loose objects", and then waste a bunch of CPU every
> > time you commit.
> 
> If so, then this would already be the behavior in non daemon mode.
> Are you saying this accidental effect of daemon mode is in fact
> intentional?

I'm not sure if I'd call it intentional, since I don't recall ever
seeing this side effect discussed. But since daemonizing is the default,
I think that's the case people have focused on when they hit annoying
cases. E.g., a831c06a2b (gc: ignore old gc.log files, 2017-02-10).

I'd consider the fact that "gc --auto" with gc.autoDetach=false will
repeatedly do useless work to be a minor bug. But I think Jonathan's
patch makes it even worse because we do not even tell people that
useless work is being done (while making them wait for each gc to
finish!). Even if we were going to remove this message to help the
daemonized case, I think we'd want to retain it for the non-daemon case.

> > A daemonized git-gc runs a bunch of sub-commands (e.g., "git prune")
> > with their stderr redirected into the logfile. If you want to have
> > warnings go somewhere else, then either:
> >
> >   - you need some way to tell those sub-commands to send the warnings
> > elsewhere (i.e., _not_ stderr)
> >
> > or
> >
> >   - you have to post-process the output they send to separate warnings
> > from other errors. Right now that means scraping. If you are
> > proposing a system of machine-readable output, it would need to work
> > not just for git-gc, but for every sub-command it runs.
> 
>   or
> 
> - you can simply record and trust the exit code
> 
> A simple way to do that without changing formats is to truncate the
> file when exiting with status 0.

That's a different behavior than what we do now (and what was suggested
earlier), in that it assumes that anything written to stderr is OK for
gc to hide from the user if the process exits with code zero.

That's probably OK in most cases, though I wonder if there are corner
cases. For example, if you have a corrupt ref, we used to say "error:
refs/heads/foo does not point to a valid object!" but otherwise ignore
it. These days git-gc sets REF_PARANOIA=1, so we'll actually barf on a
corrupt ref. But I wonder if there are other cases lurking.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Elijah Newren

Hi,

On Mon, Jul 16, 2018 at 12:19 PM, Jonathan Nieder  wrote:
> Elijah Newren wrote:
>
>> The basic problem here, at least for us, is that gc has enough
>> information to know it could expunge some objects, but because of how
>> it is structured in terms of several substeps (reflog expiration,
>> repack, prune), the information is lost between the steps and it
>> instead writes them out as unreachable objects.  If we could prune (or
>> avoid exploding) loose objects that are only reachable from reflog
>> entries that we are expiring, then the problem goes away for us.
>
> My understanding is that exploding the objects is intentional behavior,
> to avoid a race where objects are newly referenced while they are being
> pruned.
>
> I am not a fan of that behavior.  It's still racy.  But when we've
> brought it up in the past, the consensus seems to have been that it's
> better than nothing.  Documentation/technical/hash-function-transition.txt
> section "Loose objects and unreachable objects" describes a way to
> eliminate the race.

Ah, that's good to know and at least makes sense.  It seems somewhat
odd, though; loose objects that are two weeks old are just as
susceptible to being referenced anew by new commits, so the default of
running 'git prune --expire=2.weeks.ago' as gc currently does would
also be unsafe, wouldn't it?  Why is that any more or less unsafe than
pruning objects only referenced by reflog entries that are more than
90 days old?

Elijah

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Elijah Newren

On Mon, Jul 16, 2018 at 12:52 PM, Jeff King  wrote:
> On Mon, Jul 16, 2018 at 12:15:05PM -0700, Elijah Newren wrote:
>
>> The basic problem here, at least for us, is that gc has enough
>> information to know it could expunge some objects, but because of how
>> it is structured in terms of several substeps (reflog expiration,
>> repack, prune), the information is lost between the steps and it
>> instead writes them out as unreachable objects.  If we could prune (or
>> avoid exploding) loose objects that are only reachable from reflog
>> entries that we are expiring, then the problem goes away for us.  (I
>> totally understand that other repos may have enough unreachable
>> objects for other reasons that Peff's suggestion to just pack up
>> unreachable objects is still a really good idea.  But on its own, it
>> seems like a waste since it's packing stuff that we know we could just
>> expunge.)
>
> No, we should have expunged everything that could be during the "repack"
> and "prune" steps. We feed the expiration time to repack, so that it
> knows to drop objects entirely instead of exploding them loose.

Um, except it doesn't actually do that.  The testcase I provided shows
that it leaves around 1 objects that are totally deletable and
were only previously referenced by reflog entries -- entries that gc
removed without removing the corresponding objects.


I will note that my testcase was slightly out-of-date; with current
git it needs a call to 'wait_for_background_gc_to_finish' right before
the 'git gc --quiet' to avoid erroring out.

> You
> could literally just do:
>
>   find .git/objects/?? -type f |
>   perl -lne 's{../.{38}$} and print "$1$2"' |
>   git pack-objects .git/objects/pack/cruft-pack
>
> But:
>
>   - that will explode them out only to repack them, which is inefficient
> (if they're already packed, you can probably reuse deltas, not to
> mention the I/O savings)
>
>   - there's the question of how to handle timestamps. Some of those
> objects may have been _about_ to expire, but now you've just put
> them in a brand-new pack that adds another 2 weeks to their life
>
>   - the find above is sloppy, and will race with somebody adding new
> objects to the repo
>
> So probably you want to have pack-objects write out the list of objects
> it _would_ explode, rather than exploding them. And then before
> git-repack deletes the old packs, put those into a new cruft pack. That
> _just_ leaves the timestamp issue (which is discussed at length in the
> thread I linked earlier).
>
>> git_actual_garbage_collect() {
>> GITDIR=$(git rev-parse --git-dir)
>>
>> # Record all revisions stored in reflog before and after gc
>> git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
>> git gc --auto
>> wait_for_background_gc_to_finish
>> git rev-list --no-walk --reflog >$GITDIR/gc.final-refs
>>
>> # Find out which reflog entries were removed
>> DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort 
>> $GITDIR/gc.final-refs))
>
> This is too detailed, I think. There are other reasons to have
> unreachable objects than expired reflogs. I think you really just want
> to consider all unreachable objects (like the pack-objects thing I
> mentioned above).

Yes, like I said, coarse workaround and I never had time to create a
real fix.  But I thought the testcase might be useful as a
demonstration of how git gc leaves around loose objects that were
previously reference by reflogs that gc itself pruned.

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Jeff King wrote:
> On Mon, Jul 16, 2018 at 12:09:49PM -0700, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> Er, the action is to run "git prune", like the warning says. :)
>>
>> I don't think we want to recommend that, especially when "git gc --auto"
>> does the right thing automatically.
>
> But that's the point. This warning is written literally after running
> "git gc --auto" _didn't_ do the right thing. Yes, it would be nicer if
> it could do the right thing. But it doesn't yet know how to.

I think we fundamentally disagree, and I would like your help getting
past this impasse.

Even restricting attention to the warning, not the exit code, I think
you are saying the current behavior is acceptable.  I do not believe
it is.  I think you are saying Jonathan's patch makes the behavior
worse, and I'm not seeing it.  Can you describe an example user
interaction where the current behavior would be better than the
behavior after Jonathan's patch?  That might help make this discussion
more concrete.

[...]
> See the thread I linked earlier on putting unreachable objects into a
> pack, which I think is the real solution.

Have you looked over the discussion in "Loose objects and unreachable
objects" in Documentation/technical/hash-function-transition.txt?  Do
you have thoughts on it (preferrably in a separate thread)?

[...]
>> This sounds awful.  It sounds to me like you're saying "git gc --auto"
>> is saying "I just did the wrong thing, and here is how you can clean
>> up after me".  That's not how I want a program to behave.
>
> Sure, it would be nice if it did the right thing. Nobody has written
> that yet. Until they do, we have to deal with the fallout.

"git gc --auto" is already doing the right thing.

[...]
> I mean that if you do not write a persistent log, then "gc --auto" will
> do an unproductive gc every time it is invoked. I.e., it will see "oh,
> there are too many loose objects", and then waste a bunch of CPU every
> time you commit.

If so, then this would already be the behavior in non daemon mode.
Are you saying this accidental effect of daemon mode is in fact
intentional?

[...]
>>> But in practice, I think you have to
>>> resort to scraping anyway, if you are interested in treating warnings
>>> from sub-processes the same way.
>>
>> Can you say more about this for me?  I am not understanding what
>> you're saying necessitates scraping the output.  I would strongly
>> prefer to avoid scraping the output.
>
> A daemonized git-gc runs a bunch of sub-commands (e.g., "git prune")
> with their stderr redirected into the logfile. If you want to have
> warnings go somewhere else, then either:
>
>   - you need some way to tell those sub-commands to send the warnings
> elsewhere (i.e., _not_ stderr)
>
> or
>
>   - you have to post-process the output they send to separate warnings
> from other errors. Right now that means scraping. If you are
> proposing a system of machine-readable output, it would need to work
> not just for git-gc, but for every sub-command it runs.

  or

- you can simply record and trust the exit code

A simple way to do that without changing formats is to truncate the
file when exiting with status 0.

Thanks,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 12:15:05PM -0700, Elijah Newren wrote:

> The basic problem here, at least for us, is that gc has enough
> information to know it could expunge some objects, but because of how
> it is structured in terms of several substeps (reflog expiration,
> repack, prune), the information is lost between the steps and it
> instead writes them out as unreachable objects.  If we could prune (or
> avoid exploding) loose objects that are only reachable from reflog
> entries that we are expiring, then the problem goes away for us.  (I
> totally understand that other repos may have enough unreachable
> objects for other reasons that Peff's suggestion to just pack up
> unreachable objects is still a really good idea.  But on its own, it
> seems like a waste since it's packing stuff that we know we could just
> expunge.)

No, we should have expunged everything that could be during the "repack"
and "prune" steps. We feed the expiration time to repack, so that it
knows to drop objects entirely instead of exploding them loose.

So the leftovers really are objects that cannot be deleted yet. You
could literally just do:

  find .git/objects/?? -type f |
  perl -lne 's{../.{38}$} and print "$1$2"' |
  git pack-objects .git/objects/pack/cruft-pack

But:

  - that will explode them out only to repack them, which is inefficient
(if they're already packed, you can probably reuse deltas, not to
mention the I/O savings)

  - there's the question of how to handle timestamps. Some of those
objects may have been _about_ to expire, but now you've just put
them in a brand-new pack that adds another 2 weeks to their life

  - the find above is sloppy, and will race with somebody adding new
objects to the repo

So probably you want to have pack-objects write out the list of objects
it _would_ explode, rather than exploding them. And then before
git-repack deletes the old packs, put those into a new cruft pack. That
_just_ leaves the timestamp issue (which is discussed at length in the
thread I linked earlier).

> git_actual_garbage_collect() {
> GITDIR=$(git rev-parse --git-dir)
> 
> # Record all revisions stored in reflog before and after gc
> git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
> git gc --auto
> wait_for_background_gc_to_finish
> git rev-list --no-walk --reflog >$GITDIR/gc.final-refs
> 
> # Find out which reflog entries were removed
> DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort 
> $GITDIR/gc.final-refs))

This is too detailed, I think. There are other reasons to have
unreachable objects than expired reflogs. I think you really just want
to consider all unreachable objects (like the pack-objects thing I
mentioned above).

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 12:09:49PM -0700, Jonathan Nieder wrote:

> >>> So while I completely agree that it's not a great thing to encourage
> >>> users to blindly run "git prune", I think it _is_ actionable.
> >>
> >> To flesh this out a little more: what user action do you suggest?  Could
> >> we carry out that action automatically?
> >
> > Er, the action is to run "git prune", like the warning says. :)
> 
> I don't think we want to recommend that, especially when "git gc --auto"
> does the right thing automatically.

But that's the point. This warning is written literally after running
"git gc --auto" _didn't_ do the right thing. Yes, it would be nicer if
it could do the right thing. But it doesn't yet know how to.

See the thread I linked earlier on putting unreachable objects into a
pack, which I think is the real solution.

> > The warning that is deleted by this patch is: you _just_ ran gc, and hey
> > we even did it automatically for you, but we're still in a funky state
> > afterwards. You might need to clean up this state.
> 
> This sounds awful.  It sounds to me like you're saying "git gc --auto"
> is saying "I just did the wrong thing, and here is how you can clean
> up after me".  That's not how I want a program to behave.

Sure, it would be nice if it did the right thing. Nobody has written
that yet. Until they do, we have to deal with the fallout.

> > If you do that without anything further, then it will break the
> > protection against repeatedly running auto-gc, as I described in the
> > previous email.
> 
> Do you mean ratelimiting for the message, or do you actually mean
> repeatedly running auto-gc itself?
> 
> If we suppress warnings, there would still be a gc.log while "git gc
> --auto" is running, just as though there had been no warnings at all.
> I believe this is close to the intended behavior; it's the same as
> what you'd get without daemon mode, except that you lose the warning.

I mean that if you do not write a persistent log, then "gc --auto" will
do an unproductive gc every time it is invoked. I.e., it will see "oh,
there are too many loose objects", and then waste a bunch of CPU every
time you commit.

> > Any of those would work similarly to the "just detect warnings" I
> > suggested earlier, with respect to keeping the "1 day" expiration
> > intact, so I'd be OK with them. In theory they'd be more robust than
> > scraping the "warning:" prefix. But in practice, I think you have to
> > resort to scraping anyway, if you are interested in treating warnings
> > from sub-processes the same way.
> 
> Can you say more about this for me?  I am not understanding what
> you're saying necessitates scraping the output.  I would strongly
> prefer to avoid scraping the output.

A daemonized git-gc runs a bunch of sub-commands (e.g., "git prune")
with their stderr redirected into the logfile. If you want to have
warnings go somewhere else, then either:

  - you need some way to tell those sub-commands to send the warnings
elsewhere (i.e., _not_ stderr)

or

  - you have to post-process the output they send to separate warnings
from other errors. Right now that means scraping. If you are
proposing a system of machine-readable output, it would need to work
not just for git-gc, but for every sub-command it runs.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Hi,

Elijah Newren wrote:

> The basic problem here, at least for us, is that gc has enough
> information to know it could expunge some objects, but because of how
> it is structured in terms of several substeps (reflog expiration,
> repack, prune), the information is lost between the steps and it
> instead writes them out as unreachable objects.  If we could prune (or
> avoid exploding) loose objects that are only reachable from reflog
> entries that we are expiring, then the problem goes away for us.

My understanding is that exploding the objects is intentional behavior,
to avoid a race where objects are newly referenced while they are being
pruned.

I am not a fan of that behavior.  It's still racy.  But when we've
brought it up in the past, the consensus seems to have been that it's
better than nothing.  Documentation/technical/hash-function-transition.txt
section "Loose objects and unreachable objects" describes a way to
eliminate the race.

Thanks and hope that helps,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Elijah Newren

On Mon, Jul 16, 2018 at 10:27 AM, Jonathan Tan  wrote:
> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
> cruft", 2007-09-17), the user was warned if there were too many
> unreachable loose objects. This made sense at the time, because gc
> couldn't prune them safely. But subsequently, git prune learned the
> ability to not prune recently created loose objects, making pruning able
> to be done more safely, and gc was made to automatically prune old
> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
> 2.weeks.ago" by default", 2008-03-12).
...
>
> ---
> This was noticed when a daemonized gc run wrote this warning to the log
> file, and returned 0; but a subsequent run merely read the log file, saw
> that it is non-empty and returned -1 (which is inconsistent in that such
> a run should return 0, as it did the first time).

Yeah, we've hit this several times too.  I even created a testcase and a
workaround, though I never got it into proper submission form.

The basic problem here, at least for us, is that gc has enough
information to know it could expunge some objects, but because of how
it is structured in terms of several substeps (reflog expiration,
repack, prune), the information is lost between the steps and it
instead writes them out as unreachable objects.  If we could prune (or
avoid exploding) loose objects that are only reachable from reflog
entries that we are expiring, then the problem goes away for us.  (I
totally understand that other repos may have enough unreachable
objects for other reasons that Peff's suggestion to just pack up
unreachable objects is still a really good idea.  But on its own, it
seems like a waste since it's packing stuff that we know we could just
expunge.)

Anyway, my very rough testcase is below.  The workaround is the
git_actual_garbage_collect() function (minus the call to
wait_for_background_gc_to_finish).

Elijah

---

wait_for_background_gc_to_finish() {
while ( ps -ef | grep -v grep | grep --quiet git.gc.--auto ); do
sleep 1;
done
}

git_standard_garbage_collect() {
# Current git gc sprays unreachable objects back in loose form; this is
# fine in many cases, but is annoying when done with objects which
# newly become unreachable because of something else git-gc does and
# git-gc doesn't clean them up.
git gc --auto
wait_for_background_gc_to_finish
}

git_actual_garbage_collect() {
GITDIR=$(git rev-parse --git-dir)

# Record all revisions stored in reflog before and after gc
git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
git gc --auto
wait_for_background_gc_to_finish
git rev-list --no-walk --reflog >$GITDIR/gc.final-refs

# Find out which reflog entries were removed
DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort 
$GITDIR/gc.final-refs))

# Get the list of objects which used to be reachable, but were made
# unreachable due to gc's reflog expiration.  To get these, I need
# the intersection of things reachable from $DELETED_REFS and things
# which are unreachable now.
git rev-list --objects $DELETED_REFS --not --all --reflog | awk '{print 
$1}' >$GITDIR/gc.previously-reachable-objects
git prune --expire=now --dry-run | awk '{print $1}' > 
$GITDIR/gc.unreachable-objects

# Delete all the previously-reachable-objects made unreachable by the
# reflog expiration done by git gc
comm -12 <(sort $GITDIR/gc.unreachable-objects) <(sort 
$GITDIR/gc.previously-reachable-objects) | sed -e 
"s#^\(..\)#$GITDIR/objects/\1/#" | xargs rm
}

test -d testcase && { echo "testcase exists; exiting"; exit 1; }
git init testcase/
cd testcase

# Create a basic commit
echo hello >world
git add world
git commit -q -m "Initial"

# Create a commit with lots of files
for i in {..}; do echo $i >$i; done
git add [0-9]*
git commit --quiet -m "Lots of numbers"

# Pack it all up
git gc --quiet

# Stop referencing the garbage
git reset --quiet --hard HEAD~1

# Pretend we did all the above stuff 30 days ago
for rlog in $(find .git/logs -type f); do
  # Subtract 3E6 (just over 30 days) from every date (assuming dates have
  # exactly 10 digits, which just happens to be valid...right now at least)
  perl -i -ne '/(.*?)(\b[0-9]{10}\b)(.*)/ && print $1,$2-300,$3,"\n"' $rlog
done

# HOWEVER: note that the pack is new; if we make the pack old, the old objects
# will get pruned for us.  But it is quite common to have new packfiles with
# old-and-soon-to-be-unreferenced-objects because frequent gc's mean moving
# the objects to new packfiles often, and eventually the reflog is expired.
# If you want to test them being part of an old packfile, uncomment this:
#   find .git/objects/pack -type f | xargs touch -t 21010101

# Create 50 packfiles in the current repo so that 'git gc --auto' will
# trigger `git repack -A -d -l` instead of just `git repack -d -l`
for i in {01..50}; do
  git fast-export master | sed -e s/Initial/In

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Hi,

Jeff King wrote:
> On Mon, Jul 16, 2018 at 11:22:07AM -0700, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> So while I completely agree that it's not a great thing to encourage
>>> users to blindly run "git prune", I think it _is_ actionable.
>>
>> To flesh this out a little more: what user action do you suggest?  Could
>> we carry out that action automatically?
>
> Er, the action is to run "git prune", like the warning says. :)

I don't think we want to recommend that, especially when "git gc --auto"
does the right thing automatically.

Can you convince me I'm wrong?

[...]
>> I mentally make a distinction between this warning from "git gc
>> --auto" and the warning from commands like "git gui".
[...]
> I don't think those warnings are the same. The warning from "git gui"
> is: you may benefit from running gc.
>
> The warning that is deleted by this patch is: you _just_ ran gc, and hey
> we even did it automatically for you, but we're still in a funky state
> afterwards. You might need to clean up this state.

This sounds awful.  It sounds to me like you're saying "git gc --auto"
is saying "I just did the wrong thing, and here is how you can clean
up after me".  That's not how I want a program to behave.

But there's a simpler explanation for what "git gc --auto" was trying
to say, which Jonathan described.

[...]
>> Yes, this is a real problem, and it would affect any other warning
>> (or even GIT_TRACE=1 output) produced by gc --auto as well.  I think we
>> should consider it a serious bug, separate from the symptom Jonathan is
>> fixing.
>>
>> Unfortunately I don't have a great idea about how to fix it.  Should
>> we avoid writing warnings to gc.log in daemon mode?  That would
>> prevent the user from ever seeing the warnings, but because of the
>> nature of a warning, that might be reasonable.
>
> If you do that without anything further, then it will break the
> protection against repeatedly running auto-gc, as I described in the
> previous email.

Do you mean ratelimiting for the message, or do you actually mean
repeatedly running auto-gc itself?

If we suppress warnings, there would still be a gc.log while "git gc
--auto" is running, just as though there had been no warnings at all.
I believe this is close to the intended behavior; it's the same as
what you'd get without daemon mode, except that you lose the warning.

[...]
>> Should we put warnings
>> in a separate log file with different semantics?  Should we change the
>> format of gc.log to allow more structured information (include 'gc'
>> exit code) and perform a two-stage migration to the new format (first
>> learn to read the new format, then switch to writing it)?
>
> Any of those would work similarly to the "just detect warnings" I
> suggested earlier, with respect to keeping the "1 day" expiration
> intact, so I'd be OK with them. In theory they'd be more robust than
> scraping the "warning:" prefix. But in practice, I think you have to
> resort to scraping anyway, if you are interested in treating warnings
> from sub-processes the same way.

Can you say more about this for me?  I am not understanding what
you're saying necessitates scraping the output.  I would strongly
prefer to avoid scraping the output.

Thanks,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 11:22:07AM -0700, Jonathan Nieder wrote:

> > I'm not sure if this tells the whole story. You may still run into a
> > case where auto-gc runs over and over during the grace period, because
> > you have a bunch of objects which are not reachable and not yet ready to
> > be expired. So a gc cannot make forward progress until the 2-week
> > expiration, and the way to break the cycle is to run an immediate
> > "prune".
> >
> > So while I completely agree that it's not a great thing to encourage
> > users to blindly run "git prune", I think it _is_ actionable.
> 
> To flesh this out a little more: what user action do you suggest?  Could
> we carry out that action automatically?

Er, the action is to run "git prune", like the warning says. :)

And no, I do not think we should run it automatically. It deletes
objects without respect to the grace period, which may not be what the
user wants. (And yes, the existing warning is terrible because it does
not at all make clear that following its advice may be dangerous).

> I mentally make a distinction between this warning from "git gc
> --auto" and the warning from commands like "git gui".  The former was
> saying "Please run 'git prune', because I'm not going to do it", and
> as Jonathan noticed, that's not true any more.  The latter says
> 
>   This repository currently has approximately %i loose objects.
> 
>   To maintain optimal performance it is strongly recommended
>   that you compress the database.
> 
>   Compress the database now?
> 
> which is relevant to the current operation (slowly reading the
> repository) and easy to act on (choose "yes").

I don't think those warnings are the same. The warning from "git gui"
is: you may benefit from running gc.

The warning that is deleted by this patch is: you _just_ ran gc, and hey
we even did it automatically for you, but we're still in a funky state
afterwards. You might need to clean up this state.

(As an aside, I think the git-gui warning pre-dates "gc --auto", and
probably should just run that rather than implementing its own
heuristic).

> > I agree that the "-1" return is a little funny. Perhaps on the reading
> > side we should detect that the log contains only a "warning" line and
> > adjust our exit code accordingly.
> 
> Yes, this is a real problem, and it would affect any other warning
> (or even GIT_TRACE=1 output) produced by gc --auto as well.  I think we
> should consider it a serious bug, separate from the symptom Jonathan is
> fixing.
> 
> Unfortunately I don't have a great idea about how to fix it.  Should
> we avoid writing warnings to gc.log in daemon mode?  That would
> prevent the user from ever seeing the warnings, but because of the
> nature of a warning, that might be reasonable.

If you do that without anything further, then it will break the
protection against repeatedly running auto-gc, as I described in the
previous email.

> Should we put warnings
> in a separate log file with different semantics?  Should we change the
> format of gc.log to allow more structured information (include 'gc'
> exit code) and perform a two-stage migration to the new format (first
> learn to read the new format, then switch to writing it)?

Any of those would work similarly to the "just detect warnings" I
suggested earlier, with respect to keeping the "1 day" expiration
intact, so I'd be OK with them. In theory they'd be more robust than
scraping the "warning:" prefix. But in practice, I think you have to
resort to scraping anyway, if you are interested in treating warnings
from sub-processes the same way.

-Peff

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jonathan Nieder

Hi,

Jeff King wrote:
> On Mon, Jul 16, 2018 at 10:27:17AM -0700, Jonathan Tan wrote:

>> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
>> cruft", 2007-09-17), the user was warned if there were too many
>> unreachable loose objects. This made sense at the time, because gc
>> couldn't prune them safely. But subsequently, git prune learned the
>> ability to not prune recently created loose objects, making pruning able
>> to be done more safely, and gc was made to automatically prune old
>> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
>> 2.weeks.ago" by default", 2008-03-12).
>>
>> This makes the warning unactionable by the user, as any loose objects
>> left are not deleted yet because of safety, and "git prune" is not a
>> command that the user is recommended to run directly anyway.
>
> I'm not sure if this tells the whole story. You may still run into a
> case where auto-gc runs over and over during the grace period, because
> you have a bunch of objects which are not reachable and not yet ready to
> be expired. So a gc cannot make forward progress until the 2-week
> expiration, and the way to break the cycle is to run an immediate
> "prune".
>
> So while I completely agree that it's not a great thing to encourage
> users to blindly run "git prune", I think it _is_ actionable.

To flesh this out a little more: what user action do you suggest?  Could
we carry out that action automatically?

I mentally make a distinction between this warning from "git gc
--auto" and the warning from commands like "git gui".  The former was
saying "Please run 'git prune', because I'm not going to do it", and
as Jonathan noticed, that's not true any more.  The latter says

This repository currently has approximately %i loose objects.

To maintain optimal performance it is strongly recommended
that you compress the database.

Compress the database now?

which is relevant to the current operation (slowly reading the
repository) and easy to act on (choose "yes").

[...]
> I agree that the "-1" return is a little funny. Perhaps on the reading
> side we should detect that the log contains only a "warning" line and
> adjust our exit code accordingly.

Yes, this is a real problem, and it would affect any other warning
(or even GIT_TRACE=1 output) produced by gc --auto as well.  I think we
should consider it a serious bug, separate from the symptom Jonathan is
fixing.

Unfortunately I don't have a great idea about how to fix it.  Should
we avoid writing warnings to gc.log in daemon mode?  That would
prevent the user from ever seeing the warnings, but because of the
nature of a warning, that might be reasonable.  Should we put warnings
in a separate log file with different semantics?  Should we change the
format of gc.log to allow more structured information (include 'gc'
exit code) and perform a two-stage migration to the new format (first
learn to read the new format, then switch to writing it)?

Thanks,
Jonathan

Re: [PATCH] gc: do not warn about too many loose objects

2018-07-16 Thread Jeff King

On Mon, Jul 16, 2018 at 10:27:17AM -0700, Jonathan Tan wrote:

> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
> cruft", 2007-09-17), the user was warned if there were too many
> unreachable loose objects. This made sense at the time, because gc
> couldn't prune them safely. But subsequently, git prune learned the
> ability to not prune recently created loose objects, making pruning able
> to be done more safely, and gc was made to automatically prune old
> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
> 2.weeks.ago" by default", 2008-03-12).
> 
> This makes the warning unactionable by the user, as any loose objects
> left are not deleted yet because of safety, and "git prune" is not a
> command that the user is recommended to run directly anyway.

I'm not sure if this tells the whole story. You may still run into a
case where auto-gc runs over and over during the grace period, because
you have a bunch of objects which are not reachable and not yet ready to
be expired. So a gc cannot make forward progress until the 2-week
expiration, and the way to break the cycle is to run an immediate
"prune".

So while I completely agree that it's not a great thing to encourage
users to blindly run "git prune", I think it _is_ actionable.

But...

> This was noticed when a daemonized gc run wrote this warning to the log
> file, and returned 0; but a subsequent run merely read the log file, saw
> that it is non-empty and returned -1 (which is inconsistent in that such
> a run should return 0, as it did the first time).

Yes, this got much worse with daemonized gc. The warning blocks further
runs. And then even after the 2-week period, you still cannot make
further progress until the user steps in! I think we dealt with that in
a831c06a2b (gc: ignore old gc.log files, 2017-02-10). So now we won't
run gc for a day, but eventually we may make further progress.

So the warning _is_ still doing something useful there (it's blocking
immediate auto-gc and kicking in the 1-day threshold).

I agree that the "-1" return is a little funny. Perhaps on the reading
side we should detect that the log contains only a "warning" line and
adjust our exit code accordingly.

Ultimately, I think Git should avoid keeping unreachable objects as
loose in the first place, which would make this whole problem go away.
There's some discussion in this thread:

  https://public-inbox.org/git/87inc89j38@evledraar.gmail.com/

-Peff

39 matches

Mail list logo