On Mon, Jul 16, 2018 at 10:27 AM, Jonathan Tan <jonathanta...@google.com> wrote:
> In a087cc9819 ("git-gc --auto: protect ourselves from accumulated
> cruft", 2007-09-17), the user was warned if there were too many
> unreachable loose objects. This made sense at the time, because gc
> couldn't prune them safely. But subsequently, git prune learned the
> ability to not prune recently created loose objects, making pruning able
> to be done more safely, and gc was made to automatically prune old
> unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire
> 2.weeks.ago" by default", 2008-03-12).
...
>
> ---
> This was noticed when a daemonized gc run wrote this warning to the log
> file, and returned 0; but a subsequent run merely read the log file, saw
> that it is non-empty and returned -1 (which is inconsistent in that such
> a run should return 0, as it did the first time).

Yeah, we've hit this several times too.  I even created a testcase and a
workaround, though I never got it into proper submission form.

The basic problem here, at least for us, is that gc has enough
information to know it could expunge some objects, but because of how
it is structured in terms of several substeps (reflog expiration,
repack, prune), the information is lost between the steps and it
instead writes them out as unreachable objects.  If we could prune (or
avoid exploding) loose objects that are only reachable from reflog
entries that we are expiring, then the problem goes away for us.  (I
totally understand that other repos may have enough unreachable
objects for other reasons that Peff's suggestion to just pack up
unreachable objects is still a really good idea.  But on its own, it
seems like a waste since it's packing stuff that we know we could just
expunge.)

Anyway, my very rough testcase is below.  The workaround is the
git_actual_garbage_collect() function (minus the call to
wait_for_background_gc_to_finish).

Elijah

---


wait_for_background_gc_to_finish() {
    while ( ps -ef | grep -v grep | grep --quiet git.gc.--auto ); do
        sleep 1;
    done
}

git_standard_garbage_collect() {
    # Current git gc sprays unreachable objects back in loose form; this is
    # fine in many cases, but is annoying when done with objects which
    # newly become unreachable because of something else git-gc does and
    # git-gc doesn't clean them up.
    git gc --auto
    wait_for_background_gc_to_finish
}

git_actual_garbage_collect() {
    GITDIR=$(git rev-parse --git-dir)

    # Record all revisions stored in reflog before and after gc
    git rev-list --no-walk --reflog >$GITDIR/gc.original-refs
    git gc --auto
    wait_for_background_gc_to_finish
    git rev-list --no-walk --reflog >$GITDIR/gc.final-refs

    # Find out which reflog entries were removed
    DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort 
$GITDIR/gc.final-refs))

    # Get the list of objects which used to be reachable, but were made
    # unreachable due to gc's reflog expiration.  To get these, I need
    # the intersection of things reachable from $DELETED_REFS and things
    # which are unreachable now.
    git rev-list --objects $DELETED_REFS --not --all --reflog | awk '{print 
$1}' >$GITDIR/gc.previously-reachable-objects
    git prune --expire=now --dry-run | awk '{print $1}' > 
$GITDIR/gc.unreachable-objects

    # Delete all the previously-reachable-objects made unreachable by the
    # reflog expiration done by git gc
    comm -12 <(sort $GITDIR/gc.unreachable-objects) <(sort 
$GITDIR/gc.previously-reachable-objects) | sed -e 
"s#^\(..\)#$GITDIR/objects/\1/#" | xargs rm
}


test -d testcase && { echo "testcase exists; exiting"; exit 1; }
git init testcase/
cd testcase

# Create a basic commit
echo hello >world
git add world
git commit -q -m "Initial"

# Create a commit with lots of files
for i in {0000..9999}; do echo $i >$i; done
git add [0-9]*
git commit --quiet -m "Lots of numbers"

# Pack it all up
git gc --quiet

# Stop referencing the garbage
git reset --quiet --hard HEAD~1

# Pretend we did all the above stuff 30 days ago
for rlog in $(find .git/logs -type f); do
  # Subtract 3E6 (just over 30 days) from every date (assuming dates have
  # exactly 10 digits, which just happens to be valid...right now at least)
  perl -i -ne '/(.*?)(\b[0-9]{10}\b)(.*)/ && print $1,$2-3000000,$3,"\n"' $rlog
done

# HOWEVER: note that the pack is new; if we make the pack old, the old objects
# will get pruned for us.  But it is quite common to have new packfiles with
# old-and-soon-to-be-unreferenced-objects because frequent gc's mean moving
# the objects to new packfiles often, and eventually the reflog is expired.
# If you want to test them being part of an old packfile, uncomment this:
#   find .git/objects/pack -type f | xargs touch -t 200001010101

# Create 50 packfiles in the current repo so that 'git gc --auto' will
# trigger `git repack -A -d -l` instead of just `git repack -d -l`
for i in {01..50}; do
  git fast-export master | sed -e s/Initial/Initi$i/ | git -c 
fastimport.unpacklimit=0 fast-import --quiet |& grep -v "Not updating 
refs/heads/master"
done

echo "*** Before gc, reflog refers to garbage-collectible commits: ***"
git rev-list --no-walk --all --reflog
cat .git/logs/refs/heads/master
echo "*** Before gc, everything is packed with no loose objects: ***"
git count-objects -v

git_standard_garbage_collect  # Just `git gc --auto`
#git_actual_garbage_collect    # What I really want from `git gc --auto`

echo -e "\n\n*** After gc, commit garbage collected and objects made loose: ***"
git rev-list --no-walk --all --reflog
cat .git/logs/refs/heads/master
git count-objects -v

echo -e "\n\n*** Now we can trigger the 'too many unreachable' error: ***"
git gc --auto

Reply via email to