Re: git pull git gc

2015-03-19 Thread Duy Nguyen
On Wed, Mar 18, 2015 at 9:58 PM, John Keeping j...@keeping.me.uk wrote:
 On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
 On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote:
  If not, I made some mistake in analyzing this and we'll start again.

 I did make one mistake, the first gc should have reduced the number
 of loose objects to zero. Why didn't it.?  I'll come back to this
 tomorrow if nobody finds out first :)

 Most likely they are not referenced by anything but are younger than 2
 weeks.

 I saw a similar issue with automatic gc triggering after every operation
 when I did something equivalent to:

 git add lots of files
 git commit
 git reset --hard HEAD^

 which creates a log of unreachable objects which are not old enough to
 be pruned.

And there's another problem caused by background gc. If it's not run
in background, it should print this

warning: There are too many unreachable loose objects; run 'git prune'
to remove them.

but because background gc does not have access to stdout/stderr
anymore, this is lost.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Дилян Палаузов

Hello,

# git gc --auto
Auto packing the repository in background for optimum performance.
See git help gc for manual housekeeping.

and calls in the background:

25618 1  0 32451   884   1 14:20 ?00:00:00 git gc --auto
25639 25618 51 49076 49428   0 14:20 ?00:00:07 git prune 
--expire 2.weeks.ago


# git count-objects -v
count: 6039
size: 65464
in-pack: 185432
packs: 1
size-pack: 46687
prune-packable: 0
garbage: 0
size-garbage: 0

Regards
  Dilian


On 18.03.2015 15:16, Duy Nguyen wrote:

On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов
dilyan.palau...@aegee.org wrote:

Hello,

I have a local folder with the git-repository (so that its .git/config
contains ([remote origin]\nurl = git://github.com/git/git.git\nfetch =
+refs/heads/*:refs/remotes/origin/* )

I do there git pull.

Usually the output is
   Already up to date

but since today it prints
   Auto packing the repository in background for optimum performance.
   See git help gc for manual housekeeping.
   Already up-to-date.

and starts in the background a git gc --auto process.  This is all fine,
however, when the git gc process finishes, and I do again git pull I get
the same message, as above (git gc is again started).


So if you do git gc --auto now, does it exit immediately or go
through the garbage collection process again (it'll print something)?
What does git count-objects -v show?


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Duy Nguyen
On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote:
 If not, I made some mistake in analyzing this and we'll start again.

I did make one mistake, the first gc should have reduced the number
of loose objects to zero. Why didn't it.?  I'll come back to this
tomorrow if nobody finds out first :)
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Дилян Палаузов

Hello Duy,

#ls .git/objects/17/*  | wc -l
30

30 * 256 = 7 680  6 700

And now?  Do I have to run git gc --aggressive ?

Kind regards
  Dilian


On 18.03.2015 15:33, Duy Nguyen wrote:

On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов
dilyan.palau...@aegee.org wrote:

Hello,

# git gc --auto
Auto packing the repository in background for optimum performance.
See git help gc for manual housekeeping.

and calls in the background:

25618 1  0 32451   884   1 14:20 ?00:00:00 git gc --auto
25639 25618 51 49076 49428   0 14:20 ?00:00:07 git prune --expire
2.weeks.ago

# git count-objects -v
count: 6039


loose number threshold is 6700, unless you tweaked something. But
there's a tweak, we'll come back to this.


size: 65464
in-pack: 185432
packs: 1


Pack threshold is 50, You only have one pack, good

OK back to the count 6039 above. You have that many loose objects.
But 'git gc' is lazier than 'git count-objects'. It assume a flat
distribution, and only counts the number of objects in .git/objects/17
directory only, then extrapolate for the total number.

So can you see how many files you have in this directory
.git/objects/17? That number, multiplied by 256, should be greater
than 6700. If that's the case git gc laziness is the problem. If
not, I made some mistake in analyzing this and we'll start again.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Duy Nguyen
On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов
dilyan.palau...@aegee.org wrote:
 Hello,

 I have a local folder with the git-repository (so that its .git/config
 contains ([remote origin]\nurl = git://github.com/git/git.git\nfetch =
 +refs/heads/*:refs/remotes/origin/* )

 I do there git pull.

 Usually the output is
   Already up to date

 but since today it prints
   Auto packing the repository in background for optimum performance.
   See git help gc for manual housekeeping.
   Already up-to-date.

 and starts in the background a git gc --auto process.  This is all fine,
 however, when the git gc process finishes, and I do again git pull I get
 the same message, as above (git gc is again started).

So if you do git gc --auto now, does it exit immediately or go
through the garbage collection process again (it'll print something)?
What does git count-objects -v show?
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Duy Nguyen
On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов
dilyan.palau...@aegee.org wrote:
 Hello,

 # git gc --auto
 Auto packing the repository in background for optimum performance.
 See git help gc for manual housekeeping.

 and calls in the background:

 25618 1  0 32451   884   1 14:20 ?00:00:00 git gc --auto
 25639 25618 51 49076 49428   0 14:20 ?00:00:07 git prune --expire
 2.weeks.ago

 # git count-objects -v
 count: 6039

loose number threshold is 6700, unless you tweaked something. But
there's a tweak, we'll come back to this.

 size: 65464
 in-pack: 185432
 packs: 1

Pack threshold is 50, You only have one pack, good

OK back to the count 6039 above. You have that many loose objects.
But 'git gc' is lazier than 'git count-objects'. It assume a flat
distribution, and only counts the number of objects in .git/objects/17
directory only, then extrapolate for the total number.

So can you see how many files you have in this directory
.git/objects/17? That number, multiplied by 256, should be greater
than 6700. If that's the case git gc laziness is the problem. If
not, I made some mistake in analyzing this and we'll start again.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread John Keeping
On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
 On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote:
  If not, I made some mistake in analyzing this and we'll start again.
 
 I did make one mistake, the first gc should have reduced the number
 of loose objects to zero. Why didn't it.?  I'll come back to this
 tomorrow if nobody finds out first :)

Most likely they are not referenced by anything but are younger than 2
weeks.

I saw a similar issue with automatic gc triggering after every operation
when I did something equivalent to:

git add lots of files
git commit
git reset --hard HEAD^

which creates a log of unreachable objects which are not old enough to
be pruned.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Jeff King
On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote:

 Or we could count/estimate the number of loose objects again after
 repack/prune. Then we can maybe have a way to prevent the next gc that
 we know will not improve the situation anyway. One option is pack
 unreachable objects in the second pack. This would stop the next gc,
 but that would screw prune up because st_mtime info is gone.. Maybe we
 just save a file to tell gc to ignore the number of loose objects
 until after a specific date.

I don't think packing the unreachables is a good plan. They just end up
accumulating then, and they never expire, because we keep refreshing
their mtime at each pack (unless you pack them once and then leave them
to expire, but then you end up with a large number of packs).

Keeping a file that says I ran gc at time T, and there were still N
objects left over is probably the best bet. When the next gc --auto
runs, if T is recent enough, subtract N from the estimated number of
objects. I'm not sure of the right value for recent enough there,
though. If it is too far back, you will not gc when you could. If it is
too close, then you will end up running gc repeatedly, waiting for those
objects to leave the expiration window.

I guess leaving a bunch of loose objects around longer than necessary
isn't the end of the world. It wastes space, but it does not actively
make the rest of git slower (whereas having a large number of packs does
impact performance). So you could probably make recent enough be T 
now - gc.pruneExpire / 4 or something. At most we would try to gc 4
times before dropping unreachable objects, and for the default period,
that's only once every couple days.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Mike Hommey
On Wed, Mar 18, 2015 at 09:27:22PM -0400, Jeff King wrote:
 On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote:
 
  Or we could count/estimate the number of loose objects again after
  repack/prune. Then we can maybe have a way to prevent the next gc that
  we know will not improve the situation anyway. One option is pack
  unreachable objects in the second pack. This would stop the next gc,
  but that would screw prune up because st_mtime info is gone.. Maybe we
  just save a file to tell gc to ignore the number of loose objects
  until after a specific date.
 
 I don't think packing the unreachables is a good plan. They just end up
 accumulating then, and they never expire, because we keep refreshing
 their mtime at each pack (unless you pack them once and then leave them
 to expire, but then you end up with a large number of packs).

Note, sometimes I wish unreachables were packed. Recently, I ended up in
a situation where running gc created something like 3GB of data as per
du, because I suddenly had something like 600K unreachable objects, each
of them, as a loose object, taking at least 4K on disk. This made my
.git take 5GB instead of 2GB. That surely didn't feel like garbage
collection.

Mike
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Junio C Hamano
On Wed, Mar 18, 2015 at 6:27 PM, Jeff King p...@peff.net wrote:

 Keeping a file that says I ran gc at time T, and there were still N
 objects left over is probably the best bet. When the next gc --auto
 runs, if T is recent enough, subtract N from the estimated number of
 objects. I'm not sure of the right value for recent enough there,
 though. If it is too far back, you will not gc when you could. If it is
 too close, then you will end up running gc repeatedly, waiting for those
 objects to leave the expiration window.

 I guess leaving a bunch of loose objects around longer than necessary
 isn't the end of the world. It wastes space, but it does not actively
 make the rest of git slower (whereas having a large number of packs does
 impact performance). So you could probably make recent enough be T 
 now - gc.pruneExpire / 4 or something. At most we would try to gc 4
 times before dropping unreachable objects, and for the default period,
 that's only once every couple days.

We could simply prune unreachables more aggressively, and it would
solve this issue at the root cause, no?

We do keep things reachable from reflogs, so the only thing you are
getting by leaving the unreachables around is for an expert to perform
some forensic analysis---especially if there are so many loose objects
that are all unreachable, nobody sane can go through them one by one
and guess correctly if each of them is what they wished they kept if
their ancient reflog entry extended a few weeks more.

That is, unless there is some tool to analyse the unreachable loose
objects, collect them into meaningful islands, and present them in
some way that the end user can make sense of, which I do not think
exists (yet).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Duy Nguyen
On Thu, Mar 19, 2015 at 4:04 AM, Jeff King p...@peff.net wrote:
 On Wed, Mar 18, 2015 at 02:58:15PM +, John Keeping wrote:

 On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
  On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote:
   If not, I made some mistake in analyzing this and we'll start again.
 
  I did make one mistake, the first gc should have reduced the number
  of loose objects to zero. Why didn't it.?  I'll come back to this
  tomorrow if nobody finds out first :)

 Most likely they are not referenced by anything but are younger than 2
 weeks.

 I saw a similar issue with automatic gc triggering after every operation
 when I did something equivalent to:

   git add lots of files
   git commit
   git reset --hard HEAD^

 which creates a log of unreachable objects which are not old enough to
 be pruned.

 Yes, this is almost certainly the problem. Though to be pedantic, the
 command above will still have a reflog entry, so the objects will be
 reachable (and packed). But there are other variants that don't leave
 the objects reachable from even reflogs.

 I don't know if there is an easy way around this. Auto-gc's object count
 is making the assumption that running the gc will reduce the number of
 objects, but obviously it does not always do so. We could do a more
 thorough check and find the number of actual packable and prune-able
 objects. The prune-able part of that is easy; just omit objects from
 the count that are newer than 2 weeks. But packable is expensive. You
 would have to compute reachability by walking from the tips. That can
 take tens of seconds on a large repo.

Or we could count/estimate the number of loose objects again after
repack/prune. Then we can maybe have a way to prevent the next gc that
we know will not improve the situation anyway. One option is pack
unreachable objects in the second pack. This would stop the next gc,
but that would screw prune up because st_mtime info is gone.. Maybe we
just save a file to tell gc to ignore the number of loose objects
until after a specific date.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Jeff King
On Wed, Mar 18, 2015 at 07:27:46PM -0700, Junio C Hamano wrote:

  I guess leaving a bunch of loose objects around longer than necessary
  isn't the end of the world. It wastes space, but it does not actively
  make the rest of git slower (whereas having a large number of packs does
  impact performance). So you could probably make recent enough be T 
  now - gc.pruneExpire / 4 or something. At most we would try to gc 4
  times before dropping unreachable objects, and for the default period,
  that's only once every couple days.
 
 We could simply prune unreachables more aggressively, and it would
 solve this issue at the root cause, no?

Yes, but not too aggressively. You mentioned object archaeology, but my
main interest is avoiding corruption. The mtime check is the thing that
prevents us from pruning objects being used for an operation-in-progress
that has not yet updated a ref.  For some long-running operations, like
adding files to a commit, we take into account references like a blob
being mentioned in the index. But I do not know offhand if there are
other long-running operations that would run into problems if we
shortened the expiration time drastically.  Anything building a
temporary index is potentially problematic.

But if we assume that operations like that tend to create and reference
their objects within a reasonable time period (say, seconds to minutes)
then the current default of 2 weeks is absurd for this purpose.  For
raciness within a single operation, a few seconds is probably enough
(e.g., we may write out a commit object and then update the ref a few
milliseconds later).

The potential for problems is exacerbated by the fact that object `X`
may exist in the filesystem with an old mtime, and then a new operation
wants to reference it. That's made somewhat better by 33d4221
(write_sha1_file: freshen existing objects, 2014-10-15), as before we
could silently turn a file write into a noop. But it's still racy to do:

  git cat-file -e $commit
  git update-ref refs/heads/foo $commit

as we do not update the mtime for a read-only operation like cat-file
(and even if we did, it's still somewhat racy as prune does not
atomically check the mtime and remove the file).

So I think there's definitely some possible danger with dropping the
default prune expiration time.

For a long time GitHub ran with it as 1.hour.ago. We definitely saw some
oddities and corruption over the years that were apparently caused by
over-aggressive pruning and/or raciness. I've fixed a number of bugs,
and things did get better as a result. But I could not say whether all
such problems are gone. These days we do our regular repacks with
--keep-unreachable and almost never prune anything.

It's also not clear whether GitHub represents anything close to normal
use. We have a much smaller array of operations that we perform (most
objects are either from a push, or from a test-merge between a topic
branch and HEAD). But we also have busy repos that are frequently doing
gc in the background (especially because we share object storage, so
activity on another fork can trigger a gc job that affects a whole
repository network). On workstations, I'd guess most git-gc jobs run
during a fairly quiescent period.

All of which is to say that I don't really know the answer, and there
may be dragons. I'd imagine that dropping the default expiration time
from 2 weeks to 1 day would probably be fine. A good way to experiment
would be for some brave souls to set gc.pruneexpire themselves, run with
it for a few weeks or months, and see if anything goes wrong.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Mike Hommey
On Thu, Mar 19, 2015 at 12:14:53AM -0400, Jeff King wrote:
 On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote:
 
   I don't think packing the unreachables is a good plan. They just end up
   accumulating then, and they never expire, because we keep refreshing
   their mtime at each pack (unless you pack them once and then leave them
   to expire, but then you end up with a large number of packs).
  
  Note, sometimes I wish unreachables were packed. Recently, I ended up in
  a situation where running gc created something like 3GB of data as per
  du, because I suddenly had something like 600K unreachable objects, each
  of them, as a loose object, taking at least 4K on disk. This made my
  .git take 5GB instead of 2GB. That surely didn't feel like garbage
  collection.
 
 That's definitely a thing that happens, but it is a bit of a corner
 case. It's unusual to have such a large number of unreferenced objects
 all at once.
 
 I don't suppose you happen to remember the details, but would a lower
 expiration time (e.g., 1 day or 1 hour) have made all of those objects
 go away? Or were they really from some extremely recent event (of
 course, event here might just have been I did a full repack right
 before rewriting history which would freshen the mtimes on everything
 in the pack).

Unfortunately, I don't know the exact details. But yes, I guess a lower
expiration time might have helped.

 Certainly the loosening behavior for unreachable objects has corner
 cases like this, and they suck when you hit one. Leaving the objects
 packed would be better, but IMHO is not a viable alternative unless
 somebody comes up with a plan for segregating the old objects in a way
 that they actually expire eventually, and don't just keep getting
 repacked and freshened over and over.

It sure is a corner case, otoh, when it happens, every single git
operation calls git gc --auto, which happily spends 5 minutes sucking
CPU to end up doing nothing in practice. And add more salt on the
injury if you are on battery

6700 loose objects seems easy to reach on a repo with 6M objects...

Mike
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Jeff King
On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote:

 On Thu, Mar 19, 2015 at 8:27 AM, Jeff King p...@peff.net wrote:
  Keeping a file that says I ran gc at time T, and there were still N
  objects left over is probably the best bet. When the next gc --auto
  runs, if T is recent enough, subtract N from the estimated number of
  objects. I'm not sure of the right value for recent enough there,
  though. If it is too far back, you will not gc when you could. If it is
  too close, then you will end up running gc repeatedly, waiting for those
  objects to leave the expiration window.
 
 And would not be hard to implement either. git-gc is already prepared
 to deal with stale gc.pid, which would stop git-gc for a day or so
 before it deletes gc.pid and starts anyway. All we need to do is check
 at the end of git-gc, if we know for sure the next 'gc --auto' is a
 waste, then leave gc.pid behind.

That omits the N objects left over information. Which I think may be
useful, because otherwise the rule is basically don't do another gc at
all for X time units. That's OK for most use, but it has its own corner
cases. E.g., imagine you are doing an SVN import that does an auto-gc
check every 1000 commits. You have some unreferenced objects in your
repository. After the first 1000 commits, we do a gc, and then say wow,
still a lot of cruft; let's block gc for a day. Five minutes later,
after another 1000 commits, we run gc --auto again. It doesn't run
because of the cruft-check, even though there are a _huge_ number of new
packable objects.

If the blocker file tells us 7000 extra objects and we see that there
are 17,000 in the repo, then we know it's still worth doing the gc
(i.e., we know we that we'll probably end up ignoring the 7000 cruft
that didn't get cleaned up last time, but we also know that there are
10,000 new objects).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Duy Nguyen
On Thu, Mar 19, 2015 at 11:20 AM, Jeff King p...@peff.net wrote:
 On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote:

 On Thu, Mar 19, 2015 at 8:27 AM, Jeff King p...@peff.net wrote:
  Keeping a file that says I ran gc at time T, and there were still N
  objects left over is probably the best bet. When the next gc --auto
  runs, if T is recent enough, subtract N from the estimated number of
  objects. I'm not sure of the right value for recent enough there,
  though. If it is too far back, you will not gc when you could. If it is
  too close, then you will end up running gc repeatedly, waiting for those
  objects to leave the expiration window.

 And would not be hard to implement either. git-gc is already prepared
 to deal with stale gc.pid, which would stop git-gc for a day or so
 before it deletes gc.pid and starts anyway. All we need to do is check
 at the end of git-gc, if we know for sure the next 'gc --auto' is a
 waste, then leave gc.pid behind.

 That omits the N objects left over information. Which I think may be
 useful, because otherwise the rule is basically don't do another gc at
 all for X time units. That's OK for most use, but it has its own corner
 cases.

True. But saving N objects left over in a file also has a corner
case. If the user prune --expire=now manually, the next 'gc --auto'
still thinks we have that many leftovers and keeps delaying gc for
some more time. Unless we make 'prune' (or any other commands that
delete leftovers) to also delete this file. Yeah maybe saving this
info in a file will work.

 E.g., imagine you are doing an SVN import that does an auto-gc
 check every 1000 commits. You have some unreferenced objects in your
 repository. After the first 1000 commits, we do a gc, and then say wow,
 still a lot of cruft; let's block gc for a day. Five minutes later,
 after another 1000 commits, we run gc --auto again. It doesn't run
 because of the cruft-check, even though there are a _huge_ number of new
 packable objects.

 If the blocker file tells us 7000 extra objects and we see that there
 are 17,000 in the repo, then we know it's still worth doing the gc
 (i.e., we know we that we'll probably end up ignoring the 7000 cruft
 that didn't get cleaned up last time, but we also know that there are
 10,000 new objects).
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Jeff King
On Thu, Mar 19, 2015 at 11:29:57AM +0700, Duy Nguyen wrote:

  That omits the N objects left over information. Which I think may be
  useful, because otherwise the rule is basically don't do another gc at
  all for X time units. That's OK for most use, but it has its own corner
  cases.
 
 True. But saving N objects left over in a file also has a corner
 case. If the user prune --expire=now manually, the next 'gc --auto'
 still thinks we have that many leftovers and keeps delaying gc for
 some more time. Unless we make 'prune' (or any other commands that
 delete leftovers) to also delete this file. Yeah maybe saving this
 info in a file will work.

I assumed that the user would not run prune manually, but would run git
gc --prune=now. And yeah, definitely any time gc runs, it should update
the file (if there are fewer than `gc.auto` objects, I think it could
just delete the file).

We could also apply that rule any run of git prune, but my mental
model is that git gc is the magical porcelain that will do this stuff
for you, and git prune is the plumbing that users shouldn't need to
call themselves. I don't know if that model is shared by users, though. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Jeff King
On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote:

  I don't think packing the unreachables is a good plan. They just end up
  accumulating then, and they never expire, because we keep refreshing
  their mtime at each pack (unless you pack them once and then leave them
  to expire, but then you end up with a large number of packs).
 
 Note, sometimes I wish unreachables were packed. Recently, I ended up in
 a situation where running gc created something like 3GB of data as per
 du, because I suddenly had something like 600K unreachable objects, each
 of them, as a loose object, taking at least 4K on disk. This made my
 .git take 5GB instead of 2GB. That surely didn't feel like garbage
 collection.

That's definitely a thing that happens, but it is a bit of a corner
case. It's unusual to have such a large number of unreferenced objects
all at once.

I don't suppose you happen to remember the details, but would a lower
expiration time (e.g., 1 day or 1 hour) have made all of those objects
go away? Or were they really from some extremely recent event (of
course, event here might just have been I did a full repack right
before rewriting history which would freshen the mtimes on everything
in the pack).

Certainly the loosening behavior for unreachable objects has corner
cases like this, and they suck when you hit one. Leaving the objects
packed would be better, but IMHO is not a viable alternative unless
somebody comes up with a plan for segregating the old objects in a way
that they actually expire eventually, and don't just keep getting
repacked and freshened over and over.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Duy Nguyen
On Thu, Mar 19, 2015 at 8:27 AM, Jeff King p...@peff.net wrote:
 Keeping a file that says I ran gc at time T, and there were still N
 objects left over is probably the best bet. When the next gc --auto
 runs, if T is recent enough, subtract N from the estimated number of
 objects. I'm not sure of the right value for recent enough there,
 though. If it is too far back, you will not gc when you could. If it is
 too close, then you will end up running gc repeatedly, waiting for those
 objects to leave the expiration window.

And would not be hard to implement either. git-gc is already prepared
to deal with stale gc.pid, which would stop git-gc for a day or so
before it deletes gc.pid and starts anyway. All we need to do is check
at the end of git-gc, if we know for sure the next 'gc --auto' is a
waste, then leave gc.pid behind.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Jeff King
On Wed, Mar 18, 2015 at 02:58:15PM +, John Keeping wrote:

 On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote:
  On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote:
   If not, I made some mistake in analyzing this and we'll start again.
  
  I did make one mistake, the first gc should have reduced the number
  of loose objects to zero. Why didn't it.?  I'll come back to this
  tomorrow if nobody finds out first :)
 
 Most likely they are not referenced by anything but are younger than 2
 weeks.
 
 I saw a similar issue with automatic gc triggering after every operation
 when I did something equivalent to:
 
   git add lots of files
   git commit
   git reset --hard HEAD^
 
 which creates a log of unreachable objects which are not old enough to
 be pruned.

Yes, this is almost certainly the problem. Though to be pedantic, the
command above will still have a reflog entry, so the objects will be
reachable (and packed). But there are other variants that don't leave
the objects reachable from even reflogs.

I don't know if there is an easy way around this. Auto-gc's object count
is making the assumption that running the gc will reduce the number of
objects, but obviously it does not always do so. We could do a more
thorough check and find the number of actual packable and prune-able
objects. The prune-able part of that is easy; just omit objects from
the count that are newer than 2 weeks. But packable is expensive. You
would have to compute reachability by walking from the tips. That can
take tens of seconds on a large repo.

You could perhaps cut off the walk early when you hit a packed commit
(this does not strictly imply that all of the related objects are
packed, but it would be good enough for a heuristic). But even that is
probably too expensive for gc --auto.

-Peff

PS Note that in git v2.2.0 and up, prune will leave not only recent
   unreachable objects, but older objects which are reachable from those
   recent ones (so that we keep or prune whole chunks of history, rather
   than dropping part and leaving the rest broken). Technically this
   exacerbates the problem (we keep more objects), though I doubt it
   makes much difference in practice (most chunks of history were
   created at similar times, so the mtimes of the whole chunk will be
   close together).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git pull git gc

2015-03-18 Thread Jeff King
On Wed, Mar 18, 2015 at 03:48:42PM +0100, Дилян Палаузов wrote:

 #ls .git/objects/17/*  | wc -l
 30
 
 30 * 256 = 7 680  6 700
 
 And now?  Do I have to run git gc --aggressive ?

No, aggressive just controls the time we spend on repacking. If the
guess is correct that the objects are kept because they are unreachable
but recent, then shortening the prune expiration time would get rid of
them. E.g., git gc --prune=1.hour.ago.

That does not solve the underlying problem discussed elsewhere in the
thread, but it would make this particular instance of it go away. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html