Re: git pull git gc
On Wed, Mar 18, 2015 at 9:58 PM, John Keeping j...@keeping.me.uk wrote: On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote: If not, I made some mistake in analyzing this and we'll start again. I did make one mistake, the first gc should have reduced the number of loose objects to zero. Why didn't it.? I'll come back to this tomorrow if nobody finds out first :) Most likely they are not referenced by anything but are younger than 2 weeks. I saw a similar issue with automatic gc triggering after every operation when I did something equivalent to: git add lots of files git commit git reset --hard HEAD^ which creates a log of unreachable objects which are not old enough to be pruned. And there's another problem caused by background gc. If it's not run in background, it should print this warning: There are too many unreachable loose objects; run 'git prune' to remove them. but because background gc does not have access to stdout/stderr anymore, this is lost. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
Hello, # git gc --auto Auto packing the repository in background for optimum performance. See git help gc for manual housekeeping. and calls in the background: 25618 1 0 32451 884 1 14:20 ?00:00:00 git gc --auto 25639 25618 51 49076 49428 0 14:20 ?00:00:07 git prune --expire 2.weeks.ago # git count-objects -v count: 6039 size: 65464 in-pack: 185432 packs: 1 size-pack: 46687 prune-packable: 0 garbage: 0 size-garbage: 0 Regards Dilian On 18.03.2015 15:16, Duy Nguyen wrote: On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов dilyan.palau...@aegee.org wrote: Hello, I have a local folder with the git-repository (so that its .git/config contains ([remote origin]\nurl = git://github.com/git/git.git\nfetch = +refs/heads/*:refs/remotes/origin/* ) I do there git pull. Usually the output is Already up to date but since today it prints Auto packing the repository in background for optimum performance. See git help gc for manual housekeeping. Already up-to-date. and starts in the background a git gc --auto process. This is all fine, however, when the git gc process finishes, and I do again git pull I get the same message, as above (git gc is again started). So if you do git gc --auto now, does it exit immediately or go through the garbage collection process again (it'll print something)? What does git count-objects -v show? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote: If not, I made some mistake in analyzing this and we'll start again. I did make one mistake, the first gc should have reduced the number of loose objects to zero. Why didn't it.? I'll come back to this tomorrow if nobody finds out first :) -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
Hello Duy, #ls .git/objects/17/* | wc -l 30 30 * 256 = 7 680 6 700 And now? Do I have to run git gc --aggressive ? Kind regards Dilian On 18.03.2015 15:33, Duy Nguyen wrote: On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов dilyan.palau...@aegee.org wrote: Hello, # git gc --auto Auto packing the repository in background for optimum performance. See git help gc for manual housekeeping. and calls in the background: 25618 1 0 32451 884 1 14:20 ?00:00:00 git gc --auto 25639 25618 51 49076 49428 0 14:20 ?00:00:07 git prune --expire 2.weeks.ago # git count-objects -v count: 6039 loose number threshold is 6700, unless you tweaked something. But there's a tweak, we'll come back to this. size: 65464 in-pack: 185432 packs: 1 Pack threshold is 50, You only have one pack, good OK back to the count 6039 above. You have that many loose objects. But 'git gc' is lazier than 'git count-objects'. It assume a flat distribution, and only counts the number of objects in .git/objects/17 directory only, then extrapolate for the total number. So can you see how many files you have in this directory .git/objects/17? That number, multiplied by 256, should be greater than 6700. If that's the case git gc laziness is the problem. If not, I made some mistake in analyzing this and we'll start again. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 8:53 PM, Дилян Палаузов dilyan.palau...@aegee.org wrote: Hello, I have a local folder with the git-repository (so that its .git/config contains ([remote origin]\nurl = git://github.com/git/git.git\nfetch = +refs/heads/*:refs/remotes/origin/* ) I do there git pull. Usually the output is Already up to date but since today it prints Auto packing the repository in background for optimum performance. See git help gc for manual housekeeping. Already up-to-date. and starts in the background a git gc --auto process. This is all fine, however, when the git gc process finishes, and I do again git pull I get the same message, as above (git gc is again started). So if you do git gc --auto now, does it exit immediately or go through the garbage collection process again (it'll print something)? What does git count-objects -v show? -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 9:23 PM, Дилян Палаузов dilyan.palau...@aegee.org wrote: Hello, # git gc --auto Auto packing the repository in background for optimum performance. See git help gc for manual housekeeping. and calls in the background: 25618 1 0 32451 884 1 14:20 ?00:00:00 git gc --auto 25639 25618 51 49076 49428 0 14:20 ?00:00:07 git prune --expire 2.weeks.ago # git count-objects -v count: 6039 loose number threshold is 6700, unless you tweaked something. But there's a tweak, we'll come back to this. size: 65464 in-pack: 185432 packs: 1 Pack threshold is 50, You only have one pack, good OK back to the count 6039 above. You have that many loose objects. But 'git gc' is lazier than 'git count-objects'. It assume a flat distribution, and only counts the number of objects in .git/objects/17 directory only, then extrapolate for the total number. So can you see how many files you have in this directory .git/objects/17? That number, multiplied by 256, should be greater than 6700. If that's the case git gc laziness is the problem. If not, I made some mistake in analyzing this and we'll start again. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote: If not, I made some mistake in analyzing this and we'll start again. I did make one mistake, the first gc should have reduced the number of loose objects to zero. Why didn't it.? I'll come back to this tomorrow if nobody finds out first :) Most likely they are not referenced by anything but are younger than 2 weeks. I saw a similar issue with automatic gc triggering after every operation when I did something equivalent to: git add lots of files git commit git reset --hard HEAD^ which creates a log of unreachable objects which are not old enough to be pruned. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote: Or we could count/estimate the number of loose objects again after repack/prune. Then we can maybe have a way to prevent the next gc that we know will not improve the situation anyway. One option is pack unreachable objects in the second pack. This would stop the next gc, but that would screw prune up because st_mtime info is gone.. Maybe we just save a file to tell gc to ignore the number of loose objects until after a specific date. I don't think packing the unreachables is a good plan. They just end up accumulating then, and they never expire, because we keep refreshing their mtime at each pack (unless you pack them once and then leave them to expire, but then you end up with a large number of packs). Keeping a file that says I ran gc at time T, and there were still N objects left over is probably the best bet. When the next gc --auto runs, if T is recent enough, subtract N from the estimated number of objects. I'm not sure of the right value for recent enough there, though. If it is too far back, you will not gc when you could. If it is too close, then you will end up running gc repeatedly, waiting for those objects to leave the expiration window. I guess leaving a bunch of loose objects around longer than necessary isn't the end of the world. It wastes space, but it does not actively make the rest of git slower (whereas having a large number of packs does impact performance). So you could probably make recent enough be T now - gc.pruneExpire / 4 or something. At most we would try to gc 4 times before dropping unreachable objects, and for the default period, that's only once every couple days. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 09:27:22PM -0400, Jeff King wrote: On Thu, Mar 19, 2015 at 07:31:48AM +0700, Duy Nguyen wrote: Or we could count/estimate the number of loose objects again after repack/prune. Then we can maybe have a way to prevent the next gc that we know will not improve the situation anyway. One option is pack unreachable objects in the second pack. This would stop the next gc, but that would screw prune up because st_mtime info is gone.. Maybe we just save a file to tell gc to ignore the number of loose objects until after a specific date. I don't think packing the unreachables is a good plan. They just end up accumulating then, and they never expire, because we keep refreshing their mtime at each pack (unless you pack them once and then leave them to expire, but then you end up with a large number of packs). Note, sometimes I wish unreachables were packed. Recently, I ended up in a situation where running gc created something like 3GB of data as per du, because I suddenly had something like 600K unreachable objects, each of them, as a loose object, taking at least 4K on disk. This made my .git take 5GB instead of 2GB. That surely didn't feel like garbage collection. Mike -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 6:27 PM, Jeff King p...@peff.net wrote: Keeping a file that says I ran gc at time T, and there were still N objects left over is probably the best bet. When the next gc --auto runs, if T is recent enough, subtract N from the estimated number of objects. I'm not sure of the right value for recent enough there, though. If it is too far back, you will not gc when you could. If it is too close, then you will end up running gc repeatedly, waiting for those objects to leave the expiration window. I guess leaving a bunch of loose objects around longer than necessary isn't the end of the world. It wastes space, but it does not actively make the rest of git slower (whereas having a large number of packs does impact performance). So you could probably make recent enough be T now - gc.pruneExpire / 4 or something. At most we would try to gc 4 times before dropping unreachable objects, and for the default period, that's only once every couple days. We could simply prune unreachables more aggressively, and it would solve this issue at the root cause, no? We do keep things reachable from reflogs, so the only thing you are getting by leaving the unreachables around is for an expert to perform some forensic analysis---especially if there are so many loose objects that are all unreachable, nobody sane can go through them one by one and guess correctly if each of them is what they wished they kept if their ancient reflog entry extended a few weeks more. That is, unless there is some tool to analyse the unreachable loose objects, collect them into meaningful islands, and present them in some way that the end user can make sense of, which I do not think exists (yet). -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 4:04 AM, Jeff King p...@peff.net wrote: On Wed, Mar 18, 2015 at 02:58:15PM +, John Keeping wrote: On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote: If not, I made some mistake in analyzing this and we'll start again. I did make one mistake, the first gc should have reduced the number of loose objects to zero. Why didn't it.? I'll come back to this tomorrow if nobody finds out first :) Most likely they are not referenced by anything but are younger than 2 weeks. I saw a similar issue with automatic gc triggering after every operation when I did something equivalent to: git add lots of files git commit git reset --hard HEAD^ which creates a log of unreachable objects which are not old enough to be pruned. Yes, this is almost certainly the problem. Though to be pedantic, the command above will still have a reflog entry, so the objects will be reachable (and packed). But there are other variants that don't leave the objects reachable from even reflogs. I don't know if there is an easy way around this. Auto-gc's object count is making the assumption that running the gc will reduce the number of objects, but obviously it does not always do so. We could do a more thorough check and find the number of actual packable and prune-able objects. The prune-able part of that is easy; just omit objects from the count that are newer than 2 weeks. But packable is expensive. You would have to compute reachability by walking from the tips. That can take tens of seconds on a large repo. Or we could count/estimate the number of loose objects again after repack/prune. Then we can maybe have a way to prevent the next gc that we know will not improve the situation anyway. One option is pack unreachable objects in the second pack. This would stop the next gc, but that would screw prune up because st_mtime info is gone.. Maybe we just save a file to tell gc to ignore the number of loose objects until after a specific date. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 07:27:46PM -0700, Junio C Hamano wrote: I guess leaving a bunch of loose objects around longer than necessary isn't the end of the world. It wastes space, but it does not actively make the rest of git slower (whereas having a large number of packs does impact performance). So you could probably make recent enough be T now - gc.pruneExpire / 4 or something. At most we would try to gc 4 times before dropping unreachable objects, and for the default period, that's only once every couple days. We could simply prune unreachables more aggressively, and it would solve this issue at the root cause, no? Yes, but not too aggressively. You mentioned object archaeology, but my main interest is avoiding corruption. The mtime check is the thing that prevents us from pruning objects being used for an operation-in-progress that has not yet updated a ref. For some long-running operations, like adding files to a commit, we take into account references like a blob being mentioned in the index. But I do not know offhand if there are other long-running operations that would run into problems if we shortened the expiration time drastically. Anything building a temporary index is potentially problematic. But if we assume that operations like that tend to create and reference their objects within a reasonable time period (say, seconds to minutes) then the current default of 2 weeks is absurd for this purpose. For raciness within a single operation, a few seconds is probably enough (e.g., we may write out a commit object and then update the ref a few milliseconds later). The potential for problems is exacerbated by the fact that object `X` may exist in the filesystem with an old mtime, and then a new operation wants to reference it. That's made somewhat better by 33d4221 (write_sha1_file: freshen existing objects, 2014-10-15), as before we could silently turn a file write into a noop. But it's still racy to do: git cat-file -e $commit git update-ref refs/heads/foo $commit as we do not update the mtime for a read-only operation like cat-file (and even if we did, it's still somewhat racy as prune does not atomically check the mtime and remove the file). So I think there's definitely some possible danger with dropping the default prune expiration time. For a long time GitHub ran with it as 1.hour.ago. We definitely saw some oddities and corruption over the years that were apparently caused by over-aggressive pruning and/or raciness. I've fixed a number of bugs, and things did get better as a result. But I could not say whether all such problems are gone. These days we do our regular repacks with --keep-unreachable and almost never prune anything. It's also not clear whether GitHub represents anything close to normal use. We have a much smaller array of operations that we perform (most objects are either from a push, or from a test-merge between a topic branch and HEAD). But we also have busy repos that are frequently doing gc in the background (especially because we share object storage, so activity on another fork can trigger a gc job that affects a whole repository network). On workstations, I'd guess most git-gc jobs run during a fairly quiescent period. All of which is to say that I don't really know the answer, and there may be dragons. I'd imagine that dropping the default expiration time from 2 weeks to 1 day would probably be fine. A good way to experiment would be for some brave souls to set gc.pruneexpire themselves, run with it for a few weeks or months, and see if anything goes wrong. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 12:14:53AM -0400, Jeff King wrote: On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote: I don't think packing the unreachables is a good plan. They just end up accumulating then, and they never expire, because we keep refreshing their mtime at each pack (unless you pack them once and then leave them to expire, but then you end up with a large number of packs). Note, sometimes I wish unreachables were packed. Recently, I ended up in a situation where running gc created something like 3GB of data as per du, because I suddenly had something like 600K unreachable objects, each of them, as a loose object, taking at least 4K on disk. This made my .git take 5GB instead of 2GB. That surely didn't feel like garbage collection. That's definitely a thing that happens, but it is a bit of a corner case. It's unusual to have such a large number of unreferenced objects all at once. I don't suppose you happen to remember the details, but would a lower expiration time (e.g., 1 day or 1 hour) have made all of those objects go away? Or were they really from some extremely recent event (of course, event here might just have been I did a full repack right before rewriting history which would freshen the mtimes on everything in the pack). Unfortunately, I don't know the exact details. But yes, I guess a lower expiration time might have helped. Certainly the loosening behavior for unreachable objects has corner cases like this, and they suck when you hit one. Leaving the objects packed would be better, but IMHO is not a viable alternative unless somebody comes up with a plan for segregating the old objects in a way that they actually expire eventually, and don't just keep getting repacked and freshened over and over. It sure is a corner case, otoh, when it happens, every single git operation calls git gc --auto, which happily spends 5 minutes sucking CPU to end up doing nothing in practice. And add more salt on the injury if you are on battery 6700 loose objects seems easy to reach on a repo with 6M objects... Mike -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote: On Thu, Mar 19, 2015 at 8:27 AM, Jeff King p...@peff.net wrote: Keeping a file that says I ran gc at time T, and there were still N objects left over is probably the best bet. When the next gc --auto runs, if T is recent enough, subtract N from the estimated number of objects. I'm not sure of the right value for recent enough there, though. If it is too far back, you will not gc when you could. If it is too close, then you will end up running gc repeatedly, waiting for those objects to leave the expiration window. And would not be hard to implement either. git-gc is already prepared to deal with stale gc.pid, which would stop git-gc for a day or so before it deletes gc.pid and starts anyway. All we need to do is check at the end of git-gc, if we know for sure the next 'gc --auto' is a waste, then leave gc.pid behind. That omits the N objects left over information. Which I think may be useful, because otherwise the rule is basically don't do another gc at all for X time units. That's OK for most use, but it has its own corner cases. E.g., imagine you are doing an SVN import that does an auto-gc check every 1000 commits. You have some unreferenced objects in your repository. After the first 1000 commits, we do a gc, and then say wow, still a lot of cruft; let's block gc for a day. Five minutes later, after another 1000 commits, we run gc --auto again. It doesn't run because of the cruft-check, even though there are a _huge_ number of new packable objects. If the blocker file tells us 7000 extra objects and we see that there are 17,000 in the repo, then we know it's still worth doing the gc (i.e., we know we that we'll probably end up ignoring the 7000 cruft that didn't get cleaned up last time, but we also know that there are 10,000 new objects). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 11:20 AM, Jeff King p...@peff.net wrote: On Thu, Mar 19, 2015 at 11:15:19AM +0700, Duy Nguyen wrote: On Thu, Mar 19, 2015 at 8:27 AM, Jeff King p...@peff.net wrote: Keeping a file that says I ran gc at time T, and there were still N objects left over is probably the best bet. When the next gc --auto runs, if T is recent enough, subtract N from the estimated number of objects. I'm not sure of the right value for recent enough there, though. If it is too far back, you will not gc when you could. If it is too close, then you will end up running gc repeatedly, waiting for those objects to leave the expiration window. And would not be hard to implement either. git-gc is already prepared to deal with stale gc.pid, which would stop git-gc for a day or so before it deletes gc.pid and starts anyway. All we need to do is check at the end of git-gc, if we know for sure the next 'gc --auto' is a waste, then leave gc.pid behind. That omits the N objects left over information. Which I think may be useful, because otherwise the rule is basically don't do another gc at all for X time units. That's OK for most use, but it has its own corner cases. True. But saving N objects left over in a file also has a corner case. If the user prune --expire=now manually, the next 'gc --auto' still thinks we have that many leftovers and keeps delaying gc for some more time. Unless we make 'prune' (or any other commands that delete leftovers) to also delete this file. Yeah maybe saving this info in a file will work. E.g., imagine you are doing an SVN import that does an auto-gc check every 1000 commits. You have some unreferenced objects in your repository. After the first 1000 commits, we do a gc, and then say wow, still a lot of cruft; let's block gc for a day. Five minutes later, after another 1000 commits, we run gc --auto again. It doesn't run because of the cruft-check, even though there are a _huge_ number of new packable objects. If the blocker file tells us 7000 extra objects and we see that there are 17,000 in the repo, then we know it's still worth doing the gc (i.e., we know we that we'll probably end up ignoring the 7000 cruft that didn't get cleaned up last time, but we also know that there are 10,000 new objects). -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 11:29:57AM +0700, Duy Nguyen wrote: That omits the N objects left over information. Which I think may be useful, because otherwise the rule is basically don't do another gc at all for X time units. That's OK for most use, but it has its own corner cases. True. But saving N objects left over in a file also has a corner case. If the user prune --expire=now manually, the next 'gc --auto' still thinks we have that many leftovers and keeps delaying gc for some more time. Unless we make 'prune' (or any other commands that delete leftovers) to also delete this file. Yeah maybe saving this info in a file will work. I assumed that the user would not run prune manually, but would run git gc --prune=now. And yeah, definitely any time gc runs, it should update the file (if there are fewer than `gc.auto` objects, I think it could just delete the file). We could also apply that rule any run of git prune, but my mental model is that git gc is the magical porcelain that will do this stuff for you, and git prune is the plumbing that users shouldn't need to call themselves. I don't know if that model is shared by users, though. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 11:01:17AM +0900, Mike Hommey wrote: I don't think packing the unreachables is a good plan. They just end up accumulating then, and they never expire, because we keep refreshing their mtime at each pack (unless you pack them once and then leave them to expire, but then you end up with a large number of packs). Note, sometimes I wish unreachables were packed. Recently, I ended up in a situation where running gc created something like 3GB of data as per du, because I suddenly had something like 600K unreachable objects, each of them, as a loose object, taking at least 4K on disk. This made my .git take 5GB instead of 2GB. That surely didn't feel like garbage collection. That's definitely a thing that happens, but it is a bit of a corner case. It's unusual to have such a large number of unreferenced objects all at once. I don't suppose you happen to remember the details, but would a lower expiration time (e.g., 1 day or 1 hour) have made all of those objects go away? Or were they really from some extremely recent event (of course, event here might just have been I did a full repack right before rewriting history which would freshen the mtimes on everything in the pack). Certainly the loosening behavior for unreachable objects has corner cases like this, and they suck when you hit one. Leaving the objects packed would be better, but IMHO is not a viable alternative unless somebody comes up with a plan for segregating the old objects in a way that they actually expire eventually, and don't just keep getting repacked and freshened over and over. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Thu, Mar 19, 2015 at 8:27 AM, Jeff King p...@peff.net wrote: Keeping a file that says I ran gc at time T, and there were still N objects left over is probably the best bet. When the next gc --auto runs, if T is recent enough, subtract N from the estimated number of objects. I'm not sure of the right value for recent enough there, though. If it is too far back, you will not gc when you could. If it is too close, then you will end up running gc repeatedly, waiting for those objects to leave the expiration window. And would not be hard to implement either. git-gc is already prepared to deal with stale gc.pid, which would stop git-gc for a day or so before it deletes gc.pid and starts anyway. All we need to do is check at the end of git-gc, if we know for sure the next 'gc --auto' is a waste, then leave gc.pid behind. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 02:58:15PM +, John Keeping wrote: On Wed, Mar 18, 2015 at 09:41:59PM +0700, Duy Nguyen wrote: On Wed, Mar 18, 2015 at 9:33 PM, Duy Nguyen pclo...@gmail.com wrote: If not, I made some mistake in analyzing this and we'll start again. I did make one mistake, the first gc should have reduced the number of loose objects to zero. Why didn't it.? I'll come back to this tomorrow if nobody finds out first :) Most likely they are not referenced by anything but are younger than 2 weeks. I saw a similar issue with automatic gc triggering after every operation when I did something equivalent to: git add lots of files git commit git reset --hard HEAD^ which creates a log of unreachable objects which are not old enough to be pruned. Yes, this is almost certainly the problem. Though to be pedantic, the command above will still have a reflog entry, so the objects will be reachable (and packed). But there are other variants that don't leave the objects reachable from even reflogs. I don't know if there is an easy way around this. Auto-gc's object count is making the assumption that running the gc will reduce the number of objects, but obviously it does not always do so. We could do a more thorough check and find the number of actual packable and prune-able objects. The prune-able part of that is easy; just omit objects from the count that are newer than 2 weeks. But packable is expensive. You would have to compute reachability by walking from the tips. That can take tens of seconds on a large repo. You could perhaps cut off the walk early when you hit a packed commit (this does not strictly imply that all of the related objects are packed, but it would be good enough for a heuristic). But even that is probably too expensive for gc --auto. -Peff PS Note that in git v2.2.0 and up, prune will leave not only recent unreachable objects, but older objects which are reachable from those recent ones (so that we keep or prune whole chunks of history, rather than dropping part and leaving the rest broken). Technically this exacerbates the problem (we keep more objects), though I doubt it makes much difference in practice (most chunks of history were created at similar times, so the mtimes of the whole chunk will be close together). -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: git pull git gc
On Wed, Mar 18, 2015 at 03:48:42PM +0100, Дилян Палаузов wrote: #ls .git/objects/17/* | wc -l 30 30 * 256 = 7 680 6 700 And now? Do I have to run git gc --aggressive ? No, aggressive just controls the time we spend on repacking. If the guess is correct that the objects are kept because they are unreachable but recent, then shortening the prune expiration time would get rid of them. E.g., git gc --prune=1.hour.ago. That does not solve the underlying problem discussed elsewhere in the thread, but it would make this particular instance of it go away. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html