Re: [gitorious] git gc'ing all repositories destroys hardlinked clones

Marc Guenther Tue, 08 Feb 2011 08:47:12 -0800

Hi Marius,

On 27.01.2011, at 15:33, Marius Mårnes Mathiesen wrote:

> On Wed, Jan 26, 2011 at 10:08 PM, Marc Guenther <y...@schli.ch> wrote:
>> Hi,
>> 
>> We have a local installation of Gitorious. As seems to be good practice with 
>> git, I wanted to regularly run "git gc" on all our repositories, so I added 
>> a small cronjob which does this.
> 
> Marc,
> First of all: there is already a script in the Gitorious distribution
> that does this for you, it is in script/repo_housekeeping. Gitorious
> already records the number of pushes to its repositories, and this
> script does some heuristics to find which repositories are due for a
> gc. Whenever a repository is gc-ed, we clear the counter which holds
> the push count and saves how much disk this repository takes up on
> disk.

Ah, thanks, I didn't know this. I will try this.

>> And this caused our disk space to explode. We have a repository which is 
>> about 3.5GB in size. This is cloned 10 times inside of Gitorious. Which 
>> isn't a problem, since git uses hardlinks for clones, so the complete disk 
>> usage is still 3.5GB.
>> 
>> Turns out, that git gc --aggressive breaks these hardlinks. After running it 
>> everywhere, the size of the repository shrank down to 2.2 GB, but now I have 
>> 10 copies of them. So the situation is actually worse than before.
> 
> The script I mentioned above will use the Repository class's gc!
> method, which will call out to Git for you. I suppose a repack will
> regenerate the pack files, which will probably fill up your disk - do
> you have any suggestions on how alternates could be used in this
> setting?

Well, from what I understand, if you have an alternates file in repo2, which 
points to repo1, than a "git gc" in repo2 will remove all object files which 
also exist in repo1. This would solve the problem in this particular situation. 
You could do this by using git clone -s ... when creating the clone.

The downside of this is, that now repo2 is dependant on repo1, so you cannot 
delete repo1 without first regenerating all the objects in repo2 (using 
something like git repack -ad). And the repo1 does not know which other repos 
are dependant on it, so that information has to be stored somewhere else.

I was also toying with the idea of a script, which walks through all repos, and 
creates hardlinks out of all identical files. But that is somewhat of a hack, 
and I don't even know if it will work when the clones diverge, and the pack 
files become different.

Marc

-- 
To post to this group, send email to gitorious@googlegroups.com
To unsubscribe from this group, send email to
gitorious+unsubscr...@googlegroups.com

Re: [gitorious] git gc'ing all repositories destroys hardlinked clones

Reply via email to