A co-worker asked me today how space could be saved when you have
multiple checkouts of the same repository (at different revs) on the
same machine. I said since these won't block-level de-duplicate well[1]
one way to do this is with alternates.
However, once you have an existing clone I didn't know how to get the
gains without a full re-clone, but I hadn't looked deeply into it. As it
turns out I'm wrong about that, which I found when writing the following
test-case which shows that it works:
(
cd /tmp &&
rm -rf /tmp/git-{master,pu,pu-alt}.git &&
# Normal clones
git clone --bare --no-tags --single-branch --branch master
https://github.com/git/git.git /tmp/git-master.git &&
git clone --bare --no-tags --single-branch --branch pu
https://github.com/git/git.git /tmp/git-pu.git &&
# An 'alternate' clone using 'master' objects from another repo
git --bare init /tmp/git-pu-alt.git &&
for git in git-pu.git git-pu-alt.git
do
echo /tmp/git-master.git/objects >/tmp/$git/objects/info/alternates
done &&
git -C git-pu-alt.git fetch --no-tags https://github.com/git/git.git
pu:pu
# Respective sizes, 'alternate' clone much smaller
du -shc /tmp/git-*.git &&
# GC them all. Compacts the git-pu.git to git-pu-alt.git's size
for repo in git-*.git
do
git -C $repo gc
done &&
du -shc /tmp/git-*.git
# Add another big history (GFW) to git-{pu,master}.git (in that order!)
for repo in $(ls -d /tmp/git-*.git | sort -r)
do
git -C $repo fetch --no-tags https://github.com/git-for-windows/git
master:master-gfw
done &&
du -shc /tmp/git-*.git &&
# Another GC. The objects now in git-master.git will be de-duped by all
for repo in git-*.git
do
git -C $repo gc
done &&
du -shc /tmp/git-*.git
)
This shows a scenario where we clone git.git at "master" and "pu" in
different places. After clone the relevant sizes are:
108M /tmp/git-master.git
3.2M /tmp/git-pu-alt.git
109M /tmp/git-pu.git
219M total
I.e. git-pu-alt.git is much smaller since it points via alternates to
git-master.git, and the history of "pu" shares most of the objects with
"master". But then how do you get those gains for git-pu.git? Turns out
you just "git gc"
111M /tmp/git-master.git
2.1M /tmp/git-pu-alt.git
2.1M /tmp/git-pu.git
115M total
This is the thing I was wrong about, in retrospect probably because I'd
been putting PATH_TO_REPO in objects/info/alternates, but we actually
need PATH_TO_REPO/objects, and "git gc" won't warn about this (or "git
fsck"). Probably a good idea to patch that at some point, i.e. whine
about paths in alternates that don't have objects, or at the very least
those that don't exist. #leftoverbits
Then when we fetch git-for-windows:master to all the repos they all grow
by the amount git-for-windows has diverged:
144M /tmp/git-master.git
36M /tmp/git-pu-alt.git
36M /tmp/git-pu.git
214M total
Note that the "sort -r" is critical here. If we fetched git-master.git
first (at this point the alternate for git-pu*.git) we wouldn't get the
duplication in the first place, but instead:
144M /tmp/git-master.git
2.1M /tmp/git-pu-alt.git
2.1M /tmp/git-pu.git
148M total
This shows the importance of keeping such an 'alternate' repo
up-to-date, i.e. we don't get the duplication in the first place, but
regardless (this from a run with sort -r) a "git gc" will coalesce them:
131M /tmp/git-master.git
2.1M /tmp/git-pu-alt.git
2.2M /tmp/git-pu.git
135M total
If you find this interesting make sure to read my
https://public-inbox.org/git/[email protected]/ and
https://public-inbox.org/git/[email protected]/ for the
caveats, i.e. if this is something intended for users then no ref in the
alternate can ever be rewound, that'll potentially result in repository
corruption.
1. https://public-inbox.org/git/[email protected]/