Re: Repacking a repository uses up all available disk space

2016-06-12 Thread Jeff King
On Mon, Jun 13, 2016 at 07:24:51AM +0700, Duy Nguyen wrote:

> >> - git fsck --full
> >> - git repack -Adl -b --pack-kept-objects
> >> - git pack-refs --all
> >> - git prune
> >>
> >> The reason it's split into repack + prune instead of just gc is because
> >> we use alternates to save on disk space and try not to prune repos that
> >> are used as alternates by other repos in order to avoid potential
> >> corruption.
> 
> Isn't this what extensions.preciousObjects is for? It looks like prune
> just refuses to run in precious objects mode though, and repack is
> skipped by gc, but if that repack command works, maybe we should do
> something like that in git-gc?

Sort of. preciousObjects is a fail-safe so that you do not ever
accidentally run an object-deleting operation where you shouldn't (e.g.,
in the shared repository used by others as an alternate). So the
important step there is that before running "repack", you would want to
make sure you have taken into account the reachability of anybody
sharing from you.

So you could do something like (in your shared repository):

  git config core.repositoryFormatVersion 1
  git config extension.preciousObjects true

  # this will fail, because it's dangerous!
  git gc

  # but we can do it safely if we take into account the other repos
  for repo in $(somehow_get_list_of_shared_repos); do
git fetch $repo +refs/*:refs/shared/$repo/*
  done
  git config extension.preciousObjects false
  git gc
  git config extension.preciousObjects true

So it really is orthogonal to running the various gc commands yourself;
it's just here to prevent you shooting yourself in the foot.

It may still be useful in such a case to split up the commands in your
own script, though. In my case, you'll note that the commands above are
racy (what happens if somebody pushes a reference to a shared object
between your fetch and the gc invocation?). So we use a custom "repack
-k" to get around that (it just keeps everything).

You _could_ have gc automatically switch to "-k" in a preciousObjects
repository. That's at least safe. But note that it doesn't really solve
all of the problems (you do still want to have ref tips from the leaf
repositories, because it affects things like bitmaps, and packing
order).

> BTW Jeff, I think we need more documentation for
> extensions.preciousObjects. It's only documented in technical/ which
> is practically invisible to all users. Maybe
> include::repository-version.txt in config.txt, or somewhere close to
> alternates?

I'm a little hesitant to document it for end users because it's still
pretty experimental. In fact, even we are not using it at GitHub
currently. We don't have a big problem with "oops, I accidentally ran
something destructive in the shared repository", because nothing except
the maintenance script ever even goes into the shared repository.

The reason I introduced it in the first place is that I was
experimenting with the idea of actually symlinking "objects/" in the
leaf repos into the shared repository. That eliminates the object
writing in the "fetch" step above, which can be a bottleneck in some
cases (not just the I/O, but the shared repo ends up having a _lot_ of
refs, and fetch can be pretty slow).

But in that case, anything that deletes an object in one of the leaf
repos is very dangerous, as it has no idea that its object store is
shared with other leaf repos. So I really wanted a fail safe so that
running "git gc" wasn't catastrophic.

I still think that's a viable approach, but my experiments got
side-tracked and I never produced anything worth looking at. So until
there's something end users can actually make use of, I'm hesitant to
push that stuff into the regular-user documentation. Anybody who is
playing with it at this point probably _should_ be familiar with what's
in Documentation/technical.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Repacking a repository uses up all available disk space

2016-06-12 Thread Nasser Grainawi
On Jun 12, 2016, at 4:13 PM, Jeff King  wrote:
> 
>At GitHub we actually have a patch to `repack` that keeps all
>objects, reachable or not, in the pack, and use it for all of our
>automated maintenance. Since we don't drop objects at all, we can't
>ever have such a race. Aside from some pathological cases, it wastes
>much less space than you'd expect. We turn the flag off for special
>cases (e.g., somebody has rewound history and wants to expunge a
>sensitive object).
> 
>I'm happy to share the "keep everything" patch if you're interested.

We have the same kind of patch actually (for the same reason), but back on the 
shell implementation of repack. It'd be great if you could share your modern 
version.

Nasser

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Repacking a repository uses up all available disk space

2016-06-12 Thread Duy Nguyen
On Mon, Jun 13, 2016 at 5:13 AM, Jeff King  wrote:
> On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote:
>
>> >   git gc --prune=now
>>
>> You are correct, this solves the problem, however I'm curious. The usual
>> maintenance for these repositories is a regular run of:
>>
>> - git fsck --full
>> - git repack -Adl -b --pack-kept-objects
>> - git pack-refs --all
>> - git prune
>>
>> The reason it's split into repack + prune instead of just gc is because
>> we use alternates to save on disk space and try not to prune repos that
>> are used as alternates by other repos in order to avoid potential
>> corruption.

Isn't this what extensions.preciousObjects is for? It looks like prune
just refuses to run in precious objects mode though, and repack is
skipped by gc, but if that repack command works, maybe we should do
something like that in git-gc?

BTW Jeff, I think we need more documentation for
extensions.preciousObjects. It's only documented in technical/ which
is practically invisible to all users. Maybe
include::repository-version.txt in config.txt, or somewhere close to
alternates?

> [2] It's unclear to me if you're passing any options to git-prune, but
> you may want to pass "--expire" with a short grace period. Without
> any options it prunes every unreachable thing, which can lead to
> races if the repository is actively being used.
>
> At GitHub we actually have a patch to `repack` that keeps all
> objects, reachable or not, in the pack, and use it for all of our
> automated maintenance. Since we don't drop objects at all, we can't
> ever have such a race. Aside from some pathological cases, it wastes
> much less space than you'd expect. We turn the flag off for special
> cases (e.g., somebody has rewound history and wants to expunge a
> sensitive object).
>
> I'm happy to share the "keep everything" patch if you're interested.

Ah ok, I guess this is why we just skip repack. I guess '-Adl -b
--pack-kept-objects' is not enough then.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Repacking a repository uses up all available disk space

2016-06-12 Thread Jeff King
On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote:

> >   git gc --prune=now
> 
> You are correct, this solves the problem, however I'm curious. The usual
> maintenance for these repositories is a regular run of:
> 
> - git fsck --full
> - git repack -Adl -b --pack-kept-objects
> - git pack-refs --all
> - git prune
> 
> The reason it's split into repack + prune instead of just gc is because
> we use alternates to save on disk space and try not to prune repos that
> are used as alternates by other repos in order to avoid potential
> corruption.
> 
> Am I not doing something that needs to be doing in order to avoid the
> same problem?

Your approach makes sense; we do the same thing at GitHub for the same
reasons[1]. The main thing you are missing that gc will do is that it
knows the prune-time it is going to feed to git-prune[2], and passes
that along to repack. That's what enables the "don't bother ejecting
these, because I'm about to delete them" optimization.

That option is not documented, because it was always assumed to be an
internal thing to git-gc, but it is:

  git repack ... --unpack-unreachable=5.minutes.ago

or whatever.

-Peff

[1] We don't run the fsck at the front, though, because it's really
expensive.  I'm not sure it buys you much, either. The repack
will do a full walk of the graph, so it gets you a connectivity
check, as well as a full content check of the commits and trees. The
blobs are copied as-is from the old pack, but there is a checksum on
the pack data (to catch any bit flips by the disk storage). So the
only thing the fsck is getting you is that it fully reconstructs the
deltas for each blob and checks their sha1. That's more robust than
a checksum, but it's a lot more expensive.

[2] It's unclear to me if you're passing any options to git-prune, but
you may want to pass "--expire" with a short grace period. Without
any options it prunes every unreachable thing, which can lead to
races if the repository is actively being used.

At GitHub we actually have a patch to `repack` that keeps all
objects, reachable or not, in the pack, and use it for all of our
automated maintenance. Since we don't drop objects at all, we can't
ever have such a race. Aside from some pathological cases, it wastes
much less space than you'd expect. We turn the flag off for special
cases (e.g., somebody has rewound history and wants to expunge a
sensitive object).

I'm happy to share the "keep everything" patch if you're interested.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Repacking a repository uses up all available disk space

2016-06-12 Thread Konstantin Ryabitsev
On Sun, Jun 12, 2016 at 05:38:04PM -0400, Jeff King wrote:
> > - When attempting to repack, creates millions of files and eventually
> >   eats up all available disk space
> 
> That means these objects fall into the unreachable category. Git will
> prune unreachable loose objects after a grace period based on the
> filesystem mtime of the objects; the default is 2 weeks.
> 
> For unreachable packed objects, their mtime is jumbled in with the rest
> of the objects in the packfile.  So Git's strategy is to "eject" such
> objects from the packfiles into individual loose objects, and let them
> "age out" of the grace period individually.
> 
> Generally this works just fine, but there are corner cases where you
> might have a very large number of such objects, and the loose storage is
> much more expensive than the packed (e.g., because each object is stored
> individually, not as a delta).
> 
> It sounds like this is the case you're running into.
> 
> The solution is to lower the grace period time, with something like:
> 
>   git gc --prune=5.minutes.ago
> 
> or even:
> 
>   git gc --prune=now

You are correct, this solves the problem, however I'm curious. The usual
maintenance for these repositories is a regular run of:

- git fsck --full
- git repack -Adl -b --pack-kept-objects
- git pack-refs --all
- git prune

The reason it's split into repack + prune instead of just gc is because
we use alternates to save on disk space and try not to prune repos that
are used as alternates by other repos in order to avoid potential
corruption.

Am I not doing something that needs to be doing in order to avoid the
same problem?

Thanks for your help.

Regards,
-- 
Konstantin Ryabitsev
Linux Foundation Collab Projects
Montréal, Québec


signature.asc
Description: PGP signature


Re: Repacking a repository uses up all available disk space

2016-06-12 Thread Jeff King
On Sun, Jun 12, 2016 at 05:25:14PM -0400, Konstantin Ryabitsev wrote:

> Hello:
> 
> I have a problematic repository that:
> 
> - Takes up 9GB on disk
> - Passes 'git fsck --full' with no errors
> - When cloned with --mirror, takes up 38M on the target system

Cloning will only copy the objects that are reachable from the refs. So
presumably the other 8.9GB is either reachable from reflogs, or not
reachable at all (due to rewinding history or deleting branches).

> - When attempting to repack, creates millions of files and eventually
>   eats up all available disk space

That means these objects fall into the unreachable category. Git will
prune unreachable loose objects after a grace period based on the
filesystem mtime of the objects; the default is 2 weeks.

For unreachable packed objects, their mtime is jumbled in with the rest
of the objects in the packfile.  So Git's strategy is to "eject" such
objects from the packfiles into individual loose objects, and let them
"age out" of the grace period individually.

Generally this works just fine, but there are corner cases where you
might have a very large number of such objects, and the loose storage is
much more expensive than the packed (e.g., because each object is stored
individually, not as a delta).

It sounds like this is the case you're running into.

The solution is to lower the grace period time, with something like:

  git gc --prune=5.minutes.ago

or even:

  git gc --prune=now

That will prune the unreachable objects immediately (and the packfile
ejector is smart enough to skip ejecting any file that would just get
deleted immediately anyway).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Repacking a repository uses up all available disk space

2016-06-12 Thread Konstantin Ryabitsev
Hello:

I have a problematic repository that:

- Takes up 9GB on disk
- Passes 'git fsck --full' with no errors
- When cloned with --mirror, takes up 38M on the target system
- When attempting to repack, creates millions of files and eventually
  eats up all available disk space

Repacking the result of 'git clone --mirror' shows no problem, so it's
got to be something really weird with that particular instance of the
repository.

If anyone is interested in poking at this particular problem to figure
out what causes the repack process to eat up all available disk space,
you can find the tarball of the problematic repository here:

http://mricon.com/misc/src.git.tar.xz (warning: 6.6GB)

You can clone the non-problematic version of this repository from
git://codeaurora.org/quic/chrome4sdp/breakpad/breakpad/src.git

Best,
-- 
Konstantin Ryabitsev
Linux Foundation Collab Projects
Montréal, Québec


signature.asc
Description: PGP signature