Re: Repacking a repository uses up all available disk space
On Mon, Jun 13, 2016 at 07:24:51AM +0700, Duy Nguyen wrote: > >> - git fsck --full > >> - git repack -Adl -b --pack-kept-objects > >> - git pack-refs --all > >> - git prune > >> > >> The reason it's split into repack + prune instead of just gc is because > >> we use alternates to save on disk space and try not to prune repos that > >> are used as alternates by other repos in order to avoid potential > >> corruption. > > Isn't this what extensions.preciousObjects is for? It looks like prune > just refuses to run in precious objects mode though, and repack is > skipped by gc, but if that repack command works, maybe we should do > something like that in git-gc? Sort of. preciousObjects is a fail-safe so that you do not ever accidentally run an object-deleting operation where you shouldn't (e.g., in the shared repository used by others as an alternate). So the important step there is that before running "repack", you would want to make sure you have taken into account the reachability of anybody sharing from you. So you could do something like (in your shared repository): git config core.repositoryFormatVersion 1 git config extension.preciousObjects true # this will fail, because it's dangerous! git gc # but we can do it safely if we take into account the other repos for repo in $(somehow_get_list_of_shared_repos); do git fetch $repo +refs/*:refs/shared/$repo/* done git config extension.preciousObjects false git gc git config extension.preciousObjects true So it really is orthogonal to running the various gc commands yourself; it's just here to prevent you shooting yourself in the foot. It may still be useful in such a case to split up the commands in your own script, though. In my case, you'll note that the commands above are racy (what happens if somebody pushes a reference to a shared object between your fetch and the gc invocation?). So we use a custom "repack -k" to get around that (it just keeps everything). You _could_ have gc automatically switch to "-k" in a preciousObjects repository. That's at least safe. But note that it doesn't really solve all of the problems (you do still want to have ref tips from the leaf repositories, because it affects things like bitmaps, and packing order). > BTW Jeff, I think we need more documentation for > extensions.preciousObjects. It's only documented in technical/ which > is practically invisible to all users. Maybe > include::repository-version.txt in config.txt, or somewhere close to > alternates? I'm a little hesitant to document it for end users because it's still pretty experimental. In fact, even we are not using it at GitHub currently. We don't have a big problem with "oops, I accidentally ran something destructive in the shared repository", because nothing except the maintenance script ever even goes into the shared repository. The reason I introduced it in the first place is that I was experimenting with the idea of actually symlinking "objects/" in the leaf repos into the shared repository. That eliminates the object writing in the "fetch" step above, which can be a bottleneck in some cases (not just the I/O, but the shared repo ends up having a _lot_ of refs, and fetch can be pretty slow). But in that case, anything that deletes an object in one of the leaf repos is very dangerous, as it has no idea that its object store is shared with other leaf repos. So I really wanted a fail safe so that running "git gc" wasn't catastrophic. I still think that's a viable approach, but my experiments got side-tracked and I never produced anything worth looking at. So until there's something end users can actually make use of, I'm hesitant to push that stuff into the regular-user documentation. Anybody who is playing with it at this point probably _should_ be familiar with what's in Documentation/technical. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Repacking a repository uses up all available disk space
On Jun 12, 2016, at 4:13 PM, Jeff King wrote: > >At GitHub we actually have a patch to `repack` that keeps all >objects, reachable or not, in the pack, and use it for all of our >automated maintenance. Since we don't drop objects at all, we can't >ever have such a race. Aside from some pathological cases, it wastes >much less space than you'd expect. We turn the flag off for special >cases (e.g., somebody has rewound history and wants to expunge a >sensitive object). > >I'm happy to share the "keep everything" patch if you're interested. We have the same kind of patch actually (for the same reason), but back on the shell implementation of repack. It'd be great if you could share your modern version. Nasser -- Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Repacking a repository uses up all available disk space
On Mon, Jun 13, 2016 at 5:13 AM, Jeff King wrote: > On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote: > >> > git gc --prune=now >> >> You are correct, this solves the problem, however I'm curious. The usual >> maintenance for these repositories is a regular run of: >> >> - git fsck --full >> - git repack -Adl -b --pack-kept-objects >> - git pack-refs --all >> - git prune >> >> The reason it's split into repack + prune instead of just gc is because >> we use alternates to save on disk space and try not to prune repos that >> are used as alternates by other repos in order to avoid potential >> corruption. Isn't this what extensions.preciousObjects is for? It looks like prune just refuses to run in precious objects mode though, and repack is skipped by gc, but if that repack command works, maybe we should do something like that in git-gc? BTW Jeff, I think we need more documentation for extensions.preciousObjects. It's only documented in technical/ which is practically invisible to all users. Maybe include::repository-version.txt in config.txt, or somewhere close to alternates? > [2] It's unclear to me if you're passing any options to git-prune, but > you may want to pass "--expire" with a short grace period. Without > any options it prunes every unreachable thing, which can lead to > races if the repository is actively being used. > > At GitHub we actually have a patch to `repack` that keeps all > objects, reachable or not, in the pack, and use it for all of our > automated maintenance. Since we don't drop objects at all, we can't > ever have such a race. Aside from some pathological cases, it wastes > much less space than you'd expect. We turn the flag off for special > cases (e.g., somebody has rewound history and wants to expunge a > sensitive object). > > I'm happy to share the "keep everything" patch if you're interested. Ah ok, I guess this is why we just skip repack. I guess '-Adl -b --pack-kept-objects' is not enough then. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Repacking a repository uses up all available disk space
On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote: > > git gc --prune=now > > You are correct, this solves the problem, however I'm curious. The usual > maintenance for these repositories is a regular run of: > > - git fsck --full > - git repack -Adl -b --pack-kept-objects > - git pack-refs --all > - git prune > > The reason it's split into repack + prune instead of just gc is because > we use alternates to save on disk space and try not to prune repos that > are used as alternates by other repos in order to avoid potential > corruption. > > Am I not doing something that needs to be doing in order to avoid the > same problem? Your approach makes sense; we do the same thing at GitHub for the same reasons[1]. The main thing you are missing that gc will do is that it knows the prune-time it is going to feed to git-prune[2], and passes that along to repack. That's what enables the "don't bother ejecting these, because I'm about to delete them" optimization. That option is not documented, because it was always assumed to be an internal thing to git-gc, but it is: git repack ... --unpack-unreachable=5.minutes.ago or whatever. -Peff [1] We don't run the fsck at the front, though, because it's really expensive. I'm not sure it buys you much, either. The repack will do a full walk of the graph, so it gets you a connectivity check, as well as a full content check of the commits and trees. The blobs are copied as-is from the old pack, but there is a checksum on the pack data (to catch any bit flips by the disk storage). So the only thing the fsck is getting you is that it fully reconstructs the deltas for each blob and checks their sha1. That's more robust than a checksum, but it's a lot more expensive. [2] It's unclear to me if you're passing any options to git-prune, but you may want to pass "--expire" with a short grace period. Without any options it prunes every unreachable thing, which can lead to races if the repository is actively being used. At GitHub we actually have a patch to `repack` that keeps all objects, reachable or not, in the pack, and use it for all of our automated maintenance. Since we don't drop objects at all, we can't ever have such a race. Aside from some pathological cases, it wastes much less space than you'd expect. We turn the flag off for special cases (e.g., somebody has rewound history and wants to expunge a sensitive object). I'm happy to share the "keep everything" patch if you're interested. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Repacking a repository uses up all available disk space
On Sun, Jun 12, 2016 at 05:38:04PM -0400, Jeff King wrote: > > - When attempting to repack, creates millions of files and eventually > > eats up all available disk space > > That means these objects fall into the unreachable category. Git will > prune unreachable loose objects after a grace period based on the > filesystem mtime of the objects; the default is 2 weeks. > > For unreachable packed objects, their mtime is jumbled in with the rest > of the objects in the packfile. So Git's strategy is to "eject" such > objects from the packfiles into individual loose objects, and let them > "age out" of the grace period individually. > > Generally this works just fine, but there are corner cases where you > might have a very large number of such objects, and the loose storage is > much more expensive than the packed (e.g., because each object is stored > individually, not as a delta). > > It sounds like this is the case you're running into. > > The solution is to lower the grace period time, with something like: > > git gc --prune=5.minutes.ago > > or even: > > git gc --prune=now You are correct, this solves the problem, however I'm curious. The usual maintenance for these repositories is a regular run of: - git fsck --full - git repack -Adl -b --pack-kept-objects - git pack-refs --all - git prune The reason it's split into repack + prune instead of just gc is because we use alternates to save on disk space and try not to prune repos that are used as alternates by other repos in order to avoid potential corruption. Am I not doing something that needs to be doing in order to avoid the same problem? Thanks for your help. Regards, -- Konstantin Ryabitsev Linux Foundation Collab Projects Montréal, Québec signature.asc Description: PGP signature
Re: Repacking a repository uses up all available disk space
On Sun, Jun 12, 2016 at 05:25:14PM -0400, Konstantin Ryabitsev wrote: > Hello: > > I have a problematic repository that: > > - Takes up 9GB on disk > - Passes 'git fsck --full' with no errors > - When cloned with --mirror, takes up 38M on the target system Cloning will only copy the objects that are reachable from the refs. So presumably the other 8.9GB is either reachable from reflogs, or not reachable at all (due to rewinding history or deleting branches). > - When attempting to repack, creates millions of files and eventually > eats up all available disk space That means these objects fall into the unreachable category. Git will prune unreachable loose objects after a grace period based on the filesystem mtime of the objects; the default is 2 weeks. For unreachable packed objects, their mtime is jumbled in with the rest of the objects in the packfile. So Git's strategy is to "eject" such objects from the packfiles into individual loose objects, and let them "age out" of the grace period individually. Generally this works just fine, but there are corner cases where you might have a very large number of such objects, and the loose storage is much more expensive than the packed (e.g., because each object is stored individually, not as a delta). It sounds like this is the case you're running into. The solution is to lower the grace period time, with something like: git gc --prune=5.minutes.ago or even: git gc --prune=now That will prune the unreachable objects immediately (and the packfile ejector is smart enough to skip ejecting any file that would just get deleted immediately anyway). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Repacking a repository uses up all available disk space
Hello: I have a problematic repository that: - Takes up 9GB on disk - Passes 'git fsck --full' with no errors - When cloned with --mirror, takes up 38M on the target system - When attempting to repack, creates millions of files and eventually eats up all available disk space Repacking the result of 'git clone --mirror' shows no problem, so it's got to be something really weird with that particular instance of the repository. If anyone is interested in poking at this particular problem to figure out what causes the repack process to eat up all available disk space, you can find the tarball of the problematic repository here: http://mricon.com/misc/src.git.tar.xz (warning: 6.6GB) You can clone the non-problematic version of this repository from git://codeaurora.org/quic/chrome4sdp/breakpad/breakpad/src.git Best, -- Konstantin Ryabitsev Linux Foundation Collab Projects Montréal, Québec signature.asc Description: PGP signature