Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 5:05 PM, Ramkumar Ramachandra wrote: > Duy Nguyen wrote: >> Right. transport_get() is also run for push and it sets >> smart_options->thin = 1 unconditionally. > > So, it looks like smart http implies the thin capability. "smart_options" is a bit misleading. This applies to both smart http://, git:// and ssh:// > I think > sop's patch (6 years ago) was about turning off thin on dumb http No, dumb http walks through the remote repository object database, there's no temporary pack created for transferring data. Dumb http has nothing to do with this flag. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Duy Nguyen wrote: > Right. transport_get() is also run for push and it sets > smart_options->thin = 1 unconditionally. So, it looks like smart http implies the thin capability. I think sop's patch (6 years ago) was about turning off thin on dumb http: can we check that before deciding that push --[no-]thin is historical cruft? Also, a documentation update is required. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 4:43 PM, Jeff King wrote: >> push has --thin turned off by default favoring server resources over >> network traffic, see a4503a1 (Make --no-thin the default in git-push >> to save server resources - 2007-09-09) > > Hmm. I don't think that is the case anymore. > > If I do: > > git init parent && > (cd parent && seq 1 1 >file && >git add file && git commit -m base > ) && > git clone parent child && > cd child && seq 1 10001 >file && > git commit -a -m more && > GIT_TRACE=1 git push origin HEAD:foo > > I see: > > trace: run_command: 'pack-objects' '--all-progress-implied' '--revs' > '--stdout' '--thin' '--delta-base-offset' '--progress' > Right. transport_get() is also run for push and it sets smart_options->thin = 1 unconditionally. Thanks for correcting. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 08:24:39AM +0700, Nguyen Thai Ngoc Duy wrote: > > the other end cannot use). You _might_ be able to get by with a kind of > > "two-level" hack: consider your main pack as "group A" and newly pushed > > packs as "group B". Allow storing thin deltas on disk from group B > > against group A, but never the reverse (nor within group B). That makes > > sure you don't have cycles, and it eliminates even more I/O than any > > repacking solution (because you never write the extra copy of Y to disk > > in the first place). But I can think of two problems: > [...] > > Some refinements on this idea > > - We could keep packs in group B ordered as the packs come in. The > new pack can depend on the previous ones. I think you could dispense with the two-level altogether and simply give a definite ordering to packs, whereby newer packs can only depend on older packs. Enforcing that with filesystem mtime feels a bit error-prone; I think you'd want to explicitly store a counter somewhere. > - A group index in addition to separate index for each pack would > solve linear search object lookup problem. Yeah. I do not even think it would be that much work. It is a pure optimization, so you can ignore issues like "what happens if I search for an object, but the pack it is supposed to be in went away?". The answer is "you fall back to a linear search through the packs", and assume it happens infrequently enough not to care. I'd wait to see how other proposed optimizations work out before doing a global index, though. The current wisdom is "don't have a ton of packs, for both the index issue and other reasons, like wasting space and on-the-fly deltas for fetches". If the "other reasons" go away, then a global index would make sense to solve the remaining issue. But if the solution for the other issues is to make it cheaper to repack so you can do it more often, then the index issue just goes away. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 04:24:48PM +0700, Nguyen Thai Ngoc Duy wrote: > > Yeah, I read about the --thin option. However, it's only for > > network-packs (i.e --stdout; why?). Also, is it turned on by default? > > The documentation says so, but I ran it and found that the value of > > thin is 0 in builtin/push.c:316. What is going on? > > --thin is enabled by default for fetch (see > transport.c:transport_get()) but it's only effective when the server > advertises "thin-pack" capability (see protocol-capabilities.txt). > push has --thin turned off by default favoring server resources over > network traffic, see a4503a1 (Make --no-thin the default in git-push > to save server resources - 2007-09-09) Hmm. I don't think that is the case anymore. If I do: git init parent && (cd parent && seq 1 1 >file && git add file && git commit -m base ) && git clone parent child && cd child && seq 1 10001 >file && git commit -a -m more && GIT_TRACE=1 git push origin HEAD:foo I see: trace: run_command: 'pack-objects' '--all-progress-implied' '--revs' '--stdout' '--thin' '--delta-base-offset' '--progress' -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 02:12:46PM +0530, Ramkumar Ramachandra wrote: > First off, thanks for the fabulous writeup. I hope more contributors > read this, and get interested in tackling the problems. You're welcome. I'll just respond to a few questions/points you raised in your response: > Yeah, I read about the --thin option. However, it's only for > network-packs (i.e --stdout; why?). Also, is it turned on by default? > The documentation says so, but I ran it and found that the value of > thin is 0 in builtin/push.c:316. What is going on? I'm not sure whether "push --thin" actually does anything anymore. It looks like we always turn on thin-packs for the git transport in transport_get, and have ever since 9b28851 (Push code for transport library, 2007-09-10). And you cannot turn off thin-packing with "--no-thin". So I suspect it is historical cruft at this point. > > And if > > you think about it, that is likely where most of Martin's "317 packs > > turned into 8MB" space savings are coming from. > > Thin packs have nothing to do with git-exproll per-se, but I think I > understand why pack generations work. I think it matters in terms of understanding the workload, and why the repo benefits from repacking. In a normal git repo that accumulates packs via incremental "git repack" (e.g., by running "gc --auto" until we hit the pack limit), you will not have duplicate objects (only objects which could be delta'd across the pack boundary but are not). But if you accumulate packs by repeated thin-pack pushing, you will have good deltas, but extra copies of base objects. > > And that is why something like Junio's suggestion of "do not traverse > > history; just concatenate the packs and throw away duplicate objects" > > might be helpful. You would not find new deltas, but you would get rid > > of these duplicate objects, and doing so would be relatively cheap. > > ... but we've already determined that git pack-objects does a good job > of reusing deltas, so I'm not sure what we can do over and above this. If I understand the proposal correctly, the problem space is something like: A fully packed repository is the best way to store objects, both in terms of space efficiency and for delta reuse when clients fetch from us. But it is too expensive to do a full repack after each push. What makes it expensive, and how can we get around that? I think the answer to "why is it expensive" is two-fold: 1. We spend a lot of CPU on history traversal. Even if we have only 10 new commits to pack, we have to traverse the _whole_ history to generate the pack. 2. We spend some CPU on delta compression. This scales nicely (10 commits means we only do deltas for those 10 commits). 3. We do a lot of I/O, as we rewrite the whole of the old pack content into the new pack, along with the extra objects (so again, we might rewrite millions of objects just to add 10 more to the pack). I do not think (2) is that big a deal. It scales nicely already, and the deltas you get are worth it (and you would have to compute them on the fly for the next clone anyway). But (1) and (3) are bad, because they are based on the size of the repository, not the size of the new content. If we were to simply append the objects from the new packs onto the big, old pack, dropping any duplicates but retaining the deltas we have, we would make (1) and (3) go away. Our pack probably wouldn't be as good as a "real" repack, as it would potentially contain unreachable objects, and we might miss some delta opportunities. But we could cheaply run the "concatenate" repack after each push, and then do a real repack less frequently. I am not sure if that is exactly what Junio is proposing. There are some logistical complications (fixing up the pack header and footer, along with the filename, and updating the pack index). > > 1. You still want to repack more often than every 300 packs, because > > having many packs cost both in space, but also in object lookup > > time (we can do a log(N) search through each pack index, but have > > to search linearly through the set of indices). > > Once you narrow it down to a pack-index, you can bisect it since the > table is sorted. How do you find the right pack-index though? Do I > have to do the bisection-search on each one to see if the object > exists? Can this be improved by generating a global index for all > packs that contains packfile-name in addition to offset for each > object? Right. The bisection-search is the "log(N)" I mention above; you can do that on each pack index. But you have to search the packs linearly, since there is no indication of which pack the object might be found in. A global index would solve that. Right now, the way to create such an index is "git repack -ad". :) But there is nothing to say that we could not have a meta-index on top of the existing pack indices (however it does introduce complications with lifetimes, as rig
Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 4:24 PM, Duy Nguyen wrote: > push has --thin turned off by default favoring server resources over > network traffic, see a4503a1 (Make --no-thin the default in git-push > to save server resources - 2007-09-09) Side note, I think the server should be able to control the default behavior. Maybe if the server advertises "thin-pack" capability and the user does not explicitly specify --no-thin, then --thin is automatically turned on at the client. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 3:42 PM, Ramkumar Ramachandra wrote: >> As Junio mentioned, that is what "--thin" is about; the sender omits the >> base and the receiver adds it back in ("index-pack --fix-thin"). > > Yeah, I read about the --thin option. However, it's only for > network-packs (i.e --stdout; why?). Also, is it turned on by default? > The documentation says so, but I ran it and found that the value of > thin is 0 in builtin/push.c:316. What is going on? --thin is enabled by default for fetch (see transport.c:transport_get()) but it's only effective when the server advertises "thin-pack" capability (see protocol-capabilities.txt). push has --thin turned off by default favoring server resources over network traffic, see a4503a1 (Make --no-thin the default in git-push to save server resources - 2007-09-09) -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Jeff King wrote: > [...] First off, thanks for the fabulous writeup. I hope more contributors read this, and get interested in tackling the problems. > You are missing a step in the preparation of the delta list. If we > already have a delta on-disk, we will check whether it is usable and > keep a note in the list (see check_object). Then when we > prepare the list of objects to delta, it is omitted from the list (see > prepare_pack). Ah, prepare_pack() calls get_object_details(), which then calls check_object(): reuse_delta is what I was looking for. > That is why you may see a much smaller number objects in the progress > eye candy for "Compressing objects..." than we are actually sending. I see. > As Junio mentioned, that is what "--thin" is about; the sender omits the > base and the receiver adds it back in ("index-pack --fix-thin"). Yeah, I read about the --thin option. However, it's only for network-packs (i.e --stdout; why?). Also, is it turned on by default? The documentation says so, but I ran it and found that the value of thin is 0 in builtin/push.c:316. What is going on? > And if > you think about it, that is likely where most of Martin's "317 packs > turned into 8MB" space savings are coming from. Thin packs have nothing to do with git-exproll per-se, but I think I understand why pack generations work. > The reason we do not store thin-packs on disk is that you > run into problems with cycles in the delta graph (e.g., A deltas against > B, which deltas against C, which deltas against A; at one point you had > a full copy of one object which let you create the cycle, but you later > deleted it as redundant with the delta, and now you cannot reconstruct > any of the objects). Ah, that's why. > Somebody pushes X, which deltas against Y. They send only the delta for > X->Y, but the receiver stores that delta plus a full copy of Y, even > though we already have Y in another pack. Repacking throws away the > extra copy of Y. Right, I got that. > And that is why something like Junio's suggestion of "do not traverse > history; just concatenate the packs and throw away duplicate objects" > might be helpful. You would not find new deltas, but you would get rid > of these duplicate objects, and doing so would be relatively cheap. ... but we've already determined that git pack-objects does a good job of reusing deltas, so I'm not sure what we can do over and above this. If I understand correctly, the proposal is to use existing (but sub-optimal) deltas more aggressively when repacking? > Another solution could involve not writing the duplicate of Y in the > first place. > [...] > You could possibly solve this with cycle detection, though it would be > complicated (you need to do it not just when getting rid of objects, but > when sending a pack, to make sure you don't send a cycle of deltas that > the other end cannot use). You _might_ be able to get by with a kind of > "two-level" hack: consider your main pack as "group A" and newly pushed > packs as "group B". Allow storing thin deltas on disk from group B > against group A, but never the reverse (nor within group B). That makes > sure you don't have cycles, and it eliminates even more I/O than any > repacking solution (because you never write the extra copy of Y to disk > in the first place). But I can think of two problems: Hm, having two kinds of packs: full packs and thin packs, and different codepaths to handle them. > 1. You still want to repack more often than every 300 packs, because > having many packs cost both in space, but also in object lookup > time (we can do a log(N) search through each pack index, but have > to search linearly through the set of indices). Once you narrow it down to a pack-index, you can bisect it since the table is sorted. How do you find the right pack-index though? Do I have to do the bisection-search on each one to see if the object exists? Can this be improved by generating a global index for all packs that contains packfile-name in addition to offset for each object? > 2. As you accumulate group B packs with new objects, the deltas that > people send will tend to be against objects in group B. They are > closer to the tip of history, and therefore make better deltas for > history built on top. Actually, I'm not sure it's worth the additional complexity at all. Keeping packs self-contained means that simpler code and easier isolation in the face of corruption: the space isn't a big trade-off, unless there are lots of really tiny packs (in which case we consolidate them into larger packs). The way I see it, we'll be saving on some IO by creating thin packs. But the trade-off seems to be quite heavy: self-contained packs means that I can just mmap() that pack-index, look for objects and apply deltas without bothering about any other packs. Thin packs means that I'll have to mmap() a lot more than one pack-index, and spend a lot more time searching for the base
Re: [PATCH] git exproll: steps to tackle gc aggression
Jeff King writes: > ... The reason we do not store thin-packs on disk is that you > run into problems with cycles in the delta graph (e.g., A deltas against > B, which deltas against C, which deltas against A; at one point you had > a full copy of one object which let you create the cycle, but you later > deleted it as redundant with the delta, and now you cannot reconstruct > any of the objects). As an extension to that is that a thin-pack will make the "validity" of a single pack dependent on its surroundings, making "verify-pack" useless. It may say "everything is fine", but corruption of another pack that holds a base object a delta in it depends on will render the pack unusable. With the current arrangement, if you grab a single pack and re-idx it, you know you can reconstruct all the objects in it. The original reason was not cycles; it was primarily because we did not want to have to map more than one packfile while reconstructing the delta chain for a single object (this was way before the "open small mmap windows into a large packfile" was invented). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Sat, Aug 10, 2013 at 5:16 AM, Jeff King wrote: > Another solution could involve not writing the duplicate of Y in the > first place. The reason we do not store thin-packs on disk is that you > run into problems with cycles in the delta graph (e.g., A deltas against > B, which deltas against C, which deltas against A; at one point you had > a full copy of one object which let you create the cycle, but you later > deleted it as redundant with the delta, and now you cannot reconstruct > any of the objects). > > You could possibly solve this with cycle detection, though it would be > complicated (you need to do it not just when getting rid of objects, but > when sending a pack, to make sure you don't send a cycle of deltas that > the other end cannot use). You _might_ be able to get by with a kind of > "two-level" hack: consider your main pack as "group A" and newly pushed > packs as "group B". Allow storing thin deltas on disk from group B > against group A, but never the reverse (nor within group B). That makes > sure you don't have cycles, and it eliminates even more I/O than any > repacking solution (because you never write the extra copy of Y to disk > in the first place). But I can think of two problems: > > 1. You still want to repack more often than every 300 packs, because > having many packs cost both in space, but also in object lookup > time (we can do a log(N) search through each pack index, but have > to search linearly through the set of indices). > > 2. As you accumulate group B packs with new objects, the deltas that > people send will tend to be against objects in group B. They are > closer to the tip of history, and therefore make better deltas for > history built on top. > > That's all just off the top of my head. There are probably other flaws, > too, as I haven't considered it too hard. Some refinements on this idea - We could keep packs in group B ordered as the packs come in. The new pack can depend on the previous ones. - A group index in addition to separate index for each pack would solve linear search object lookup problem. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Fri, Aug 09, 2013 at 07:04:23PM +0530, Ramkumar Ramachandra wrote: > I'll raise some (hopefully interesting) points. Let's take the example > of a simple push: I start send-pack, which in turn starts receive_pack > on the server and connects its stdin/stdout to it (using git_connect). > Now, it reads the (sha1, ref) pairs it receives on stdin and spawns > pack-objects --stdout with the right arguments as the response, right? > Overall, nothing special: just pack-objects invoked with specific > arguments. > > How does pack-objects work? ll_find_deltas() spawns some worker > threads to find_deltas(). This is where this get fuzzy for me: it > calls try_delta() in a nested loop, trying to find the smallest delta, > right? I'm not sure whether the interfaces it uses to read objects > differentiates between packed deltas versus packed undeltified objects > versus loose objects at all. You are missing a step in the preparation of the delta list. If we already have a delta on-disk, we will check whether it is usable and keep a note in the list (see check_object). Then when we prepare the list of objects to delta, it is omitted from the list (see prepare_pack). That is why you may see a much smaller number objects in the progress eye candy for "Compressing objects..." than we are actually sending. > Now, the main problem I see is that a pack has to be self-contained: I > can't have an object "bar" which is a delta against an object that is > not present in the pack, right? Therefore no matter what the server > already has, I have to prepare deltas only against the data that I'm > sending across. As Junio mentioned, that is what "--thin" is about; the sender omits the base and the receiver adds it back in ("index-pack --fix-thin"). And if you think about it, that is likely where most of Martin's "317 packs turned into 8MB" space savings are coming from. Somebody pushes X, which deltas against Y. They send only the delta for X->Y, but the receiver stores that delta plus a full copy of Y, even though we already have Y in another pack. Repacking throws away the extra copy of Y. And that is why something like Junio's suggestion of "do not traverse history; just concatenate the packs and throw away duplicate objects" might be helpful. You would not find new deltas, but you would get rid of these duplicate objects, and doing so would be relatively cheap. Another solution could involve not writing the duplicate of Y in the first place. The reason we do not store thin-packs on disk is that you run into problems with cycles in the delta graph (e.g., A deltas against B, which deltas against C, which deltas against A; at one point you had a full copy of one object which let you create the cycle, but you later deleted it as redundant with the delta, and now you cannot reconstruct any of the objects). You could possibly solve this with cycle detection, though it would be complicated (you need to do it not just when getting rid of objects, but when sending a pack, to make sure you don't send a cycle of deltas that the other end cannot use). You _might_ be able to get by with a kind of "two-level" hack: consider your main pack as "group A" and newly pushed packs as "group B". Allow storing thin deltas on disk from group B against group A, but never the reverse (nor within group B). That makes sure you don't have cycles, and it eliminates even more I/O than any repacking solution (because you never write the extra copy of Y to disk in the first place). But I can think of two problems: 1. You still want to repack more often than every 300 packs, because having many packs cost both in space, but also in object lookup time (we can do a log(N) search through each pack index, but have to search linearly through the set of indices). 2. As you accumulate group B packs with new objects, the deltas that people send will tend to be against objects in group B. They are closer to the tip of history, and therefore make better deltas for history built on top. That's all just off the top of my head. There are probably other flaws, too, as I haven't considered it too hard. But if you are really concerned about I/O on a busy repo (and I think for hosting sites, it is a real problem), the best-performing solution will probably not involve packs at all, but some kind of delta-capable storage that you can incrementally add to without rewriting the whole repository. The downside is that it would be significantly more complicated. > > Of course we may have to reconstruct deltas for trees in order to find > > the correct set of objects (i.e., the history traversal). But that is a > > separate phase from generating the pack's object content, and we do not > > reuse any of the traversal work in later phases. > > I see. Are we talking about tree-walk.c here? This is not unique to > packing at all; we need to do this kind of traversal for any git > operation that digs into older history, no? I recall a disc
Re: [PATCH] git exproll: steps to tackle gc aggression
Ramkumar Ramachandra writes: > I'll raise some (hopefully interesting) points. Let's take the example > of a simple push: I start send-pack, which in turn starts receive_pack > on the server and connects its stdin/stdout to it (using git_connect). > Now, it reads the (sha1, ref) pairs it receives on stdin and spawns > pack-objects --stdout with the right arguments as the response, right? > Overall, nothing special: just pack-objects invoked with specific > arguments. > > How does pack-objects work? ... You've been here long enough to know that you can and should learn things yourself instead of repeating "tell me tell me" ;-) > threads to find_deltas(). This is where this get fuzzy for me: it > calls try_delta() in a nested loop, trying to find the smallest delta, > right? Yes. > I'm not sure whether the interfaces it uses to read objects > differentiates between packed deltas versus packed undeltified objects > versus loose objects at all. Yes, but that happens way before that. Start reading from pack-heuristics document to get the general overall feel of what goes on, and then go back to the source. > Now, the main problem I see is that a pack has to be self-contained: I > can't have an object "bar" which is a delta against an object that is > not present in the pack, right? Therefore no matter what the server > already has, I have to prepare deltas only against the data that I'm > sending across. There is a --thin option to allow deltas against base objects that are known to exist on the other end to be used in the pack without including the base objects. The receiving end then re-adds the base objects to the received data to complete the pack. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Jeff King wrote: > It depends on what each side has it, doesn't it? We generally try to > reuse on-disk deltas when we can, since they require no computation. If > I have object A delta'd against B, and I know that the other side wants > A and has B (or I am also sending B), I can simply send what I have on > disk. So we do not just blit out the existing pack as-is, but we may > reuse portions of it as appropriate. I'll raise some (hopefully interesting) points. Let's take the example of a simple push: I start send-pack, which in turn starts receive_pack on the server and connects its stdin/stdout to it (using git_connect). Now, it reads the (sha1, ref) pairs it receives on stdin and spawns pack-objects --stdout with the right arguments as the response, right? Overall, nothing special: just pack-objects invoked with specific arguments. How does pack-objects work? ll_find_deltas() spawns some worker threads to find_deltas(). This is where this get fuzzy for me: it calls try_delta() in a nested loop, trying to find the smallest delta, right? I'm not sure whether the interfaces it uses to read objects differentiates between packed deltas versus packed undeltified objects versus loose objects at all. Now, the main problem I see is that a pack has to be self-contained: I can't have an object "bar" which is a delta against an object that is not present in the pack, right? Therefore no matter what the server already has, I have to prepare deltas only against the data that I'm sending across. > Of course we may have to reconstruct deltas for trees in order to find > the correct set of objects (i.e., the history traversal). But that is a > separate phase from generating the pack's object content, and we do not > reuse any of the traversal work in later phases. I see. Are we talking about tree-walk.c here? This is not unique to packing at all; we need to do this kind of traversal for any git operation that digs into older history, no? I recall a discussion about using generation numbers to speed up the walk: I tried playing with your series (where you use a cache to keep the generation numbers), but got nowhere. Does it make sense to think about speeding up the walk (how?). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Fri, Aug 09, 2013 at 01:34:48AM +0530, Ramkumar Ramachandra wrote: > Certainly. A push will never use an existing pack as-is, as it's very > highly unlikely that the server requested exactly what gc --auto > packed for us locally. > > Sure, undeltified objects in the pack are probably better for push > performance, but I'm talking about the majority: deltified objects. > Don't you need to apply the deltas (ie. explode the pack), before you > can recompute the deltas for the information you're sending across for > the push? It depends on what each side has it, doesn't it? We generally try to reuse on-disk deltas when we can, since they require no computation. If I have object A delta'd against B, and I know that the other side wants A and has B (or I am also sending B), I can simply send what I have on disk. So we do not just blit out the existing pack as-is, but we may reuse portions of it as appropriate. Of course we may have to reconstruct deltas for trees in order to find the correct set of objects (i.e., the history traversal). But that is a separate phase from generating the pack's object content, and we do not reuse any of the traversal work in later phases. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Thu, Aug 8, 2013 at 4:04 PM, Ramkumar Ramachandra wrote: > You will need to educate your contributors and potential contributors > if you want these problems to be fixed. TBH, I think Junio is exceedingly kind and generous with his time. IME (and there's quite a few years of it :-) ), good contributors learn a little bit asking, and ton on their own. Your tone is quite demanding, which I personally find... rather out of place. m -- martin.langh...@gmail.com - ask interesting questions - don't get distracted with shiny stuff - working code first ~ http://docs.moodle.org/en/User:Martin_Langhoff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Junio C Hamano wrote: >> 3. After a few days of working, the gc heuristics figure out that I >> have too much garbage and too many packs; a cleanup is required. The >> gc --auto which doesn't tolerate fragmentation: it tries to put >> everything into one large pack. > > Today's "gc --auto" skims the history shallowly and packs loose > objects into a single _new_ pack. If you start from a single pack > and enough loose objects and run "gc --auto", you will have two > packs, one intact from the original, one new. That is when I have lots of loose objects and few packs. We are discussing gc aggression when there are too many packs (how did we get there in the first place if new packs aren't created?): in which case it doesn't tolerate fragmentation, and tries to put everything in one pack. That is both IO and CPU intensive, and we've been trying to tackle that since the start of the thread. > And I see from "(or when ...)" part that you do not know what you > are talking about. "push" will not stream existing pack to the > other end without computation, and it never will. It will reuse > existing individual pieces when it can, and having data in a pack > (even without deltifying, or as a suboptimal delta) is much better > for push performance than having the same data in a loose object if > only for this reason, as you do not have to recompress and reencode > it in a different way (loose objects and undelitifed object in pack > are encoded differently). Certainly. A push will never use an existing pack as-is, as it's very highly unlikely that the server requested exactly what gc --auto packed for us locally. Sure, undeltified objects in the pack are probably better for push performance, but I'm talking about the majority: deltified objects. Don't you need to apply the deltas (ie. explode the pack), before you can recompute the deltas for the information you're sending across for the push? > I've treated this thread as a design discussion, not an education > session, but the above ended up having more education material than > anything that would advance the design. You will need to educate your contributors and potential contributors if you want these problems to be fixed. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Ramkumar Ramachandra writes: > I asked you a very simple question: what happens when I do "git push"? You asked "does push send unreachable cruft?" and I said No. Does that answer your question? > 3. After a few days of working, the gc heuristics figure out that I > have too much garbage and too many packs; a cleanup is required. The > gc --auto which doesn't tolerate fragmentation: it tries to put > everything into one large pack. Today's "gc --auto" skims the history shallowly and packs loose objects into a single _new_ pack. If you start from a single pack and enough loose objects and run "gc --auto", you will have two packs, one intact from the original, one new. > We're talking about tackling the gc aggression problem in 3. > when I say "git push" (or when gc decides to repack), won't I need > to explode the young pack into loose objects, do a reachability > analysis, and repack anyway? You do not explode anything. "push" will read objects from anywhere, either from loose or packed, and send a newly created pack over the wire. And I see from "(or when ...)" part that you do not know what you are talking about. "push" will not stream existing pack to the other end without computation, and it never will. It will reuse existing individual pieces when it can, and having data in a pack (even without deltifying, or as a suboptimal delta) is much better for push performance than having the same data in a loose object if only for this reason, as you do not have to recompress and reencode it in a different way (loose objects and undelitifed object in pack are encoded differently). > ... are you willing to explain it to me? Hope that the above clarifies. I've treated this thread as a design discussion, not an education session, but the above ended up having more education material than anything that would advance the design. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Junio C Hamano wrote: > Martin Fick writes: >> Assuming I understand what you are suggesting, would these >> "young object" likely still get "deduped" in an efficient >> way without doing history traversal (it sounds like they >> would)? > > Yes. > > The very first thing pack-object machinery does is to get the list > of object names and sort them in a certain order to help producing > good deltas, and this initial input preprocessing will dedup them. So, the proposal is to create an index of young objects without doing reachability analysis (I still didn't get the point of packing them; as I pointed out, it seems to be rather counter-productive) to help the actual packing? From what I vaguely understood: 1. Index all the young objects to save a history traversal (?) 2. Perform the reachability analysis using the index in step 1, and then generate the pack. I'm not yet clear about what information 1 contains to help 2. Is it the rough ordering? (The big important objects come near the top of the pack, and the deltas are generated against them). I say "rough" because the ordering might change after the unreachable objects are pruned. *scratches head* -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Martin Fick writes: > Assuming I understand what you are suggesting, would these > "young object" likely still get "deduped" in an efficient > way without doing history traversal (it sounds like they > would)? Yes. The very first thing pack-object machinery does is to get the list of object names and sort them in a certain order to help producing good deltas, and this initial input preprocessing will dedup them. > If so, then yeah this might be nice, especially if the history > traversal is what would speed this up. That was the assumption behind the "it might help" suggestion. If that helps or not is not known yet, and since Ram started this subthread telling me not to talk about performance improvements, my time on this thread is _not_ spent on that (yet). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Junio C Hamano wrote: > So I do not see how that question is "obvious". The question > obviously pointless and misses the mark by wide margin? The > question makes it obvious that whoever asks it does not understand > how Git works? Shall we all sit and mourn over the fact that I don't understand how Git works, or are you willing to explain it to me? > And of course we do not transfer objects that are not asked for from > or to a repository over pack tranfer. > > Most importantly, it is not about butchering the pack machinery in > such a way that we can create _only_ such "non history traversal" > packs. I asked you a very simple question: what happens when I do "git push"? Instead of answering the question, you butchered the pack machinery to "only" create packs with garbage in them (aka. stripped out the reachability analysis code completely), and blamed me for doing it. Explain it to me in plain English without getting agitated: 1. I'm on my terminal doing various repository operations: constantly creating new objects and moving my refs around to create unreachable objects. I have lots of loose objects. 2. I say "git push". What happens? A reachability analysis is performed on my loose objects, and the ones reachable by the ref I'm sending are packed up and sent over the network. Now, I no longer have any loose objects. 3. After a few days of working, the gc heuristics figure out that I have too much garbage and too many packs; a cleanup is required. The gc --auto which doesn't tolerate fragmentation: it tries to put everything into one large pack. Loop. We're talking about tackling the gc aggression problem in 3. And you propose putting the young objects in a pack without performing reachability analysis: I'm asking how this is going to benefit me; when I say "git push" (or when gc decides to repack), won't I need to explode the young pack into loose objects, do a reachability analysis, and repack anyway? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Thursday, August 08, 2013 10:56:38 am Junio C Hamano wrote: > I thought the discussion was about making the local gc > cheaper, and the "Imagine we have a cheap way" was to > address it by assuming that the daily "pack young > objects into a single pack" can be sped up if we did not > have to traverse history. More permanent packs (the > older ones in "set of packs staggered by age" Martin > proposes) in the repository should go through the normal > history traversal route. Assuming I understand what you are suggesting, would these "young object" likely still get "deduped" in an efficient way without doing history traversal (it sounds like they would)? In other words, if I understand correctly, it would save time by not pruning unreferenced objects, but it would still be deduping things and delta compressing also, so you would still likely get a great benefit from creating these young object packs? In other words, is there still a good chance that my 317 new pack files which included a 33M pack file will still get consolidated down to something near 8M? If so, then yeah this might be nice, especially if the history traversal is what would speed this up. Because today, my solution mostly saves IO and not time. I think it still saves time, I believe I have seen up to a 50% savings, but that is nothing compared to massive, several orders of magnitude IO savings. But if what you suggest could also give massive time (orders of magnitude) savings along with the IO improvements I am seeing, then suddenly repacking regularly would become very cheap even on large repos. The only time consuming piece would be pruning then? Could bitmaps eventually help out there? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Ramkumar Ramachandra writes: > Junio C Hamano wrote: >> it is >> not a problem for the pack that consolidates young objects into a >> single pack to contain some unreachable crufts. > > So far, we have never considered putting unreachable objects in packs. > Let me ask the obvious question first: what happens when I push? Do I > pack up all the loose objects quickly (without bothering about > reachability) and send unreachable cruft to the server? No. I thought the discussion was about making the local gc cheaper, and the "Imagine we have a cheap way" was to address it by assuming that the daily "pack young objects into a single pack" can be sped up if we did not have to traverse history. More permanent packs (the older ones in "set of packs staggered by age" Martin proposes) in the repository should go through the normal history traversal route. And of course we do not transfer objects that are not asked for from or to a repository over pack tranfer. Most importantly, it is not about butchering the pack machinery in such a way that we can create _only_ such "non history traversal" packs. So I do not see how that question is "obvious". The question obviously pointless and misses the mark by wide margin? The question makes it obvious that whoever asks it does not understand how Git works? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Junio C Hamano wrote: > it is > not a problem for the pack that consolidates young objects into a > single pack to contain some unreachable crufts. So far, we have never considered putting unreachable objects in packs. Let me ask the obvious question first: what happens when I push? Do I pack up all the loose objects quickly (without bothering about reachability) and send unreachable cruft to the server? Who is ultimately going to clean up this cruft, and when? > Note that *1* is an assumption. I do not know offhand if such a > "consolidate young objects quickly into one to keep the number of > packs small" strategy is an overall win. The way I see it, you're just delaying the reachability analysis beyond the usual pack-time. The whole point of separating out loose objects from packfiles was to not make the user wait everytime she does common repository operations locally: delay the reachability analysis and compression (aka. packing) until a network operation kicks in. To see if introducing another stage is an overall win, think about what the bottlenecks are: in network operations, it's always the data being sent over. If I understand correctly, during a network transaction, there's very minimal computation where the server-client just quickly tell each other where their refs are: therefore, it's trivial to figure out what's missing and pack that up to send it over. The strategy of including unreachable cruft in these packs might make sense from the point of view of a local gc, but will ultimately slow down network operations, no? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Ramkumar Ramachandra writes: > Junio C Hamano wrote: >> Imagine we have a cheap way to enumerate the young objects without >> the usual history traversal. > > Before we discuss the advantages, can you outline how we can possibly > get this data without actually walking downwards from the roots > (refs)? One way to do it is to pull data out of a log of ref updates > (aka. reflog), but we both know how unreliable that can be. My understanding of the topic is to come up with a way that is much cheaper than the current "gc --auto" that involves recent history walk to consolidate both loose objects and small young packs into one, so that we can use that logic for "gc --auto". The key phrase is "without the usual history traversal". We are talking about young objects, and they are likely to be reachable from something (like reflog entries, if not refs). We may include unreachable cruft in the result in the "let's be quick and collect them into a single young pack", and you will need to keep them while reflog entries are alive, and you will need periodic sweeps with the usual history walking to remove older crufts that recently have become unreachable due to reflog expiry from packs anyway, so it is not a problem for the pack that consolidates young objects into a single pack to contain some unreachable crufts. If you start from that assumption [*1*], the way to enumerate the young objects without the usual history traversal should be fairly obvious. By definition, loose objects are all young because they were created since the last "gc --auto". Also pack .idx files know their own creation timestamp to let you decide how old they are, you can see how many objects there are in the corresponding .pack and how big it is. By doing an equivalent of "find .git/objects/[0-9a-f][0-9a-f]/", you can enumerate the loose objects, and an equivalent of "show-ref" will enumerate the objects in the pack that the .idx file you determined to be small and young. Note that *1* is an assumption. I do not know offhand if such a "consolidate young objects quickly into one to keep the number of packs small" strategy is an overall win. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Wed, Aug 7, 2013 at 7:10 AM, Martin Fick wrote: >> I wonder if a simpler approach may be nearly efficient as >> this one: keep the largest pack out, repack the rest at >> fetch/push time so there are at most 2 packs at a time. >> Or we we could do the repack at 'gc --auto' time, but >> with lower pack threshold (about 10 or so). When the >> second pack is as big as, say half the size of the >> first, merge them into one at "gc --auto" time. This can >> be easily implemented in git-repack.sh. > > It would definitely be better than the current gc approach. > > However, I suspect it is still at least one to two orders of > magnitude off from where it should be. To give you a real > world example, on our server today when gitexproll ran on > our kernel/msm repo, it consolidated 317 pack files into one > almost 8M packfile (it compresses/dedupes shockingly well, > one of those new packs was 33M). Our largest packfile in > that repo is 1.5G! > > So let's now imagine that the second closest packfile is > only 100M, it would keep getting consolidated with 8M worth > of data every day (assuming the same conditions and no extra > compression). That would take (750M-100M)/8M ~ 81 days to > finally build up large enough to no longer consolidate the > new packs with the second largest pack file daily. During > those 80+ days, it will be on average writing 325M too much > per day (when it should be writing just 8M). > > So I can see the appeal of a simple solution, unfortunately > I think one layer would still "suck" though. And if you are > going to add even just one extra layer, I suspect that you > might as well go the full distance since you probably > already need to implement the logic to do so? I see. It looks like your way is the best way to go. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Junio C Hamano wrote: > Imagine we have a cheap way to enumerate the young objects without > the usual history traversal. Before we discuss the advantages, can you outline how we can possibly get this data without actually walking downwards from the roots (refs)? One way to do it is to pull data out of a log of ref updates (aka. reflog), but we both know how unreliable that can be. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Martin Fick wrote: > So, it has me wondering if there isn't a more accurate way > to estimate the new packfile without wasting a ton of time? I'm not sure there is. Adding the sizes of individual packs can be off by a lot, because your deltification will be more effective if you have more data to slide windows over and compress. For the purposes of illustration, take a simple example: packfile-1 has a 30M Makefile and several tiny deltas. Total = 40M. packfile-2 has a 31M Makefile.um and several tiny deltas. Total = 40M. Now, what is the size of packfile-3 which contains the contents of both packfile-1 and packfile-2? 80M is a bad estimate, because you can store deltas against just one Makefile. So, unless you do an in-depth analysis of the objects in the packfiles (which can be terribly expensive), I don't see how you can arrive at a better estimate. > If not, one approach which might be worth experimenting with > is to just assume that new packfiles have size 0! Then just > consolidate them with any other packfile which is ready for > consolidation, or if none are ready, with the smallest > packfile. I would not be surprised to see this work on > average better than the current summation, That is assuming that all fetches (or pushes) are small, which is probably a good rule; you might like to have a "smallness threshold", although I haven't thought hard about the problem. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Monday, August 05, 2013 08:38:47 pm Ramkumar Ramachandra wrote: > This is the rough explanation I wrote down after reading > it: > > So, the problem is that my .git/objects/pack is polluted > with little packs everytime I fetch (or push, if you're > the server), and this is problematic from the > perspective of a overtly (naively) aggressive gc that > hammers out all fragmentation. So, on the first run, > the little packfiles I have are all "consolidated" into > big packfiles; you also write .keep files to say that > "don't gc these big packs we just generated". In > subsequent runs, the little packfiles from the fetch are > absorbed into a pack that is immune to gc. You're also > using a size heuristic, to consolidate similarly sized > packfiles. You also have a --ratio to tweak the ratio > of sizes. > > From: Martin Fick > See: https://gerrit-review.googlesource.com/#/c/35215/ > Thread: > http://thread.gmane.org/gmane.comp.version-control.git/2 > 31555 (Martin's emails are missing from the archive) > --- After analyzing today's data, I recognize that in some circumstances the size estimation after consolidation can be off by huge amounts. The script naively just adds the current sizes together. This gives a very rough estimate, of the new packfile size, but sometimes it can be off by over 2 orders of magnitude. :( While many new packfiles are tiny (several K only), it seems like the larger new packfiles have a terrible tendency to throw the estimate way off (I suspect they simply have many duplicate objects). But despite this poor estimate, the script still offers drastic improvements over plain git gc. So, it has me wondering if there isn't a more accurate way to estimate the new packfile without wasting a ton of time? If not, one approach which might be worth experimenting with is to just assume that new packfiles have size 0! Then just consolidate them with any other packfile which is ready for consolidation, or if none are ready, with the smallest packfile. I would not be surprised to see this work on average better than the current summation, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Tuesday, August 06, 2013 06:24:50 am Duy Nguyen wrote: > On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra wrote: > > + Garbage collect using a pseudo > > logarithmic packfile maintenance + > > approach. This approach attempts to minimize packfile > > churn + by keeping several generations > > of varying sized packfiles around + and > > only consolidating packfiles (or loose objects) which > > are + either new packfiles, or packfiles > > close to the same size as + another > > packfile. > > I wonder if a simpler approach may be nearly efficient as > this one: keep the largest pack out, repack the rest at > fetch/push time so there are at most 2 packs at a time. > Or we we could do the repack at 'gc --auto' time, but > with lower pack threshold (about 10 or so). When the > second pack is as big as, say half the size of the > first, merge them into one at "gc --auto" time. This can > be easily implemented in git-repack.sh. It would definitely be better than the current gc approach. However, I suspect it is still at least one to two orders of magnitude off from where it should be. To give you a real world example, on our server today when gitexproll ran on our kernel/msm repo, it consolidated 317 pack files into one almost 8M packfile (it compresses/dedupes shockingly well, one of those new packs was 33M). Our largest packfile in that repo is 1.5G! So let's now imagine that the second closest packfile is only 100M, it would keep getting consolidated with 8M worth of data every day (assuming the same conditions and no extra compression). That would take (750M-100M)/8M ~ 81 days to finally build up large enough to no longer consolidate the new packs with the second largest pack file daily. During those 80+ days, it will be on average writing 325M too much per day (when it should be writing just 8M). So I can see the appeal of a simple solution, unfortunately I think one layer would still "suck" though. And if you are going to add even just one extra layer, I suspect that you might as well go the full distance since you probably already need to implement the logic to do so? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
Duy Nguyen writes: > On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra > wrote: >> + Garbage collect using a pseudo logarithmic packfile >> maintenance >> + approach. This approach attempts to minimize packfile churn >> + by keeping several generations of varying sized packfiles >> around >> + and only consolidating packfiles (or loose objects) which are >> + either new packfiles, or packfiles close to the same size as >> + another packfile. > > I wonder if a simpler approach may be nearly efficient as this one: > keep the largest pack out, repack the rest at fetch/push time so there > are at most 2 packs at a time. Or we we could do the repack at 'gc > --auto' time, but with lower pack threshold (about 10 or so). When the > second pack is as big as, say half the size of the first, merge them > into one at "gc --auto" time. This can be easily implemented in > git-repack.sh. Another random thought. Imagine we have a cheap way to enumerate the young objects without the usual history traversal. For example, list of all loose objects and what appears in the .idx files that are young. We can reconstruct "names" for trees and blobs from such a list of object names; if a commit in the list refers to a tree, that tree is the top level, and a blob or a tree that appears in such a top-level tree can be given a "name" for its place in the tree (recursively). I suspect we would end up giving names to large majority of trees and blobs in such a list by doing so. If these suspicions turn out to be true, then we could: - run that enumeration algorithm to come up with a set of object names; - emit the tag objects in that set in the tagger timestamp order; - emit the commit objects in that set in the commit timestamp order, while noting the tree objects contained in the set, giving them name ""; - "traverse" the trees and blobs in that set, giving the found ones names (do so without stepping outside the set); - emit the trees and blobs with their names. Some objects may not have given any name, but that is OK as long as they are in the minority. And feeding it to pack-objects to produce a single pack, and then prune away the source of these young objects in the end. The above could turn out to be much cheaper than the traditional history traversal. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] git exproll: steps to tackle gc aggression
On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra wrote: > + Garbage collect using a pseudo logarithmic packfile > maintenance > + approach. This approach attempts to minimize packfile churn > + by keeping several generations of varying sized packfiles > around > + and only consolidating packfiles (or loose objects) which are > + either new packfiles, or packfiles close to the same size as > + another packfile. I wonder if a simpler approach may be nearly efficient as this one: keep the largest pack out, repack the rest at fetch/push time so there are at most 2 packs at a time. Or we we could do the repack at 'gc --auto' time, but with lower pack threshold (about 10 or so). When the second pack is as big as, say half the size of the first, merge them into one at "gc --auto" time. This can be easily implemented in git-repack.sh. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] git exproll: steps to tackle gc aggression
This is the rough explanation I wrote down after reading it: So, the problem is that my .git/objects/pack is polluted with little packs everytime I fetch (or push, if you're the server), and this is problematic from the perspective of a overtly (naively) aggressive gc that hammers out all fragmentation. So, on the first run, the little packfiles I have are all "consolidated" into big packfiles; you also write .keep files to say that "don't gc these big packs we just generated". In subsequent runs, the little packfiles from the fetch are absorbed into a pack that is immune to gc. You're also using a size heuristic, to consolidate similarly sized packfiles. You also have a --ratio to tweak the ratio of sizes. From: Martin Fick See: https://gerrit-review.googlesource.com/#/c/35215/ Thread: http://thread.gmane.org/gmane.comp.version-control.git/231555 (Martin's emails are missing from the archive) --- Not a candidate for inclusion, given how difficult it is to convince our conservative maintainer to add anything new to the tree. I'm posting this in the interest of visibility: I've checked it into my repo and started using it. I encourage everyone else to do the same, so we can learn from it and discuss how to improve gc. contrib/exproll/git-exproll.sh | 566 + 1 file changed, 566 insertions(+) create mode 100755 contrib/exproll/git-exproll.sh diff --git a/contrib/exproll/git-exproll.sh b/contrib/exproll/git-exproll.sh new file mode 100755 index 000..9526d9f --- /dev/null +++ b/contrib/exproll/git-exproll.sh @@ -0,0 +1,566 @@ +#!/bin/bash +# Copyright (c) 2012, Code Aurora Forum. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +## Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +## Redistributions in binary form must reproduce the above +# copyright notice, this list of conditions and the following +# disclaimer in the documentation and/or other materials provided +# with the distribution. +## Neither the name of Code Aurora Forum, Inc. nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED +# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT +# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS +# BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR +# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, +# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE +# OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN +# IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +usage() { # error_message + +cat <<-EOF + usage: $(basename $0) [-unvt] [--noref] [--nolosse] [-r|--ratio number] + [git gc option...] git.repo + + -u|-husage/help + -v verbose + -n dry-run don't actually repack anything + -t touch treat repo as if it had been touched + --noref avoid extra ref packing timestamp checking + --noloosedo not run just because there are loose object dirs +(repacking may still run if they are referenced) + -r ratio packfile ratio to aim for (default 10) + + git gc optionwill be passed as args to git gc + + git.repo to run gc against + + Garbage collect using a pseudo logarithmic packfile maintenance + approach. This approach attempts to minimize packfile churn + by keeping several generations of varying sized packfiles around + and only consolidating packfiles (or loose objects) which are + either new packfiles, or packfiles close to the same size as + another packfile. + + An estimate is used to predict when rollups (one consolidation + would cause another consolidation) would occur so that this + rollup can be done all at once via a single repack. This reduces + both the runtime and the pack file churn in rollup cases. + + Approach: plan each consolidation by creating a table like this: + + Id Keep Size Sha1(or consolidation list) Actions(repack down