Re: git-clone causes out of memory
On 10/13/2017 10:26 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is causing diff_can_quit_early() to return false. Due to the corner-case of the bug it seems it will not be a huge performance improvement in most cases. Still worth fixing and I'm looking at your suggestions to try and learn this area better. Yeah, I just timed some pathspec limits on linux.git, and it makes at best a fraction of a percent improvement (but any improvement is well within run-to-run noise). Which is not surprising. I agree it's worth fixing, though. -Peff I just ran a first-level folder history in the Windows repository and `git rev-list HEAD -- windows >/dev/null` went from 0.43s to 0.04s. So, a good percentage improvement but even on this enormous repo the bug doesn't present an issue to human users. Thanks, -Stolee
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: > This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is > causing diff_can_quit_early() to return false. Due to the corner-case of the > bug it seems it will not be a huge performance improvement in most cases. > Still worth fixing and I'm looking at your suggestions to try and learn this > area better. Yeah, I just timed some pathspec limits on linux.git, and it makes at best a fraction of a percent improvement (but any improvement is well within run-to-run noise). Which is not surprising. I agree it's worth fixing, though. -Peff
Re: git-clone causes out of memory
On 10/13/2017 10:20 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote: Hmm. So this patch makes it go fast: diff --git a/revision.c b/revision.c index d167223e69..b52ea4e9d8 100644 --- a/revision.c +++ b/revision.c @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options, int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; tree_difference |= diff; - if (tree_difference == REV_TREE_DIFFERENT) + if (tree_difference & REV_TREE_DIFFERENT) DIFF_OPT_SET(options, HAS_CHANGES); } But that essentially makes the conditional a noop (since we know we set either NEW or OLD above and DIFFERENT is the union of those flags). I'm not sure I understand why file_add_remove() would ever want to avoid setting HAS_CHANGES (certainly its companion file_change() always does). This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use diff-tree --quiet machinery., 2007-03-14). Maybe I am missing something, but AFAICT this was always buggy. But since it only affects adds and deletes, maybe nobody noticed? I'm also not sure if it only causes a slowdown, or if this could cause us to erroneously mark something as TREESAME which isn't (I _do_ think people would have noticed that). Answering my own question a little, there is a hint in the comment a few lines above: /* * The goal is to get REV_TREE_NEW as the result only if the * diff consists of all '+' (and no other changes), REV_TREE_OLD * if the whole diff is removal of old data, and otherwise * REV_TREE_DIFFERENT (of course if the trees are the same we * want REV_TREE_SAME). * That means that once we get to REV_TREE_DIFFERENT, we do not * have to look any further. */ So my patch above is breaking that. But it's not clear from that comment why we care about knowing the different between NEW, OLD, and DIFFERENT. Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is in try_to_simplify_commit(): case REV_TREE_NEW: if (revs->remove_empty_trees && rev_same_tree_as_empty(revs, p)) { /* We are adding all the specified * paths from this parent, so the * history beyond this parent is not * interesting. Remove its parents * (they are grandparents for us). * IOW, we pretend this parent is a * "root" commit. */ if (parse_commit(p) < 0) die("cannot simplify commit %s (invalid %s)", oid_to_hex(>object.oid), oid_to_hex(>object.oid)); p->parents = NULL; } /* fallthrough */ case REV_TREE_OLD: case REV_TREE_DIFFERENT: So when --remove-empty is not in effect (and it's not by default), we don't care about OLD vs NEW, and we should be able to optimize further. -Peff This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is causing diff_can_quit_early() to return false. Due to the corner-case of the bug it seems it will not be a huge performance improvement in most cases. Still worth fixing and I'm looking at your suggestions to try and learn this area better. It will speed up natural cases of someone adding or renaming a folder with a lot of contents, including someone initializing a repository from an existing codebase. This git bomb case happens to add all of the folders at once, which is why the performance is so noticeable. I use a version of this "exponential file growth" for testing that increases the folder-depth one commit at a time, so the contents are rather small in the first "add", hence not noticing this issue. Thanks, -Stolee
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote: > Hmm. So this patch makes it go fast: > > diff --git a/revision.c b/revision.c > index d167223e69..b52ea4e9d8 100644 > --- a/revision.c > +++ b/revision.c > @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options, > int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; > > tree_difference |= diff; > - if (tree_difference == REV_TREE_DIFFERENT) > + if (tree_difference & REV_TREE_DIFFERENT) > DIFF_OPT_SET(options, HAS_CHANGES); > } > > > But that essentially makes the conditional a noop (since we know we set > either NEW or OLD above and DIFFERENT is the union of those flags). > > I'm not sure I understand why file_add_remove() would ever want to avoid > setting HAS_CHANGES (certainly its companion file_change() always does). > This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use > diff-tree --quiet machinery., 2007-03-14). > > Maybe I am missing something, but AFAICT this was always buggy. But > since it only affects adds and deletes, maybe nobody noticed? I'm also > not sure if it only causes a slowdown, or if this could cause us to > erroneously mark something as TREESAME which isn't (I _do_ think people > would have noticed that). Answering my own question a little, there is a hint in the comment a few lines above: /* * The goal is to get REV_TREE_NEW as the result only if the * diff consists of all '+' (and no other changes), REV_TREE_OLD * if the whole diff is removal of old data, and otherwise * REV_TREE_DIFFERENT (of course if the trees are the same we * want REV_TREE_SAME). * That means that once we get to REV_TREE_DIFFERENT, we do not * have to look any further. */ So my patch above is breaking that. But it's not clear from that comment why we care about knowing the different between NEW, OLD, and DIFFERENT. Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is in try_to_simplify_commit(): case REV_TREE_NEW: if (revs->remove_empty_trees && rev_same_tree_as_empty(revs, p)) { /* We are adding all the specified * paths from this parent, so the * history beyond this parent is not * interesting. Remove its parents * (they are grandparents for us). * IOW, we pretend this parent is a * "root" commit. */ if (parse_commit(p) < 0) die("cannot simplify commit %s (invalid %s)", oid_to_hex(>object.oid), oid_to_hex(>object.oid)); p->parents = NULL; } /* fallthrough */ case REV_TREE_OLD: case REV_TREE_DIFFERENT: So when --remove-empty is not in effect (and it's not by default), we don't care about OLD vs NEW, and we should be able to optimize further. -Peff
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 09:56:36AM -0400, Jeff King wrote: > On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote: > > > > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep > > > your pathspec goes). We should be able to see immediately that the entry > > > is not present between the two and not bother descending. After all, > > > we've set the QUICK flag in init_revisions(). So the real question is > > > why QUICK is not kicking in. > > > > I'm struggling to understand your meaning. We want to walk from root to > > d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe > > that's what the QUICK flag is supposed to do. > > Yes, that's exactly what it is for. When we see the first difference we > should say "aha, the caller only wanted to know whether there was a > difference, not what it was" and return immediately. See > diff_can_quit_early(). Hmm. So this patch makes it go fast: diff --git a/revision.c b/revision.c index d167223e69..b52ea4e9d8 100644 --- a/revision.c +++ b/revision.c @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options, int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; tree_difference |= diff; - if (tree_difference == REV_TREE_DIFFERENT) + if (tree_difference & REV_TREE_DIFFERENT) DIFF_OPT_SET(options, HAS_CHANGES); } But that essentially makes the conditional a noop (since we know we set either NEW or OLD above and DIFFERENT is the union of those flags). I'm not sure I understand why file_add_remove() would ever want to avoid setting HAS_CHANGES (certainly its companion file_change() always does). This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use diff-tree --quiet machinery., 2007-03-14). Maybe I am missing something, but AFAICT this was always buggy. But since it only affects adds and deletes, maybe nobody noticed? I'm also not sure if it only causes a slowdown, or if this could cause us to erroneously mark something as TREESAME which isn't (I _do_ think people would have noticed that). -Peff
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote: > > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep > > your pathspec goes). We should be able to see immediately that the entry > > is not present between the two and not bother descending. After all, > > we've set the QUICK flag in init_revisions(). So the real question is > > why QUICK is not kicking in. > > I'm struggling to understand your meaning. We want to walk from root to > d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe > that's what the QUICK flag is supposed to do. Yes, that's exactly what it is for. When we see the first difference we should say "aha, the caller only wanted to know whether there was a difference, not what it was" and return immediately. See diff_can_quit_early(). -Peff
Re: git-clone causes out of memory
On 10/13/2017 9:50 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote: Since I don't understand enough about the consumers to diff_tree_oid() (and the fact that the recursive behavior may be wanted in some cases), I think we can fix this in builtin/rev-list.c with this simple diff: --- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index ded1577424..b2e8e02cc8 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) git_config(git_default_config, NULL); init_revisions(, prefix); + + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; + (Note: I'm running tests now and see that this breaks behavior. Definitely not the solution we want.) Hmm, this feels wrong, because we _do_ want to recurse down and follow the pathspec to see if there are real changes. We should be comparing an empty tree and d0/d0/d0/d0 (or however deep your pathspec goes). We should be able to see immediately that the entry is not present between the two and not bother descending. After all, we've set the QUICK flag in init_revisions(). So the real question is why QUICK is not kicking in. -Peff I'm struggling to understand your meaning. We want to walk from root to d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe that's what the QUICK flag is supposed to do. Thanks, -Stolee
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote: > Since I don't understand enough about the consumers to diff_tree_oid() (and > the fact that the recursive behavior may be wanted in some cases), I think > we can fix this in builtin/rev-list.c with this simple diff: > > --- > > diff --git a/builtin/rev-list.c b/builtin/rev-list.c > index ded1577424..b2e8e02cc8 100644 > --- a/builtin/rev-list.c > +++ b/builtin/rev-list.c > @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char > *prefix) > > git_config(git_default_config, NULL); > init_revisions(, prefix); > + > + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; > + Hmm, this feels wrong, because we _do_ want to recurse down and follow the pathspec to see if there are real changes. We should be comparing an empty tree and d0/d0/d0/d0 (or however deep your pathspec goes). We should be able to see immediately that the entry is not present between the two and not bother descending. After all, we've set the QUICK flag in init_revisions(). So the real question is why QUICK is not kicking in. -Peff
Re: git-clone causes out of memory
On 10/13/2017 9:15 AM, Derrick Stolee wrote: On 10/13/2017 8:44 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: On 13.10.2017 15:04, Junio C Hamano wrote: Christian Couderwrites: Yeah, but perhaps Git could be smarter when rev-listing too and avoid processing files or directories it has already seen? Aren't you suggesting to optimize for a wrong case? Anything that is possible with a software should be considered as a possible usecase. It's in fact a DoS attack. Imagine there's a server that using git to process something, and now there's a way to knock down this server. It's also bad from a promotional stand point. But the point is that you'd have the same problem with any repository that had 10^7 files in it. Yes, it's convenient for the attacker that there are only 9 objects, but fundamentally it's pretty easy for an attacker to construct repositories that have large trees (or very deep trees -- that's what causes stack exhaustion in some cases). Note too that this attack almost always comes down to the diff code (which is why it kicks in for pathspec limiting) which has to actually expand the tree. Most "normal" server-side operations (like accepting pushes or serving fetches) operate only on the object graph and _do_ avoid processing already-seen objects. As soon as servers start trying to checkout or diff, though, the attack surface gets quite large. And you really need to start thinking about having resource limits and quotas for CPU and memory use of each process (and group by requesting user, IP, repo, etc). I think the main thing Git could be doing here is to limit the size of the tree (both width and depth). But arbitrary limits like that have a way of being annoying, and I think it just pushes resource-exhaustion attacks off a little (e.g., can you construct a blob that behaves badly with the "--patch"?). -Peff I'm particularly interested in why `git rev-list HEAD -- [path]` gets slower in this case, because I wrote the history algorithm used by VSTS. In our algorithm, we only walk the list of objects from commit to the tree containing the path item. For example, in the path d0/d0/d0, we would only walk: commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 entry] From this, we can determine the TREESAME relationship by parsing four objects without parsing all contents below d0/d0/d0. The reason we have exponential behavior in `git rev-list` is because we are calling diff_tree_oid() in tree-diff.c recursively without short-circuiting on equal OIDs. I will prepare a patch that adds this OID-equal short-circuit to avoid this exponential behavior. I'll model my patch against a similar patch in master: commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid duplicate ODB lookups during checkout It will also significantly speed up rev-list calls for short paths in deep repositories. It will not be very measurable in the git or Linux repos because their shallow folder structure. Thanks, -Stolee Since I don't understand enough about the consumers to diff_tree_oid() (and the fact that the recursive behavior may be wanted in some cases), I think we can fix this in builtin/rev-list.c with this simple diff: --- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index ded1577424..b2e8e02cc8 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) git_config(git_default_config, NULL); init_revisions(, prefix); + + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; + revs.abbrev = DEFAULT_ABBREV; revs.commit_format = CMIT_FMT_UNSPECIFIED; argc = setup_revisions(argc, argv, , NULL);
Re: git-clone causes out of memory
On 10/13/2017 8:44 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: On 13.10.2017 15:04, Junio C Hamano wrote: Christian Couderwrites: Yeah, but perhaps Git could be smarter when rev-listing too and avoid processing files or directories it has already seen? Aren't you suggesting to optimize for a wrong case? Anything that is possible with a software should be considered as a possible usecase. It's in fact a DoS attack. Imagine there's a server that using git to process something, and now there's a way to knock down this server. It's also bad from a promotional stand point. But the point is that you'd have the same problem with any repository that had 10^7 files in it. Yes, it's convenient for the attacker that there are only 9 objects, but fundamentally it's pretty easy for an attacker to construct repositories that have large trees (or very deep trees -- that's what causes stack exhaustion in some cases). Note too that this attack almost always comes down to the diff code (which is why it kicks in for pathspec limiting) which has to actually expand the tree. Most "normal" server-side operations (like accepting pushes or serving fetches) operate only on the object graph and _do_ avoid processing already-seen objects. As soon as servers start trying to checkout or diff, though, the attack surface gets quite large. And you really need to start thinking about having resource limits and quotas for CPU and memory use of each process (and group by requesting user, IP, repo, etc). I think the main thing Git could be doing here is to limit the size of the tree (both width and depth). But arbitrary limits like that have a way of being annoying, and I think it just pushes resource-exhaustion attacks off a little (e.g., can you construct a blob that behaves badly with the "--patch"?). -Peff I'm particularly interested in why `git rev-list HEAD -- [path]` gets slower in this case, because I wrote the history algorithm used by VSTS. In our algorithm, we only walk the list of objects from commit to the tree containing the path item. For example, in the path d0/d0/d0, we would only walk: commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 entry] From this, we can determine the TREESAME relationship by parsing four objects without parsing all contents below d0/d0/d0. The reason we have exponential behavior in `git rev-list` is because we are calling diff_tree_oid() in tree-diff.c recursively without short-circuiting on equal OIDs. I will prepare a patch that adds this OID-equal short-circuit to avoid this exponential behavior. I'll model my patch against a similar patch in master: commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid duplicate ODB lookups during checkout It will also significantly speed up rev-list calls for short paths in deep repositories. It will not be very measurable in the git or Linux repos because their shallow folder structure. Thanks, -Stolee
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: > On 13.10.2017 15:04, Junio C Hamano wrote: > > Christian Couderwrites: > > > > > Yeah, but perhaps Git could be smarter when rev-listing too and avoid > > > processing files or directories it has already seen? > > > > Aren't you suggesting to optimize for a wrong case? > > > > Anything that is possible with a software should be considered as a possible > usecase. It's in fact a DoS attack. Imagine there's a server that using git > to process something, and now there's a way to knock down this server. It's > also bad from a promotional stand point. But the point is that you'd have the same problem with any repository that had 10^7 files in it. Yes, it's convenient for the attacker that there are only 9 objects, but fundamentally it's pretty easy for an attacker to construct repositories that have large trees (or very deep trees -- that's what causes stack exhaustion in some cases). Note too that this attack almost always comes down to the diff code (which is why it kicks in for pathspec limiting) which has to actually expand the tree. Most "normal" server-side operations (like accepting pushes or serving fetches) operate only on the object graph and _do_ avoid processing already-seen objects. As soon as servers start trying to checkout or diff, though, the attack surface gets quite large. And you really need to start thinking about having resource limits and quotas for CPU and memory use of each process (and group by requesting user, IP, repo, etc). I think the main thing Git could be doing here is to limit the size of the tree (both width and depth). But arbitrary limits like that have a way of being annoying, and I think it just pushes resource-exhaustion attacks off a little (e.g., can you construct a blob that behaves badly with the "--patch"?). -Peff
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 07:06:03PM +0900, Mike Hommey wrote: > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: > > There's a gitbomb on github. It is undoubtedly creative and funny, but since > > this is a bug in git, I thought it'd be nice to report. The command: > > > > $ git clone https://github.com/x0rz/ShadowBrokersFiles > > What fills memory is actually the checkout part of the command. git > clone -n doesn't fail. > > Credit should go where it's due: https://kate.io/blog/git-bomb/ > (with the bonus that it comes with explanations) That is a nice explanation, and I think the dates point to the kate.io post as the parent there. I certainly have come across this type of "bomb" repo before, but I couldn't come up with any public references (somebody uploaded such a repository to GitHub in 2014). So I think this may be the first public discussion of the idea. -Peff
Re: git-clone causes out of memory
On 13.10.2017 15:04, Junio C Hamano wrote: Christian Couderwrites: Yeah, but perhaps Git could be smarter when rev-listing too and avoid processing files or directories it has already seen? Aren't you suggesting to optimize for a wrong case? Anything that is possible with a software should be considered as a possible usecase. It's in fact a DoS attack. Imagine there's a server that using git to process something, and now there's a way to knock down this server. It's also bad from a promotional stand point.
Re: git-clone causes out of memory
Christian Couderwrites: > Yeah, but perhaps Git could be smarter when rev-listing too and avoid > processing files or directories it has already seen? Aren't you suggesting to optimize for a wrong case?
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 12:37 PM, Mike Hommeywrote: > On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote: >> >> After cloning it with -n, there is the following "funny" situation: >> >> $ time git rev-list HEAD >> 7af99c9e7d4768fa681f4fe4ff61259794cf719b >> 18ed56cbc5012117e24a603e7c072cf65d36d469 >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real0m0.004s >> user0m0.000s >> sys 0m0.004s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0 >> >> real0m0.004s >> user0m0.000s >> sys 0m0.000s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0 >> >> real0m0.004s >> user0m0.000s >> sys 0m0.000s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real0m0.005s >> user0m0.008s >> sys 0m0.000s >> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real0m0.203s >> user0m0.112s >> sys 0m0.088s >> $ time git rev-list HEAD -- d0/d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real0m1.305s >> user0m0.720s >> sys 0m0.580s >> $ time git rev-list HEAD -- d0/d0/d0/ >> 45546f17e5801791d4bc5968b91253a2f4b0db72 >> >> real0m12.135s >> user0m6.700s >> sys 0m5.412s >> >> So `git rev-list` becomes exponentially more expensive when you run it >> on a shorter directory path, though it is fast if you run it without a >> path. > > That's because there are 10^7 files under d0/d0/d0, 10^6 under > d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc. > > So really, this is all about things being slower when there's a crazy > number of files. Picture me surprised. > > What makes it kind of special is that the repository contains a lot of > paths/files, but very few objects, because it's duplicating everything. > > All the 10^10 blobs have the same content, all the 10^9 trees that point > to them have the same content, all the 10^8 trees that point to those > trees have the same content, etc. > > If git wasn't effectively deduplicating identical content, the repository > would be multiple gigabytes large. Yeah, but perhaps Git could be smarter when rev-listing too and avoid processing files or directories it has already seen?
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote: > On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommeywrote: > > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: > >> There's a gitbomb on github. It is undoubtedly creative and funny, but > >> since > >> this is a bug in git, I thought it'd be nice to report. The command: > >> > >> $ git clone https://github.com/x0rz/ShadowBrokersFiles > > > > What fills memory is actually the checkout part of the command. git > > clone -n doesn't fail. > > > > Credit should go where it's due: https://kate.io/blog/git-bomb/ > > (with the bonus that it comes with explanations) > > Yeah, there is a thread on Hacker News about this too: > > https://news.ycombinator.com/item?id=15457076 > > The original repo on GitHub is: > > https://github.com/Katee/git-bomb.git > > After cloning it with -n, there is the following "funny" situation: > > $ time git rev-list HEAD > 7af99c9e7d4768fa681f4fe4ff61259794cf719b > 18ed56cbc5012117e24a603e7c072cf65d36d469 > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real0m0.004s > user0m0.000s > sys 0m0.004s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0 > > real0m0.004s > user0m0.000s > sys 0m0.000s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0 > > real0m0.004s > user0m0.000s > sys 0m0.000s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real0m0.005s > user0m0.008s > sys 0m0.000s > $ time git rev-list HEAD -- d0/d0/d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real0m0.203s > user0m0.112s > sys 0m0.088s > $ time git rev-list HEAD -- d0/d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real0m1.305s > user0m0.720s > sys 0m0.580s > $ time git rev-list HEAD -- d0/d0/d0/ > 45546f17e5801791d4bc5968b91253a2f4b0db72 > > real0m12.135s > user0m6.700s > sys 0m5.412s > > So `git rev-list` becomes exponentially more expensive when you run it > on a shorter directory path, though it is fast if you run it without a > path. That's because there are 10^7 files under d0/d0/d0, 10^6 under d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc. So really, this is all about things being slower when there's a crazy number of files. Picture me surprised. What makes it kind of special is that the repository contains a lot of paths/files, but very few objects, because it's duplicating everything. All the 10^10 blobs have the same content, all the 10^9 trees that point to them have the same content, all the 10^8 trees that point to those trees have the same content, etc. If git wasn't effectively deduplicating identical content, the repository would be multiple gigabytes large. Mike
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommeywrote: > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: >> There's a gitbomb on github. It is undoubtedly creative and funny, but since >> this is a bug in git, I thought it'd be nice to report. The command: >> >> $ git clone https://github.com/x0rz/ShadowBrokersFiles > > What fills memory is actually the checkout part of the command. git > clone -n doesn't fail. > > Credit should go where it's due: https://kate.io/blog/git-bomb/ > (with the bonus that it comes with explanations) Yeah, there is a thread on Hacker News about this too: https://news.ycombinator.com/item?id=15457076 The original repo on GitHub is: https://github.com/Katee/git-bomb.git After cloning it with -n, there is the following "funny" situation: $ time git rev-list HEAD 7af99c9e7d4768fa681f4fe4ff61259794cf719b 18ed56cbc5012117e24a603e7c072cf65d36d469 45546f17e5801791d4bc5968b91253a2f4b0db72 real0m0.004s user0m0.000s sys 0m0.004s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0 real0m0.004s user0m0.000s sys 0m0.000s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0 real0m0.004s user0m0.000s sys 0m0.000s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real0m0.005s user0m0.008s sys 0m0.000s $ time git rev-list HEAD -- d0/d0/d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real0m0.203s user0m0.112s sys 0m0.088s $ time git rev-list HEAD -- d0/d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real0m1.305s user0m0.720s sys 0m0.580s $ time git rev-list HEAD -- d0/d0/d0/ 45546f17e5801791d4bc5968b91253a2f4b0db72 real0m12.135s user0m6.700s sys 0m5.412s So `git rev-list` becomes exponentially more expensive when you run it on a shorter directory path, though it is fast if you run it without a path.
Re: git-clone causes out of memory
On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote: > There's a gitbomb on github. It is undoubtedly creative and funny, but since > this is a bug in git, I thought it'd be nice to report. The command: > > $ git clone https://github.com/x0rz/ShadowBrokersFiles What fills memory is actually the checkout part of the command. git clone -n doesn't fail. Credit should go where it's due: https://kate.io/blog/git-bomb/ (with the bonus that it comes with explanations) Mike