Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 10:26 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
causing diff_can_quit_early() to return false. Due to the corner-case of the
bug it seems it will not be a huge performance improvement in most cases.
Still worth fixing and I'm looking at your suggestions to try and learn this
area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

-Peff


I just ran a first-level folder history in the Windows repository and 
`git rev-list HEAD -- windows >/dev/null` went from 0.43s to 0.04s. So, 
a good percentage improvement but even on this enormous repo the bug 
doesn't present an issue to human users.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:

> This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
> causing diff_can_quit_early() to return false. Due to the corner-case of the
> bug it seems it will not be a huge performance improvement in most cases.
> Still worth fixing and I'm looking at your suggestions to try and learn this
> area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

-Peff


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 10:20 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote:


Hmm. So this patch makes it go fast:

diff --git a/revision.c b/revision.c
index d167223e69..b52ea4e9d8 100644
--- a/revision.c
+++ b/revision.c
@@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
  
  	tree_difference |= diff;

-   if (tree_difference == REV_TREE_DIFFERENT)
+   if (tree_difference & REV_TREE_DIFFERENT)
DIFF_OPT_SET(options, HAS_CHANGES);
  }
  


But that essentially makes the conditional a noop (since we know we set
either NEW or OLD above and DIFFERENT is the union of those flags).

I'm not sure I understand why file_add_remove() would ever want to avoid
setting HAS_CHANGES (certainly its companion file_change() always does).
This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
diff-tree --quiet machinery., 2007-03-14).

Maybe I am missing something, but AFAICT this was always buggy. But
since it only affects adds and deletes, maybe nobody noticed? I'm also
not sure if it only causes a slowdown, or if this could cause us to
erroneously mark something as TREESAME which isn't (I _do_ think people
would have noticed that).

Answering my own question a little, there is a hint in the comment
a few lines above:

   /*
* The goal is to get REV_TREE_NEW as the result only if the
* diff consists of all '+' (and no other changes), REV_TREE_OLD
* if the whole diff is removal of old data, and otherwise
* REV_TREE_DIFFERENT (of course if the trees are the same we
* want REV_TREE_SAME).
* That means that once we get to REV_TREE_DIFFERENT, we do not
* have to look any further.
*/

So my patch above is breaking that. But it's not clear from that comment
why we care about knowing the different between NEW, OLD, and DIFFERENT.

Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is
in try_to_simplify_commit():

  case REV_TREE_NEW:
   if (revs->remove_empty_trees &&
   rev_same_tree_as_empty(revs, p)) {
   /* We are adding all the specified
* paths from this parent, so the
* history beyond this parent is not
* interesting.  Remove its parents
* (they are grandparents for us).
* IOW, we pretend this parent is a
* "root" commit.
*/
   if (parse_commit(p) < 0)
   die("cannot simplify commit %s (invalid %s)",
   oid_to_hex(>object.oid),
   oid_to_hex(>object.oid));
   p->parents = NULL;
   }
 /* fallthrough */
 case REV_TREE_OLD:
 case REV_TREE_DIFFERENT:

So when --remove-empty is not in effect (and it's not by default), we
don't care about OLD vs NEW, and we should be able to optimize further.

-Peff


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is 
causing diff_can_quit_early() to return false. Due to the corner-case of 
the bug it seems it will not be a huge performance improvement in most 
cases. Still worth fixing and I'm looking at your suggestions to try and 
learn this area better.


It will speed up natural cases of someone adding or renaming a folder 
with a lot of contents, including someone initializing a repository from 
an existing codebase. This git bomb case happens to add all of the 
folders at once, which is why the performance is so noticeable. I use a 
version of this "exponential file growth" for testing that increases the 
folder-depth one commit at a time, so the contents are rather small in 
the first "add", hence not noticing this issue.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote:

> Hmm. So this patch makes it go fast:
> 
> diff --git a/revision.c b/revision.c
> index d167223e69..b52ea4e9d8 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
>   int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
>  
>   tree_difference |= diff;
> - if (tree_difference == REV_TREE_DIFFERENT)
> + if (tree_difference & REV_TREE_DIFFERENT)
>   DIFF_OPT_SET(options, HAS_CHANGES);
>  }
>  
> 
> But that essentially makes the conditional a noop (since we know we set
> either NEW or OLD above and DIFFERENT is the union of those flags).
> 
> I'm not sure I understand why file_add_remove() would ever want to avoid
> setting HAS_CHANGES (certainly its companion file_change() always does).
> This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
> diff-tree --quiet machinery., 2007-03-14).
> 
> Maybe I am missing something, but AFAICT this was always buggy. But
> since it only affects adds and deletes, maybe nobody noticed? I'm also
> not sure if it only causes a slowdown, or if this could cause us to
> erroneously mark something as TREESAME which isn't (I _do_ think people
> would have noticed that).

Answering my own question a little, there is a hint in the comment
a few lines above:

  /*
   * The goal is to get REV_TREE_NEW as the result only if the
   * diff consists of all '+' (and no other changes), REV_TREE_OLD
   * if the whole diff is removal of old data, and otherwise
   * REV_TREE_DIFFERENT (of course if the trees are the same we
   * want REV_TREE_SAME).
   * That means that once we get to REV_TREE_DIFFERENT, we do not
   * have to look any further.
   */

So my patch above is breaking that. But it's not clear from that comment
why we care about knowing the different between NEW, OLD, and DIFFERENT.

Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is
in try_to_simplify_commit():

 case REV_TREE_NEW:
  if (revs->remove_empty_trees &&
  rev_same_tree_as_empty(revs, p)) {
  /* We are adding all the specified
   * paths from this parent, so the
   * history beyond this parent is not
   * interesting.  Remove its parents
   * (they are grandparents for us).
   * IOW, we pretend this parent is a
   * "root" commit.
   */
  if (parse_commit(p) < 0)
  die("cannot simplify commit %s (invalid %s)",
  oid_to_hex(>object.oid),
  oid_to_hex(>object.oid));
  p->parents = NULL;
  }
/* fallthrough */
case REV_TREE_OLD:
case REV_TREE_DIFFERENT:

So when --remove-empty is not in effect (and it's not by default), we
don't care about OLD vs NEW, and we should be able to optimize further.

-Peff


Re: git-clone causes out of memory

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 09:56:36AM -0400, Jeff King wrote:

> On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote:
> 
> > > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
> > > your pathspec goes). We should be able to see immediately that the entry
> > > is not present between the two and not bother descending. After all,
> > > we've set the QUICK flag in init_revisions(). So the real question is
> > > why QUICK is not kicking in.
> > 
> > I'm struggling to understand your meaning. We want to walk from root to
> > d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe
> > that's what the QUICK flag is supposed to do.
> 
> Yes, that's exactly what it is for. When we see the first difference we
> should say "aha, the caller only wanted to know whether there was a
> difference, not what it was" and return immediately. See
> diff_can_quit_early().

Hmm. So this patch makes it go fast:

diff --git a/revision.c b/revision.c
index d167223e69..b52ea4e9d8 100644
--- a/revision.c
+++ b/revision.c
@@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
 
tree_difference |= diff;
-   if (tree_difference == REV_TREE_DIFFERENT)
+   if (tree_difference & REV_TREE_DIFFERENT)
DIFF_OPT_SET(options, HAS_CHANGES);
 }
 

But that essentially makes the conditional a noop (since we know we set
either NEW or OLD above and DIFFERENT is the union of those flags).

I'm not sure I understand why file_add_remove() would ever want to avoid
setting HAS_CHANGES (certainly its companion file_change() always does).
This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
diff-tree --quiet machinery., 2007-03-14).

Maybe I am missing something, but AFAICT this was always buggy. But
since it only affects adds and deletes, maybe nobody noticed? I'm also
not sure if it only causes a slowdown, or if this could cause us to
erroneously mark something as TREESAME which isn't (I _do_ think people
would have noticed that).

-Peff


Re: git-clone causes out of memory

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 09:55:15AM -0400, Derrick Stolee wrote:

> > We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
> > your pathspec goes). We should be able to see immediately that the entry
> > is not present between the two and not bother descending. After all,
> > we've set the QUICK flag in init_revisions(). So the real question is
> > why QUICK is not kicking in.
> 
> I'm struggling to understand your meaning. We want to walk from root to
> d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe
> that's what the QUICK flag is supposed to do.

Yes, that's exactly what it is for. When we see the first difference we
should say "aha, the caller only wanted to know whether there was a
difference, not what it was" and return immediately. See
diff_can_quit_early().

-Peff


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 9:50 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote:


Since I don't understand enough about the consumers to diff_tree_oid() (and
the fact that the recursive behavior may be wanted in some cases), I think
we can fix this in builtin/rev-list.c with this simple diff:

---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char
*prefix)

     git_config(git_default_config, NULL);
     init_revisions(, prefix);
+
+   revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+


(Note: I'm running tests now and see that this breaks behavior. 
Definitely not the solution we want.)



Hmm, this feels wrong, because we _do_ want to recurse down and follow
the pathspec to see if there are real changes.

We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
your pathspec goes). We should be able to see immediately that the entry
is not present between the two and not bother descending. After all,
we've set the QUICK flag in init_revisions(). So the real question is
why QUICK is not kicking in.

-Peff


I'm struggling to understand your meaning. We want to walk from root to 
d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe 
that's what the QUICK flag is supposed to do.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote:

> Since I don't understand enough about the consumers to diff_tree_oid() (and
> the fact that the recursive behavior may be wanted in some cases), I think
> we can fix this in builtin/rev-list.c with this simple diff:
> 
> ---
> 
> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
> index ded1577424..b2e8e02cc8 100644
> --- a/builtin/rev-list.c
> +++ b/builtin/rev-list.c
> @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char
> *prefix)
> 
>     git_config(git_default_config, NULL);
>     init_revisions(, prefix);
> +
> +   revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
> +

Hmm, this feels wrong, because we _do_ want to recurse down and follow
the pathspec to see if there are real changes.

We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
your pathspec goes). We should be able to see immediately that the entry
is not present between the two and not bother descending. After all,
we've set the QUICK flag in init_revisions(). So the real question is
why QUICK is not kicking in.

-Peff


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 9:15 AM, Derrick Stolee wrote:

On 10/13/2017 8:44 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:


On 13.10.2017 15:04, Junio C Hamano wrote:

Christian Couder  writes:


Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?

Anything that is possible with a software should be considered as a 
possible
usecase. It's in fact a DoS attack. Imagine there's a server that 
using git
to process something, and now there's a way to knock down this 
server. It's

also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff


I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
slower in this case, because I wrote the history algorithm used by 
VSTS. In our algorithm, we only walk the list of objects from commit 
to the tree containing the path item. For example, in the path 
d0/d0/d0, we would only walk:


    commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
entry]


From this, we can determine the TREESAME relationship by parsing four 
objects without parsing all contents below d0/d0/d0.


The reason we have exponential behavior in `git rev-list` is because 
we are calling diff_tree_oid() in tree-diff.c recursively without 
short-circuiting on equal OIDs.


I will prepare a patch that adds this OID-equal short-circuit to avoid 
this exponential behavior. I'll model my patch against a similar patch 
in master:


    commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: 
avoid duplicate ODB lookups during checkout


It will also significantly speed up rev-list calls for short paths in 
deep repositories. It will not be very measurable in the git or Linux 
repos because their shallow folder structure.


Thanks,
-Stolee


Since I don't understand enough about the consumers to diff_tree_oid() 
(and the fact that the recursive behavior may be wanted in some cases), 
I think we can fix this in builtin/rev-list.c with this simple diff:


---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const 
char *prefix)


    git_config(git_default_config, NULL);
    init_revisions(, prefix);
+
+   revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+
    revs.abbrev = DEFAULT_ABBREV;
    revs.commit_format = CMIT_FMT_UNSPECIFIED;
    argc = setup_revisions(argc, argv, , NULL);



Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 8:44 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:


On 13.10.2017 15:04, Junio C Hamano wrote:

Christian Couder  writes:


Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?


Anything that is possible with a software should be considered as a possible
usecase. It's in fact a DoS attack. Imagine there's a server that using git
to process something, and now there's a way to knock down this server. It's
also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff


I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
slower in this case, because I wrote the history algorithm used by VSTS. 
In our algorithm, we only walk the list of objects from commit to the 
tree containing the path item. For example, in the path d0/d0/d0, we 
would only walk:


    commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
entry]


From this, we can determine the TREESAME relationship by parsing four 
objects without parsing all contents below d0/d0/d0.


The reason we have exponential behavior in `git rev-list` is because we 
are calling diff_tree_oid() in tree-diff.c recursively without 
short-circuiting on equal OIDs.


I will prepare a patch that adds this OID-equal short-circuit to avoid 
this exponential behavior. I'll model my patch against a similar patch 
in master:


    commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid 
duplicate ODB lookups during checkout


It will also significantly speed up rev-list calls for short paths in 
deep repositories. It will not be very measurable in the git or Linux 
repos because their shallow folder structure.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:

> On 13.10.2017 15:04, Junio C Hamano wrote:
> > Christian Couder  writes:
> > 
> > > Yeah, but perhaps Git could be smarter when rev-listing too and avoid
> > > processing files or directories it has already seen?
> > 
> > Aren't you suggesting to optimize for a wrong case?
> > 
> 
> Anything that is possible with a software should be considered as a possible
> usecase. It's in fact a DoS attack. Imagine there's a server that using git
> to process something, and now there's a way to knock down this server. It's
> also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff


Re: git-clone causes out of memory

2017-10-13 Thread Jeff King
On Fri, Oct 13, 2017 at 07:06:03PM +0900, Mike Hommey wrote:

> On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> > There's a gitbomb on github. It is undoubtedly creative and funny, but since
> > this is a bug in git, I thought it'd be nice to report. The command:
> > 
> > $ git clone https://github.com/x0rz/ShadowBrokersFiles
> 
> What fills memory is actually the checkout part of the command. git
> clone -n doesn't fail.
> 
> Credit should go where it's due: https://kate.io/blog/git-bomb/
> (with the bonus that it comes with explanations)

That is a nice explanation, and I think the dates point to the kate.io
post as the parent there.

I certainly have come across this type of "bomb" repo before, but I
couldn't come up with any public references (somebody uploaded such a
repository to GitHub in 2014). So I think this may be the first public
discussion of the idea.

-Peff


Re: git-clone causes out of memory

2017-10-13 Thread Constantine

On 13.10.2017 15:04, Junio C Hamano wrote:

Christian Couder  writes:


Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?


Aren't you suggesting to optimize for a wrong case?



Anything that is possible with a software should be considered as a 
possible usecase. It's in fact a DoS attack. Imagine there's a server 
that using git to process something, and now there's a way to knock down 
this server. It's also bad from a promotional stand point.


Re: git-clone causes out of memory

2017-10-13 Thread Junio C Hamano
Christian Couder  writes:

> Yeah, but perhaps Git could be smarter when rev-listing too and avoid
> processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?


Re: git-clone causes out of memory

2017-10-13 Thread Christian Couder
On Fri, Oct 13, 2017 at 12:37 PM, Mike Hommey  wrote:
> On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote:
>>
>> After cloning it with -n, there is the following "funny" situation:
>>
>> $ time git rev-list HEAD
>> 7af99c9e7d4768fa681f4fe4ff61259794cf719b
>> 18ed56cbc5012117e24a603e7c072cf65d36d469
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real0m0.004s
>> user0m0.000s
>> sys 0m0.004s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0
>>
>> real0m0.004s
>> user0m0.000s
>> sys 0m0.000s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0
>>
>> real0m0.004s
>> user0m0.000s
>> sys 0m0.000s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real0m0.005s
>> user0m0.008s
>> sys 0m0.000s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real0m0.203s
>> user0m0.112s
>> sys 0m0.088s
>> $ time git rev-list HEAD -- d0/d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real0m1.305s
>> user0m0.720s
>> sys 0m0.580s
>> $ time git rev-list HEAD -- d0/d0/d0/
>> 45546f17e5801791d4bc5968b91253a2f4b0db72
>>
>> real0m12.135s
>> user0m6.700s
>> sys 0m5.412s
>>
>> So `git rev-list` becomes exponentially more expensive when you run it
>> on a shorter directory path, though it is fast if you run it without a
>> path.
>
> That's because there are 10^7 files under d0/d0/d0, 10^6 under
> d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc.
>
> So really, this is all about things being slower when there's a crazy
> number of files. Picture me surprised.
>
> What makes it kind of special is that the repository contains a lot of
> paths/files, but very few objects, because it's duplicating everything.
>
> All the 10^10 blobs have the same content, all the 10^9 trees that point
> to them have the same content, all the 10^8 trees that point to those
> trees have the same content, etc.
>
> If git wasn't effectively deduplicating identical content, the repository
> would be multiple gigabytes large.

Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?


Re: git-clone causes out of memory

2017-10-13 Thread Mike Hommey
On Fri, Oct 13, 2017 at 12:26:46PM +0200, Christian Couder wrote:
> On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey  wrote:
> > On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> >> There's a gitbomb on github. It is undoubtedly creative and funny, but 
> >> since
> >> this is a bug in git, I thought it'd be nice to report. The command:
> >>
> >>   $ git clone https://github.com/x0rz/ShadowBrokersFiles
> >
> > What fills memory is actually the checkout part of the command. git
> > clone -n doesn't fail.
> >
> > Credit should go where it's due: https://kate.io/blog/git-bomb/
> > (with the bonus that it comes with explanations)
> 
> Yeah, there is a thread on Hacker News about this too:
> 
> https://news.ycombinator.com/item?id=15457076
> 
> The original repo on GitHub is:
> 
> https://github.com/Katee/git-bomb.git
> 
> After cloning it with -n, there is the following "funny" situation:
> 
> $ time git rev-list HEAD
> 7af99c9e7d4768fa681f4fe4ff61259794cf719b
> 18ed56cbc5012117e24a603e7c072cf65d36d469
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m0.004s
> user0m0.000s
> sys 0m0.004s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0
> 
> real0m0.004s
> user0m0.000s
> sys 0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0
> 
> real0m0.004s
> user0m0.000s
> sys 0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m0.005s
> user0m0.008s
> sys 0m0.000s
> $ time git rev-list HEAD -- d0/d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m0.203s
> user0m0.112s
> sys 0m0.088s
> $ time git rev-list HEAD -- d0/d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m1.305s
> user0m0.720s
> sys 0m0.580s
> $ time git rev-list HEAD -- d0/d0/d0/
> 45546f17e5801791d4bc5968b91253a2f4b0db72
> 
> real0m12.135s
> user0m6.700s
> sys 0m5.412s
> 
> So `git rev-list` becomes exponentially more expensive when you run it
> on a shorter directory path, though it is fast if you run it without a
> path.

That's because there are 10^7 files under d0/d0/d0, 10^6 under
d0/d0/d0/d0/, 10^5 under d0/d0/d0/d0/d0/ etc.

So really, this is all about things being slower when there's a crazy
number of files. Picture me surprised.

What makes it kind of special is that the repository contains a lot of
paths/files, but very few objects, because it's duplicating everything.

All the 10^10 blobs have the same content, all the 10^9 trees that point
to them have the same content, all the 10^8 trees that point to those
trees have the same content, etc.

If git wasn't effectively deduplicating identical content, the repository
would be multiple gigabytes large.

Mike


Re: git-clone causes out of memory

2017-10-13 Thread Christian Couder
On Fri, Oct 13, 2017 at 12:06 PM, Mike Hommey  wrote:
> On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
>> There's a gitbomb on github. It is undoubtedly creative and funny, but since
>> this is a bug in git, I thought it'd be nice to report. The command:
>>
>>   $ git clone https://github.com/x0rz/ShadowBrokersFiles
>
> What fills memory is actually the checkout part of the command. git
> clone -n doesn't fail.
>
> Credit should go where it's due: https://kate.io/blog/git-bomb/
> (with the bonus that it comes with explanations)

Yeah, there is a thread on Hacker News about this too:

https://news.ycombinator.com/item?id=15457076

The original repo on GitHub is:

https://github.com/Katee/git-bomb.git

After cloning it with -n, there is the following "funny" situation:

$ time git rev-list HEAD
7af99c9e7d4768fa681f4fe4ff61259794cf719b
18ed56cbc5012117e24a603e7c072cf65d36d469
45546f17e5801791d4bc5968b91253a2f4b0db72

real0m0.004s
user0m0.000s
sys 0m0.004s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0/f0

real0m0.004s
user0m0.000s
sys 0m0.000s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/d0/d0

real0m0.004s
user0m0.000s
sys 0m0.000s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real0m0.005s
user0m0.008s
sys 0m0.000s
$ time git rev-list HEAD -- d0/d0/d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real0m0.203s
user0m0.112s
sys 0m0.088s
$ time git rev-list HEAD -- d0/d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real0m1.305s
user0m0.720s
sys 0m0.580s
$ time git rev-list HEAD -- d0/d0/d0/
45546f17e5801791d4bc5968b91253a2f4b0db72

real0m12.135s
user0m6.700s
sys 0m5.412s

So `git rev-list` becomes exponentially more expensive when you run it
on a shorter directory path, though it is fast if you run it without a
path.


Re: git-clone causes out of memory

2017-10-13 Thread Mike Hommey
On Fri, Oct 13, 2017 at 12:51:58PM +0300, Constantine wrote:
> There's a gitbomb on github. It is undoubtedly creative and funny, but since
> this is a bug in git, I thought it'd be nice to report. The command:
> 
>   $ git clone https://github.com/x0rz/ShadowBrokersFiles

What fills memory is actually the checkout part of the command. git
clone -n doesn't fail.

Credit should go where it's due: https://kate.io/blog/git-bomb/
(with the bonus that it comes with explanations)

Mike