Re: [PATCH] blame.c: don't drop origin blobs as eagerly

2016-05-28 Thread David Kastrup
Johannes Schindelin  writes:

> Hi David,
>
> On Sat, 28 May 2016, David Kastrup wrote:
>
>> > The short version of your answer is that you will leave this patch in
>> > its current form and address none of my concerns because you moved on,
>> > correct? If so, that's totally okay, it just needs to be spelled out.
>> 
>> Yes, that's it.  You'll notice that the code change itself is both
>> minuscule as well purely functional, so it contains nothing
>> copyrightable.
>
> That is unfortunately an interpretation of the law that would need to
> come from a professional lawyer.

A professional lawyer would laugh at "Signed-off-by:" being of more
legal significance than a written record of intent but of course you
know that.  This is mere bluster.

> As it is, the patch was clearly authored by you, and anybody else who
> would claim authorship would most likely be breaking the law.

The _diff_ is not "clearly authored" by me but just the simplest
expression of the intent.  The commit message is clearly authored by me
but is not acceptable anyway.  Whoever gets to write an acceptable
commit message is up for all copyrightable credit in my book.  Feel free
to keep the authorship if you really want to, but when replacing the
commit message it is not a particularly accurate attribution.

> So I won't touch it.

Signed-off-by: David Kastrup 

You don't get more than that for other patches either and it's a few
bytes compared to a mail conversation.  Here is a PGP signature on top.

As I said: I am not going to put more work into it anyway and if it's an
occasion for theatralics, it has at least accomplished something.

-- 
David Kastrup


signature.asc
Description: PGP signature


Re: [PATCH] blame.c: don't drop origin blobs as eagerly

2016-05-28 Thread Johannes Schindelin
Hi David,

On Sat, 28 May 2016, David Kastrup wrote:

> > The short version of your answer is that you will leave this patch in
> > its current form and address none of my concerns because you moved on,
> > correct? If so, that's totally okay, it just needs to be spelled out.
> 
> Yes, that's it.  You'll notice that the code change itself is both
> minuscule as well purely functional, so it contains nothing
> copyrightable.

That is unfortunately an interpretation of the law that would need to come
from a professional lawyer. As it is, the patch was clearly authored by
you, and anybody else who would claim authorship would most likely be
breaking the law.

So I won't touch it.

Sorry,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] blame.c: don't drop origin blobs as eagerly

2016-05-28 Thread David Kastrup
Johannes Schindelin  writes:

> On Fri, 27 May 2016, David Kastrup wrote:
>
>> Johannes Schindelin  writes:
>> 
>> > On Fri, 27 May 2016, David Kastrup wrote:
>> >
>> >> pressure particularly when the history contains lots of merges from
>> >> long-diverged branches.  In practice, this optimization appears to
>> >> behave quite benignly,
>> >
>> > Why not just stop here?
>> 
>> Because there is a caveat.
>> 
>> > I say that because...
>> >
>> >> and a viable strategy for limiting the total amount of cached blobs in a
>> >> useful manner seems rather hard to implement.
>> >
>> > ... this sounds awfully handwaving.
>> 
>> Because it is.
>
> Hrm. Are you really happy with this, on public record?

Shrug.  git-blame has limits for its memory usage which it generally
heeds, throwing away potentially useful data to do so.  git-blame -C has
a store of data retained outside of those limits because nobody bothered
reining them in, and this patch similarly retains data outside of those
limits, though strictly limited to the pending priority queue size which
controls the descent in reverse commit time order.

In actual measurements on actual heavy-duty repositories I could not
discern a difference in memory usage after the patch.  The difference is
likely only noticeable when you have lots of old long branches which is
not a typical development model.

The current data structures don't make harder constraints on memory
pressure feasible, and enforcing such constraints would incur a lot of
swapping (not by the operating system which does it more efficiently
particularly since it does not need to decompress every time, but by the
application constantly rereading and recompressing data).

git-blame -C cares jack-shit about that (and it's used by git-gui's
second pass if I remember correctly) and nobody raised concerns about
its memory usage ever.

I've taken out a lot of hand-waving of the "eh, good enough" kind from
git-blame.  If one wanted to have the memory limits enforced more
stringently than they currently are, one would need a whole new layer of
FIFO and accounting.  This is not done in the current code base for
existing functionality.

And nobody bothered implementing it in decades or complaining about it
in spite of it affecting git-blame -C (and consequently git-gui blame).

Yes, this patch is another hand waving beside an already waving crowd.
I am not going to lie about it but I am also not going to invest the
time to fix yet another part of git-blame that has in decades not shown
to be a problem anybody considered worth reporting let alone fixing.

The limits for git-blame are set quite conservative: on actual developer
systems, exceeding them by a factor of 10 will not exhaust the available
memory.  And I could not even measure a difference in memory usage in a
small sample set of large examples.

That's good enough for me (and better than the shape most of git-blame
was in before I started on it and also after I finished), but it's still
hand-waving.  And it's not like it doesn't have a lot of company.

>> > Since we already have reference counting, it sounds fishy to claim
>> > that simply piggybacking a global counter on top of it would be
>> > "rather hard".
>> 
>> You'll see that the patch is from 2014.
>
> No I do not. There was no indication of that.

I thought that git-send-email/git-am retained most patch metadata as
opposed to git-patch, but you are entirely correct that the author date
is nowhere to be found.  Sorry for the presumption.

>> When I actively worked on it, I found no convincing/feasible way to
>> enforce a reasonable hard limit.  I am not picking up work on this
>> again but am merely flushing my queue so that the patch going to
>> waste is not on my conscience.
>
> Hmm. Speaking from my point of view as a maintainer, this raises a
> yellow alert. Sure, I do not maintain git.git, but if I were, I would
> want my confidence in the quality of this patch, and in the patch
> author being around when things go south with it, strengthened. This
> paragraph achieves the opposite.

It's completely reasonable to apply the same standards here as with an
author having terminal cancer.  I am not going to be around when things
go South with git-blame but I should be very much surprised if this
patch were significantly involved.

>> > Besides, -C is *supposed* to look harder.
>> 
>> We are not talking about "looking harder" but "taking more memory
>> than the set limit".
>
> I meant to say: *of course* it uses more memory, it has a much harder
> job.

Still a non-sequitur.  If you apply memory limits harder, it will still
achieve the same job but finish later due to (decompressing) swapping.

>> > Also: is there an easy way to reproduce your claims of better I/O
>> > characteristics? Something like a command-line, ideally with a file
>> > in git.git's own history, that demonstrates the I/O before and
>> > after the patch, would be an 

Re: [PATCH] blame.c: don't drop origin blobs as eagerly

2016-05-28 Thread Johannes Schindelin
Hi David,

On Fri, 27 May 2016, David Kastrup wrote:

> Johannes Schindelin  writes:
> 
> > On Fri, 27 May 2016, David Kastrup wrote:
> >
> >> pressure particularly when the history contains lots of merges from
> >> long-diverged branches.  In practice, this optimization appears to
> >> behave quite benignly,
> >
> > Why not just stop here?
> 
> Because there is a caveat.
> 
> > I say that because...
> >
> >> and a viable strategy for limiting the total amount of cached blobs in a
> >> useful manner seems rather hard to implement.
> >
> > ... this sounds awfully handwaving.
> 
> Because it is.

Hrm. Are you really happy with this, on public record?

> > Since we already have reference counting, it sounds fishy to claim
> > that simply piggybacking a global counter on top of it would be
> > "rather hard".
> 
> You'll see that the patch is from 2014.

No I do not. There was no indication of that.

> When I actively worked on it, I found no convincing/feasible way to
> enforce a reasonable hard limit.  I am not picking up work on this again
> but am merely flushing my queue so that the patch going to waste is not
> on my conscience.

Hmm. Speaking from my point of view as a maintainer, this raises a yellow
alert. Sure, I do not maintain git.git, but if I were, I would want my
confidence in the quality of this patch, and in the patch author being
around when things go south with it, strengthened. This paragraph achieves
the opposite.

> >> In addition, calling git-blame with -C leads to similar memory retention
> >> patterns.
> >
> > This is a red herring. Just delete it. I, for one, being a heavy user of
> > `git blame`, could count the number of times I used blame's -C option
> > without any remaining hands. Zero times.
> >
> > Besides, -C is *supposed* to look harder.
> 
> We are not talking about "looking harder" but "taking more memory than
> the set limit".

I meant to say: *of course* it uses more memory, it has a much harder job.

> > Also: is there an easy way to reproduce your claims of better I/O
> > characteristics? Something like a command-line, ideally with a file in
> > git.git's own history, that demonstrates the I/O before and after the
> > patch, would be an excellent addition to the commit message.
> 
> I've used it on the wortliste repository and system time goes down from
> about 70 seconds to 50 seconds (this is a flash drive).  User time from
> about 4:20 to 4:00.  It is a rather degenerate repository (predominantly
> small changes in one humongous text file) so savings for more typical
> cases might end up less than that.  But then it is degenerate
> repositories that are most costly to blame.

Well, obviously I did not mean to measure time, but I/O. Something like an
strace call that pipes into some grep that in turn pipes into wc.

> > Further: I would have at least expected some rudimentary discussion
> > why this patch -- which seems to at least partially contradict 7c3c796
> > (blame: drop blob data after passing blame to the parent, 2007-12-11)
> > -- is not regressing on the intent of said commit.
> 
> It is regressing on the intent of said commit by _retaining_ blob data
> that it is _sure_ to look at again because of already having this data
> asked for again in the priority queue.  That's the point.  It only drops
> the blob data for which it has no request queued yet.  But there is no
> handle on when the request is going to pop up until it actually leaves
> the priority queue: the priority queue is a heap IIRC and thus only
> provides partial orderings.

Again, this lowers my confidence in the patch. Mind you, the patch could
be totally sound! Yet its commit message, and unfortunately even more so
the discussion we are having right now, offer no adequate reason to assume
that. If you, as the patch author, state that you are not sure what this
thing does exactly, we must conservatively assume that it contains flaws.

> >> diff --git a/builtin/blame.c b/builtin/blame.c
> >> index 21f42b0..2596fbc 100644
> >> --- a/builtin/blame.c
> >> +++ b/builtin/blame.c
> >> @@ -1556,7 +1556,8 @@ finish:
> >>}
> >>for (i = 0; i < num_sg; i++) {
> >>if (sg_origin[i]) {
> >> -  drop_origin_blob(sg_origin[i]);
> >> +  if (!sg_origin[i]->suspects)
> >> +  drop_origin_blob(sg_origin[i]);
> >>origin_decref(sg_origin[i]);
> >>}
> >
> > It would be good to mention in the commit message that this patch does not
> > change anything for blobs with only one remaining reference (the current
> > one) because origin_decref() would do the same job as drop_origin_blob
> > when decrementing the reference counter to 0.
> 
> A sizable number of references but not blobs are retained and needed for
> producing the information in the final printed output (at least when
> using the default sequential output instead of --incremental or
> --porcelaine or similar).

Sorry, please 

Re: [PATCH] blame.c: don't drop origin blobs as eagerly

2016-05-27 Thread David Kastrup
Johannes Schindelin  writes:

> On Fri, 27 May 2016, David Kastrup wrote:
>
>> pressure particularly when the history contains lots of merges from
>> long-diverged branches.  In practice, this optimization appears to
>> behave quite benignly,
>
> Why not just stop here?

Because there is a caveat.

> I say that because...
>
>> and a viable strategy for limiting the total amount of cached blobs in a
>> useful manner seems rather hard to implement.
>
> ... this sounds awfully handwaving.

Because it is.

> Since we already have reference counting, it sounds fishy to claim
> that simply piggybacking a global counter on top of it would be
> "rather hard".

You'll see that the patch is from 2014.  When I actively worked on it,
I found no convincing/feasible way to enforce a reasonable hard limit.
I am not picking up work on this again but am merely flushing my queue
so that the patch going to waste is not on my conscience.

>> In addition, calling git-blame with -C leads to similar memory retention
>> patterns.
>
> This is a red herring. Just delete it. I, for one, being a heavy user of
> `git blame`, could count the number of times I used blame's -C option
> without any remaining hands. Zero times.
>
> Besides, -C is *supposed* to look harder.

We are not talking about "looking harder" but "taking more memory than
the set limit".

> Also: is there an easy way to reproduce your claims of better I/O
> characteristics? Something like a command-line, ideally with a file in
> git.git's own history, that demonstrates the I/O before and after the
> patch, would be an excellent addition to the commit message.

I've used it on the wortliste repository and system time goes down from
about 70 seconds to 50 seconds (this is a flash drive).  User time from
about 4:20 to 4:00.  It is a rather degenerate repository (predominantly
small changes in one humongous text file) so savings for more typical
cases might end up less than that.  But then it is degenerate
repositories that are most costly to blame.

> Further: I would have at least expected some rudimentary discussion
> why this patch -- which seems to at least partially contradict 7c3c796
> (blame: drop blob data after passing blame to the parent, 2007-12-11)
> -- is not regressing on the intent of said commit.

It is regressing on the intent of said commit by _retaining_ blob data
that it is _sure_ to look at again because of already having this data
asked for again in the priority queue.  That's the point.  It only drops
the blob data for which it has no request queued yet.  But there is no
handle on when the request is going to pop up until it actually leaves
the priority queue: the priority queue is a heap IIRC and thus only
provides partial orderings.

>> diff --git a/builtin/blame.c b/builtin/blame.c
>> index 21f42b0..2596fbc 100644
>> --- a/builtin/blame.c
>> +++ b/builtin/blame.c
>> @@ -1556,7 +1556,8 @@ finish:
>>  }
>>  for (i = 0; i < num_sg; i++) {
>>  if (sg_origin[i]) {
>> -drop_origin_blob(sg_origin[i]);
>> +if (!sg_origin[i]->suspects)
>> +drop_origin_blob(sg_origin[i]);
>>  origin_decref(sg_origin[i]);
>>  }
>
> It would be good to mention in the commit message that this patch does not
> change anything for blobs with only one remaining reference (the current
> one) because origin_decref() would do the same job as drop_origin_blob
> when decrementing the reference counter to 0.

A sizable number of references but not blobs are retained and needed for
producing the information in the final printed output (at least when
using the default sequential output instead of --incremental or
--porcelaine or similar).

> In fact, I suspect that simply removing the drop_origin_blob() call
> might result in the exact same I/O pattern.

It's been years since I actually worked on the code but I'm still pretty
sure you are wrong.

-- 
David Kastrup
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] blame.c: don't drop origin blobs as eagerly

2016-05-27 Thread Johannes Schindelin
Hi David,

it is good practice to Cc: the original author of the code in question, in
this case Junio. I guess he sees this anyway, but that is really just an
assumption.

On Fri, 27 May 2016, David Kastrup wrote:

> When a parent blob already has chunks queued up for blaming, dropping
> the blob at the end of one blame step will cause it to get reloaded
> right away, doubling the amount of I/O and unpacking when processing a
> linear history.

It is obvious from your commit message that you have studied the code
quite deeply. To make it easier for the reader (which might be your future
self), it is advisable to give at least a little bit of introduction, e.g.
what the "parent blob" is.

I would *guess* that it is the blob corresponding to the same path in the
parent of the current revision, but that should be spelled out explicitly.

> Keeping such parent blobs in memory seems like a reasonable
> optimization.  It's conceivable that this may incur additional memory

This sentence would be easier to read if "It's conceivable that" was
simply deleted.

> pressure particularly when the history contains lots of merges from
> long-diverged branches.  In practice, this optimization appears to
> behave quite benignly,

Why not just stop here? I say that because...

> and a viable strategy for limiting the total amount of cached blobs in a
> useful manner seems rather hard to implement.

... this sounds awfully handwaving. Since we already have reference
counting, it sounds fishy to claim that simply piggybacking a global
counter on top of it would be "rather hard".

> In addition, calling git-blame with -C leads to similar memory retention
> patterns.

This is a red herring. Just delete it. I, for one, being a heavy user of
`git blame`, could count the number of times I used blame's -C option
without any remaining hands. Zero times.

Besides, -C is *supposed* to look harder. By that argument, you could read
all blobs in rev-list even when the user did not specify --objects
"because --objects leads to similar memory retention patterns". So: let's
just forget about that statement.

The commit message is missing your sign-off.

Also: is there an easy way to reproduce your claims of better I/O
characteristics? Something like a command-line, ideally with a file in
git.git's own history, that demonstrates the I/O before and after the
patch, would be an excellent addition to the commit message.

Further: I would have at least expected some rudimentary discussion why
this patch -- which seems to at least partially contradict 7c3c796 (blame:
drop blob data after passing blame to the parent, 2007-12-11) -- is not
regressing on the intent of said commit.

> diff --git a/builtin/blame.c b/builtin/blame.c
> index 21f42b0..2596fbc 100644
> --- a/builtin/blame.c
> +++ b/builtin/blame.c
> @@ -1556,7 +1556,8 @@ finish:
>   }
>   for (i = 0; i < num_sg; i++) {
>   if (sg_origin[i]) {
> - drop_origin_blob(sg_origin[i]);
> + if (!sg_origin[i]->suspects)
> + drop_origin_blob(sg_origin[i]);
>   origin_decref(sg_origin[i]);
>   }

It would be good to mention in the commit message that this patch does not
change anything for blobs with only one remaining reference (the current
one) because origin_decref() would do the same job as drop_origin_blob
when decrementing the reference counter to 0.

In fact, I suspect that simply removing the drop_origin_blob() call might
result in the exact same I/O pattern.

Ciao,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] blame.c: don't drop origin blobs as eagerly

2016-05-27 Thread David Kastrup
When a parent blob already has chunks queued up for blaming, dropping
the blob at the end of one blame step will cause it to get reloaded
right away, doubling the amount of I/O and unpacking when processing a
linear history.

Keeping such parent blobs in memory seems like a reasonable
optimization.  It's conceivable that this may incur additional memory
pressure particularly when the history contains lots of merges from
long-diverged branches.  In practice, this optimization appears to
behave quite benignly, and a viable strategy for limiting the total
amount of cached blobs in a useful manner seems rather hard to
implement.  In addition, calling git-blame with -C leads to similar
memory retention patterns.
---
 builtin/blame.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 21f42b0..2596fbc 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -1556,7 +1556,8 @@ finish:
}
for (i = 0; i < num_sg; i++) {
if (sg_origin[i]) {
-   drop_origin_blob(sg_origin[i]);
+   if (!sg_origin[i]->suspects)
+   drop_origin_blob(sg_origin[i]);
origin_decref(sg_origin[i]);
}
}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html