Re: Question about get_cached_commit_buffer()

2018-02-21 Thread Junio C Hamano
Jeff King  writes:

> Out of curiosity, do you actually use --show-all for anything?

Absolutely not.  I'd actually love it if I could say "not anymore"
instead, but I haven't had an opportunity to debug the revision
traversal code for quite some time so I do not even remember when
was the last time I used it, which disqualifies me from saying even
that.

> So what I'm wondering is whether we should consider just ripping it out
> (but I'm OK with keeping it, as once the commit-buffer stuff is fixed,
> it's probably not hurting anybody).

I see no problem in removing it.  With more "interesting" features
relying on post-processing (like 'simplify-merges'), show_all whose
primary focus was how limit_list() behaves soft of outlived its
usefulness, I would think.



Re: Question about get_cached_commit_buffer()

2018-02-21 Thread Jeff King
On Wed, Feb 21, 2018 at 02:22:05PM -0800, Junio C Hamano wrote:

> > I think that repeating the oid is intentional; the point is to dump how
> > the traversal code is hitting the endpoints, even if we do so multiple
> > times.
> >
> > The --oneline behavior just looks like a bug. I think --format is broken
> > with --show-all, too (it does not show anything!).
> 
> I do not know about the --format thing,[...]

Hmm, maybe it is fine. I thought before that I got funny output out of
"git log --show-all --format", but I can't seem to reproduce it now.

> being a bug is correct.  I've known about the oneline that does not
> show anything other than the oid (not even end-of-line) for unparsed
> commits for a long time---I just didn't bother looking into fixing
> it exactly because this is only a debugging aid ;-)

Out of curiosity, do you actually use --show-all for anything? I didn't
even know it existed until this thread, and AFAICT nobody but Linus has
ever recommended its use. And it does not even seem that useful as a
general debugging aid.

So what I'm wondering is whether we should consider just ripping it out
(but I'm OK with keeping it, as once the commit-buffer stuff is fixed,
it's probably not hurting anybody).

-Peff


Re: Question about get_cached_commit_buffer()

2018-02-21 Thread Junio C Hamano
Jeff King  writes:

> I think that repeating the oid is intentional; the point is to dump how
> the traversal code is hitting the endpoints, even if we do so multiple
> times.
>
> The --oneline behavior just looks like a bug. I think --format is broken
> with --show-all, too (it does not show anything!).

I do not know about the --format thing, but the part about --oneline
being a bug is correct.  I've known about the oneline that does not
show anything other than the oid (not even end-of-line) for unparsed
commits for a long time---I just didn't bother looking into fixing
it exactly because this is only a debugging aid ;-)

> Though I think it would be equally correct to have set_commit_buffer()
> just throw away the existing cache entry and replace it with this one. I
> don't think there's a real reason to prefer the old to the new. And that
> might be worth doing if it would let us drop get_cached_commit_buffer()
> as a public function. But...
> ...
> In my opinion it's not really worth trying to make it private. The
> confusion you're fixing in the first two calls is not due to a bad API,
> but due to some subtly confusing logic in that code's use of the API. ;)

Yup.

> So I'd probably do this:
> ...

Makes sense to me.


Re: Question about get_cached_commit_buffer()

2018-02-21 Thread Jeff King
On Wed, Feb 21, 2018 at 09:13:22AM -0500, Derrick Stolee wrote:

> > So there it is. It does show commits multiple times, but suppresses the
> > verbose header after the first showing. If we do something like this:
> > 
> >git rev-list --show-all --pretty --boundary c93150cfb0^-
> > 
> > you'll see some boundary commits that _don't_ have their pretty headers
> > shown. And with your proposed patch, we'd show them again. To keep the
> > same behavior we need to store that "we've already seen this" boolean
> > somewhere else (e.g., in an object flag; possibly SEEN, but that might
> > run afoul of other logic).
> 
> What confuses me about this behavior is that the OID is still shown on the
> repeat (and in the case of `git log --oneline` will not actually have a line
> break between two short-OIDs). I don't believe this behavior is something to
> preserve.

I think that repeating the oid is intentional; the point is to dump how
the traversal code is hitting the endpoints, even if we do so multiple
times.

The --oneline behavior just looks like a bug. I think --format is broken
with --show-all, too (it does not show anything!).

> Unless I am misunderstanding, the current behavior on a repeated commit is
> already incorrect: some amount of output occurs before checking the buffer,
> so the output includes repeated records but with formatting that violates
> the expectation. By doing the simple change of swapping
> get_cached_commit_buffer() with get_commit_buffer(), we correct that format
> violation but have duplicate copies.

Yeah, I'd agree with that assessment.

> The most-correct thing to do (in my opinion) is to put the requirement of
> "no repeats" into the revision walk logic and stop having the formatting
> methods expect them. Then, however we change this boolean setting of "we
> have seen this before" it will not require the formatting methods to change.

But then you wouldn't show repeats at all. If I'm understanding you
correctly.

TBH, I do not think it is worth spending a lot of effort on this
--show-all feature. It seems mostly like forgotten debugging cruft to
me. That's why I'd be OK with showing the whole header as the simplest
fix (i.e., just removing those calls entirely, not even converting them
to get_commit_buffer).

> I can start working on a patch to move the duplicate-removal logic into
> revision.c instead of these three callers:
> 
> builtin/rev-list.c: if (revs->verbose_header &&
> get_cached_commit_buffer(commit, NULL)) {
> log-tree.c: if (!get_cached_commit_buffer(commit, NULL))
> object.c:   if (!get_cached_commit_buffer(commit, NULL))

Those first two are duplicate detection. The third one in object.c
should stay, though. We've been fed a commit buffer to parse, and we
want to know whether we should attach it as the cached buffer for that
commit. But if we already have a cached buffer, there's no point in
doing so. And that's what we're checking there.

Though I think it would be equally correct to have set_commit_buffer()
just throw away the existing cache entry and replace it with this one. I
don't think there's a real reason to prefer the old to the new. And that
might be worth doing if it would let us drop get_cached_commit_buffer()
as a public function. But...

> But this caller seems pretty important in pretty.c:
> 
>     /*
>  * Otherwise, we still want to munge the encoding header in the
>  * result, which will be done by modifying the buffer. If we
>  * are using a fresh copy, we can reuse it. But if we are using
>  * the cached copy from get_commit_buffer, we need to duplicate it
>  * to avoid munging the cached copy.
>  */
>     if (msg == get_cached_commit_buffer(commit, NULL))
>     out = xstrdup(msg);
>     else
>     out = (char *)msg

Like the one in object.c, this really does want to know about the cached
entry. And it should be unaffected by your patch, since we will have
called get_commit_buffer() at the top of that function.

If we wanted to write this one without get_cached_commit_buffer(), we'd
really need a function to ask "did this pointer come from the cache, or
was it freshly allocated?". That's the same thing we do for
unuse_commit_buffer(). So in theory we could have a boolean function
that would check that, and that would let us make
get_cached_commit_buffer() private.

In my opinion it's not really worth trying to make it private. The
confusion you're fixing in the first two calls is not due to a bad API,
but due to some subtly confusing logic in that code's use of the API. ;)

So I'd probably do this:

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index d94062bc84..3af56921c8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -150,7 +150,7 @@ static void show_commit(struct commit *commit, void *data)
else
putchar('\n');
 
-   if (revs->verbose_header && get_cached_commit_buffer(commit, 

Re: Question about get_cached_commit_buffer()

2018-02-21 Thread Jeff King
On Wed, Feb 21, 2018 at 01:48:11PM -0500, Jeff King wrote:

> > What confuses me about this behavior is that the OID is still shown on the
> > repeat (and in the case of `git log --oneline` will not actually have a line
> > break between two short-OIDs). I don't believe this behavior is something to
> > preserve.
> 
> I think that repeating the oid is intentional; the point is to dump how
> the traversal code is hitting the endpoints, even if we do so multiple
> times.
> 
> The --oneline behavior just looks like a bug. I think --format is broken
> with --show-all, too (it does not show anything!).

I poked at one of the examples a little more closely. I actually think
these are not repeats, but simply UNINTERESTING parents that we never
needed to look at in our traversal (because we hit a point where
everything was UNINTERESTING).

So we are relying not on finish_commit() to have freed the buffer, but
on the traversal code to have never parsed those commits in the first
place. Which is doubly subtle.

I think the rest of my email stands, though: we should just show the
full headers for those commits.

-Peff


Re: Question about get_cached_commit_buffer()

2018-02-21 Thread Derrick Stolee

On 2/20/2018 5:57 PM, Jeff King wrote:

On Tue, Feb 20, 2018 at 05:12:50PM -0500, Derrick Stolee wrote:


In rev-list, the "--header" option outputs a value and expects the buffer to
be cached. It outputs the header info only if get_cached_commit_buffer()
returns a non-null buffer, giving incorrect output. If it called
get_commit_buffer() instead, it would immediately call
get_cached_commit_buffer() and on failure actually load the buffer.

This has not been a problem before, since the buffer was always loaded at
some point for each commit (and saved because of the save_commit_buffer
global).

I propose to make get_cached_commit_buffer() static to commit.c and convert
all callers to get_commit_buffer(). Is there any reason to _not_ do this? It
seems that there is no functional or performance change.

That helper was added in 152ff1cceb (provide helpers to access the
commit buffer, 2014-06-10). I think interesting part is the final
paragraph:

   Note that we also need to add a "get_cached" variant which
   returns NULL when we do not have a cached buffer. At first
   glance this seems to defeat the purpose of "get", which is
   to always provide a return value. However, some log code
   paths actually use the NULL-ness of commit->buffer as a
   boolean flag to decide whether to try printing the
   commit. At least for now, we want to continue supporting
   that use.

So I think a conversion to get_commit_buffer() would break the callers
that use the boolean nature for something useful. Unfortunately the
author did not enumerate exactly what those uses are, so we'll have to
dig. :)

My guess is that it has to do with showing some commits twice, since we
would normally have the buffer available the first time we hit the
commit, and then free it in finish_commit().

If we blame that rev-list line (and then walk back over a bunch of
uninteresting commits via parent re-blaming), it comes from 3131b71301
(Add "--show-all" revision walker flag for debugging, 2008-02-09).


Thanks for doing this digging. I appreciate the breadcrumbs, too, so I 
can do a better job of digging next time.



So there it is. It does show commits multiple times, but suppresses the
verbose header after the first showing. If we do something like this:

   git rev-list --show-all --pretty --boundary c93150cfb0^-

you'll see some boundary commits that _don't_ have their pretty headers
shown. And with your proposed patch, we'd show them again. To keep the
same behavior we need to store that "we've already seen this" boolean
somewhere else (e.g., in an object flag; possibly SEEN, but that might
run afoul of other logic).


What confuses me about this behavior is that the OID is still shown on 
the repeat (and in the case of `git log --oneline` will not actually 
have a line break between two short-OIDs). I don't believe this behavior 
is something to preserve.


You are right that we definitely don't want to show the full content twice.


It looks like the call in log-tree comes from the same commit, and
serves the same purpose.

Aside from storing the boolean "did we show it" in another way, the
other option is to simply change the behavior and accept that we might
pretty-print the commit twice. This is a backwards-incompatible change,
but I'm not sure if anybody would care. According to that commit,
--show-all was added explicitly for debugging, and it has never been
documented.  I couldn't find any reference to people actually using it
on the list (a grep of the whole archive turns up 32 messages, most of
which are just it appearing in context; the only person mentioning its
actual use was Linus in 2008.


Unless I am misunderstanding, the current behavior on a repeated commit 
is already incorrect: some amount of output occurs before checking the 
buffer, so the output includes repeated records but with formatting that 
violates the expectation. By doing the simple change of swapping 
get_cached_commit_buffer() with get_commit_buffer(), we correct that 
format violation but have duplicate copies.


The most-correct thing to do (in my opinion) is to put the requirement 
of "no repeats" into the revision walk logic and stop having the 
formatting methods expect them. Then, however we change this boolean 
setting of "we have seen this before" it will not require the formatting 
methods to change.


I can start working on a patch to move the duplicate-removal logic into 
revision.c instead of these three callers:


builtin/rev-list.c: if (revs->verbose_header && 
get_cached_commit_buffer(commit, NULL)) {

log-tree.c: if (!get_cached_commit_buffer(commit, NULL))
object.c:   if (!get_cached_commit_buffer(commit, 
NULL)) {


But this caller seems pretty important in pretty.c:

    /*
 * Otherwise, we still want to munge the encoding header in the
 * result, which will be done by modifying the buffer. If we
 * are using a fresh copy, we can reuse it. But if 

Re: Question about get_cached_commit_buffer()

2018-02-20 Thread Jeff King
On Tue, Feb 20, 2018 at 05:12:50PM -0500, Derrick Stolee wrote:

> In rev-list, the "--header" option outputs a value and expects the buffer to
> be cached. It outputs the header info only if get_cached_commit_buffer()
> returns a non-null buffer, giving incorrect output. If it called
> get_commit_buffer() instead, it would immediately call
> get_cached_commit_buffer() and on failure actually load the buffer.
> 
> This has not been a problem before, since the buffer was always loaded at
> some point for each commit (and saved because of the save_commit_buffer
> global).
> 
> I propose to make get_cached_commit_buffer() static to commit.c and convert
> all callers to get_commit_buffer(). Is there any reason to _not_ do this? It
> seems that there is no functional or performance change.

That helper was added in 152ff1cceb (provide helpers to access the
commit buffer, 2014-06-10). I think interesting part is the final
paragraph:

  Note that we also need to add a "get_cached" variant which
  returns NULL when we do not have a cached buffer. At first
  glance this seems to defeat the purpose of "get", which is
  to always provide a return value. However, some log code
  paths actually use the NULL-ness of commit->buffer as a
  boolean flag to decide whether to try printing the
  commit. At least for now, we want to continue supporting
  that use.

So I think a conversion to get_commit_buffer() would break the callers
that use the boolean nature for something useful. Unfortunately the
author did not enumerate exactly what those uses are, so we'll have to
dig. :)

My guess is that it has to do with showing some commits twice, since we
would normally have the buffer available the first time we hit the
commit, and then free it in finish_commit().

If we blame that rev-list line (and then walk back over a bunch of
uninteresting commits via parent re-blaming), it comes from 3131b71301
(Add "--show-all" revision walker flag for debugging, 2008-02-09).

So there it is. It does show commits multiple times, but suppresses the
verbose header after the first showing. If we do something like this:

  git rev-list --show-all --pretty --boundary c93150cfb0^-

you'll see some boundary commits that _don't_ have their pretty headers
shown. And with your proposed patch, we'd show them again. To keep the
same behavior we need to store that "we've already seen this" boolean
somewhere else (e.g., in an object flag; possibly SEEN, but that might
run afoul of other logic).

It looks like the call in log-tree comes from the same commit, and
serves the same purpose.

Aside from storing the boolean "did we show it" in another way, the
other option is to simply change the behavior and accept that we might
pretty-print the commit twice. This is a backwards-incompatible change,
but I'm not sure if anybody would care. According to that commit,
--show-all was added explicitly for debugging, and it has never been
documented.  I couldn't find any reference to people actually using it
on the list (a grep of the whole archive turns up 32 messages, most of
which are just it appearing in context; the only person mentioning its
actual use was Linus in 2008.

-Peff