Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-19 Thread Jeff King
On Thu, Oct 19, 2017 at 12:53:03PM -0700, Stefan Beller wrote:

> > @@ -771,7 +771,7 @@ static unsigned get_string_hash(struct 
> > emitted_diff_symbol *es, struct diff_opti
> >  {
> > if (o->xdl_opts & XDF_WHITESPACE_FLAGS) {
> > static struct strbuf sb = STRBUF_INIT;
> > -   const char *ap = es->line, *ae = es->line + es->len;
> > +   const char *ap = es->line, *ae = es->line + es->len - 1;
> > int c;
> >
> > strbuf_reset();
> >
> > it does. It just adjusts our "end pointer" to point to the last valid
> > character in the string (rather than one past),
> 
> Thanks for spotting. I can send a proper patch with tests if you'd like.

I'm putting the finishing touches on a series that should be out in a
few minutes. Hold off until then.

-Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-19 Thread Stefan Beller
On Wed, Oct 18, 2017 at 10:42 PM, Jeff King  wrote:

>
> So I think the right fix is this:
>
[...]
>
> It's late here, so I'll wait for comments from Stefan and then try to
> wrap it up with a commit message and test tomorrow.
>
> -Peff

I agree that this is better and looks correct.
Thanks for offering to send a patch. I'd be happy to review it.

Thanks,
Stefan


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-19 Thread Stefan Beller
On Wed, Oct 18, 2017 at 10:24 PM, Jeff King  wrote:
> On Thu, Oct 19, 2017 at 01:04:59AM -0400, Jeff King wrote:
>
>> So. That leaves me with:
>>
>>   - I'm unclear on whether next_byte() is meant to return that trailing
>> NUL or not. I don't think it causes any bugs, but it certainly
>> confused me for a function to take a cp/endp pair of pointers, and
>> then dereference endp. It might be worth either fixing or clarifying
>> with a comment.
>>
>>   - Those loops to eat trailing whitespace are doing nothing. I'm not
>> sure if that all works out because next_byte() eats whitespaces or
>> not (I think not, because it doesn't eat whitespace for the
>> IGNORE_WHITESPACE_AT_EOL case). But I'm not quite sure what a test
>> would look like.
>
> I had trouble constructing a test at first, but I think my test lines
> just weren't long enough to trigger the movement heuristics. If I switch
> to something besides seq, I can do:
>
>   # any input that has reasonably sized lines
>   look e | head -50 >file
>   git add file
>
>   perl -i -ne '
> # pick up lines 20-25 to move to line 40, and
> # add some trailing whitespace to them
> if ($. >= 20 && $. <= 25) {
>   s/$/ /;
>   $hold .= $_;
> } else {
>   print $hold if ($. == 40);
>   print;
> }
>   ' file
>
>   git diff --color-moved --ignore-space-at-eol
>
> I think that _should_ show the block as moved, but it doesn't. But if I
> apply this patch:
>
> diff --git a/diff.c b/diff.c
> index 93dccd1817..375d9cf447 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -743,8 +743,8 @@ static int moved_entry_cmp(const struct diff_options 
> *diffopt,
>const struct moved_entry *b,
>const void *keydata)
>  {
> -   const char *ap = a->es->line, *ae = a->es->line + a->es->len;
> -   const char *bp = b->es->line, *be = b->es->line + b->es->len;
> +   const char *ap = a->es->line, *ae = a->es->line + a->es->len - 1;
> +   const char *bp = b->es->line, *be = b->es->line + b->es->len - 1;
>
> if (!(diffopt->xdl_opts & XDF_WHITESPACE_FLAGS))
> return a->es->len != b->es->len  || memcmp(ap, bp, 
> a->es->len);
> @@ -771,7 +771,7 @@ static unsigned get_string_hash(struct 
> emitted_diff_symbol *es, struct diff_opti
>  {
> if (o->xdl_opts & XDF_WHITESPACE_FLAGS) {
> static struct strbuf sb = STRBUF_INIT;
> -   const char *ap = es->line, *ae = es->line + es->len;
> +   const char *ap = es->line, *ae = es->line + es->len - 1;
> int c;
>
> strbuf_reset();
>
> it does. It just adjusts our "end pointer" to point to the last valid
> character in the string (rather than one past),

Thanks for spotting. I can send a proper patch with tests if you'd like.


> which seems to be the
> convention that those loops (and next_byte) expect.

I'll look at that again.

Thanks for poking!
Stefan

>
> -Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-18 Thread Jeff King
On Thu, Oct 19, 2017 at 02:30:08PM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > it does. It just adjusts our "end pointer" to point to the last valid
> > character in the string (rather than one past), which seems to be the
> > convention that those loops (and next_byte) expect.
> 
> Yeah I am not sure if I like this comparison at the beginning of the
> function:
> 
> static int next_byte(const char **cp, const char **endp,
>  const struct diff_options *diffopt)
> {
> int retval;
> 
> if (*cp > *endp)
> return -1;
> 
> but it says endp _is_ part of valid input, contrary to my intuition.

Actually, I think even this function is confused about its convention.
In the line you quote, we clearly treat *endp as part of the input. But
later we do:

  while (*cp < *endp && isspace(**cp))
  (*cp)++;

meaning that we'd fail to soak up whitespace at *endp. That wouldn't be
so bad if not for the other bug which fails to eat whitespace at endp in
the first place. :)

So I think the right fix is this:

diff --git a/diff.c b/diff.c
index 6fd288420b..09081a207c 100644
--- a/diff.c
+++ b/diff.c
@@ -712,7 +712,7 @@ static int next_byte(const char **cp, const char **endp,
 {
int retval;
 
-   if (*cp > *endp)
+   if (*cp >= *endp)
return -1;
 
if (isspace(**cp)) {
@@ -729,7 +729,12 @@ static int next_byte(const char **cp, const char **endp,
if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) {
while (*cp < *endp && isspace(**cp))
(*cp)++;
-   /* return the first non-ws character via the usual 
below */
+   /*
+* return the first non-ws character via the usual
+* below, unless we ate all of the bytes
+*/
+   if (*cp >= *endp)
+   return -1;
}
}
 
@@ -750,9 +755,9 @@ static int moved_entry_cmp(const struct diff_options 
*diffopt,
return a->es->len != b->es->len  || memcmp(ap, bp, a->es->len);
 
if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_AT_EOL)) {
-   while (ae > ap && isspace(*ae))
+   while (ae > ap && isspace(ae[-1]))
ae--;
-   while (be > bp && isspace(*be))
+   while (be > bp && isspace(be[-1]))
be--;
}
 
@@ -775,7 +780,7 @@ static unsigned get_string_hash(struct emitted_diff_symbol 
*es, struct diff_opti
int c;
 
strbuf_reset();
-   while (ae > ap && isspace(*ae))
+   while (ae > ap && isspace(ae[-1]))
ae--;
while ((c = next_byte(, , o)) > 0)
strbuf_addch(, c);

It's late here, so I'll wait for comments from Stefan and then try to
wrap it up with a commit message and test tomorrow.

-Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-18 Thread Jeff King
On Thu, Oct 19, 2017 at 02:32:04PM +0900, Junio C Hamano wrote:

> > Yeah I am not sure if I like this comparison at the beginning of the
> > function:
> >
> > static int next_byte(const char **cp, const char **endp,
> >  const struct diff_options *diffopt)
> > {
> > int retval;
> >
> > if (*cp > *endp)
> > return -1;
> >
> > but it says endp _is_ part of valid input, contrary to my intuition.
> >
> > And your change to the initialization of ae/be in moved_entry_cmp()
> > makes it consistent with it, I think.
> >
> > But doesn't it mean ae computation in get_string_hash() also needs a
> > massaging?
> 
> Ah, forget the last two lines.  You do do the massaging in your
> patch.

I was just replying so. :)

> My preference actually is to fix next_byte to follow the usual "when
> we end, it points one past the valid region", though.

Yeah, I think that is my preference, too.

-Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-18 Thread Junio C Hamano
Junio C Hamano  writes:

> Jeff King  writes:
>
>> it does. It just adjusts our "end pointer" to point to the last valid
>> character in the string (rather than one past), which seems to be the
>> convention that those loops (and next_byte) expect.
>
> Yeah I am not sure if I like this comparison at the beginning of the
> function:
>
> static int next_byte(const char **cp, const char **endp,
>  const struct diff_options *diffopt)
> {
> int retval;
>
> if (*cp > *endp)
> return -1;
>
> but it says endp _is_ part of valid input, contrary to my intuition.
>
> And your change to the initialization of ae/be in moved_entry_cmp()
> makes it consistent with it, I think.
>
> But doesn't it mean ae computation in get_string_hash() also needs a
> massaging?

Ah, forget the last two lines.  You do do the massaging in your
patch.

My preference actually is to fix next_byte to follow the usual "when
we end, it points one past the valid region", though.

Thanks for digging it through.





Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-18 Thread Junio C Hamano
Jeff King  writes:

> it does. It just adjusts our "end pointer" to point to the last valid
> character in the string (rather than one past), which seems to be the
> convention that those loops (and next_byte) expect.

Yeah I am not sure if I like this comparison at the beginning of the
function:

static int next_byte(const char **cp, const char **endp,
 const struct diff_options *diffopt)
{
int retval;

if (*cp > *endp)
return -1;

but it says endp _is_ part of valid input, contrary to my intuition.

And your change to the initialization of ae/be in moved_entry_cmp()
makes it consistent with it, I think.

But doesn't it mean ae computation in get_string_hash() also needs a
massaging?


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-18 Thread Jeff King
On Thu, Oct 19, 2017 at 01:04:59AM -0400, Jeff King wrote:

> So. That leaves me with:
> 
>   - I'm unclear on whether next_byte() is meant to return that trailing
> NUL or not. I don't think it causes any bugs, but it certainly
> confused me for a function to take a cp/endp pair of pointers, and
> then dereference endp. It might be worth either fixing or clarifying
> with a comment.
> 
>   - Those loops to eat trailing whitespace are doing nothing. I'm not
> sure if that all works out because next_byte() eats whitespaces or
> not (I think not, because it doesn't eat whitespace for the
> IGNORE_WHITESPACE_AT_EOL case). But I'm not quite sure what a test
> would look like.

I had trouble constructing a test at first, but I think my test lines
just weren't long enough to trigger the movement heuristics. If I switch
to something besides seq, I can do:

  # any input that has reasonably sized lines
  look e | head -50 >file
  git add file
  
  perl -i -ne '
# pick up lines 20-25 to move to line 40, and
# add some trailing whitespace to them
if ($. >= 20 && $. <= 25) {
  s/$/ /;
  $hold .= $_;
} else {
  print $hold if ($. == 40);
  print;
}
  ' file

  git diff --color-moved --ignore-space-at-eol

I think that _should_ show the block as moved, but it doesn't. But if I
apply this patch:

diff --git a/diff.c b/diff.c
index 93dccd1817..375d9cf447 100644
--- a/diff.c
+++ b/diff.c
@@ -743,8 +743,8 @@ static int moved_entry_cmp(const struct diff_options 
*diffopt,
   const struct moved_entry *b,
   const void *keydata)
 {
-   const char *ap = a->es->line, *ae = a->es->line + a->es->len;
-   const char *bp = b->es->line, *be = b->es->line + b->es->len;
+   const char *ap = a->es->line, *ae = a->es->line + a->es->len - 1;
+   const char *bp = b->es->line, *be = b->es->line + b->es->len - 1;
 
if (!(diffopt->xdl_opts & XDF_WHITESPACE_FLAGS))
return a->es->len != b->es->len  || memcmp(ap, bp, a->es->len);
@@ -771,7 +771,7 @@ static unsigned get_string_hash(struct emitted_diff_symbol 
*es, struct diff_opti
 {
if (o->xdl_opts & XDF_WHITESPACE_FLAGS) {
static struct strbuf sb = STRBUF_INIT;
-   const char *ap = es->line, *ae = es->line + es->len;
+   const char *ap = es->line, *ae = es->line + es->len - 1;
int c;
 
strbuf_reset();

it does. It just adjusts our "end pointer" to point to the last valid
character in the string (rather than one past), which seems to be the
convention that those loops (and next_byte) expect.

-Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-18 Thread Jeff King
On Thu, Oct 12, 2017 at 08:20:57PM -0400, Jeff King wrote:

> On Thu, Oct 12, 2017 at 08:18:37PM -0400, Jeff King wrote:
> 
> > Fix this by entering the conditional only when we actually
> > see whitespace. We can apply this also to the
> > IGNORE_WHITESPACE change. That code path isn't buggy
> > (because it falls through to returning the next
> > non-whitespace byte), but it makes the logic more clear if
> > we only bother to look at whitespace flags after seeing that
> > the next byte is whitespace.
> 
> I think there actually _is_ a bug in that code path, but it's unrelated
> to this one. If you have whitespace at the end of the buffer, then we'd
> advance *cp until it matches *endp, and then return whatever is at *endp
> (which is nonsense, or probably a NUL) rather than returning "-1".
> 
> I'm out of time for tonight and not familiar enough with the color-moved
> code to come up with a reasonable test case quickly, but maybe you can
> see if that can trigger bad behavior?

I found a few moments to follow up on this. I'm not sure if it's a bug
or intentional behavior. It's definitely easy to trigger next_byte()
reading *endp (even without ignoring whitespace, we always dereference
it).

This function is always operating on the buffer from an
emitted_diff_symbol, which is NUL-terminated. So *endp is always NUL. So
we'll return "0" at the end of the string.

There are two callers. The first is get_string_hash(), which does:

  while ((c = next_byte(, , o)) > 0)

so we'll break out on the NUL byte. But it also sometimes shrinks the
end-pointer "ae" to skip trailing whitespace.  Or at least it tries to.
It does:

   const char *ap = es->line, *ae = es->line + es->len;
   ...
   while (ae > ap && isspace(*ae))
   ae--;

but AFAICT that loop will never trigger, since *ae will always point to
NUL to begin with. But it points to the expectation that our end-pointer
does not follow the usual one-past-the-end convention, but rather is
meant to point to the actual last character in the string.

The same problem is repeated in the other caller, moved_entry_cmp(),
which tries to eat trailing whitespace for IGNORE_WHITESPACE_AT_EOL. But
it uses the same loop that starts on NUL and will never trigger.

It then goes on to call next_byte(). The extra NUL there should not
cause any problem, because we are just checking for equality between the
two strings (so if they both return NUL at the same time, good; if not,
then we know they are different).

So. That leaves me with:

  - I'm unclear on whether next_byte() is meant to return that trailing
NUL or not. I don't think it causes any bugs, but it certainly
confused me for a function to take a cp/endp pair of pointers, and
then dereference endp. It might be worth either fixing or clarifying
with a comment.

  - Those loops to eat trailing whitespace are doing nothing. I'm not
sure if that all works out because next_byte() eats whitespaces or
not (I think not, because it doesn't eat whitespace for the
IGNORE_WHITESPACE_AT_EOL case). But I'm not quite sure what a test
would look like.

-Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-12 Thread Stefan Beller
On Thu, Oct 12, 2017 at 5:20 PM, Jeff King  wrote:
> On Thu, Oct 12, 2017 at 08:18:37PM -0400, Jeff King wrote:
>
>> Fix this by entering the conditional only when we actually
>> see whitespace. We can apply this also to the
>> IGNORE_WHITESPACE change. That code path isn't buggy
>> (because it falls through to returning the next
>> non-whitespace byte), but it makes the logic more clear if
>> we only bother to look at whitespace flags after seeing that
>> the next byte is whitespace.
>
> I think there actually _is_ a bug in that code path, but it's unrelated
> to this one. If you have whitespace at the end of the buffer, then we'd
> advance *cp until it matches *endp, and then return whatever is at *endp
> (which is nonsense, or probably a NUL) rather than returning "-1".

Good catch! This plays together interestingly with
IGNORE_WHITESPACE_AT_EOL, too. If that is set the endp is just put
further to the front, so we'd actually compare white spaces after endp.

If that flag is not set, we'd get NUL or nonsense.

> I'm out of time for tonight and not familiar enough with the color-moved
> code to come up with a reasonable test case quickly, but maybe you can
> see if that can trigger bad behavior?
>
> -Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-12 Thread Jeff King
On Thu, Oct 12, 2017 at 08:18:37PM -0400, Jeff King wrote:

> Fix this by entering the conditional only when we actually
> see whitespace. We can apply this also to the
> IGNORE_WHITESPACE change. That code path isn't buggy
> (because it falls through to returning the next
> non-whitespace byte), but it makes the logic more clear if
> we only bother to look at whitespace flags after seeing that
> the next byte is whitespace.

I think there actually _is_ a bug in that code path, but it's unrelated
to this one. If you have whitespace at the end of the buffer, then we'd
advance *cp until it matches *endp, and then return whatever is at *endp
(which is nonsense, or probably a NUL) rather than returning "-1".

I'm out of time for tonight and not familiar enough with the color-moved
code to come up with a reasonable test case quickly, but maybe you can
see if that can trigger bad behavior?

-Peff


Re: [PATCH] diff.c: increment buffer pointer in all code path

2017-10-12 Thread Jeff King
On Thu, Oct 12, 2017 at 04:33:22PM -0700, Stefan Beller wrote:

> Peff, feel free to take ownership here. I merely made it to a patch.

I think compared to my original, it makes sense to actually wrap the
whole thing with a check for the whitespace. You can do it just in the
IGNORE_WHITESPACE_CHANGE conditional, but I think it makes more sense to
cover both. Like the patch below (view with "-w" to see the logic more
clearly).

I also tweaked the test to remove "sed -i", which isn't portable, and
fixed up a few style nits.

-- >8 --
Subject: diff: fix infinite loop with --color-moved --ignore-space-change

The --color-moved code uses next_byte() to advance through
the blob contents. When the user has asked to ignore
whitespace changes, we try to collapse any whitespace change
down to a single space.

However, we enter the conditional block whenever we see the
IGNORE_WHITESPACE_CHANGE flag, even if the next byte isn't
whitespace.

This means that the combination of "--color-moved and
--ignore-space-change" was completely broken. Worse, because
we return from next_byte() without having advanced our
pointer, the function makes no forward progress in the
buffer and loops infinitely.

Fix this by entering the conditional only when we actually
see whitespace. We can apply this also to the
IGNORE_WHITESPACE change. That code path isn't buggy
(because it falls through to returning the next
non-whitespace byte), but it makes the logic more clear if
we only bother to look at whitespace flags after seeing that
the next byte is whitespace.

Reported-by: Orgad Shaneh 
Signed-off-by: Jeff King 
---
 diff.c | 28 +++-
 t/t4015-diff-whitespace.sh |  9 +
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/diff.c b/diff.c
index 69f03570ad..d76bb937c1 100644
--- a/diff.c
+++ b/diff.c
@@ -712,20 +712,22 @@ static int next_byte(const char **cp, const char **endp,
if (*cp > *endp)
return -1;
 
-   if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_CHANGE)) {
-   while (*cp < *endp && isspace(**cp))
-   (*cp)++;
-   /*
-* After skipping a couple of whitespaces, we still have to
-* account for one space.
-*/
-   return (int)' ';
-   }
+   if (isspace(**cp)) {
+   if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_CHANGE)) {
+   while (*cp < *endp && isspace(**cp))
+   (*cp)++;
+   /*
+* After skipping a couple of whitespaces,
+* we still have to account for one space.
+*/
+   return (int)' ';
+   }
 
-   if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) {
-   while (*cp < *endp && isspace(**cp))
-   (*cp)++;
-   /* return the first non-ws character via the usual below */
+   if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) {
+   while (*cp < *endp && isspace(**cp))
+   (*cp)++;
+   /* return the first non-ws character via the usual 
below */
+   }
}
 
retval = (unsigned char)(**cp);
diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh
index bd0f75d9f7..87083f728f 100755
--- a/t/t4015-diff-whitespace.sh
+++ b/t/t4015-diff-whitespace.sh
@@ -1530,4 +1530,13 @@ test_expect_success 'move detection with submodules' '
test_cmp expect decoded_actual
 '
 
+test_expect_success 'move detection with whitespace changes' '
+   test_when_finished "git reset --hard" &&
+   test_seq 10 >test &&
+   git add test &&
+   sed s/3/42/ test.tmp &&
+   mv test.tmp test &&
+   git -c diff.colormoved diff --ignore-space-change -- test
+'
+
 test_done
-- 
2.15.0.rc1.381.g94fac273d4