Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 19, 2017 at 12:53:03PM -0700, Stefan Beller wrote: > > @@ -771,7 +771,7 @@ static unsigned get_string_hash(struct > > emitted_diff_symbol *es, struct diff_opti > > { > > if (o->xdl_opts & XDF_WHITESPACE_FLAGS) { > > static struct strbuf sb = STRBUF_INIT; > > - const char *ap = es->line, *ae = es->line + es->len; > > + const char *ap = es->line, *ae = es->line + es->len - 1; > > int c; > > > > strbuf_reset(); > > > > it does. It just adjusts our "end pointer" to point to the last valid > > character in the string (rather than one past), > > Thanks for spotting. I can send a proper patch with tests if you'd like. I'm putting the finishing touches on a series that should be out in a few minutes. Hold off until then. -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Wed, Oct 18, 2017 at 10:42 PM, Jeff Kingwrote: > > So I think the right fix is this: > [...] > > It's late here, so I'll wait for comments from Stefan and then try to > wrap it up with a commit message and test tomorrow. > > -Peff I agree that this is better and looks correct. Thanks for offering to send a patch. I'd be happy to review it. Thanks, Stefan
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Wed, Oct 18, 2017 at 10:24 PM, Jeff Kingwrote: > On Thu, Oct 19, 2017 at 01:04:59AM -0400, Jeff King wrote: > >> So. That leaves me with: >> >> - I'm unclear on whether next_byte() is meant to return that trailing >> NUL or not. I don't think it causes any bugs, but it certainly >> confused me for a function to take a cp/endp pair of pointers, and >> then dereference endp. It might be worth either fixing or clarifying >> with a comment. >> >> - Those loops to eat trailing whitespace are doing nothing. I'm not >> sure if that all works out because next_byte() eats whitespaces or >> not (I think not, because it doesn't eat whitespace for the >> IGNORE_WHITESPACE_AT_EOL case). But I'm not quite sure what a test >> would look like. > > I had trouble constructing a test at first, but I think my test lines > just weren't long enough to trigger the movement heuristics. If I switch > to something besides seq, I can do: > > # any input that has reasonably sized lines > look e | head -50 >file > git add file > > perl -i -ne ' > # pick up lines 20-25 to move to line 40, and > # add some trailing whitespace to them > if ($. >= 20 && $. <= 25) { > s/$/ /; > $hold .= $_; > } else { > print $hold if ($. == 40); > print; > } > ' file > > git diff --color-moved --ignore-space-at-eol > > I think that _should_ show the block as moved, but it doesn't. But if I > apply this patch: > > diff --git a/diff.c b/diff.c > index 93dccd1817..375d9cf447 100644 > --- a/diff.c > +++ b/diff.c > @@ -743,8 +743,8 @@ static int moved_entry_cmp(const struct diff_options > *diffopt, >const struct moved_entry *b, >const void *keydata) > { > - const char *ap = a->es->line, *ae = a->es->line + a->es->len; > - const char *bp = b->es->line, *be = b->es->line + b->es->len; > + const char *ap = a->es->line, *ae = a->es->line + a->es->len - 1; > + const char *bp = b->es->line, *be = b->es->line + b->es->len - 1; > > if (!(diffopt->xdl_opts & XDF_WHITESPACE_FLAGS)) > return a->es->len != b->es->len || memcmp(ap, bp, > a->es->len); > @@ -771,7 +771,7 @@ static unsigned get_string_hash(struct > emitted_diff_symbol *es, struct diff_opti > { > if (o->xdl_opts & XDF_WHITESPACE_FLAGS) { > static struct strbuf sb = STRBUF_INIT; > - const char *ap = es->line, *ae = es->line + es->len; > + const char *ap = es->line, *ae = es->line + es->len - 1; > int c; > > strbuf_reset(); > > it does. It just adjusts our "end pointer" to point to the last valid > character in the string (rather than one past), Thanks for spotting. I can send a proper patch with tests if you'd like. > which seems to be the > convention that those loops (and next_byte) expect. I'll look at that again. Thanks for poking! Stefan > > -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 19, 2017 at 02:30:08PM +0900, Junio C Hamano wrote: > Jeff Kingwrites: > > > it does. It just adjusts our "end pointer" to point to the last valid > > character in the string (rather than one past), which seems to be the > > convention that those loops (and next_byte) expect. > > Yeah I am not sure if I like this comparison at the beginning of the > function: > > static int next_byte(const char **cp, const char **endp, > const struct diff_options *diffopt) > { > int retval; > > if (*cp > *endp) > return -1; > > but it says endp _is_ part of valid input, contrary to my intuition. Actually, I think even this function is confused about its convention. In the line you quote, we clearly treat *endp as part of the input. But later we do: while (*cp < *endp && isspace(**cp)) (*cp)++; meaning that we'd fail to soak up whitespace at *endp. That wouldn't be so bad if not for the other bug which fails to eat whitespace at endp in the first place. :) So I think the right fix is this: diff --git a/diff.c b/diff.c index 6fd288420b..09081a207c 100644 --- a/diff.c +++ b/diff.c @@ -712,7 +712,7 @@ static int next_byte(const char **cp, const char **endp, { int retval; - if (*cp > *endp) + if (*cp >= *endp) return -1; if (isspace(**cp)) { @@ -729,7 +729,12 @@ static int next_byte(const char **cp, const char **endp, if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) { while (*cp < *endp && isspace(**cp)) (*cp)++; - /* return the first non-ws character via the usual below */ + /* +* return the first non-ws character via the usual +* below, unless we ate all of the bytes +*/ + if (*cp >= *endp) + return -1; } } @@ -750,9 +755,9 @@ static int moved_entry_cmp(const struct diff_options *diffopt, return a->es->len != b->es->len || memcmp(ap, bp, a->es->len); if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_AT_EOL)) { - while (ae > ap && isspace(*ae)) + while (ae > ap && isspace(ae[-1])) ae--; - while (be > bp && isspace(*be)) + while (be > bp && isspace(be[-1])) be--; } @@ -775,7 +780,7 @@ static unsigned get_string_hash(struct emitted_diff_symbol *es, struct diff_opti int c; strbuf_reset(); - while (ae > ap && isspace(*ae)) + while (ae > ap && isspace(ae[-1])) ae--; while ((c = next_byte(, , o)) > 0) strbuf_addch(, c); It's late here, so I'll wait for comments from Stefan and then try to wrap it up with a commit message and test tomorrow. -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 19, 2017 at 02:32:04PM +0900, Junio C Hamano wrote: > > Yeah I am not sure if I like this comparison at the beginning of the > > function: > > > > static int next_byte(const char **cp, const char **endp, > > const struct diff_options *diffopt) > > { > > int retval; > > > > if (*cp > *endp) > > return -1; > > > > but it says endp _is_ part of valid input, contrary to my intuition. > > > > And your change to the initialization of ae/be in moved_entry_cmp() > > makes it consistent with it, I think. > > > > But doesn't it mean ae computation in get_string_hash() also needs a > > massaging? > > Ah, forget the last two lines. You do do the massaging in your > patch. I was just replying so. :) > My preference actually is to fix next_byte to follow the usual "when > we end, it points one past the valid region", though. Yeah, I think that is my preference, too. -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
Junio C Hamanowrites: > Jeff King writes: > >> it does. It just adjusts our "end pointer" to point to the last valid >> character in the string (rather than one past), which seems to be the >> convention that those loops (and next_byte) expect. > > Yeah I am not sure if I like this comparison at the beginning of the > function: > > static int next_byte(const char **cp, const char **endp, > const struct diff_options *diffopt) > { > int retval; > > if (*cp > *endp) > return -1; > > but it says endp _is_ part of valid input, contrary to my intuition. > > And your change to the initialization of ae/be in moved_entry_cmp() > makes it consistent with it, I think. > > But doesn't it mean ae computation in get_string_hash() also needs a > massaging? Ah, forget the last two lines. You do do the massaging in your patch. My preference actually is to fix next_byte to follow the usual "when we end, it points one past the valid region", though. Thanks for digging it through.
Re: [PATCH] diff.c: increment buffer pointer in all code path
Jeff Kingwrites: > it does. It just adjusts our "end pointer" to point to the last valid > character in the string (rather than one past), which seems to be the > convention that those loops (and next_byte) expect. Yeah I am not sure if I like this comparison at the beginning of the function: static int next_byte(const char **cp, const char **endp, const struct diff_options *diffopt) { int retval; if (*cp > *endp) return -1; but it says endp _is_ part of valid input, contrary to my intuition. And your change to the initialization of ae/be in moved_entry_cmp() makes it consistent with it, I think. But doesn't it mean ae computation in get_string_hash() also needs a massaging?
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 19, 2017 at 01:04:59AM -0400, Jeff King wrote: > So. That leaves me with: > > - I'm unclear on whether next_byte() is meant to return that trailing > NUL or not. I don't think it causes any bugs, but it certainly > confused me for a function to take a cp/endp pair of pointers, and > then dereference endp. It might be worth either fixing or clarifying > with a comment. > > - Those loops to eat trailing whitespace are doing nothing. I'm not > sure if that all works out because next_byte() eats whitespaces or > not (I think not, because it doesn't eat whitespace for the > IGNORE_WHITESPACE_AT_EOL case). But I'm not quite sure what a test > would look like. I had trouble constructing a test at first, but I think my test lines just weren't long enough to trigger the movement heuristics. If I switch to something besides seq, I can do: # any input that has reasonably sized lines look e | head -50 >file git add file perl -i -ne ' # pick up lines 20-25 to move to line 40, and # add some trailing whitespace to them if ($. >= 20 && $. <= 25) { s/$/ /; $hold .= $_; } else { print $hold if ($. == 40); print; } ' file git diff --color-moved --ignore-space-at-eol I think that _should_ show the block as moved, but it doesn't. But if I apply this patch: diff --git a/diff.c b/diff.c index 93dccd1817..375d9cf447 100644 --- a/diff.c +++ b/diff.c @@ -743,8 +743,8 @@ static int moved_entry_cmp(const struct diff_options *diffopt, const struct moved_entry *b, const void *keydata) { - const char *ap = a->es->line, *ae = a->es->line + a->es->len; - const char *bp = b->es->line, *be = b->es->line + b->es->len; + const char *ap = a->es->line, *ae = a->es->line + a->es->len - 1; + const char *bp = b->es->line, *be = b->es->line + b->es->len - 1; if (!(diffopt->xdl_opts & XDF_WHITESPACE_FLAGS)) return a->es->len != b->es->len || memcmp(ap, bp, a->es->len); @@ -771,7 +771,7 @@ static unsigned get_string_hash(struct emitted_diff_symbol *es, struct diff_opti { if (o->xdl_opts & XDF_WHITESPACE_FLAGS) { static struct strbuf sb = STRBUF_INIT; - const char *ap = es->line, *ae = es->line + es->len; + const char *ap = es->line, *ae = es->line + es->len - 1; int c; strbuf_reset(); it does. It just adjusts our "end pointer" to point to the last valid character in the string (rather than one past), which seems to be the convention that those loops (and next_byte) expect. -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 12, 2017 at 08:20:57PM -0400, Jeff King wrote: > On Thu, Oct 12, 2017 at 08:18:37PM -0400, Jeff King wrote: > > > Fix this by entering the conditional only when we actually > > see whitespace. We can apply this also to the > > IGNORE_WHITESPACE change. That code path isn't buggy > > (because it falls through to returning the next > > non-whitespace byte), but it makes the logic more clear if > > we only bother to look at whitespace flags after seeing that > > the next byte is whitespace. > > I think there actually _is_ a bug in that code path, but it's unrelated > to this one. If you have whitespace at the end of the buffer, then we'd > advance *cp until it matches *endp, and then return whatever is at *endp > (which is nonsense, or probably a NUL) rather than returning "-1". > > I'm out of time for tonight and not familiar enough with the color-moved > code to come up with a reasonable test case quickly, but maybe you can > see if that can trigger bad behavior? I found a few moments to follow up on this. I'm not sure if it's a bug or intentional behavior. It's definitely easy to trigger next_byte() reading *endp (even without ignoring whitespace, we always dereference it). This function is always operating on the buffer from an emitted_diff_symbol, which is NUL-terminated. So *endp is always NUL. So we'll return "0" at the end of the string. There are two callers. The first is get_string_hash(), which does: while ((c = next_byte(, , o)) > 0) so we'll break out on the NUL byte. But it also sometimes shrinks the end-pointer "ae" to skip trailing whitespace. Or at least it tries to. It does: const char *ap = es->line, *ae = es->line + es->len; ... while (ae > ap && isspace(*ae)) ae--; but AFAICT that loop will never trigger, since *ae will always point to NUL to begin with. But it points to the expectation that our end-pointer does not follow the usual one-past-the-end convention, but rather is meant to point to the actual last character in the string. The same problem is repeated in the other caller, moved_entry_cmp(), which tries to eat trailing whitespace for IGNORE_WHITESPACE_AT_EOL. But it uses the same loop that starts on NUL and will never trigger. It then goes on to call next_byte(). The extra NUL there should not cause any problem, because we are just checking for equality between the two strings (so if they both return NUL at the same time, good; if not, then we know they are different). So. That leaves me with: - I'm unclear on whether next_byte() is meant to return that trailing NUL or not. I don't think it causes any bugs, but it certainly confused me for a function to take a cp/endp pair of pointers, and then dereference endp. It might be worth either fixing or clarifying with a comment. - Those loops to eat trailing whitespace are doing nothing. I'm not sure if that all works out because next_byte() eats whitespaces or not (I think not, because it doesn't eat whitespace for the IGNORE_WHITESPACE_AT_EOL case). But I'm not quite sure what a test would look like. -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 12, 2017 at 5:20 PM, Jeff Kingwrote: > On Thu, Oct 12, 2017 at 08:18:37PM -0400, Jeff King wrote: > >> Fix this by entering the conditional only when we actually >> see whitespace. We can apply this also to the >> IGNORE_WHITESPACE change. That code path isn't buggy >> (because it falls through to returning the next >> non-whitespace byte), but it makes the logic more clear if >> we only bother to look at whitespace flags after seeing that >> the next byte is whitespace. > > I think there actually _is_ a bug in that code path, but it's unrelated > to this one. If you have whitespace at the end of the buffer, then we'd > advance *cp until it matches *endp, and then return whatever is at *endp > (which is nonsense, or probably a NUL) rather than returning "-1". Good catch! This plays together interestingly with IGNORE_WHITESPACE_AT_EOL, too. If that is set the endp is just put further to the front, so we'd actually compare white spaces after endp. If that flag is not set, we'd get NUL or nonsense. > I'm out of time for tonight and not familiar enough with the color-moved > code to come up with a reasonable test case quickly, but maybe you can > see if that can trigger bad behavior? > > -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 12, 2017 at 08:18:37PM -0400, Jeff King wrote: > Fix this by entering the conditional only when we actually > see whitespace. We can apply this also to the > IGNORE_WHITESPACE change. That code path isn't buggy > (because it falls through to returning the next > non-whitespace byte), but it makes the logic more clear if > we only bother to look at whitespace flags after seeing that > the next byte is whitespace. I think there actually _is_ a bug in that code path, but it's unrelated to this one. If you have whitespace at the end of the buffer, then we'd advance *cp until it matches *endp, and then return whatever is at *endp (which is nonsense, or probably a NUL) rather than returning "-1". I'm out of time for tonight and not familiar enough with the color-moved code to come up with a reasonable test case quickly, but maybe you can see if that can trigger bad behavior? -Peff
Re: [PATCH] diff.c: increment buffer pointer in all code path
On Thu, Oct 12, 2017 at 04:33:22PM -0700, Stefan Beller wrote: > Peff, feel free to take ownership here. I merely made it to a patch. I think compared to my original, it makes sense to actually wrap the whole thing with a check for the whitespace. You can do it just in the IGNORE_WHITESPACE_CHANGE conditional, but I think it makes more sense to cover both. Like the patch below (view with "-w" to see the logic more clearly). I also tweaked the test to remove "sed -i", which isn't portable, and fixed up a few style nits. -- >8 -- Subject: diff: fix infinite loop with --color-moved --ignore-space-change The --color-moved code uses next_byte() to advance through the blob contents. When the user has asked to ignore whitespace changes, we try to collapse any whitespace change down to a single space. However, we enter the conditional block whenever we see the IGNORE_WHITESPACE_CHANGE flag, even if the next byte isn't whitespace. This means that the combination of "--color-moved and --ignore-space-change" was completely broken. Worse, because we return from next_byte() without having advanced our pointer, the function makes no forward progress in the buffer and loops infinitely. Fix this by entering the conditional only when we actually see whitespace. We can apply this also to the IGNORE_WHITESPACE change. That code path isn't buggy (because it falls through to returning the next non-whitespace byte), but it makes the logic more clear if we only bother to look at whitespace flags after seeing that the next byte is whitespace. Reported-by: Orgad ShanehSigned-off-by: Jeff King --- diff.c | 28 +++- t/t4015-diff-whitespace.sh | 9 + 2 files changed, 24 insertions(+), 13 deletions(-) diff --git a/diff.c b/diff.c index 69f03570ad..d76bb937c1 100644 --- a/diff.c +++ b/diff.c @@ -712,20 +712,22 @@ static int next_byte(const char **cp, const char **endp, if (*cp > *endp) return -1; - if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_CHANGE)) { - while (*cp < *endp && isspace(**cp)) - (*cp)++; - /* -* After skipping a couple of whitespaces, we still have to -* account for one space. -*/ - return (int)' '; - } + if (isspace(**cp)) { + if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_CHANGE)) { + while (*cp < *endp && isspace(**cp)) + (*cp)++; + /* +* After skipping a couple of whitespaces, +* we still have to account for one space. +*/ + return (int)' '; + } - if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) { - while (*cp < *endp && isspace(**cp)) - (*cp)++; - /* return the first non-ws character via the usual below */ + if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) { + while (*cp < *endp && isspace(**cp)) + (*cp)++; + /* return the first non-ws character via the usual below */ + } } retval = (unsigned char)(**cp); diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh index bd0f75d9f7..87083f728f 100755 --- a/t/t4015-diff-whitespace.sh +++ b/t/t4015-diff-whitespace.sh @@ -1530,4 +1530,13 @@ test_expect_success 'move detection with submodules' ' test_cmp expect decoded_actual ' +test_expect_success 'move detection with whitespace changes' ' + test_when_finished "git reset --hard" && + test_seq 10 >test && + git add test && + sed s/3/42/ test.tmp && + mv test.tmp test && + git -c diff.colormoved diff --ignore-space-change -- test +' + test_done -- 2.15.0.rc1.381.g94fac273d4