Re: [PATCH] t4062: stop using repetition in regex
René Scharfewrites: > Am 09.08.2017 um 19:47 schrieb Junio C Hamano: >> René Scharfe writes: >> >>> There could be any characters except NUL and LF between the 4096 zeros >>> and "0$" for the latter to match wrongly, no? So there are 4095 >>> opportunities for the misleading pattern in a page, with probabilities >>> like this: >>> >>>0$ 1/256 * 2/256 >>>.0$ 254/256 * 1/256 * 2/256 >>>..0$ (254/256)^2* 1/256 * 2/256 >>>.{3}0$ (254/256)^3* 1/256 * 2/256 >>> >>>.{4094}0$ (254/256)^4094 * 1/256 * 2/256 >>> >>> That sums up to ca. 1/256 (did that numerically). Does that make >>> sense? >> >> Yes, thanks. I think the number would be different for "^0*$" (the >> above is for "0$") and moves it down to ~1/3, but as I said, >> allowing additional false success rate is unnecessary (even if it is >> miniscule enough to be acceptable), so let's take the 64*64 patch. > > Ah, right, now I get your calculation in the email I replied to above. > "^0*$" has a probability of 2/255 to produce false positives. Yes, and that is larger than 2/256 we would have to accept with the original "^0{4096}$" or the updated "^(0{64}){64}$" by ~1/3, which is unnecessary additional false rate of success. Thanks.
Re: [PATCH] t4062: stop using repetition in regex
Am 09.08.2017 um 19:47 schrieb Junio C Hamano: > René Scharfewrites: > >> There could be any characters except NUL and LF between the 4096 zeros >> and "0$" for the latter to match wrongly, no? So there are 4095 >> opportunities for the misleading pattern in a page, with probabilities >> like this: >> >>0$ 1/256 * 2/256 >>.0$ 254/256 * 1/256 * 2/256 >>..0$ (254/256)^2* 1/256 * 2/256 >>.{3}0$ (254/256)^3* 1/256 * 2/256 >> >>.{4094}0$ (254/256)^4094 * 1/256 * 2/256 >> >> That sums up to ca. 1/256 (did that numerically). Does that make >> sense? > > Yes, thanks. I think the number would be different for "^0*$" (the > above is for "0$") and moves it down to ~1/3, but as I said, > allowing additional false success rate is unnecessary (even if it is > miniscule enough to be acceptable), so let's take the 64*64 patch. Ah, right, now I get your calculation in the email I replied to above. "^0*$" has a probability of 2/255 to produce false positives. Thanks, René
Re: [PATCH] t4062: stop using repetition in regex
Hi all, On Wed, 9 Aug 2017, David Coppa wrote: > On Wed, Aug 9, 2017 at 4:15 PM, René Scharfewrote: > > Am 09.08.2017 um 08:15 schrieb René Scharfe: > >> Am 09.08.2017 um 07:29 schrieb Junio C Hamano: > >>> René Scharfe writes: > >>> > Am 09.08.2017 um 00:26 schrieb Junio C Hamano: > > ... but in the meantime, I think replacing the test with "0$" to > > force the scanner to find either the end of line or the end of the > > buffer may be a good workaround. We do not have to care how many of > > random bytes are in front of the last "0" in order to ensure that > > the regexec_buf() does not overstep to 4097th byte, while seeing > > that regexec() that does not know how long the haystack is has to do > > so, no? > > Our regexec() calls strlen() (see my other reply). > > Using "0$" looks like the best option to me. > >>> > >>> Yeah, it seems that way. If we want to be close/faithful to the > >>> original, we could do "^0*$", but the part that is essential to > >>> trigger the old bug is not the "we have many zeroes" (or "we have > >>> 4096 zeroes") part, but "zero is at the end of the string" part, so > >>> "0$" would be the minimal pattern that also would work for OBSD. > >> > >> Thought about it a bit more. > >> > >> "^0{4096}$" checks if the byte after the buffer is \n or \0 in the > >> hope of triggering a segfault. On Linux I can access that byte just > >> fine; perhaps there is no guard page. Also there is a 2 in 256 > >> chance of the byte being \n or \0 (provided its value is random), > >> which would cause the test to falsely report success. > >> > >> "0$" effectively looks for "0\n" or "0\0", which can only occur > >> after the buffer. If that string is found close enough then we > >> may not trigger a segfault and report a false positive. > >> > >> In the face of unreliable segfaults we need to reverse our strategy, > >> I think. Searching for something not in the buffer (e.g. "1") and > >> considering matches and segfaults as confirmation that the bug is > >> still present should avoid any false positives. Right? > > > > And that's not it either. *sigh* > > > > If the 4097th byte is NUL or LF then we can only hope its access > > triggers a segfault -- there is no other way to distinguish the > > result from a legitimate match when limiting with REG_STARTEND. So > > we have to accept this flakiness. > > > > We can check the value of that byte with [^0] and interpret a > > match as failure, but that adds negation and makes the test more > > complex. > > > > ^0*$ would falsely match if the 4097th byte (and possibly later > > ones) is 0. We need to make sure we check for end-of-line after > > the 4096th byte, not later. > > > > Sorry, Dscho, I thought we could take a shortcut here, but -- as > > you wrote all along -- we can't. > > > > So how about this? > > > > -- >8 -- > > Subject: [PATCH] t4062: use less than 256 repetitions in regex > > > > OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255. > > That's the minimum acceptable value according to POSIX. In t4062 we use > > 4096 repetitions in the test "-G matches", though, causing it to fail. > > Combine two repetition operators, both less than 256, to arrive at 4096 > > zeros instead of using a single one, to fix the test on OpenBSD. > > > > Original-patch-by: David Coppa > > Signed-off-by: Rene Scharfe > > --- > > t/t4062-diff-pickaxe.sh | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh > > index 7c4903f497..1130c8019b 100755 > > --- a/t/t4062-diff-pickaxe.sh > > +++ b/t/t4062-diff-pickaxe.sh > > @@ -14,8 +14,10 @@ test_expect_success setup ' > > test_tick && > > git commit -m "A 4k file" > > ' > > + > > +# OpenBSD only supports up to 255 repetitions, so repeat twice for > > 64*64=4096. > > test_expect_success '-G matches' ' > > - git diff --name-only -G "^0{4096}$" HEAD^ >out && > > + git diff --name-only -G "^(0{64}){64}$" HEAD^ >out && > > test 4096-zeroes.txt = "$(cat out)" > > ' > > > > -- > > 2.14.0 > > I think this should work w/o problems on OpenBSD: > > $ uname -a > OpenBSD open.bsdgeek.it 6.1 GENERIC#54 amd64 > $ git diff --name-only -G "^0{4096}$" HEAD^ 1>/dev/null > fatal: invalid regex: invalid repetition count(s) > $ git diff --name-only -G "^(0{64}){64}$" HEAD^ 1>/dev/null > $ Perfect! Thank y'all for being so thorough, Dscho
Re: [PATCH] t4062: stop using repetition in regex
René Scharfewrites: > There could be any characters except NUL and LF between the 4096 zeros > and "0$" for the latter to match wrongly, no? So there are 4095 > opportunities for the misleading pattern in a page, with probabilities > like this: > > 0$ 1/256 * 2/256 > .0$ 254/256 * 1/256 * 2/256 > ..0$ (254/256)^2* 1/256 * 2/256 > .{3}0$ (254/256)^3* 1/256 * 2/256 > > .{4094}0$ (254/256)^4094 * 1/256 * 2/256 > > That sums up to ca. 1/256 (did that numerically). Does that make > sense? Yes, thanks. I think the number would be different for "^0*$" (the above is for "0$") and moves it down to ~1/3, but as I said, allowing additional false success rate is unnecessary (even if it is miniscule enough to be acceptable), so let's take the 64*64 patch. >> So we are saying that we accept ~1/100 false success rate, but >> additional ~1/3 is unacceptable. >> >> I do not know if I buy that argument, but I do think that additional >> false success rate, even if it is miniscule, is unnecessary. So as >> long as everybody's regexp library is happy with "^0{64}{64}$", >> let's use that. > > The parentheses are necessary ("^(0{64}){64}$"), at least on OpenBSD. Sorry, what I wrote was merely a typo; the one from you I applied did have the parens so we are good. Thanks.
Re: [PATCH] t4062: stop using repetition in regex
Am 09.08.2017 um 18:07 schrieb Junio C Hamano: > René Scharfewrites: > >>> In the face of unreliable segfaults we need to reverse our strategy, >>> I think. Searching for something not in the buffer (e.g. "1") and >>> considering matches and segfaults as confirmation that the bug is >>> still present should avoid any false positives. Right? >> >> And that's not it either. *sigh* >> >> If the 4097th byte is NUL or LF then we can only hope its access >> triggers a segfault -- there is no other way to distinguish the >> result from a legitimate match when limiting with REG_STARTEND. So >> we have to accept this flakiness. > >> We can check the value of that byte with [^0] and interpret a >> match as failure, but that adds negation and makes the test more >> complex. >> >> ^0*$ would falsely match if the 4097th byte (and possibly later >> ones) is 0. We need to make sure we check for end-of-line after >> the 4096th byte, not later. > > I do not have a strong objection against "^0{64}{64}$", but let me > make sure that I understand your rationale correctly. > > We assume that we do not have an unmapped guard page to reliably > trap us. So "^0{64}{64}$" will report a false success when that > 4097th byte is either NUL or LF. There is 2/256 probability of that > happening but we accept it. > > A pattern "0$" will give us a false success in the same case, but > the reason why it is worse is because in addition, we get a false > success if that second page begins with "0\0" or "0\n". The chance > of that happening is additional 2/256 * 1/256. Worse yet, the page > could start with "00\0" or "00\n", which is additional 2/256 * > 1/65536. We could have yet another "0" at the beginning of that > second page, which only increases the chance of us getting a false > sucess. I demonstrated a lack of logical thinking and now you bring on probabilities!? ;-) There could be any characters except NUL and LF between the 4096 zeros and "0$" for the latter to match wrongly, no? So there are 4095 opportunities for the misleading pattern in a page, with probabilities like this: 0$ 1/256 * 2/256 .0$ 254/256 * 1/256 * 2/256 ..0$ (254/256)^2* 1/256 * 2/256 .{3}0$ (254/256)^3* 1/256 * 2/256 .{4094}0$ (254/256)^4094 * 1/256 * 2/256 That sums up to ca. 1/256 (did that numerically). Does that make sense? > So we are saying that we accept ~1/100 false success rate, but > additional ~1/3 is unacceptable. > > I do not know if I buy that argument, but I do think that additional > false success rate, even if it is miniscule, is unnecessary. So as > long as everybody's regexp library is happy with "^0{64}{64}$", > let's use that. The parentheses are necessary ("^(0{64}){64}$"), at least on OpenBSD. It doesn't accept consecutive repetition operators without them. We could use "^(){128}$" instead if there's a problem with that. René
Re: [PATCH] t4062: stop using repetition in regex
René Scharfewrites: >> In the face of unreliable segfaults we need to reverse our strategy, >> I think. Searching for something not in the buffer (e.g. "1") and >> considering matches and segfaults as confirmation that the bug is >> still present should avoid any false positives. Right? > > And that's not it either. *sigh* > > If the 4097th byte is NUL or LF then we can only hope its access > triggers a segfault -- there is no other way to distinguish the > result from a legitimate match when limiting with REG_STARTEND. So > we have to accept this flakiness. > We can check the value of that byte with [^0] and interpret a > match as failure, but that adds negation and makes the test more > complex. > > ^0*$ would falsely match if the 4097th byte (and possibly later > ones) is 0. We need to make sure we check for end-of-line after > the 4096th byte, not later. I do not have a strong objection against "^0{64}{64}$", but let me make sure that I understand your rationale correctly. We assume that we do not have an unmapped guard page to reliably trap us. So "^0{64}{64}$" will report a false success when that 4097th byte is either NUL or LF. There is 2/256 probability of that happening but we accept it. A pattern "0$" will give us a false success in the same case, but the reason why it is worse is because in addition, we get a false success if that second page begins with "0\0" or "0\n". The chance of that happening is additional 2/256 * 1/256. Worse yet, the page could start with "00\0" or "00\n", which is additional 2/256 * 1/65536. We could have yet another "0" at the beginning of that second page, which only increases the chance of us getting a false sucess. We consider that 2/256 * (1/256 + 1/256^2 + 1/256^3 + ...) is too big an additional chance of getting false success to be acceptable, even though 2/256 which is the chance of false success that we have to and are willing to accept. Now I am not good at doing math on the keyboard and text terminal, but X= (1/256 + 1/256^2 + ...) 256X = 1 + X X= 1/255 so that additional rate of false success is 2/256 * 1/255. So we are saying that we accept ~1/100 false success rate, but additional ~1/3 is unacceptable. I do not know if I buy that argument, but I do think that additional false success rate, even if it is miniscule, is unnecessary. So as long as everybody's regexp library is happy with "^0{64}{64}$", let's use that. Thanks.
Re: [PATCH] t4062: stop using repetition in regex
On Wed, Aug 9, 2017 at 4:15 PM, René Scharfewrote: > Am 09.08.2017 um 08:15 schrieb René Scharfe: >> Am 09.08.2017 um 07:29 schrieb Junio C Hamano: >>> René Scharfe writes: >>> Am 09.08.2017 um 00:26 schrieb Junio C Hamano: > ... but in the meantime, I think replacing the test with "0$" to > force the scanner to find either the end of line or the end of the > buffer may be a good workaround. We do not have to care how many of > random bytes are in front of the last "0" in order to ensure that > the regexec_buf() does not overstep to 4097th byte, while seeing > that regexec() that does not know how long the haystack is has to do > so, no? Our regexec() calls strlen() (see my other reply). Using "0$" looks like the best option to me. >>> >>> Yeah, it seems that way. If we want to be close/faithful to the >>> original, we could do "^0*$", but the part that is essential to >>> trigger the old bug is not the "we have many zeroes" (or "we have >>> 4096 zeroes") part, but "zero is at the end of the string" part, so >>> "0$" would be the minimal pattern that also would work for OBSD. >> >> Thought about it a bit more. >> >> "^0{4096}$" checks if the byte after the buffer is \n or \0 in the >> hope of triggering a segfault. On Linux I can access that byte just >> fine; perhaps there is no guard page. Also there is a 2 in 256 >> chance of the byte being \n or \0 (provided its value is random), >> which would cause the test to falsely report success. >> >> "0$" effectively looks for "0\n" or "0\0", which can only occur >> after the buffer. If that string is found close enough then we >> may not trigger a segfault and report a false positive. >> >> In the face of unreliable segfaults we need to reverse our strategy, >> I think. Searching for something not in the buffer (e.g. "1") and >> considering matches and segfaults as confirmation that the bug is >> still present should avoid any false positives. Right? > > And that's not it either. *sigh* > > If the 4097th byte is NUL or LF then we can only hope its access > triggers a segfault -- there is no other way to distinguish the > result from a legitimate match when limiting with REG_STARTEND. So > we have to accept this flakiness. > > We can check the value of that byte with [^0] and interpret a > match as failure, but that adds negation and makes the test more > complex. > > ^0*$ would falsely match if the 4097th byte (and possibly later > ones) is 0. We need to make sure we check for end-of-line after > the 4096th byte, not later. > > Sorry, Dscho, I thought we could take a shortcut here, but -- as > you wrote all along -- we can't. > > So how about this? > > -- >8 -- > Subject: [PATCH] t4062: use less than 256 repetitions in regex > > OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255. > That's the minimum acceptable value according to POSIX. In t4062 we use > 4096 repetitions in the test "-G matches", though, causing it to fail. > Combine two repetition operators, both less than 256, to arrive at 4096 > zeros instead of using a single one, to fix the test on OpenBSD. > > Original-patch-by: David Coppa > Signed-off-by: Rene Scharfe > --- > t/t4062-diff-pickaxe.sh | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh > index 7c4903f497..1130c8019b 100755 > --- a/t/t4062-diff-pickaxe.sh > +++ b/t/t4062-diff-pickaxe.sh > @@ -14,8 +14,10 @@ test_expect_success setup ' > test_tick && > git commit -m "A 4k file" > ' > + > +# OpenBSD only supports up to 255 repetitions, so repeat twice for > 64*64=4096. > test_expect_success '-G matches' ' > - git diff --name-only -G "^0{4096}$" HEAD^ >out && > + git diff --name-only -G "^(0{64}){64}$" HEAD^ >out && > test 4096-zeroes.txt = "$(cat out)" > ' > > -- > 2.14.0 I think this should work w/o problems on OpenBSD: $ uname -a OpenBSD open.bsdgeek.it 6.1 GENERIC#54 amd64 $ git diff --name-only -G "^0{4096}$" HEAD^ 1>/dev/null fatal: invalid regex: invalid repetition count(s) $ git diff --name-only -G "^(0{64}){64}$" HEAD^ 1>/dev/null $ Ciao! David
Re: [PATCH] t4062: stop using repetition in regex
Am 09.08.2017 um 08:15 schrieb René Scharfe: > Am 09.08.2017 um 07:29 schrieb Junio C Hamano: >> René Scharfewrites: >> >>> Am 09.08.2017 um 00:26 schrieb Junio C Hamano: ... but in the meantime, I think replacing the test with "0$" to force the scanner to find either the end of line or the end of the buffer may be a good workaround. We do not have to care how many of random bytes are in front of the last "0" in order to ensure that the regexec_buf() does not overstep to 4097th byte, while seeing that regexec() that does not know how long the haystack is has to do so, no? >>> >>> Our regexec() calls strlen() (see my other reply). >>> >>> Using "0$" looks like the best option to me. >> >> Yeah, it seems that way. If we want to be close/faithful to the >> original, we could do "^0*$", but the part that is essential to >> trigger the old bug is not the "we have many zeroes" (or "we have >> 4096 zeroes") part, but "zero is at the end of the string" part, so >> "0$" would be the minimal pattern that also would work for OBSD. > > Thought about it a bit more. > > "^0{4096}$" checks if the byte after the buffer is \n or \0 in the > hope of triggering a segfault. On Linux I can access that byte just > fine; perhaps there is no guard page. Also there is a 2 in 256 > chance of the byte being \n or \0 (provided its value is random), > which would cause the test to falsely report success. > > "0$" effectively looks for "0\n" or "0\0", which can only occur > after the buffer. If that string is found close enough then we > may not trigger a segfault and report a false positive. > > In the face of unreliable segfaults we need to reverse our strategy, > I think. Searching for something not in the buffer (e.g. "1") and > considering matches and segfaults as confirmation that the bug is > still present should avoid any false positives. Right? And that's not it either. *sigh* If the 4097th byte is NUL or LF then we can only hope its access triggers a segfault -- there is no other way to distinguish the result from a legitimate match when limiting with REG_STARTEND. So we have to accept this flakiness. We can check the value of that byte with [^0] and interpret a match as failure, but that adds negation and makes the test more complex. ^0*$ would falsely match if the 4097th byte (and possibly later ones) is 0. We need to make sure we check for end-of-line after the 4096th byte, not later. Sorry, Dscho, I thought we could take a shortcut here, but -- as you wrote all along -- we can't. So how about this? -- >8 -- Subject: [PATCH] t4062: use less than 256 repetitions in regex OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255. That's the minimum acceptable value according to POSIX. In t4062 we use 4096 repetitions in the test "-G matches", though, causing it to fail. Combine two repetition operators, both less than 256, to arrive at 4096 zeros instead of using a single one, to fix the test on OpenBSD. Original-patch-by: David Coppa Signed-off-by: Rene Scharfe --- t/t4062-diff-pickaxe.sh | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/t/t4062-diff-pickaxe.sh b/t/t4062-diff-pickaxe.sh index 7c4903f497..1130c8019b 100755 --- a/t/t4062-diff-pickaxe.sh +++ b/t/t4062-diff-pickaxe.sh @@ -14,8 +14,10 @@ test_expect_success setup ' test_tick && git commit -m "A 4k file" ' + +# OpenBSD only supports up to 255 repetitions, so repeat twice for 64*64=4096. test_expect_success '-G matches' ' - git diff --name-only -G "^0{4096}$" HEAD^ >out && + git diff --name-only -G "^(0{64}){64}$" HEAD^ >out && test 4096-zeroes.txt = "$(cat out)" ' -- 2.14.0
Re: [PATCH] t4062: stop using repetition in regex
Am 09.08.2017 um 07:29 schrieb Junio C Hamano: > René Scharfewrites: > >> Am 09.08.2017 um 00:26 schrieb Junio C Hamano: >>> ... but in the meantime, I think replacing the test with "0$" to >>> force the scanner to find either the end of line or the end of the >>> buffer may be a good workaround. We do not have to care how many of >>> random bytes are in front of the last "0" in order to ensure that >>> the regexec_buf() does not overstep to 4097th byte, while seeing >>> that regexec() that does not know how long the haystack is has to do >>> so, no? >> >> Our regexec() calls strlen() (see my other reply). >> >> Using "0$" looks like the best option to me. > > Yeah, it seems that way. If we want to be close/faithful to the > original, we could do "^0*$", but the part that is essential to > trigger the old bug is not the "we have many zeroes" (or "we have > 4096 zeroes") part, but "zero is at the end of the string" part, so > "0$" would be the minimal pattern that also would work for OBSD. Thought about it a bit more. "^0{4096}$" checks if the byte after the buffer is \n or \0 in the hope of triggering a segfault. On Linux I can access that byte just fine; perhaps there is no guard page. Also there is a 2 in 256 chance of the byte being \n or \0 (provided its value is random), which would cause the test to falsely report success. "0$" effectively looks for "0\n" or "0\0", which can only occur after the buffer. If that string is found close enough then we may not trigger a segfault and report a false positive. In the face of unreliable segfaults we need to reverse our strategy, I think. Searching for something not in the buffer (e.g. "1") and considering matches and segfaults as confirmation that the bug is still present should avoid any false positives. Right? Thanks, René
Re: [PATCH] t4062: stop using repetition in regex
René Scharfewrites: > Am 09.08.2017 um 00:26 schrieb Junio C Hamano: >> ... but in the meantime, I think replacing the test with "0$" to >> force the scanner to find either the end of line or the end of the >> buffer may be a good workaround. We do not have to care how many of >> random bytes are in front of the last "0" in order to ensure that >> the regexec_buf() does not overstep to 4097th byte, while seeing >> that regexec() that does not know how long the haystack is has to do >> so, no? > > Our regexec() calls strlen() (see my other reply). > > Using "0$" looks like the best option to me. Yeah, it seems that way. If we want to be close/faithful to the original, we could do "^0*$", but the part that is essential to trigger the old bug is not the "we have many zeroes" (or "we have 4096 zeroes") part, but "zero is at the end of the string" part, so "0$" would be the minimal pattern that also would work for OBSD. Dscho?
Re: [PATCH] t4062: stop using repetition in regex
Am 09.08.2017 um 00:26 schrieb Junio C Hamano: > Junio C Hamanowrites: > >> So I find Dscho's concern quite valid, even though I do believe you >> when you say the code somehow segfaults. I just can not tell >> how/why it would segfault, though---it is possible that regexec() >> implementation is stupid and does not realize that it can return early >> reporting success without looking at the rest of the buffer, but >> somehow I find it unlikely. >> >> Puzzled. >> >>> You get different results? How is that possible? The search string is >>> NUL-terminated in each case, while the point of the test is that the >>> file contents isn't, right? > > Intellectual curiosity tells me we may want to find out why it > fails, but in the meantime, I think replacing the test with "0$" to > force the scanner to find either the end of line or the end of the > buffer may be a good workaround. We do not have to care how many of > random bytes are in front of the last "0" in order to ensure that > the regexec_buf() does not overstep to 4097th byte, while seeing > that regexec() that does not know how long the haystack is has to do > so, no? Our regexec() calls strlen() (see my other reply). Using "0$" looks like the best option to me. René
Re: [PATCH] t4062: stop using repetition in regex
Am 09.08.2017 um 00:09 schrieb Junio C Hamano: > René Scharfewrites: > >> Am 08.08.2017 um 16:49 schrieb Johannes Schindelin: >>> Hi René, >>> >>> On Tue, 8 Aug 2017, René Scharfe wrote: >>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255. That's the minimum acceptable value according to POSIX. In t4062 we use 4096 repetitions in the test "-G matches", though, causing it to fail. Do the same as the test "-S --pickaxe-regex" in the same file and search for a single zero instead. That still suffices to trigger the buffer overrun in older versions (checked with b7d36ffca02^ and --valgrind on Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit. >>> >>> I am afraid not. The 4096 is precisely the page size required to trigger >>> the bug on Windows against which this regression test tries to safeguard. >> >> Checked with b7d36ffca02^ on MinGW now as well and found that it >> segfaults with the proposed change ten out of ten times. > > That is a strange result but I can believe it. > > The reason why I find it strange is that the test wants to run > diff_grep() in diffcore-pickaxe.c with one == NULL (because we are > looking at difference between an initial empty commit and the > current commit that adds 4096-zeroes.txt file), which makes the > current blob (i.e. a page of '0' that may be mmap(2)ed without > trailing NUL to terminate it) scanned via regexec() to look for the > search string. > > I can understand why Dscho originally did "^0{4096}$"; it is to > ensure that the whole page is scanned for 4096 zeroes and then the > code would somehow make sure that there is no more byte until the > end of line, which will force regexec (but not regexec_buf that knows > where the buffer ends) to look at the 4097th byte that does not exist. > > If you change the pattern to just "0" that is not anchored, I'd expect > regexec() that does not know how big the haystack is to just find "0" > at the first byte and happily return without segfaulting (because it > does not even have to scan the remainder of the buffer). > > So I find Dscho's concern quite valid, even though I do believe you > when you say the code somehow segfaults. I just can not tell > how/why it would segfault, though---it is possible that regexec() > implementation is stupid and does not realize that it can return early > reporting success without looking at the rest of the buffer, but > somehow I find it unlikely. > > Puzzled. Good point. Valgrind reports: ==57466== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==57466== Access not within mapped region at address 0x4027000 ==57466==at 0x4C2EDF4: strlen (vg_replace_strmem.c:458) ==57466==by 0x59D9F76: regexec@@GLIBC_2.3.4 (regexec.c:240) ==57466==by 0x54D96E: diff_grep (diffcore-pickaxe.c:0) ==57466==by 0x54DAC3: pickaxe_match (diffcore-pickaxe.c:149) And you can see in our version in compat/regex/regexec.c:241 that the first thing regexec() does is calling strlen(). So to avoid depending on that implementation detail we'd need to use a search string that won't be found (e.g. "1") or with unlimited repetition (e.g. "0*"), right? René
Re: [PATCH] t4062: stop using repetition in regex
Junio C Hamanowrites: > So I find Dscho's concern quite valid, even though I do believe you > when you say the code somehow segfaults. I just can not tell > how/why it would segfault, though---it is possible that regexec() > implementation is stupid and does not realize that it can return early > reporting success without looking at the rest of the buffer, but > somehow I find it unlikely. > > Puzzled. > >> You get different results? How is that possible? The search string is >> NUL-terminated in each case, while the point of the test is that the >> file contents isn't, right? Intellectual curiosity tells me we may want to find out why it fails, but in the meantime, I think replacing the test with "0$" to force the scanner to find either the end of line or the end of the buffer may be a good workaround. We do not have to care how many of random bytes are in front of the last "0" in order to ensure that the regexec_buf() does not overstep to 4097th byte, while seeing that regexec() that does not know how long the haystack is has to do so, no?
Re: [PATCH] t4062: stop using repetition in regex
René Scharfewrites: > Am 08.08.2017 um 16:49 schrieb Johannes Schindelin: >> Hi René, >> >> On Tue, 8 Aug 2017, René Scharfe wrote: >> >>> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255. >>> That's the minimum acceptable value according to POSIX. In t4062 we use >>> 4096 repetitions in the test "-G matches", though, causing it to fail. >>> >>> Do the same as the test "-S --pickaxe-regex" in the same file and search >>> for a single zero instead. That still suffices to trigger the buffer >>> overrun in older versions (checked with b7d36ffca02^ and --valgrind on >>> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit. >> >> I am afraid not. The 4096 is precisely the page size required to trigger >> the bug on Windows against which this regression test tries to safeguard. > > Checked with b7d36ffca02^ on MinGW now as well and found that it > segfaults with the proposed change ten out of ten times. That is a strange result but I can believe it. The reason why I find it strange is that the test wants to run diff_grep() in diffcore-pickaxe.c with one == NULL (because we are looking at difference between an initial empty commit and the current commit that adds 4096-zeroes.txt file), which makes the current blob (i.e. a page of '0' that may be mmap(2)ed without trailing NUL to terminate it) scanned via regexec() to look for the search string. I can understand why Dscho originally did "^0{4096}$"; it is to ensure that the whole page is scanned for 4096 zeroes and then the code would somehow make sure that there is no more byte until the end of line, which will force regexec (but not regexec_buf that knows where the buffer ends) to look at the 4097th byte that does not exist. If you change the pattern to just "0" that is not anchored, I'd expect regexec() that does not know how big the haystack is to just find "0" at the first byte and happily return without segfaulting (because it does not even have to scan the remainder of the buffer). So I find Dscho's concern quite valid, even though I do believe you when you say the code somehow segfaults. I just can not tell how/why it would segfault, though---it is possible that regexec() implementation is stupid and does not realize that it can return early reporting success without looking at the rest of the buffer, but somehow I find it unlikely. Puzzled. > You get different results? How is that possible? The search string is > NUL-terminated in each case, while the point of the test is that the > file contents isn't, right?
Re: [PATCH] t4062: stop using repetition in regex
Am 08.08.2017 um 16:49 schrieb Johannes Schindelin: > Hi René, > > On Tue, 8 Aug 2017, René Scharfe wrote: > >> OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255. >> That's the minimum acceptable value according to POSIX. In t4062 we use >> 4096 repetitions in the test "-G matches", though, causing it to fail. >> >> Do the same as the test "-S --pickaxe-regex" in the same file and search >> for a single zero instead. That still suffices to trigger the buffer >> overrun in older versions (checked with b7d36ffca02^ and --valgrind on >> Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit. > > I am afraid not. The 4096 is precisely the page size required to trigger > the bug on Windows against which this regression test tries to safeguard. Checked with b7d36ffca02^ on MinGW now as well and found that it segfaults with the proposed change ten out of ten times. You get different results? How is that possible? The search string is NUL-terminated in each case, while the point of the test is that the file contents isn't, right? > Maybe simply disable the test on OpenBSD instead? Or guard the {4096} > behind the MINGW prereq. It's easy to build a long search string with two repetitions or by using a longer string as the base, if necessary. But first we need to find out why regexec() doesn't overflow in your case. My build uses the version from compat/. Why would it stop before reaching a NUL? That sounds like a different and serious bug. Thanks, René
Re: [PATCH] t4062: stop using repetition in regex
Hi René, On Tue, 8 Aug 2017, René Scharfe wrote: > OpenBSD's regex library has a repetition limit (RE_DUP_MAX) of 255. > That's the minimum acceptable value according to POSIX. In t4062 we use > 4096 repetitions in the test "-G matches", though, causing it to fail. > > Do the same as the test "-S --pickaxe-regex" in the same file and search > for a single zero instead. That still suffices to trigger the buffer > overrun in older versions (checked with b7d36ffca02^ and --valgrind on > Linux), simplifies the test a bit, and avoids exceeding OpenBSD's limit. I am afraid not. The 4096 is precisely the page size required to trigger the bug on Windows against which this regression test tries to safeguard. Maybe simply disable the test on OpenBSD instead? Or guard the {4096} behind the MINGW prereq. Ciao, Dscho