Re: git grep -P fatal: pcre_exec failed with error code -8
On Mon, Nov 06 2017, Jeff King jotted: > On Mon, Nov 06, 2017 at 12:50:45PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> Some replies to the thread in general, didn't want to spread this out >> into different replies. >> >> * Yes this sucks. >> >> * Just emitting a warning without an appropriate exit code would suck >>more, would break batch jobs & whatnot that expcept certain results >>from grep. > > That was my first thought, too, but something that does: > > git grep foo && echo found > > would behave basically the same. Do you mean here scripts that actually > do: > > git grep foo > case "$?" in > 0) echo found ;; > 1) echo not found ;; > *) echo wtf? ;; > esac > > I agree it would be nice to at least have _some_ way to distinguish > between those final two cases. Maybe we should emit a different exit code, but I just had in mind that we could continue but the same non-zero as when the grep fails now, but with an additional warning. > Though something like "git log --grep" is even more complicated. We > perhaps _would_ want to distinguish between: > > - match (in which case we print the commit) > > - no match (in which case we do not) > > - error (in which case we do not print, but also mark the exit code as > failing) > >> * As you point out it would be nice to print out the file name we >>didn't match on, we'd need to pass the grep_source struct down >>further, it goes as far as grep_source_1 but stops there and isn't >>passed to e.g. look_ahead(), which calls patmatch() which calls the >>engine-specific matcher and would need to report the error. We could >>just do this, would slow down things a bit (probably trivally) but we >>could emit better error messages in genreal. > > I'm not sure if the grep_source has enough information for all cases. > E.g., if you hit an error while grepping in commit headers, you'd > probably want to mention the oid of the commit. There's an "identifier" > field in the grep_source, but it's opaque. > > The caller may also want to do more things than just print an error > (like the exit code adjustment I mentioned above). Which implies to me > we should be passing the error information up, not trying to bring the > context down. Indeed, I was just thinking of the case where we're grepping a file in the tree, not when the engine is used for --grep et al. >> * You can adjust these limts in PCRE in Git, although it doesn't help >>in this case, you just add (*LIMIT_MATCH=NUM) or >>(*LIMIT_RECURSION=NUM) (or both) to the start of the pattern. See >>pcresyntax(3) or pcre2syntax(3) man pages depending on what version >>you have installed. > > I saw that in the pcre manual, but I had the impression you can't > actually raise the limits above the defaults with that, only lower them. You can, you just can't set them to anything less conservative than limits already set via the C API, but we don't set any of those. I.e. PCRE allows you to say expose a regex field in a web form (as we do with gitweb) with really conservative settings that can't be overriden via a (*LIMIT) set in the pattern. But since we don't use that C API PCRE runs in a mode where the user gets set whatever limit they want in the pattern (or other pattern-altering switch), which makes sense for interactive git-grep use. >> * While regexec() won't return an error its version of dealing with >>this is (at least under glibc) to balloon CPU/memory use until the >>OOMkiller kills git (although not on this particular pattern). > > So in a sense our current behavior with pcre is the same. We just have > to provoke the death ourselves. ;) Indeed :)
Re: git grep -P fatal: pcre_exec failed with error code -8
On Mon, Nov 06, 2017 at 12:50:45PM +0100, Ævar Arnfjörð Bjarmason wrote: > Some replies to the thread in general, didn't want to spread this out > into different replies. > > * Yes this sucks. > > * Just emitting a warning without an appropriate exit code would suck >more, would break batch jobs & whatnot that expcept certain results >from grep. That was my first thought, too, but something that does: git grep foo && echo found would behave basically the same. Do you mean here scripts that actually do: git grep foo case "$?" in 0) echo found ;; 1) echo not found ;; *) echo wtf? ;; esac I agree it would be nice to at least have _some_ way to distinguish between those final two cases. Though something like "git log --grep" is even more complicated. We perhaps _would_ want to distinguish between: - match (in which case we print the commit) - no match (in which case we do not) - error (in which case we do not print, but also mark the exit code as failing) > * As you point out it would be nice to print out the file name we >didn't match on, we'd need to pass the grep_source struct down >further, it goes as far as grep_source_1 but stops there and isn't >passed to e.g. look_ahead(), which calls patmatch() which calls the >engine-specific matcher and would need to report the error. We could >just do this, would slow down things a bit (probably trivally) but we >could emit better error messages in genreal. I'm not sure if the grep_source has enough information for all cases. E.g., if you hit an error while grepping in commit headers, you'd probably want to mention the oid of the commit. There's an "identifier" field in the grep_source, but it's opaque. The caller may also want to do more things than just print an error (like the exit code adjustment I mentioned above). Which implies to me we should be passing the error information up, not trying to bring the context down. > * You can adjust these limts in PCRE in Git, although it doesn't help >in this case, you just add (*LIMIT_MATCH=NUM) or >(*LIMIT_RECURSION=NUM) (or both) to the start of the pattern. See >pcresyntax(3) or pcre2syntax(3) man pages depending on what version >you have installed. I saw that in the pcre manual, but I had the impression you can't actually raise the limits above the defaults with that, only lower them. > * While regexec() won't return an error its version of dealing with >this is (at least under glibc) to balloon CPU/memory use until the >OOMkiller kills git (although not on this particular pattern). So in a sense our current behavior with pcre is the same. We just have to provoke the death ourselves. ;) -Peff
Re: git grep -P fatal: pcre_exec failed with error code -8
On Mon, Nov 06 2017, Jeff King jotted: > On Sun, Nov 05, 2017 at 10:41:17AM +0100, Дилян Палаузов wrote: > >> I understand that the PCRE's stack can get exhausted for some files, but in >> such cases, git grep shall proceed with the other files, and print at the >> end/stderr for which files the pattern was not applied. Such behaviour >> would be more usefull than the current one. > > Yes, I had a similar thought. It does feel a little funny for us to > basically treat an error as "no match" for non-interactive use, but then > the current behavior works out to be more or less the same (we return an > error code which most shell scripts would interpret as failure). > > IOW, I think something like this is probably the right direction: > > diff --git a/grep.c b/grep.c > index ce6a48e634..2c152e5908 100644 > --- a/grep.c > +++ b/grep.c > @@ -427,7 +427,7 @@ static int pcre1match(struct grep_pat *p, const char > *line, const char *eol, > } > > if (ret < 0 && ret != PCRE_ERROR_NOMATCH) > - die("pcre_exec failed with error code %d", ret); > + warning("pcre_exec failed with error code %d", ret); > if (ret > 0) { > ret = 0; > match->rm_so = ovector[0]; > > but possibly: > > 1. It would be nice to report the filename that we couldn't match on. > But we don't know it at this level of the code (and it might not be > a file at all that we are matching). So probably we'd want to pass > the error much further up the call stack. This is tricky as there > are multiple regex libraries we can use, and the return value gets > normalized to 1/0 for hit/not-hit long before we get as far as > something that knows the filename. > > We might need to do something invasive like adding an extra > parameter to hold the error message, and passing it through the > whole stack. > > 2. We should still try to exit with an exit code other than "1" to > indicate we hit an error besides "no lines were found". > > 3. Other regex libraries might need similar treatment. Probably > pcre2match() needs it. It doesn't look like regexec() can ever > return an error besides REG_NOMATCH. > > -Peff Some replies to the thread in general, didn't want to spread this out into different replies. * Yes this sucks. * Just emitting a warning without an appropriate exit code would suck more, would break batch jobs & whatnot that expcept certain results from grep. * As you point out it would be nice to print out the file name we didn't match on, we'd need to pass the grep_source struct down further, it goes as far as grep_source_1 but stops there and isn't passed to e.g. look_ahead(), which calls patmatch() which calls the engine-specific matcher and would need to report the error. We could just do this, would slow down things a bit (probably trivally) but we could emit better error messages in genreal. * You can adjust these limts in PCRE in Git, although it doesn't help in this case, you just add (*LIMIT_MATCH=NUM) or (*LIMIT_RECURSION=NUM) (or both) to the start of the pattern. See pcresyntax(3) or pcre2syntax(3) man pages depending on what version you have installed. * While regexec() won't return an error its version of dealing with this is (at least under glibc) to balloon CPU/memory use until the OOMkiller kills git (although not on this particular pattern).
Re: git grep -P fatal: pcre_exec failed with error code -8
On Sun, Nov 05, 2017 at 10:41:17AM +0100, Дилян Палаузов wrote: > I understand that the PCRE's stack can get exhausted for some files, but in > such cases, git grep shall proceed with the other files, and print at the > end/stderr for which files the pattern was not applied. Such behaviour > would be more usefull than the current one. Yes, I had a similar thought. It does feel a little funny for us to basically treat an error as "no match" for non-interactive use, but then the current behavior works out to be more or less the same (we return an error code which most shell scripts would interpret as failure). IOW, I think something like this is probably the right direction: diff --git a/grep.c b/grep.c index ce6a48e634..2c152e5908 100644 --- a/grep.c +++ b/grep.c @@ -427,7 +427,7 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol, } if (ret < 0 && ret != PCRE_ERROR_NOMATCH) - die("pcre_exec failed with error code %d", ret); + warning("pcre_exec failed with error code %d", ret); if (ret > 0) { ret = 0; match->rm_so = ovector[0]; but possibly: 1. It would be nice to report the filename that we couldn't match on. But we don't know it at this level of the code (and it might not be a file at all that we are matching). So probably we'd want to pass the error much further up the call stack. This is tricky as there are multiple regex libraries we can use, and the return value gets normalized to 1/0 for hit/not-hit long before we get as far as something that knows the filename. We might need to do something invasive like adding an extra parameter to hold the error message, and passing it through the whole stack. 2. We should still try to exit with an exit code other than "1" to indicate we hit an error besides "no lines were found". 3. Other regex libraries might need similar treatment. Probably pcre2match() needs it. It doesn't look like regexec() can ever return an error besides REG_NOMATCH. -Peff
Re: git grep -P fatal: pcre_exec failed with error code -8
Hello, thanks for your answer. I understand that the PCRE's stack can get exhausted for some files, but in such cases, git grep shall proceed with the other files, and print at the end/stderr for which files the pattern was not applied. Such behaviour would be more usefull than the current one. Regards Dilian On 11/05/2017 03:16 AM, Jeff King wrote: On Sun, Nov 05, 2017 at 01:06:21AM +0100, Дилян Палаузов wrote: with git 2.14.3 linked with libpcre.so.1.2.9 when I do: git clone https://github.com/django/django cd django git grep -P "if.*([^\s])+\s+and\s+\1" django/contrib/admin/static/admin/js/vendor/select2/select2.full.min.js the output is: fatal: pcre_exec failed with error code -8 Code -8 is PCRE_ERROR_MATCHLIMIT. And "man pcreapi" has this to say: The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlimited repeats. Internally, pcre_exec() uses a function called match(), which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each posi‐ tion in the subject string. When pcre_exec() is called with a pattern that was successfully studied with a JIT option, the way that the matching is exe‐ cuted is entirely different. However, there is still the pos‐ sibility of runaway matching that goes on for a very long time, and so the match_limit value is also used in this case (but in a different way) to limit how long the matching can continue. The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCH‐ LIMIT. So your pattern is just really expensive and is running afoul of pcre's backtracking limits (and it's not helped by the fact that the file is basically one giant line). There's no way to ask Git to specify a larger match_limit to pcre, but you might be able to construct your pattern in a way that involves less backtracking. It looks like you're trying to find things like "if foo and foo"? Should the captured term actually be "([^\s]+)" (with the "+" on the _inside_ of the capture? Or maybe I'm just misunderstanding your goal. -Peff
Re: git grep -P fatal: pcre_exec failed with error code -8
On Sun, Nov 05, 2017 at 01:06:21AM +0100, Дилян Палаузов wrote: > with git 2.14.3 linked with libpcre.so.1.2.9 when I do: > git clone https://github.com/django/django > cd django > git grep -P "if.*([^\s])+\s+and\s+\1" > django/contrib/admin/static/admin/js/vendor/select2/select2.full.min.js > the output is: > fatal: pcre_exec failed with error code -8 Code -8 is PCRE_ERROR_MATCHLIMIT. And "man pcreapi" has this to say: The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is a pattern that uses nested unlimited repeats. Internally, pcre_exec() uses a function called match(), which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each posi‐ tion in the subject string. When pcre_exec() is called with a pattern that was successfully studied with a JIT option, the way that the matching is exe‐ cuted is entirely different. However, there is still the pos‐ sibility of runaway matching that goes on for a very long time, and so the match_limit value is also used in this case (but in a different way) to limit how long the matching can continue. The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCH‐ LIMIT. So your pattern is just really expensive and is running afoul of pcre's backtracking limits (and it's not helped by the fact that the file is basically one giant line). There's no way to ask Git to specify a larger match_limit to pcre, but you might be able to construct your pattern in a way that involves less backtracking. It looks like you're trying to find things like "if foo and foo"? Should the captured term actually be "([^\s]+)" (with the "+" on the _inside_ of the capture? Or maybe I'm just misunderstanding your goal. -Peff
git grep -P fatal: pcre_exec failed with error code -8
Hello, with git 2.14.3 linked with libpcre.so.1.2.9 when I do: git clone https://github.com/django/django cd django git grep -P "if.*([^\s])+\s+and\s+\1" django/contrib/admin/static/admin/js/vendor/select2/select2.full.min.js the output is: fatal: pcre_exec failed with error code -8 (But not with git clone https://github.com/select2/select2 cd select2 git grep -P "if.*([^\s])+\s+and\s+\1" dist/js/select2.full.min.js as the two select2.full.min.js files differ slightly in their size) Any ideas? Regards Dilian