[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947 --- Comment #7 from Tomalak Geret'kal --- (In reply to Tim Shen from comment #5) > For the original test case, have you tried regex_match() with "what.*"? That behaves as I'd expect (http://quick-bench.com/AKdMnnhA03T1vwfN9sf53xlbD6s). > Do you have any non-trivial testcase in mind that is still unexpectedly slow > with regex_match()? The original real-world pattern that led to me discovering this was: /^[\x02-\x7f]\0..[\x01-\x0c]\0..\0\0/ Switching to regex_match() for that pattern also yields the expected result (http://quick-bench.com/g6lZj00gBswzvd-rjO7QwRE0Exg), so that's a good workaround here. But, adapting your earlier example to "(^abc|xyz)", this would require chaining a regex_match with a regex_search, which gets unwieldy quite quickly. Well, okay, I suppose in that example we could regex_match on "(?:(abc).*|.*(xyz).*)", but I really don't think we should have to rewrite patterns like this in order to get the behaviour that's common in other ecosystems' regex impls. So, although I'm open to being convinced otherwise, I still think we would reasonably expect regex_search to fail fast.
[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947 --- Comment #6 from Tim Shen --- (In reply to Tomalak Geret'kal from comment #4) > To be honest I'd expect this in less trivial circumstances too. If, at a > given stage of processing, the only possible paths towards a match all > require a prefix that's already been ruled out, that should be an immediate > return false. Thinking about this more, I think it's also easy to support the following case: regex_search(..., regex("{arbitrary_literal_string}...") where {arbitrary_literal_string} is a literal string, without other regex magic like "|" or "*". The literal prefix should be passed into a substring search. For implemention: The new algorithm for regex_search() would be: (1) prefix = find the longest deterministic prefix of the regex (2) pos = find the first occurance of the prefix in the target. If it fails, return false. (3) target = target[pos+prefix.size():] (4) try match target from start (as it's currently done) (5) if (4) fails, go to (2). (1) can be done at regex-compiling time. It'd require a new data member thus non-ABI compatible. (1) can also be done at matching time. Either way, we are looking for the longest sequence of literals from start. It stops with any of "*", "?", "(" or any other magic. (2) can be as plain as a substring search, but one might prefer the O(n) KMP algorithm if we happen to have one.
[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947 --- Comment #5 from Tim Shen --- (In reply to Tomalak Geret'kal from comment #4) > To be honest I'd expect this in less trivial circumstances too. If, at a > given stage of processing, the only possible paths towards a match all > require a prefix that's already been ruled out, that should be an immediate > return false. To the best of my knowledge this is commonly what happens in > regex engines (though again libstdc++ is far from alone in the C++ world in > not doing so!) For the original test case, have you tried regex_match() with "what.*"? Do you have any non-trivial testcase in mind that is still unexpectedly slow with regex_match()?
[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947 --- Comment #4 from Tomalak Geret'kal --- To be honest I'd expect this in less trivial circumstances too. If, at a given stage of processing, the only possible paths towards a match all require a prefix that's already been ruled out, that should be an immediate return false. To the best of my knowledge this is commonly what happens in regex engines (though again libstdc++ is far from alone in the C++ world in not doing so!)
[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947 --- Comment #3 from Tim Shen --- Thanks for reporting Tomalak. Yes, I agree that "match from the first character" should be expressable in the public interface, preferrably regex_search() with "^...". In fact, internally regex_search is implemented in terms of such semantic. As for other forms like "(^...)", or "(^abc|^xyz)", I'm sure we can go down the rabbit hole and support more of them, but IMO the gain is marginal if "^..." is supported.
[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947 Jonathan Wakely changed: What|Removed |Added Keywords||missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed||2019-01-21 CC||timshen at gcc dot gnu.org Version|unknown |8.2.0 Ever confirmed|0 |1 Severity|normal |enhancement --- Comment #2 from Jonathan Wakely --- Confirmed for the latest release (and also present on trunk).
[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947 --- Comment #1 from Jonathan Wakely --- (In reply to Tomalak Geret'kal from comment #0) > (Apologies that I am not sufficiently familiar with libstdc++ version > history to select an appropriate version number for this bug.) Libstdc++ doesn't have version numbers and you don't need to know history. You're supposed to say what version of GCC you're reporting the bug against.