[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

2019-01-22 Thread tom at kera dot name
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

--- Comment #7 from Tomalak Geret'kal  ---
(In reply to Tim Shen from comment #5)
> For the original test case, have you tried regex_match() with "what.*"?

That behaves as I'd expect
(http://quick-bench.com/AKdMnnhA03T1vwfN9sf53xlbD6s).

> Do you have any non-trivial testcase in mind that is still unexpectedly slow
> with regex_match()?

The original real-world pattern that led to me discovering this was:

/^[\x02-\x7f]\0..[\x01-\x0c]\0..\0\0/

Switching to regex_match() for that pattern also yields the expected result
(http://quick-bench.com/g6lZj00gBswzvd-rjO7QwRE0Exg), so that's a good
workaround here.

But, adapting your earlier example to "(^abc|xyz)", this would require chaining
a regex_match with a regex_search, which gets unwieldy quite quickly.

Well, okay, I suppose in that example we could regex_match on
"(?:(abc).*|.*(xyz).*)", but I really don't think we should have to rewrite
patterns like this in order to get the behaviour that's common in other
ecosystems' regex impls. So, although I'm open to being convinced otherwise, I
still think we would reasonably expect regex_search to fail fast.

[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

2019-01-22 Thread timshen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

--- Comment #6 from Tim Shen  ---
(In reply to Tomalak Geret'kal from comment #4)
> To be honest I'd expect this in less trivial circumstances too. If, at a
> given stage of processing, the only possible paths towards a match all
> require a prefix that's already been ruled out, that should be an immediate
> return false.

Thinking about this more, I think it's also easy to support the following case:
regex_search(..., regex("{arbitrary_literal_string}...")
where {arbitrary_literal_string} is a literal string, without other regex magic
like "|" or "*". The literal prefix should be passed into a substring search.

For implemention:
The new algorithm for regex_search() would be:
(1) prefix = find the longest deterministic prefix of the regex
(2) pos = find the first occurance of the prefix in the target. If it fails,
return false.
(3) target = target[pos+prefix.size():]
(4) try match target from start (as it's currently done)
(5) if (4) fails, go to (2).

(1) can be done at regex-compiling time. It'd require a new data member thus
non-ABI compatible. (1) can also be done at matching time. Either way, we are
looking for the longest sequence of literals from start. It stops with any of
"*", "?", "(" or any other magic.

(2) can be as plain as a substring search, but one might prefer the O(n) KMP
algorithm if we happen to have one.

[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

2019-01-22 Thread timshen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

--- Comment #5 from Tim Shen  ---
(In reply to Tomalak Geret'kal from comment #4)
> To be honest I'd expect this in less trivial circumstances too. If, at a
> given stage of processing, the only possible paths towards a match all
> require a prefix that's already been ruled out, that should be an immediate
> return false. To the best of my knowledge this is commonly what happens in
> regex engines (though again libstdc++ is far from alone in the C++ world in
> not doing so!)

For the original test case, have you tried regex_match() with "what.*"?

Do you have any non-trivial testcase in mind that is still unexpectedly slow
with regex_match()?

[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

2019-01-22 Thread tom at kera dot name
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

--- Comment #4 from Tomalak Geret'kal  ---
To be honest I'd expect this in less trivial circumstances too. If, at a given
stage of processing, the only possible paths towards a match all require a
prefix that's already been ruled out, that should be an immediate return false.
To the best of my knowledge this is commonly what happens in regex engines
(though again libstdc++ is far from alone in the C++ world in not doing so!)

[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

2019-01-21 Thread timshen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

--- Comment #3 from Tim Shen  ---
Thanks for reporting Tomalak.

Yes, I agree that "match from the first character" should be expressable in the
public interface, preferrably regex_search() with "^...". In fact, internally
regex_search is implemented in terms of such semantic.

As for other forms like "(^...)", or "(^abc|^xyz)", I'm sure we can go down the
rabbit hole and support more of them, but IMO the gain is marginal if "^..." is
supported.

[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

2019-01-21 Thread redi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

Jonathan Wakely  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-01-21
 CC||timshen at gcc dot gnu.org
Version|unknown |8.2.0
 Ever confirmed|0   |1
   Severity|normal  |enhancement

--- Comment #2 from Jonathan Wakely  ---
Confirmed for the latest release (and also present on trunk).

[Bug libstdc++/88947] regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

2019-01-21 Thread redi at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88947

--- Comment #1 from Jonathan Wakely  ---
(In reply to Tomalak Geret'kal from comment #0)
> (Apologies that I am not sufficiently familiar with libstdc++ version
> history to select an appropriate version number for this bug.)

Libstdc++ doesn't have version numbers and you don't need to know history.
You're supposed to say what version of GCC you're reporting the bug against.