[Bug libstdc++/85472] Regex match bug

timshen at gcc dot gnu.org Wed, 25 Apr 2018 11:36:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472


--- Comment #11 from Tim Shen <timshen at gcc dot gnu.org> ---
> The problem is that #4 has an earlier capture, making it impossible to see
> that it is left undefined later.

I wouldn't say it's impossible. libc++ implements it correctly at a cost.

Specifically, see
https://github.com/llvm-mirror/libcxx/blob/e54a22f809bc741760e39ac78444f57f06117652/include/regex#L5566.

> So it would be better to capture it as an
> empty string here:
> 
> In the example I gave, if the transitions are marked (ignoring #2)
>   1: {a ↦ {1#3, 2#3, 3#3$4}};
>   2: {b ↦ {2#4, 3#4}};
>   3: {c ↦ {3#5, 2#5, 1#5, ∅#2}.
> Then in the "ac" part, 'a' in state 1 goes to 3 and gets a #3 match 'a' plus
> an empty which I have written $4. That is the idea, though.

I'm feeling that this is going to a quadratic space complexity here. Does
"(((...((a_0)|(a_1)|(a_2)|...(a_n))...)))" if the regex expands to n
surrounding parenthesis, and n characters in the middle? Would this regex match
taking up to n^2 time, given a n-sized input string?

If you think it's time and space efficiently doable, maybe you can develop a
prototype first.

> I have though not implemented the counted repetitions A{m, n}, which I think
> usually are implemented as loops, so I do not know how that fits into the
> picture.

It might also be helpful to think about back-references like "(a+)\1".

[Bug libstdc++/85472] Regex match bug

Reply via email to