[Bug libstdc++/85472] Regex match bug

haberg-1 at telia dot com Wed, 25 Apr 2018 11:43:23 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472


--- Comment #12 from Hans Åberg <haberg-1 at telia dot com> ---
(In reply to Tim Shen from comment #11)
> > The problem is that #4 has an earlier capture, making it impossible to see
> > that it is left undefined later.
> 
> I wouldn't say it's impossible. libc++ implements it correctly at a cost.
>
> Specifically, see
> https://github.com/llvm-mirror/libcxx/blob/
> e54a22f809bc741760e39ac78444f57f06117652/include/regex#L5566. 

I meant the way I do it now.

> > So it would be better to capture it as an
> > empty string here:
> > 
> > In the example I gave, if the transitions are marked (ignoring #2)
> >   1: {a ↦ {1#3, 2#3, 3#3$4}};
> >   2: {b ↦ {2#4, 3#4}};
> >   3: {c ↦ {3#5, 2#5, 1#5, ∅#2}.
> > Then in the "ac" part, 'a' in state 1 goes to 3 and gets a #3 match 'a' plus
> > an empty which I have written $4. That is the idea, though.
> 
> I'm feeling that this is going to a quadratic space complexity here. Does
> "(((...((a_0)|(a_1)|(a_2)|...(a_n))...)))" if the regex expands to n
> surrounding parenthesis, and n characters in the middle? Would this regex
> match taking up to n^2 time, given a n-sized input string?

It is all linear, in one traverse, just pick down the action form the
transitions instead of the states. Both are computed when checking the forward
transition for a character: I just record them.

> If you think it's time and space efficiently doable, maybe you can develop a
> prototype first.

Yes. I will have to think about it, and report back here.

> > I have though not implemented the counted repetitions A{m, n}, which I think
> > usually are implemented as loops, so I do not know how that fits into the
> > picture.
> 
> It might also be helpful to think about back-references like "(a+)\1".

What's this?

[Bug libstdc++/85472] Regex match bug

Reply via email to