[Bug libstdc++/85472] Regex match bug

2018-05-02 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #22 from Hans Åberg  ---
(In reply to Tim Shen from comment #16)
> ...I meant to observe whether the example triggers a O(k^2) behavior. It
> should be less than quadratic. ...

> Here is the updated example, to re-arrange the expression to a tree, not a
> chain of "|" operators.

The matching itself that I do is linear O(k): I used a version of your example
for my program, timing the part in question for various values, dividing with N
and N^2, where N is the difference of the input numbers, making it easy to see
if it is O(k) or O(k^2).

However, in your example, the DFA lexing is quadratic, probably because I use
unordered_set objects that here gets size proportional to N. Thus this ought to
be optimized. So you might check if that might be an issue in the C++ library.

[Bug libstdc++/85472] Regex match bug

2018-05-01 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #21 from Hans Åberg  ---
(In reply to Tim Shen from comment #20)
> (In reply to Hans Åberg from comment #18)
> > (In reply to Tim Shen from comment #14)
> > > I have a C++ test case for this:
> > 
> > I get segmentation fault on the regex_match line with g++ 7.3.0.
> 
> On Linux I set "ulimit -s unlimited".

On MacOS, I can change the value, default 8192, but not to unlimited.

> The stack overflow is a known issue
> because the implementation is recursive.

I have a loop, as in Flex and Bison generated parsers.

[Bug libstdc++/85472] Regex match bug

2018-05-01 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #19 from Hans Åberg  ---
(In reply to Tim Shen from comment #16)
> I set (kStart, kEnd) to be (1000, 2000), (1000, 3000), ..., (1000, 9000) to
> see how the time grows. ...

> Here are my numbers for end ranging 2000 to 9000, in ms:
> [79,150,262,427,620,868,1078,1310].

I get segmentation fault beyond 4000, but for this value, it takes 1.130 s, so
your computer is about 4 times faster it seems.

[Bug libstdc++/85472] Regex match bug

2018-05-01 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #18 from Hans Åberg  ---
(In reply to Tim Shen from comment #14)
> I have a C++ test case for this:

I get segmentation fault on the regex_match line with g++ 7.3.0.

[Bug libstdc++/85472] Regex match bug

2018-05-01 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #17 from Hans Åberg  ---
(In reply to Tim Shen from comment #16)
> (In reply to Hans Åberg from comment #15)
> > (In reply to Tim Shen from comment #14)
> > > How fast does your prototype run on the following example?
> > >   ((10)|(11)|(12)|...|(98)|(99))* matching "10111213141516...9899"
> > > how about this:
> > >   ((100)|(101)|(102)|...|(998)|(999))* matching "100101102...998999"
> > > 
> > > where "..." are the skipped strings that follow the surrounding pattern.
> > 
> > My program does not choke on limit cases, like the regex '(a?)^k a^k' on the
> > string string "a"^k, mentioned at [1]; for k = 100, it takes 0.393 s on the
> > NFA and 0.168 on the DFA on a not so powerful computer. Otherwise, it is not
> > optimized and fast in absolute terms: I use standard C++ containers such as
> > unordered_set for DFA state lookups, and not a number.
> > 
> > 1. https://swtch.com/~rsc/regexp/regexp1.html
> 
> Sorry, I meant to observe whether the example triggers a O(k^2) behavior. It
> should be less than quadratic. To measure this, you don't have to compare
> the time of your prototype against libstdc++.

I'll have to get back on this. But it shouldn't choke:

During the lexing, I just record the DFA states as function of the string
positions. When a final state appears, the reverse NFA data can be used to
compute the NFA states this match actually uses, and therefore, as the
characters are known, also the NFA transitions, and any data that depends on
it.

[Bug libstdc++/85472] Regex match bug

2018-05-01 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #15 from Hans Åberg  ---
(In reply to Tim Shen from comment #14)
> How fast does your prototype run on the following example?
>   ((10)|(11)|(12)|...|(98)|(99))* matching "10111213141516...9899"
> how about this:
>   ((100)|(101)|(102)|...|(998)|(999))* matching "100101102...998999"
> 
> where "..." are the skipped strings that follow the surrounding pattern.

My program does not choke on limit cases, like the regex '(a?)^k a^k' on the
string string "a"^k, mentioned at [1]; for k = 100, it takes 0.393 s on the NFA
and 0.168 on the DFA on a not so powerful computer. Otherwise, it is not
optimized and fast in absolute terms: I use standard C++ containers such as
unordered_set for DFA state lookups, and not a number.

1. https://swtch.com/~rsc/regexp/regexp1.html

[Bug libstdc++/85472] Regex match bug

2018-05-01 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #13 from Hans Åberg  ---
(In reply to Tim Shen from comment #11)
> If you think it's time and space efficiently doable, maybe you can develop a
> prototype first.

I put action number lists on the NFA transitions (no markup of the states). But
then the things I described before work.

Theoretically it may be simplest to start with a NFA with empty (ε)
transitions: For the sub-automatons one is interested in to find matches for,
put a unique action number on the start and final state transitions, marked
begin resp. end match. When compacted to a NFA without empty transitions, which
is what I use, they become lists.

I use bra-ket notation  to indicate the action number k begin-end match
of the automaton A; parentheses indicate grouping. Then the original example
can be written
  <1|z|1>(<2|<3|a*|3><4|b*|4><5|c|5>|2>)*

My program then produces the following result, first the NFA, and then the
matches:

ex = (ε ↦ {<1|0}, {
  0: {z ↦ {|1><2|<3||3><4||4><5|3, |1><2|<3||3><4|2, |1><2|<3|1, |1>∅}}
  1: {a ↦ {1, |3><4|2, |3><4||4><5|3}}
  2: {b ↦ {2, |4><5|3}}
  3: {c ↦ {|5>|2><2|<3||3><4||4><5|3, |5>|2><2|<3||3><4|2, |5>|2><2|<3|1,
|5>|2>∅}}}, {})

> zaacbbbcac
Full string is a valid match!
Matches: {zaacbbbcac, [@1@4@8@10];
<1|z|1>
<1|z|1> <2|<3|aa|3><4||4><5|c|5>|2>
<1|z|1> <2|<3|aa|3><4||4><5|c|5>|2> <2|<3||3><4|bbb|4><5|c|5>|2>
<1|z|1> <2|<3|aa|3><4||4><5|c|5>|2> <2|<3||3><4|bbb|4><5|c|5>|2>
<2|<3|a|3><4||4><5|c|5>|2>
}

In the first NFA description, transition actions are written after the action
that one pass over. So, for example, the original start state ε ↦ {<1|0} means
that one opens the match <1| to arrive at state 1; then all possibilities in
state 1 closes it with |1>, then opening some other matches.

In the matches description, after the matched string follows, the final state
match string lengths follows, so that eventual empty matches get number 0. The
longest match last then have at the end the sequence
  <2|<3|a|3><4||4><5|c|5>|2>
which is what is asked for: #2 "ac, #3 "a", #4 "", #5 "c", in addition to # 1
"z" first.

[Bug libstdc++/85472] Regex match bug

2018-04-25 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #12 from Hans Åberg  ---
(In reply to Tim Shen from comment #11)
> > The problem is that #4 has an earlier capture, making it impossible to see
> > that it is left undefined later.
> 
> I wouldn't say it's impossible. libc++ implements it correctly at a cost.
>
> Specifically, see
> https://github.com/llvm-mirror/libcxx/blob/
> e54a22f809bc741760e39ac78444f57f06117652/include/regex#L5566. 

I meant the way I do it now.

> > So it would be better to capture it as an
> > empty string here:
> > 
> > In the example I gave, if the transitions are marked (ignoring #2)
> >   1: {a ↦ {1#3, 2#3, 3#3$4}};
> >   2: {b ↦ {2#4, 3#4}};
> >   3: {c ↦ {3#5, 2#5, 1#5, ∅#2}.
> > Then in the "ac" part, 'a' in state 1 goes to 3 and gets a #3 match 'a' plus
> > an empty which I have written $4. That is the idea, though.
> 
> I'm feeling that this is going to a quadratic space complexity here. Does
> "(((...((a_0)|(a_1)|(a_2)|...(a_n))...)))" if the regex expands to n
> surrounding parenthesis, and n characters in the middle? Would this regex
> match taking up to n^2 time, given a n-sized input string?

It is all linear, in one traverse, just pick down the action form the
transitions instead of the states. Both are computed when checking the forward
transition for a character: I just record them.

> If you think it's time and space efficiently doable, maybe you can develop a
> prototype first.

Yes. I will have to think about it, and report back here.

> > I have though not implemented the counted repetitions A{m, n}, which I think
> > usually are implemented as loops, so I do not know how that fits into the
> > picture.
> 
> It might also be helpful to think about back-references like "(a+)\1".

What's this?

[Bug libstdc++/85472] Regex match bug

2018-04-25 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #10 from Hans Åberg  ---
(In reply to Tim Shen from comment #9)
> Ah with the example it's clear, thanks!

You are welcome.

> > The last line gives for #1 the sub-string "z" , and for #2 "aacbbbcac".
> 
> This is not what ECMAScript produces either. for capture #2, ECMAScriptn
> produces "ac", the last match of the loop.
> 
> Think about the difference between
> 
>   (z)((a+)?(b+)?(c))*
> 
> and
> 
>   (z)((?:(a+)?(b+)?(c))*)
> 
> Your #2 seems to capture the second case, which is different.

Yes, I thought about this difference in an earlier version that marked the
start and last states only, but switched to this, because I wanted to ignore
empty matches:

I think though that one might mark the transitions with action numbers and
capture it that way (see below).

> > For
> > #3 "a", and for #5 "c". But #4 is missing, indication there is no match. So
> > there might be problem here, as there are earlier matches:
> > 
> > Perhaps the intent is that it should be implemented as a loop, only
> > retaining the last #4, 
> 
> That's what the implementations (boost, libstdc++, python) actually do.
> That's not ECMAScript's intention. ECMAScript's intention is to leave #4
> undefined (*not* retaining the last non-empty #4), as in the last iteration
> of the loop, #4 (b+)? doesn't match any sub-string.

The problem is that #4 has an earlier capture, making it impossible to see that
it is left undefined later. So it would be better to capture it as an empty
string here:

In the example I gave, if the transitions are marked (ignoring #2)
  1: {a ↦ {1#3, 2#3, 3#3$4}};
  2: {b ↦ {2#4, 3#4}};
  3: {c ↦ {3#5, 2#5, 1#5, ∅#2}.
Then in the "ac" part, 'a' in state 1 goes to 3 and gets a #3 match 'a' plus an
empty which I have written $4. That is the idea, though.

> > but the problem is that that is not what the
> > underlying theory says.
> 
> I'm not sure if there is any theory around caputring groups. If we are about
> to create one, be aware that there are multiple plausible definitions.

I have though not implemented the counted repetitions A{m, n}, which I think
usually are implemented as loops, so I do not know how that fits into the
picture.

[Bug libstdc++/85472] Regex match bug

2018-04-25 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #8 from Hans Åberg  ---
(In reply to Tim Shen from comment #5)
> I'm not following the meaning of "action number" and "the partial reverse
> NFA is recorded".
> 
> How many actions numbers are recorded? for regex_match(s, regex(re)), is it
> on the order of O(s.size()), or O(re.size())? The goal is to allocate
> O(re.size()) of memory exactly once.

Let's check the OP example, to see if it is useful, which will also explain the
action numbers. So for regex '(z)((a+)?(b+)?(c))*', I used '(z)(a*b*(c))*' with
action numbers same as in the C++ library numbering. Here they are written out
with # in the NFA:
  ex = ({0}, {0, 3},
{0: {z ↦ {1, ∅}}[#1];1: {a ↦ {1, 2, 3}}[#2#3];
 2: {b ↦ {2, 3}}[#2#4];  3: {c ↦ {3, 2, 1, ∅}}[#2#5]}, {})
with start set {0}, and set of last (pointing to a final state ∅) states {0,
3}. (The last {} is the not yet computed DFA.)

Then lexing:

> zaacbbbcac
Full string is a valid match!
Matches: {zaacbbbcac, [@0@3@7@9]; 
[#1]
[#1][#2#3][#2#3][#2#5]
[#1][#2#3][#2#3][#2#5][#2#4][#2#4][#2#4][#2#5]
[#1][#2#3][#2#3][#2#5][#2#4][#2#4][#2#4][#2#5][#2#3][#2#5]
}
Time: 0.000555 s.

Each line gives the action numbers for the string positions for the successive
strings z, zaac, zaacbbbc, zaacbbbcac with final string positions @0, @3, @7,
@9.

The last line gives for #1 the sub-string "z" , and for #2 "aacbbbcac". For #3
"a", and for #5 "c". But #4 is missing, indication there is no match. So there
might be problem here, as there are earlier matches:

Perhaps the intent is that it should be implemented as a loop, only retaining
the last #4, but the problem is that that is not what the underlying theory
says.

[Bug libstdc++/85472] Regex match bug

2018-04-25 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #7 from Hans Åberg  ---
(In reply to Tim Shen from comment #5)
> (In reply to Hans Åberg from comment #4)
> > I wrote an NFA/DFA that computes all matches, it seems, in an efficient
> > manner: Action numbers are marked on the states of the sub-NFAs one is
> > interested in. While lexing a string, the partial reverse NFA is recorded,
> > which is searched when a final state is encountered. The string position
> > action numbers, as marked on the corresponding states, are recorded, and for
> > each action number, the contiguous string sequences are the possible
> > submatches.
> 
> I'm not following the meaning of "action number"...

Arbitrary numbers used identify the sub-automatons; in the C++ library, the
number of left parentheses in the regex expression from which it derives. 

> ...and "the partial reverse
> NFA is recorded".

The lexing states are sets of NFA states, which are also the DFA states. If a
lexing character c in the state p = {p_0, ...} has transition p_i -> {q_0,
...}, the reverse state q_0 -> {r_0, ...} gets p_i added at this position in
the string. This is to record the states which c actually uses.

> How many actions numbers are recorded?

They are sets, both in the states and the string matches. Each final state
position in the string produces a separate match, containing information about
the submatches via the action numbers.

> for regex_match(s, regex(re)), is it
> on the order of O(s.size()), or O(re.size())? The goal is to allocate
> O(re.size()) of memory exactly once.

The action numbers are just stamped onto the NFA states when it is created, and
then carried along. So it is the same as the number of states in the NFA. When
a string is lexed, one just gets the action numbers for each position that
produced a character match.

[Bug libstdc++/85472] Regex match bug

2018-04-25 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #6 from Hans Åberg  ---
(In reply to Tim Shen from comment #5)
> (In reply to Hans Åberg from comment #4)
> > I wrote an NFA/DFA that computes all matches, it seems, in an efficient
> > manner: Action numbers are marked on the states of the sub-NFAs one is
> > interested in. While lexing a string, the partial reverse NFA is recorded,
> > which is searched when a final state is encountered. The string position
> > action numbers, as marked on the corresponding states, are recorded, and for
> > each action number, the contiguous string sequences are the possible
> > submatches.
> 
> I'm not following the meaning of "action number"...

Arbitrary numbers used identify the sub-automatons; in the C++ library, the
number of left parentheses in the regex expression from which it derives. 

> ...and "the partial reverse
> NFA is recorded".

The lexing states are sets of NFA states, which are also the DFA states. If a
lexing character c in the state p = {p_0, ...} has transition p_i -> {q_0,
...}, the reverse state q_0 -> {r_0, ...} gets p_i added at this position in
the string. This is to record the states which c actually uses.

> How many actions numbers are recorded?

They are sets, both in the states and the string matches. Each final state
position in the string produces a separate match, containing information about
the submatches via the action numbers.

> for regex_match(s, regex(re)), is it
> on the order of O(s.size()), or O(re.size())? The goal is to allocate
> O(re.size()) of memory exactly once.

The action numbers are just stamped onto the NFA states when it is created, and
then carried along. So it is the same as the number of states in the NFA. When
a string is lexed, one just gets the action numbers for each position that
produced a character match.

[Bug libstdc++/85472] Regex match bug

2018-04-25 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

--- Comment #4 from Hans Åberg  ---
(In reply to Tim Shen from comment #1)
> I know exactly how libc++ produces this result. It creates an empty
> match_result during each repetition ("*" operator). It's less confusing but
> much slower.

I wrote an NFA/DFA that computes all matches, it seems, in an efficient manner:
Action numbers are marked on the states of the sub-NFAs one is interested in.
While lexing a string, the partial reverse NFA is recorded, which is searched
when a final state is encountered. The string position action numbers, as
marked on the corresponding states, are recorded, and for each action number,
the contiguous string sequences are the possible submatches.

[Bug c++/85472] New: Regex match bug

2018-04-19 Thread haberg-1 at telia dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472

Bug ID: 85472
   Summary: Regex match bug
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: haberg-1 at telia dot com
  Target Milestone: ---

Created attachment 43993
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43993=edit
The regex example mentioned in the report.

In the regex "(z)((a+)?(b+)?(c))*" example at [1], also attached, applied to
the string "zaacbbbcac", g++ produces a pattern #4 match as "bbb", whereas it
says there it should be empty (the latter which clang++ does).

1. http://en.cppreference.com/w/cpp/regex/ecmascript