Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Steffen Nurpmeso via austin-group-l at The Open Group Tue, 24 Sep 2024 16:06:17 -0700

Hans Åberg wrote in
 <33614aed-35b0-4589-9d13-8c0d42afd...@telia.com>:
 |
 |> On 24 Sep 2024, at 00:39, Steffen Nurpmeso <stef...@sdaoden.eu> wrote:
 |> 
 |> Hans Åberg wrote in
 |> <32c01fe0-b715-412c-809a-0581bf1b8...@telia.com>:
 |>|I made the most efficient regular expression algorithm possible that \
 |>|catches all called for sub-matches, described here:
 |>|https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85472
 |> 
 |> You have not posted an URL to your implementation in this thread?
 |
 |It is not published yet. I wrote it back then, and is now seeing if \
 |I can integrate into a program I am writing.


Well it seems that thread caused some performance boost if
i understood that correctly, so..

 |>|In short, backtracking is avoided by caches DFA states during the \
 |>|parsing, \
 |>|though all can be computed in advance, and transverse in a loop, like \
 |>|in Flex and Bison, with no stacking or function calls. DFA minimalizatio\
 |>|n \
 |>|proved to not work well with actions of subexpressions:
 |>|
 |>|Some sub-NFAs, corresponding to sub-regexes, are marked with labels, \
 |>|and when transitioning out of one or more, all labeled matches are \
 |>|computed using a partial RNFA, reverse NFA.
 |>|
 |>|Unicode UTF-8 and UTF-32 are treated by generating byte regular expressi\
 |>|ons.
 |>|
 |>|In the context of the discussion here, the task is to find a suitable \
 |>|selection of found matches. For example, backreferences would typically \
 |>|take the last computed value of a labeled match.
 |>|
 |>|My implementation is oriented towards use with a parser, such as Bison. 
 |> 
 |> I am too stupid for yacc/bison lex/flex.  I write parsers by hand.
 |
 |In the context of discussions here, features such as backreferences \
 |might be suitable as a replacement for a parser generator, like in \
 |the example of scanning HTML code given here:
 |https://www.regular-expressions.info/backref.html
 |
 |There I would prefer a parser generator that admits more advanced grammars \
 |than regular expressions, though I am not sure how well such works \
 |with HTML.
 |
 |Handwritten parsers might be preferred if the grammar is well known, \
 |such in a compiler for a well-known programming language.
 |
 |>|Tim Shen said he would make an implementation along these lines. So \
 |>|you might check with him for a reference implementation.
 |> 
 |> Oh!  It is not as if *i* would be writing a regular expression
 |> engine any time soon.  I still remember, however, being a little
 |> bit strained once a QT release came over with a ~30KB (iirc)
 |> regular expression implementation.  One always has an option to
 |> replace a self-written suboptimal thing with a really thought
 |> through optimized variant in the Open Source World!
 |> (Pointers to Russ Cox and Plan9 regex i have seen in the issue.)
 |
 |There is an interesting one here:
 |https://www.genivia.com/doc/reflex/html/
 |https://en.wikipedia.org/wiki/RE/flex

That is an interesting thing, thanks for this link.
(Pretty academical read, like the above thread of yours.)

 |Not to be confused with:
 |https://invisible-island.net/reflex/
 --End of <33614aed-35b0-4589-9d13-8c0d42afd...@telia.com>

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: [1003.1(2024)/Issue8 0001857]: Several problems with the new "lazy" regex quantifier.

Reply via email to