practical application. now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc
I thought they were "regular" because they "regularly" occurred in the
target text. Turns out other interpretations are possible. Though, mine has
the advantage of being correct :-D
The set of "big books on regular expressions" includes Jeffrey Friedl's
"Mastering Regular Expressions" that happens to contain a chapter by the
title "NFA, DFA, and POSIX" wherein he says:
DFA Speed with NFA Capabilities: Regex Nirvana?
I've said several times that a DFA can't provide capturing parentheses or
backreferences. This is quite true, but it certainly doesn't preclude
hybrid approaches that mix technologies in an attempt to reach regex
nirvana. The sidebar in Section 4.6.3.1 told how NFAs have diverged from
the theoretical straight and narrow in search of more power, and it's
only natural that the same happens with DFAs. A DFA's construction makes
it more difficult, but that doesn't mean impossible.
GNU grep takes a simple but effective approach. It uses a DFA when
possible, reverting to an NFA when backreferences are used. GNU awk does
something similar—it uses GNU grep's fast shortest-leftmost DFA engine
for simple "does it match" checks, and reverts to a different engine for
checks where the actual extent of the match must be known. Since that
other engine is an NFA, GNU awk can conveniently offer capturing
parentheses, and it does via its special gensub function.
Tcl's regex engine is a true hybrid, custom built by Henry Spencer (whom
you may remember having played an important part in the early development
and popularization of regular expressions see Section 3.1.1.6). The Tcl
engine sometimes appears to be an NFA— it has lookaround, capturing
parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX
longest-leftmost match (see Section 4.6.1), and doesn't suffer from some
of the NFA problems that we'll see in Chapter 6. It really seems quite
wonderful.
Currently, this engine is available only to Tcl, but Henry tells me that
it's on his to-do list to break it out into a separate package that can
be used by others.
Again, turns out the "big books on regular expressions" can give the
lowlife--that's me--things "hackers" deny them.
--On Monday, October 27, 2008 10:18 AM +0000 Charles Forsyth
<[EMAIL PROTECTED]> wrote:
Both systems are complex enough that essentially no one completely
understands them.
this touches on an important point. the first introduction of regular
expressions to editors was great, because it took some formal language
theory and made it useful in an `every day' way. conversely, the theory
provided significant underpinning (in a structural sense) to that
practical application. now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc
additions. (i think there is a similarity to Perl's relationship to the
history of programming languages.) the result really is hard to reason
about ... because it hasn't got that underpinning and the algebra has
gone. it's worse than that: the code is a mess too, i supposed in
consequence.