[REBOL] Regular Expressions Re:(2)

ole_f Thu, 6 Jan 2000 12:33:33 -0800
Hi Eric, 5-Jan-2000 you wrote:

[...]
>I think backtracking is an essential part of regular expressions. To quote
>from Mastering Regular Expressions by Jeffrey Friedl (O'Reilly, p. 102),

>    The essence of an NFA engine [the regex matching engine of the kind
>    that Perl uses] is this: it considers each subexpression or component
>    in turn, and whenever it needs to decide between two equally viable
>    options, it selects one and remembers the other to return to later if
>    need be. ....

>Suppose you want to find a word ending in "ing". In Perl you could write:

>   /\b[a-z_]+ing\b/i

>Backtracking is crucial here, because "[a-z_]+" first matches to the end of
>the word, and if you didn't backtrack, "ing" would never match. With a parse
>rule, you have to do something like this:

>    pp: copy []
>    alpha: make bitset!  [#"a" - #"z" #"A" - #"Z" #"_"]
>    non-alpha: complement alpha

>    parse/all string [ some [
>        p: some [ "ing" [non-alpha | end ] (append pp p) | alpha ] |
>        some non-alpha
>    ]]

>This will leave you with the block PP holding pointers to the beginning of
>all the words ending in "ing". It isn't quite equivalent to the Perl regex,
>but it's close enough. Here backtracking is accomplished in the following
>way: Suppose you were scanning through the word "ringing". The string "ing"
>would match "ringing" first, and then attempt to match a non-alphabet
>              ~~~
>character or the string end. That wouldn't work, but the following | says
>that we should try to jump back to the second letter of "ringing" and just
>match any alphabet character, and keep going through the word. This is the
>kind of thing that regexes do for you automatically.

How about this:

>> a: [skip a | "ing"]
== [skip a | "ing"]
>> parse "ringin" a
== false
>> parse "ringing" a
== true

It should do the job. So actually backtracking in Parse can be achieved by use
of the "|" operator in Parse blocks.

>But if you want to search for a pattern in text, such as {a word begining
>with "t" and ending with "gh"},

>> a: ["t" b]
== ["t" b]
>> b: [skip b | "gh"]
== [skip b | "gh"]
>> parse "tough" a
== true
>> parse "tougher" a
== false

Or have I misunderstood you?

>or {the words "straw" and "camel" not
>separated by any punctuation}, and especially if you want to do that

Hmmm, what's the problem with this one?

>interactively without stopping to think how you're going to program it, the
>parse rules that would be needed are much too complex. This is the problem
>I've been trying to tackle with search-text.r.

[...]
>The big problem is speed. When processing the expression given, an optimized
>function is created that tries to quickly identify potential matches.
>Unfortunately for an expression that begins with ANY there's no simple way to
>do this, so the main matching function is called for every character in the
>string. Needless to say that is useless for most purposes, but in interactive
>searching I don't think most people would use such an expression. Replacing
>text is another matter, but then that won't be much use until I get text
>capture of portions of the expression implemented.

Converting "(string)*" and "(string)+", for example, into this:

some-string: ["string" some-string | "string"]
any-string: ["string" any-string | none]

should work, shouldn't it? But yes, I must admit, too, that 'any and 'some
ought to do some backtracking - in fact, I believed they did, so please
ignore the mail I sent a couple of days ago on how to easily build Parse
blocks from regular expressions ;-)

Rebol Tech., isn't the lack of backtracking in 'any and 'some a bug?

BTW, by building a DFA for a regular expression, you can get rid of the
backtracking. However, I don't know how fast a program can build a DFA, it
might not be that fast... but if you try to match one regular expression to a
lot of strings, it might be worth a go. Keep in mind, though, that then
you'll only be able to get a "matched" or "didn't match" answer from your
expression matching, it will probably be very difficult to implement copying
of the text inside subexpressions etc.

Kind regards,
--
Ole Friis <[EMAIL PROTECTED]>

"Ignorance is bliss"
(Cypher, The Matrix)
[REBOL] Regular Expressions Re:(2)

Reply via email to