On Wed, Aug 28, 2002 at 12:55:44PM -0400, Deven T. Corzine wrote:
> On Wed, 28 Aug 2002, Dan Sugalski wrote:
> > At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
> 
> On the other hand, :, ::, ::: and <commit> don't necessarily need to be a 
> problem if they can be treated as hints that can be ignored.  If allowing 
> the normal engine to backtrack despite the hints would change the results, 
> that might be a problem.  I don't know; <cut> may pose special problems.

They do change the semantics.

 (June|Jun) ebug

matches the string "Junebug", but

 (June|Jun): ebug

does not. Similarly,

 (June|Jun) ebug
 (June|Jun) :: ebug
 (June|Jun) ::: ebug
 (June|Jun) <commit> ebug

all behave differently when embedded in a larger grammar.

However, it is very possible that in many (the majority?) of actual
uses, they may be intended purely as optimizations and so any
differing behavior is unintentional. It may be worth having a flag
noting that (maybe combined with a warning "you said this :: isn't
supposed to change what can match, but it does.")

> > That doesn't mean you can't write one for a specific subset of perl's 
> > regexes, though. A medium-term goal for the regex engine is to note 
> > where a DFA would give correct behaviour and use one, rather than 
> > going through the more expensive generalized regex engine we'd 
> > otherwise use.
> 
> I think this is a more realistic goal, and more or less what I had in mind.
> 
> I believe there are many subpatterns which might be beneficial to compile 
> to a DFA (or DFA-like) form, where runtime performance is important.  For 
> example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern
> would be more efficient to implement as a DFA than with backtracking.  With 
> a large amount of data to process, that represent significant savings...

I agree. I will reply to this in perl6-internals, though.

> (2) Would simple alternation impede DFA-style optimizations?
> 
> Currently, a pattern like (Jun|June) would never match "June" because the 
> "leftmost" match "Jun" would always take precedence, despite the normal 
> "longest-match" behavior of regexes in general.  This example could be 
> implemented in a DFA; would that always be the case?

You should read Friedl's Mastering Regular Expressions, if you haven't
already. A POSIX NFA would be required to find the longest match (it
has to work as if it were a DFA). A "traditional" NFA produces what
would result from the straightforward backtracking implementation,
which often gives an answer closer to what the user expects. IMO,
these are often different, and the DFA would surprise users fairly
often.

> Would it be better for the matching of (Jun|June) to be "undefined" and 
> implementation-dependent?  Or is it best to require "leftmost" semantics?

I'm voting "leftmost", because I've frequently seen people depend on
it. I'm not so sure that Larry's suggestion of adding a :dfa flag is
always the right approach, because I think this might actually be
something you'd want to set for subsets of a grammar or a single
expression. I don't think it's useful enough to go as far as proposing
that || mean "alternate without defining the order of preference", but
perhaps some <angle-bracketed thing> would work. (Or can you embed
flags in expressions, like perl5's (?imsx:R) thing? Then the :dfa flag
is of course adequate!)

Reply via email to