Re: Matching subpatterns in any order, conjunctions, negated matches

Peter Pentchev Sat, 16 May 2020 18:34:24 -0700

On Sat, May 16, 2020 at 05:53:04PM -0700, Joseph Brenner wrote:
>  Peter Pentchev <r...@ringlet.net> wrote:
> > On Fri, May 15, 2020 at 07:32:50PM -0700, Joseph Brenner wrote:
> >> Regex engines by their nature care a lot about order, but I
> >> occasionally want to relax that to match for multiple
> >> multicharacter subpatterns where the order of them doesn't
> >> matter.
> >>
> >> Frequently the simplest thing to do is just to just do multiple
> >> matches.   Let's say you're looking for words that have a "qu" a
> >> "th" and also, say an "ea".  This works:
> >>
> >>   my $DICT  = "/usr/share/dict/american-english";
> >>   my @hits = $DICT.IO.open( :r
> >> ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
> >>   say @hits;
> >>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
> >> earthquakes]
> >
> > Would something like this work for you?
> >
> >   /^ <?before .* "qu" > <?before .* "th" > <?before .* "ea" > /
> >
> >> Where things get interesting is when you want a negated match of
> >> one of the subpatterns.  One of the things I like about the first
> >> approach using multiple chained greps is that it's easy to do a
> >> reverse match.  What if you want words with "qu" and "th" but
> >> want to *skip* ones with an "ea"?
> >>
> >>   my @hits = $DICT.IO.open( :r
> >> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
> >>   # [Asquith discotheque discotheque's discotheques quoth]
> >
> > Maybe something like this? (note the "!" instead of "?")
> >
> >   /^ <?before .* "qu" > <?before .* "th" > <!before .* "ea" > /
> >
>
> Yes, both of those work, and arguably they're a little cleaner
> looking than my conjunction approach-- though it's not necessarily any
> easier to think about.  It looks like a pattern that's matching
> for three things in order, but the zero-widthness of the "before"
> let's them all work on top of each other.
> 
> I keep thinking there's an edge case in these before/after tricks that
> might matter if we weren't matching the one-word-per-line format of
> the unix dictionaries, but I need to think about that a little more...


Actually, there is, and I conveniently did not mention it :) It's the
case when the patterns may overlap: if you do the '<?before' thing with
'the' and 'entrance', you might match 'thentrance', which, depending on
your use case, might not be ideal.

I've thought a little about another method: splitting the string using
one of the patterns as a separator, then splitting each of the resulting
substrings using the next one and so on until you get to the last one,
where you check whether any of the ministrings contains it, but it would
have to be done carefully, it would have to somehow be done with
a special split-like function that would find all of the occurrences of
the pattern and return tuples "before" and "after" to avoid another kind
of problems with overlaps: if you split "the father" on all of
the ocurrences of "the" at the same time, you *will* miss "father" :)
So you need a special sort of split function that will split
"the father" first as ("", " father"), then as ("the fa", "r"), and return
all of the non-empty results (" father", "the fa", "r")... I'm not sure
this will be very efficient. OK, so as a microoptimization it may return
all of the results that are at least as long as the shortest pattern
remaining, but it still sounds weird.

G'luck,
Peter

-- 
Peter Pentchev  r...@ringlet.net r...@debian.org p...@storpool.com
PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13

signature.asc
Description: PGP signature

Re: Matching subpatterns in any order, conjunctions, negated matches

Reply via email to