Re: Variable character class

William Michels via perl6-users Tue, 03 Sep 2019 11:18:53 -0700

Someone might get a kick out of this ;-). Clearly regexes are built on
top of set theory, but as both Simon and Yary pointed out, my
set-based code didn't return the matching string "8420" present in the
target.


Example A, Eirik's code used an array to generate a character class,
and then tested that character class in a regex vs the target ($_) .
In Example B, all it takes is the addition of the "frugal" match
indicator ("+?") to Eirik's code (and presumably to Simon's code) to
give almost identical results as a set-intesection.

The results aren't completely identical because the regex code
(Example A, Example B) is non-symmetric.  As shown below ("lorem
ipsum"), when Example B (array regex) is compared to Example C (set
intersection), multiple copies of character class elements present in
the "target" string show up in the regex match object (Example B),
while these duplicate elements are eliminated from the "symmetrical"
set-intersection result (Example C):

##_A____
sub contains( Str $chars, Str $_ ) {
  my @arr = $chars.comb.unique;
  return m:g/@arr+/
}

say contains("24680", "19584203").join("|"); # says 8420
say contains("19584203", "24680").join("|"); # says 24|80
say contains("Lorem ipsum dolor sit amet, consectetuer adipiscing
elit.", "abcdefg").join("|"); # says a|cde|g
say contains("abcdefg", "Lorem ipsum dolor sit amet, consectetuer
adipiscing elit.").join("|"); # says e|d|a|e|c|ec|e|e|ad|c|g|e

##_B____
sub nonsym_intersect( Str $chars, Str $_ ) {
  my @arr = $chars.comb.unique;
  return m:g/@arr+?/
}

say nonsym_intersect("24680", "19584203").join("|"); # says 8|4|2|0
say nonsym_intersect("19584203", "24680").join("|"); # says 2|4|8|0
say nonsym_intersect("Lorem ipsum dolor sit amet, consectetuer
adipiscing elit.", "abcdefg").join("|"); # says a|c|d|e|g
say nonsym_intersect("abcdefg", "Lorem ipsum dolor sit amet,
consectetuer adipiscing elit.").join("|"); # says
e|d|a|e|c|e|c|e|e|a|d|c|g|e

##_C____
sub sym_intersect(Str $a, Str $b) {
   my @c = $a.comb.unique;
   my @d = $b.comb.unique;
   #return (~[@c (&) @d]).^name;
   return ~[@c (&) @d];
}

say sym_intersect("24680", "19584203").words.join("|"); # says 2|8|4|0
say sym_intersect("19584203", "24680").words.join("|"); # says 8|4|2|0
say sym_intersect("Lorem ipsum dolor sit amet, consectetuer adipiscing
elit.", "abcdefg").words.join("|"); # says  a|g|c|d|e
say sym_intersect("abcdefg", "Lorem ipsum dolor sit amet, consectetuer
adipiscing elit.").words.join("|"); # says a|d|e|g|c


One caveat (above, Example C), I can't return from a set-intersection
and just do a "join" on the result, as in the previous two examples. I
have to break the return into ".words" and then ".join", to match the
format of the previous two examples.

HTH, Bill.

PS Eirik, I think people might be referring to <{...}> as "pointy
blocks", but I'm really not sure... .





On Mon, Sep 2, 2019 at 11:25 AM Joseph Brenner <doom...@gmail.com> wrote:
>
> >   The "implicit" alternation comes from interpolating a list (of subrules,
> > see below).
>
> I see.  And that's discussed here (had to really look for it):
>
> https://docs.perl6.org/language/regexes#Quoted_lists_are_LTM_matches
>
> At first I was looking further down in the "Regex interpolation"
> section, where it's also touched on, though I kept missing it:
>
> > When an array variable is interpolated into a regex, the regex engine 
> > handles it like a | alternative of the regex elements (see the 
> > documentation on embedded lists, above).
>
>
> On 9/1/19, The Sidhekin <sidhe...@gmail.com> wrote:
> > On Mon, Sep 2, 2019 at 1:12 AM Joseph Brenner <doom...@gmail.com> wrote:
> >
> >> I was just trying to run Simon Proctor's solution, and I see it
> >> working for Yary's first case, but not his more complex one with
> >> problem characters like brackets included in the list of characters.
> >>
> >> I don't really see how to fix it, in part because I'm not that
> >> clear on what it's actually doing... there's some sort of
> >> implicit alternation going on?
> >>
> >>
> >> sub contains( Str $chars, Str $_ ) {
> >>   m:g/<{$chars.comb}>+/
> >> };
> >>
> >
> >   The "implicit" alternation comes from interpolating a list (of subrules,
> > see below).
> >
> > That works for this case:
> >>
> >>   say contains('24680', '19584203');
> >>   # (｢8420｣)
> >>
> >> But on something like this it errors out:
> >>
> >>   say contains('+\/\]\[', 'Apple ][+//e'); # says ][+//
> >>
> >
> >   … because it's trying to compile each (1-character) string as a subrule …
> >
> >   To have the (1-character) strings used a literals, rather than compiled
> > as subrules, put them in an array instead of a block wrapped in angle
> > brackets:
> >
> > sub contains( Str $chars, Str $_ ) {
> >   my @arr = $chars.comb;
> >   m:g/@arr+/
> > }
> >
> >
> >   (… hey, is there a word for "block wrapped in angle brackets"?)
> >
> >
> > Eirik
> >

Re: Variable character class

Reply via email to