Re: [Jgeneral] rxmatches does not match empty regular expression at end of string

Rik Renich Sat, 09 Apr 2022 04:57:56 -0700

Hi Raul,

That is working well.


Thanks,
Rik

On Fri, Apr 8, 2022 at 11:04 AM Raul Miller <[email protected]> wrote:

> Here's a fix for the double match from '$' rxmatches 'is'
>
> rxmatches_jregex_=: 4 : 0
> 'p n'=. 2 {. boxopen x
> regcomp p
> m=. regmatch1 y
> if. _1 = {.{.m do. i.0 1 2 return. end.
> s=. +/0 1>.{.m
> r=. ,: m
> while. s <:#y do.
>   if. _1 = {.{.m=. regmatch2 y;s do. break. end.
>   s=. (s+1) >. +/ {.m
>   r=. r, m
> end.
> if. #n do. n{"2 r end.
> )
>
> In other words:
>
> My first fix allowed us to search the empty position after the last
> character in the string.
>
> My second fix adopts a mild variation of the "next position" loop
> mechanism for dealing with the necessary shift from the first match
> (regmatch1) and the subsequent matches (regmatch2).
>
> The general problem with matching the empty string is that there's an
> infinite number of them. So, technically, there are actually two such
> matches at the end of the string (and there's three, etc.).
>
> But, for a minimal result what we do is enforce mechanisms which
> always advance at least one character beyond a match of an empty
> substring.
>
> I hope this helps,
>
> --
> Raul
>
> On Fri, Apr 8, 2022 at 10:28 AM Rik Renich <[email protected]> wrote:
> >
> > Hi Raul,
> >
> > Thanks for the quick work.  It does indeed fix the last two examples, but
> > it creates an extra match in the first one.  The pcre documentation did
> > warn us it was tricky!  Hopefully the test scripts I recently sent will
> > prove useful.
> >
> > Thanks again,
> > Rik
> >
> > P.S.  Raul, it's good to hear from you after such a long time.
> >
> > On Thu, Apr 7, 2022 at 11:44 PM Raul Miller <[email protected]>
> wrote:
> >
> > > Here's a fixed version of rxmatches:
> > >
> > > rxmatches_jregex_=: 4 : 0
> > > 'p n'=. 2 {. boxopen x
> > > regcomp p
> > > m=. regmatch1 y
> > > if. _1 = {.{.m do. i.0 1 2 return. end.
> > > s=. 1 >. +/{.m
> > > r=. ,: m
> > > while. s <:#y do.
> > >   if. _1 = {.{.m=. regmatch2 y;s do. break. end.
> > >   s=. (s+1) >. +/ {.m
> > >   r=. r, m
> > > end.
> > > if. #n do. n{"2 r end.
> > > )
> > >
> > > FYI,
> > >
> > > --
> > > Raul
> > >
> > > On Wed, Apr 6, 2022 at 10:20 PM Rik Renich <[email protected]> wrote:
> > > >
> > > > There seems to be a bug in rxmatches.  I expect ('|$' rxmatches
> 'is') to
> > > > have three matches, but the final one is omitted.  Likewise for an
> empty
> > > > regex.  For comparison with perl:
> > > >
> > > >     cat | perl
> > > >     $_= "is"; s/$/--/g; print "$_\n";
> > > >     $_= "is"; s/|$/--/g; print "$_\n";
> > > >     $_= "is"; s//--/g; print "$_\n";
> > > >
> > > >     is--
> > > >     --i--s--
> > > >     --i--s--
> > > >
> > > > Note that perl matches the end of the string for all 3 cases.
> > > >
> > > >     jc
> > > >        s=: 'is'
> > > >        (<'--') ('$' rxmatches s) rxmerge s
> > > >     is--
> > > >        (<'--') ('|$' rxmatches s) rxmerge s
> > > >     --i--s
> > > >        (<'--') ('' rxmatches s) rxmerge s
> > > >     --i--s
> > > >        exit''
> > > >
> > > > Note that I have used rxmerge to mimic the example given in perl.
> > > However,
> > > > the unexpected result comes from rxmatches.
> > > >
> > > > As these examples show, rxmatches is not compatible with perl for the
> > > > second 2 cases.  Clearly the second case should match the end of the
> > > > string, as one clause of the regex is "end of string."  The third
> case
> > > > seems to be the same bug.
> > > >
> > > > If you visit https://www.pcre.org/original/doc/html/pcreapi.html and
> > > search
> > > > for "tricky" you will find:
> > > >
> > > > Finding all the matches in a subject is tricky when the pattern can
> match
> > > > an empty string. It is possible to emulate Perl's /g behaviour by
> first
> > > > trying the match again at the same offset, with the
> PCRE_NOTEMPTY_ATSTART
> > > > and PCRE_ANCHORED options, and then if that fails, advancing the
> starting
> > > > offset and trying an ordinary match again. There is some code that
> > > > demonstrates how to do this in the pcredemo sample program. In the
> most
> > > > general case, you have to check to see if the newline convention
> > > recognizes
> > > > CRLF as a newline, and if so, and the current character is CR
> followed by
> > > > LF, advance the starting offset by two characters instead of one.
> > > >
> > > > I have tried pcredemo and it provides results consistent with perl.
> > > >
> > > > I have provided test scripts in both perl and ijs, along with a
> minimal
> > > > test file.
> > > >
> > > > Thanks,
> > > > Rik
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jgeneral] rxmatches does not match empty regular expression at end of string

Reply via email to