Re: [Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound

2017-04-11 Thread dietmar.schindler
> Von: Martin Maechler [mailto:maech...@stat.math.ethz.ch]
> Gesendet: Mittwoch, 5. April 2017 11:15
>
> >   
> > on Tue, 4 Apr 2017 08:45:30 + writes:
>
> > Dear Sirs,
> > while
>
> >> regexpr('(.{1,2})\\1', 'foo')
> > [1] 2
> > attr(,"match.length")
> > [1] 2
> > attr(,"useBytes")
> > [1] TRUE
>
> > yields the correct match, an incremented upper bound in
>
> >> regexpr('(.{1,3})\\1', 'foo')
> > [1] -1
> > attr(,"match.length")
> > [1] -1
> > attr(,"useBytes")
> > [1] TRUE
>
> > incorrectly yields no match.
>
> Hmm, yes, I would also say that this is incorrect
> (though I'm always cautious: The  ?regex  help page explicitly
>  mentions greedy repetitions, and these can "bite you" ..)
>
> The behavior is also different from the  perl=TRUE one which is
> correct (according to the above understanding).
>
> ...

Shouldn't this be submitted on R's Bugzilla then (which I as a non-member 
can't)?
--
Best regards,
Dietmar Schindler

manroland web systems GmbH -- Managing Director: Alexander Wassermann
Registered Office: Augsburg -- Trade Register: AG Augsburg -- HRB-No.: 26816 -- 
VAT: DE281389840

Confidentiality note:
This eMail and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, you are hereby notified that any use or 
dissemination of this communication is strictly prohibited. If you have 
received this eMail in error, then please delete this eMail.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound

2017-04-05 Thread Martin Maechler
>   
> on Tue, 4 Apr 2017 08:45:30 + writes:

> Dear Sirs,
> while

>> regexpr('(.{1,2})\\1', 'foo')
> [1] 2
> attr(,"match.length")
> [1] 2
> attr(,"useBytes")
> [1] TRUE

> yields the correct match, an incremented upper bound in

>> regexpr('(.{1,3})\\1', 'foo')
> [1] -1
> attr(,"match.length")
> [1] -1
> attr(,"useBytes")
> [1] TRUE

> incorrectly yields no match.

Hmm, yes, I would also say that this is incorrect
(though I'm always cautious: The  ?regex  help page explicitly
 mentions greedy repetitions, and these can "bite you" ..)

The behavior is also different from the  perl=TRUE one which is
correct (according to the above understanding).

Using  grep() instead of regexpr() makes the behavior easier to parse.
The following code 
--

tx <- c("ab","abc", paste0("foo", c("", "b", "o", "bar", "oofy")))
setNames(nchar(tx), tx)
## ab abc foofoobfooo  foobar ffy
##  2   3   3   4   4   6   7

grep1r <- function(n, txt, ...) {
pattern <- paste0('(.{1,',n,'})\\1', collapse="") ## can have empty n
ans <- grep(pattern, txt, value=TRUE, ...)
cat(sprintf("pattern '%s' : ", pattern)); print(ans, quote=FALSE)
invisible(ans)
}

grep1r({}, tx)# '.{1,}' : because of _greedy_ matching there is __no__ 
repetiion!
grep1r(100,tx)# i.e., these both give an empty match :  character(0)

## matching at most once:
grep1r(1, tx)# matches all 5 starting with "foo"
grep1r(2, tx)# ditto: all have more than 2 chars
grep1r(3, tx)# not "foo": those with more than 3 chars
grep1r(4, tx)# .. those with more than 4 characters
grep1r(5, tx)# .. those with more than 5 characters
grep1r(6, tx)# .. those with more than 6 characters
grep1r(7, tx)# NONE (= those with more than 7 characters)

for(p in c(FALSE,TRUE)) {
cat("\ngrep(*, perl =", p, ") :\n")
for(n in c(list(NULL), 1:7))
grep1r(n, tx, perl = p)
}

--

ends with

> for(p in c(FALSE,TRUE)) {
+ cat("\ngrep(*, perl =", p, ") :\n")
+ for(n in c(list(NULL), 1:7))
+ grep1r(n, tx, perl = p)
+ }

grep(*, perl = FALSE ) :
pattern '(.{1,})\1' : character(0)
pattern '(.{1,1})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,2})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,3})\1' : [1] foobfooofoobar  ffy
pattern '(.{1,4})\1' : [1] foobar  ffy
pattern '(.{1,5})\1' : [1] foobar  ffy
pattern '(.{1,6})\1' : [1] ffy
pattern '(.{1,7})\1' : character(0)

grep(*, perl = TRUE ) :
pattern '(.{1,})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,1})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,2})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,3})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,4})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,5})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,6})\1' : [1] foo foobfooofoobar  ffy
pattern '(.{1,7})\1' : [1] foo foobfooofoobar  ffy
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound

2017-04-04 Thread dietmar.schindler
Dear Sirs,

while

> regexpr('(.{1,2})\\1', 'foo')
[1] 2
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

yields the correct match, an incremented upper bound in

> regexpr('(.{1,3})\\1', 'foo')
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

incorrectly yields no match.

R versions tested:
2.11.1 on i486-pc-linux-gnu
2.15.1 on x86_64-pc-linux-gnu
3.2.1 on i386-w64-mingw32
3.2.1 on x86_64-w64-mingw32
3.3.3 on x86_64-w64-mingw32
--
Best regards,
Dietmar Schindler

manroland web systems GmbH -- Managing Director: Alexander Wassermann
Registered Office: Augsburg -- Trade Register: AG Augsburg -- HRB-No.: 26816 -- 
VAT: DE281389840

Confidentiality note:
This eMail and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the intended recipient, you are hereby notified that any use or 
dissemination of this communication is strictly prohibited. If you have 
received this eMail in error, then please delete this eMail.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel