Re: [Rd] 'gsub' not perl compatible?

2017-07-24 Thread Robert McGehee
Hi Jean-Luc,
FWIW, you're pointing out a common discrepancy between regex parsers, which is 
whether or not a regex parser advances after finding both a zero-length match 
and a non-zero-length match.

I think this article is especially helpful for understanding the nuances here, 
particularly the section "Advancing After a Zero-Length Regex Match". 
http://www.regular-expressions.info/zerolength.html

For this article, their test example was gsub("\\d*", "x", "x1"), which 
demonstrates the same difference as in your example (i.e. the answer can be 
either "xxx" or "" depending on the parser). They also specifically provide 
a note on R's gsub function that notes this discrepancy:

"The regexp functions in R and PHP are based on PCRE, so they avoid getting 
stuck on a zero-length match by backtracking like PCRE does. But the gsub() 
function to search-and-replace in R also skips zero-length matches at the 
position where the previous non-zero-length match ended, like Python does."

All that said, your larger point still seems valid, that we should expect to 
see behavior consistent with the PCRE parser when we specify perl=TRUE, even if 
that is a different answer than we get from R's default TRE parser when 
perl=FALSE. And to take perl out of the equation, I also verified your test 
directly with PCRE (8.39) on my Linux box using the `pcretest` command, and 
sure enough, pcretest shows four matches to your example, consistent with an 
answer of !a!!c! like you said. Perhaps at a minimum, the ?gsub or ?regex man 
page should add a blurb indicating that the perl=TRUE behavior differs from 
PCRE in the case of non-zero length matches adjacent to zero-length matches. 
Though I'm not sure if this difference is known and intentional or just a side 
effect of some other decision. R also supports adding perl options embedded in 
the pattern. For example '(?i)' makes the pattern case insensitive and '(?U)' 
turns of greedy matching. I could imagine having the behavior you noted d
 epend on such an option as well, if someone was inclined to make a patch and 
didn't want to change existing behavior.

However, to rewrite your query to get the result you want, it seems you may 
unfortunately have to rewrite the query using two calls to gsub using something 
like this: 

> gsub("b?", "!", gsub("b", "bb", "abc"))
 [1] "!a!!c!"

--Robert


-Original Message-----
From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Lipatz 
Jean-Luc
Sent: Friday, July 21, 2017 5:27 AM
To: r-devel@r-project.org
Subject: [Rd] 'gsub' not perl compatible?

Hi all,

Working on some SAS program conversions, I was testing this (3.4.0 Windows, but 
also 2.10.1 MacOsX):
gsub("b?","!","abc",perl=T)

which returns
[1] "!a!c!"

that I didn't understand.

Unfortunately, asked for the same thing SAS 9.4 replies : "!a!!c!", and so does 
Perl (Strawberry 5.26), a more logical answer for me.
Is there some problem with PCRE or some subtility that I didn't catch?

Results are similar with * instead of ?
and there is a similar issue with the lazy operator:
gsub("b??","!","abc",perl=T) gives : "!a!b!c!", while the other softwares give 
"!a!!!c!"


Thanks

Jean-Luc LIPATZ




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] 'gsub' not perl compatible?

2017-07-21 Thread Lipatz Jean-Luc
Hi all,

Working on some SAS program conversions, I was testing this (3.4.0 Windows, but 
also 2.10.1 MacOsX):
gsub("b?","!","abc",perl=T)

which returns
[1] "!a!c!"

that I didn't understand.

Unfortunately, asked for the same thing SAS 9.4 replies : "!a!!c!", and so does 
Perl (Strawberry 5.26), a more logical answer for me.
Is there some problem with PCRE or some subtility that I didn't catch?

Results are similar with * instead of ?
and there is a similar issue with the lazy operator:
gsub("b??","!","abc",perl=T) gives : "!a!b!c!", while the other softwares give 
"!a!!!c!"


Thanks

Jean-Luc LIPATZ




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel