Hi Jean-Luc,
FWIW, you're pointing out a common discrepancy between regex parsers, which is
whether or not a regex parser advances after finding both a zero-length match
and a non-zero-length match.
I think this article is especially helpful for understanding the nuances here,
particularly the section "Advancing After a Zero-Length Regex Match".
http://www.regular-expressions.info/zerolength.html
For this article, their test example was gsub("\\d*", "x", "x1"), which
demonstrates the same difference as in your example (i.e. the answer can be
either "xxx" or "" depending on the parser). They also specifically provide
a note on R's gsub function that notes this discrepancy:
"The regexp functions in R and PHP are based on PCRE, so they avoid getting
stuck on a zero-length match by backtracking like PCRE does. But the gsub()
function to search-and-replace in R also skips zero-length matches at the
position where the previous non-zero-length match ended, like Python does."
All that said, your larger point still seems valid, that we should expect to
see behavior consistent with the PCRE parser when we specify perl=TRUE, even if
that is a different answer than we get from R's default TRE parser when
perl=FALSE. And to take perl out of the equation, I also verified your test
directly with PCRE (8.39) on my Linux box using the `pcretest` command, and
sure enough, pcretest shows four matches to your example, consistent with an
answer of !a!!c! like you said. Perhaps at a minimum, the ?gsub or ?regex man
page should add a blurb indicating that the perl=TRUE behavior differs from
PCRE in the case of non-zero length matches adjacent to zero-length matches.
Though I'm not sure if this difference is known and intentional or just a side
effect of some other decision. R also supports adding perl options embedded in
the pattern. For example '(?i)' makes the pattern case insensitive and '(?U)'
turns of greedy matching. I could imagine having the behavior you noted d
epend on such an option as well, if someone was inclined to make a patch and
didn't want to change existing behavior.
However, to rewrite your query to get the result you want, it seems you may
unfortunately have to rewrite the query using two calls to gsub using something
like this:
> gsub("b?", "!", gsub("b", "bb", "abc"))
[1] "!a!!c!"
--Robert
-Original Message-----
From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Lipatz
Jean-Luc
Sent: Friday, July 21, 2017 5:27 AM
To: r-devel@r-project.org
Subject: [Rd] 'gsub' not perl compatible?
Hi all,
Working on some SAS program conversions, I was testing this (3.4.0 Windows, but
also 2.10.1 MacOsX):
gsub("b?","!","abc",perl=T)
which returns
[1] "!a!c!"
that I didn't understand.
Unfortunately, asked for the same thing SAS 9.4 replies : "!a!!c!", and so does
Perl (Strawberry 5.26), a more logical answer for me.
Is there some problem with PCRE or some subtility that I didn't catch?
Results are similar with * instead of ?
and there is a similar issue with the lazy operator:
gsub("b??","!","abc",perl=T) gives : "!a!b!c!", while the other softwares give
"!a!!!c!"
Thanks
Jean-Luc LIPATZ
[[alternative HTML version deleted]]
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel