I have the following to environmental variables set:

LANGVAR=en_US.UTF-8
LANG=C

I don't know exactly what both of these mean, but I always deliberately set LANG=C in my .tcshrc files since that is necessary to get Acrobat Reader working on my Red Hat system. My guess is they were both set this way at build time.

When I run Brian's two alternatives, I *always* get 84, no matter how many times I repeat it. However, when I use \w+, I sometimes get 13 and sometimes get 84 (say, when repeated 1000 times).

-roger

Prof Brian Ripley wrote:
This is actually PCRE.  Something is wrong with your build of R-patched
(1.9.1 alpha, I assume): I get 84 everywhere.  You are asking for a first
character l, then one or more characters of `word' then tmean.  In your
example this is the same as (in a suitable locale, including C)

length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))

which each give 84.

One issue: PCRE is locale-dependent. Did you use the same locale for each? What happens if you force LANG=C?

(I've just checked an R-devel Solaris system. This gave 13 on a build from Weds, and 84 when remade today. The result with 13 seems truncated, as they are the first 13. Might be coincidental, of course.)

On Fri, 11 Jun 2004, Roger D. Peng wrote:


I've noticed a change in the way grep() behaves between the 1.9.0 release and a recent R-patched. On 1.9.0 I get the following output:

> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))
> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
[1] 84

And on R-patched (2004-06-11) I get

> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))
> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
[1] 13

I can't come up with a simpler example which is why I've posted my actual character vector on the web (please let me know if there are problems downloading it).

I didn't find anything in the NEWs file that would indicate a change


No change is intended and the underlying C code is unchanged.


and another problem is that I'm not sure which behavior is correct. My knowledge of regular expressions is limited.



______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

Reply via email to