Re: [Rd] grep

2016-10-03 Thread Prof Brian Ripley

On 02/10/2016 17:54, Pi wrote:

Hello.

It would be great if the grep function in R had the option to use the -m
parameter as the linux command does.


I guess you mean the non-standard flag of the GNU version of grep 
(probably but not necessarily as used by Linux).


That the POSIX standard for grep does not have this (nor any other 
commonly used implementation I am aware of) indicates that your 
enthusiasm for this is not shared by grep experts.



That would allow to stop a grep search as soon as something is found.
It would make many operations much faster.


Those who would have to do the work to implement this will not be taking 
your word for that, but would expect convincing examples of real 
problems where it was so and grep was the bottleneck.


Your 'case' seems to be for a shortcut for any(grepl()) along the lines 
of anyDuplicated().



[[alternative HTML version deleted]]


This is a non-HTML list, as the posting guide told you.  And using a 
real name adds credibility.


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep and PCRE fun

2011-09-30 Thread Prof Brian Ripley

On Fri, 30 Sep 2011, Simon Urbanek wrote:


Jeff,

this is really a bug in PCRE since the length (0) is a multiple of 3 as 
documented so PCRE should not be writing anything. Anyway, this has been now 
fixed (by Brian).


Only in R-devel: R-2-13-branch is now closed (and was by the time I 
read the message).




Cheers,
Simon


On Sep 29, 2011, at 5:00 PM, Jeffrey Horner wrote:


Hello,

I think I've found a bug in the C function do_grep located in
src/main/grep.c. It seems to affect both the latest revisions of
R-2-13-branch and trunk when compiling R without optimizations and
with it's own version of pcre located in src/extra, at least on ubuntu
10.04.

According to the pcre_exec API (I presume the later versions), the
ovecsize argument must be a multiple of 3 , and the ovector argument
must point to a location that can hold at least ovecsize integers. All
the pcre_exec calls made by do_grep, save one, honors this. That one
call seems to overwrite areas of the stack it shouldn't. Here's the
smallest example I found that tickles the bug:


grep("[^[:blank][:cntrl]]","\\n",perl=TRUE)

Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) :
 negative length vectors are not allowed

As described above, this error occurs on ubuntu 10.04 when R is
compiled without optimizations ( I typically use CFLAGS="-ggdb"
CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the
pcre_exec call executed from do_get overwrites the integer nmatches
and sets it to -1. This has the effect of making do_grep try and
allocate a results vector of length -1, which of course causes the
error message above.

I'd be interested to know if this bug happens on other platforms.

Below is my simple fix for R-2-13-branch (a similar fix works for
trunk as well).

Jeff

$ svn diff main/grep.c
Index: main/grep.c
===
--- main/grep.c (revision 57110)
+++ main/grep.c (working copy)
@@ -723,7 +723,7 @@
{
SEXP pat, text, ind, ans;
regex_t reg;
-int i, j, n, nmatches = 0, ov, rc;
+int i, j, n, nmatches = 0, ov[3], rc;
int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert;
const char *spat = NULL;
pcre *re_pcre = NULL /* -Wall */;
@@ -882,7 +882,7 @@
if (fixed_opt)
LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) 
>= 0;
else if (perl_opt) {
-   if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= 0)
+   if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0)
INTEGER(ind)[i] = 1;
} else {
if (!use_WC)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep and PCRE fun

2011-09-30 Thread Simon Urbanek
Jeff,

this is really a bug in PCRE since the length (0) is a multiple of 3 as 
documented so PCRE should not be writing anything. Anyway, this has been now 
fixed (by Brian).

Cheers,
Simon


On Sep 29, 2011, at 5:00 PM, Jeffrey Horner wrote:

> Hello,
> 
> I think I've found a bug in the C function do_grep located in
> src/main/grep.c. It seems to affect both the latest revisions of
> R-2-13-branch and trunk when compiling R without optimizations and
> with it's own version of pcre located in src/extra, at least on ubuntu
> 10.04.
> 
> According to the pcre_exec API (I presume the later versions), the
> ovecsize argument must be a multiple of 3 , and the ovector argument
> must point to a location that can hold at least ovecsize integers. All
> the pcre_exec calls made by do_grep, save one, honors this. That one
> call seems to overwrite areas of the stack it shouldn't. Here's the
> smallest example I found that tickles the bug:
> 
>> grep("[^[:blank][:cntrl]]","\\n",perl=TRUE)
> Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) :
>  negative length vectors are not allowed
> 
> As described above, this error occurs on ubuntu 10.04 when R is
> compiled without optimizations ( I typically use CFLAGS="-ggdb"
> CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the
> pcre_exec call executed from do_get overwrites the integer nmatches
> and sets it to -1. This has the effect of making do_grep try and
> allocate a results vector of length -1, which of course causes the
> error message above.
> 
> I'd be interested to know if this bug happens on other platforms.
> 
> Below is my simple fix for R-2-13-branch (a similar fix works for
> trunk as well).
> 
> Jeff
> 
> $ svn diff main/grep.c
> Index: main/grep.c
> ===
> --- main/grep.c   (revision 57110)
> +++ main/grep.c   (working copy)
> @@ -723,7 +723,7 @@
> {
> SEXP pat, text, ind, ans;
> regex_t reg;
> -int i, j, n, nmatches = 0, ov, rc;
> +int i, j, n, nmatches = 0, ov[3], rc;
> int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert;
> const char *spat = NULL;
> pcre *re_pcre = NULL /* -Wall */;
> @@ -882,7 +882,7 @@
>   if (fixed_opt)
>   LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) 
> >= 0;
>   else if (perl_opt) {
> - if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= 0)
> + if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0)
>   INTEGER(ind)[i] = 1;
>   } else {
>   if (!use_WC)
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep and PCRE fun

2011-09-29 Thread Henrik Bengtsson
On Thu, Sep 29, 2011 at 2:00 PM, Jeffrey Horner
 wrote:
> Hello,
>
> I think I've found a bug in the C function do_grep located in
> src/main/grep.c. It seems to affect both the latest revisions of
> R-2-13-branch and trunk when compiling R without optimizations and
> with it's own version of pcre located in src/extra, at least on ubuntu
> 10.04.
>
>  According to the pcre_exec API (I presume the later versions), the
> ovecsize argument must be a multiple of 3 , and the ovector argument
> must point to a location that can hold at least ovecsize integers. All
> the pcre_exec calls made by do_grep, save one, honors this. That one
> call seems to overwrite areas of the stack it shouldn't. Here's the
> smallest example I found that tickles the bug:
>
>> grep("[^[:blank][:cntrl]]","\\n",perl=TRUE)
> Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) :
>  negative length vectors are not allowed
>
> As described above, this error occurs on ubuntu 10.04 when R is
> compiled without optimizations ( I typically use CFLAGS="-ggdb"
> CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the
> pcre_exec call executed from do_get overwrites the integer nmatches
> and sets it to -1. This has the effect of making do_grep try and
> allocate a results vector of length -1, which of course causes the
> error message above.
>
> I'd be interested to know if this bug happens on other platforms.

With R devel (2011-09-28 r57099) and R v2.13.1 patched (2011-09-05
r56953) on Windows 7 64-bit you get:

> grep("[^[:blank][:cntrl]]","\\n",perl=TRUE)
integer(0)

/Henrik

>
> Below is my simple fix for R-2-13-branch (a similar fix works for
> trunk as well).
>
> Jeff
>
> $ svn diff main/grep.c
> Index: main/grep.c
> ===
> --- main/grep.c (revision 57110)
> +++ main/grep.c (working copy)
> @@ -723,7 +723,7 @@
>  {
>     SEXP pat, text, ind, ans;
>     regex_t reg;
> -    int i, j, n, nmatches = 0, ov, rc;
> +    int i, j, n, nmatches = 0, ov[3], rc;
>     int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert;
>     const char *spat = NULL;
>     pcre *re_pcre = NULL /* -Wall */;
> @@ -882,7 +882,7 @@
>            if (fixed_opt)
>                LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) 
> >= 0;
>            else if (perl_opt) {
> -               if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= 
> 0)
> +               if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0)
>                    INTEGER(ind)[i] = 1;
>            } else {
>                if (!use_WC)
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep and PCRE fun

2011-09-29 Thread Jeffrey Horner
On Thu, Sep 29, 2011 at 4:00 PM, Jeffrey Horner
 wrote:
> Hello,
>
> I think I've found a bug in the C function do_grep located in
> src/main/grep.c. It seems to affect both the latest revisions of
> R-2-13-branch and trunk when compiling R without optimizations and
> with it's own version of pcre located in src/extra, at least on ubuntu
> 10.04.
>
>  According to the pcre_exec API (I presume the later versions), the
> ovecsize argument must be a multiple of 3 , and the ovector argument
> must point to a location that can hold at least ovecsize integers. All
> the pcre_exec calls made by do_grep, save one, honors this. That one
> call seems to overwrite areas of the stack it shouldn't. Here's the
> smallest example I found that tickles the bug:
>
>> grep("[^[:blank][:cntrl]]","\\n",perl=TRUE)
> Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) :
>  negative length vectors are not allowed

As many of you know, that regex is invalid. It's just the one I
happened upon that tickled the bug. It actually came from an error
that occurred when building R itself. Here's a snippet of my make log:

make[1]: Leaving directory `/home/hornerj/R-sources/branches/R-2-13-branch/po'
you should 'make docs' now ...
make[1]: Entering directory `/home/hornerj/R-sources/branches/R-2-13-branch/doc'
Error in grep("[^[:blank:][:cntrl:]]", unlist(Rd[sections == "TEXT"]),  :
  negative length vectors are not allowed
Calls: saveRDS ->  -> prepare2_Rd -> grep
Execution halted
make[1]: *** [NEWS.rds] Error 1

>
> As described above, this error occurs on ubuntu 10.04 when R is
> compiled without optimizations ( I typically use CFLAGS="-ggdb"
> CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the
> pcre_exec call executed from do_get overwrites the integer nmatches
> and sets it to -1. This has the effect of making do_grep try and
> allocate a results vector of length -1, which of course causes the
> error message above.
>
> I'd be interested to know if this bug happens on other platforms.
>
> Below is my simple fix for R-2-13-branch (a similar fix works for
> trunk as well).
>
> Jeff
>
> $ svn diff main/grep.c
> Index: main/grep.c
> ===
> --- main/grep.c (revision 57110)
> +++ main/grep.c (working copy)
> @@ -723,7 +723,7 @@
>  {
>     SEXP pat, text, ind, ans;
>     regex_t reg;
> -    int i, j, n, nmatches = 0, ov, rc;
> +    int i, j, n, nmatches = 0, ov[3], rc;
>     int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert;
>     const char *spat = NULL;
>     pcre *re_pcre = NULL /* -Wall */;
> @@ -882,7 +882,7 @@
>            if (fixed_opt)
>                LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) 
> >= 0;
>            else if (perl_opt) {
> -               if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= 
> 0)
> +               if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0)
>                    INTEGER(ind)[i] = 1;
>            } else {
>                if (!use_WC)
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep problem in R-devel 2.14 r57004

2011-09-16 Thread Simon Urbanek
I forgot to mention the more obvious ;) - yes, it is a known issue in PCRE 8.13 
which is hitting more people.
After re-reading the standard I think the problem was that PCRE did not require 
enclosing [ to treat [. as special.  This has been addressed in the PCRE trunk 
since and it also has a comment on what happened. I have ported that fix into 
R-devel. 

Cheers,
Simon


On Sep 16, 2011, at 9:01 AM, Simon Urbanek wrote:

> Mark, quick googling gives the answer - [.] is not what you think it is, you 
> probably meant [\.]. Bracket expressions starting with [. are collating 
> symbols which is unsupported by PCRE (only [:xxx:] is supported, neither 
> [=xxx=] nor [.xxx.] is) but that's probably not what you intended. See POSIX:
> 
> 9.3.5 RE Bracket Expression
> [...]
> 1. [..] The character sequences "[.", "[=", and "[:" (left-bracket followed 
> by a period, equals-sign, or colon) shall be special inside a bracket 
> expression and are used to delimit collating symbols, equivalence class 
> expressions, and character class expressions.
> 
> Cheers,
> Simon
> 
> 
> 
> On Sep 16, 2011, at 12:45 AM,  
>  wrote:
> 
>> Problem below with PCRE grep in R-devel; works fine in R-patched. (Unless 
>> there's been an absolutely massive change in rules for updated PCRE version 
>> 8.13; jeez I hope not)
>> 
>>> grep( '[.][.]', '', perl=TRUE)
>> Error in grep("[.][.]", "", perl = TRUE) :
>> invalid regular expression '[.][.]'
>> In addition: Warning message:
>> In grep("[.][.]", "", perl = TRUE) : PCRE pattern compilation error
>>   'POSIX collating elements are not supported'
>>   at '[.][.]'
>> 
>>> sessionInfo()
>> R Under development (unstable) (2011-09-13 r57004)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>> 
>> locale:
>> [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
>> [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
>> [5] LC_TIME=English_Australia.1252
>> 
>> attached base packages:
>> [1] stats graphics  grDevices utils datasets  methods   base
>> 
>> NB I'm sending to R-devel rather than posting a bug report because (i) I 
>> have a dim recollection that's what we're supposed to do for bugs in 
>> R-devel, and (ii) Bugzilla doesn't include an R-devel version and (iii) 
>> couldn't find any guidance on these matters.
>> 
>> Mark
>> 
>> -- 
>> Mark Bravington
>> CSIRO Mathematical & Information Sciences
>> Marine Laboratory
>> Castray Esplanade
>> Hobart 7001
>> TAS
>> 
>> ph (+61) 3 6232 5118
>> fax (+61) 3 6232 5012
>> mob (+61) 438 315 623
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep problem in R-devel 2.14 r57004

2011-09-16 Thread Simon Urbanek
Mark, quick googling gives the answer - [.] is not what you think it is, you 
probably meant [\.]. Bracket expressions starting with [. are collating symbols 
which is unsupported by PCRE (only [:xxx:] is supported, neither [=xxx=] nor 
[.xxx.] is) but that's probably not what you intended. See POSIX:

9.3.5 RE Bracket Expression
[...]
1. [..] The character sequences "[.", "[=", and "[:" (left-bracket followed by 
a period, equals-sign, or colon) shall be special inside a bracket expression 
and are used to delimit collating symbols, equivalence class expressions, and 
character class expressions.

Cheers,
Simon



On Sep 16, 2011, at 12:45 AM,  
 wrote:

> Problem below with PCRE grep in R-devel; works fine in R-patched. (Unless 
> there's been an absolutely massive change in rules for updated PCRE version 
> 8.13; jeez I hope not)
> 
>> grep( '[.][.]', '', perl=TRUE)
> Error in grep("[.][.]", "", perl = TRUE) :
>  invalid regular expression '[.][.]'
> In addition: Warning message:
> In grep("[.][.]", "", perl = TRUE) : PCRE pattern compilation error
>'POSIX collating elements are not supported'
>at '[.][.]'
> 
>> sessionInfo()
> R Under development (unstable) (2011-09-13 r57004)
> Platform: i386-pc-mingw32/i386 (32-bit)
> 
> locale:
> [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
> [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
> [5] LC_TIME=English_Australia.1252
> 
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
> 
> NB I'm sending to R-devel rather than posting a bug report because (i) I have 
> a dim recollection that's what we're supposed to do for bugs in R-devel, and 
> (ii) Bugzilla doesn't include an R-devel version and (iii) couldn't find any 
> guidance on these matters.
> 
> Mark
> 
> -- 
> Mark Bravington
> CSIRO Mathematical & Information Sciences
> Marine Laboratory
> Castray Esplanade
> Hobart 7001
> TAS
> 
> ph (+61) 3 6232 5118
> fax (+61) 3 6232 5012
> mob (+61) 438 315 623
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-17 Thread Prof Brian Ripley
On Thu, 17 May 2007, Petr Savicky wrote:

>> strncasecmp is not standard C (not even C99), but R does have a substitute
>> for it.  Unfortunately strncasecmp is not usable with multibyte charsets:
>> Linux systems have wcsncasecmp but that is not portable.  In these days of
>> widespread use of UTF-8 that is a blocking issue, I am afraid.
>
> What could help are the functions mbrtowc and towctrans and simple
> long integer comparison. Are the functions mbrtowc and towctrans
> available under Windows? mbrtowc seems to be available as Rmbrtowc
> in src/gnuwin32/extra.c.
>
> I did not find towctrans defined in R sources, but it is in 
> gnuwin32/Rdll.hide

I don't see it in Rdll.hide.  It is a C99 function (see your unix man 
page).

> and used in do_tolower. Does this mean that tolower is not usable
> with utf-8 under Windows?

UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in 
so far as that makes sense: Chinese chars do not have 'case').

Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is 
not currently active.

>> In the case of grep I think all you need is
>>
>> grep(tolower(pattern), tolower(x), fixed = TRUE)
>>
>> and similarly for regexpr.
>
> Yes. this is correct, but it has disadvantages. It needs more
> space and, if value=TRUE, we would have to do something like
>   x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
> This is hard to implement in src/library/base/R/grep.R,
> where the call to .Internal(grep(pattern,...)) is the last command
> and I think this should be preserved.
>
>>> Ignore case option is not meaningfull in gsub.
>>
>> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
>>
>> is different from 'ignore.case=FALSE', and I see the meaning as clear.
>> So what did you mean?  (Unfortunately the tolower trick does not work for
>> [g]sub.)
>
> The meaning of ignore.case in [g]sub is problematic due to the following.
>  sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
> produces
>  [1] "xyzD" "xyzd"
> but the user may in fact need the following
>  [1] "XYZD" "xyzd"

He may, but that is not what 'ignore case' means, more like 'case 
honouring'.

> It is correct that "xyzD" "xyzd" is produced, but the user
> should be aware of the fact that several substitutions like
>  x <- sub("abc", "xyz", c("ABCD", "abcd"))   # ignore.case=FALSE
>  sub("ABC", "XYZ", x)  # ignore.case=FALSE
> may be more useful.
>
> I have another question concerning the speed of grep. I expected that
> fgrep_one function is slower than calling a library routine
> for regular expressions. In particular, if the pattern has a lot of
> long partial matches in the target string, I expected that it may be much
> slower. A short example is
>  y <- "ab"
>  x <- "aaab"
>  grep(y,x)
> which requires 110 comparisons (10 comparisons for each of 11 possible
> beginnings of y in x). In general, the complexity in the worst case is
> O(m*n), where m,n are the lengths of y,x resp. I would expect that
> the library function for matching regular expressions needs
> time O(m+n) and so will be faster. However, the result obtained
> on a larger example is
>
>  > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
>  > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
>  > y <- paste(c(rep("a", times = 1), x2), collapse = "")
>  > z <- rep(y, times = 100)
>
>  > system.time(i <- grep(x1, z, fixed = T))
>  [1] 1.970 0.000 1.985 0.000 0.000
>
>  > system.time(i <- grep(x1, z, fixed = F))   # reg. expr. surprisingly slow 
> (*)
>  [1] 40.374  0.003 40.381  0.000  0.000
>
>  > system.time(i <- grep(x2, z, fixed = T))
>  [1] 0.113 0.000 0.113 0.000 0.000
>
>  > system.time(i <- grep(x2, z, fixed = F))  # reg. expr. faster than 
> fgrep_one
>  [1] 0.019 0.000 0.019 0.000 0.000
>
> Do you have an explanation of these results, in particular (*)?

Yes, there is a comment on the help page to that effect.  But these are 
highly atypical uses. Try perl=TRUE, and be aware that the locale matters 
a lot in such tests (via the charset).

No one is attempting to make R a fast string-processing language and so 
developers resources are spent on performance where it matters to more 
typical usage.  (E.g. reducing duplication in as.double and friends speeds 
up just about every R session, and speeds up some numerical sessions 
dramatically.)

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-17 Thread Petr Savicky
> strncasecmp is not standard C (not even C99), but R does have a substitute 
> for it.  Unfortunately strncasecmp is not usable with multibyte charsets: 
> Linux systems have wcsncasecmp but that is not portable.  In these days of 
> widespread use of UTF-8 that is a blocking issue, I am afraid.

What could help are the functions mbrtowc and towctrans and simple
long integer comparison. Are the functions mbrtowc and towctrans
available under Windows? mbrtowc seems to be available as Rmbrtowc
in src/gnuwin32/extra.c.

I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide
and used in do_tolower. Does this mean that tolower is not usable
with utf-8 under Windows?

> In the case of grep I think all you need is
> 
> grep(tolower(pattern), tolower(x), fixed = TRUE)
> 
> and similarly for regexpr.

Yes. this is correct, but it has disadvantages. It needs more
space and, if value=TRUE, we would have to do something like
   x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
This is hard to implement in src/library/base/R/grep.R,
where the call to .Internal(grep(pattern,...)) is the last command
and I think this should be preserved.

> >Ignore case option is not meaningfull in gsub.
> 
> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
> 
> is different from 'ignore.case=FALSE', and I see the meaning as clear.
> So what did you mean?  (Unfortunately the tolower trick does not work for 
> [g]sub.)

The meaning of ignore.case in [g]sub is problematic due to the following.
  sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
produces
  [1] "xyzD" "xyzd"
but the user may in fact need the following
  [1] "XYZD" "xyzd"

It is correct that "xyzD" "xyzd" is produced, but the user
should be aware of the fact that several substitutions like 
  x <- sub("abc", "xyz", c("ABCD", "abcd"))   # ignore.case=FALSE
  sub("ABC", "XYZ", x)  # ignore.case=FALSE
may be more useful.

I have another question concerning the speed of grep. I expected that
fgrep_one function is slower than calling a library routine
for regular expressions. In particular, if the pattern has a lot of
long partial matches in the target string, I expected that it may be much
slower. A short example is
  y <- "ab"
  x <- "aaab"
  grep(y,x)
which requires 110 comparisons (10 comparisons for each of 11 possible
beginnings of y in x). In general, the complexity in the worst case is
O(m*n), where m,n are the lengths of y,x resp. I would expect that
the library function for matching regular expressions needs
time O(m+n) and so will be faster. However, the result obtained
on a larger example is

  > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
  > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
  > y <- paste(c(rep("a", times = 1), x2), collapse = "")
  > z <- rep(y, times = 100)

  > system.time(i <- grep(x1, z, fixed = T))
  [1] 1.970 0.000 1.985 0.000 0.000

  > system.time(i <- grep(x1, z, fixed = F))   # reg. expr. surprisingly slow 
(*)
  [1] 40.374  0.003 40.381  0.000  0.000

  > system.time(i <- grep(x2, z, fixed = T))
  [1] 0.113 0.000 0.113 0.000 0.000

  > system.time(i <- grep(x2, z, fixed = F))  # reg. expr. faster than fgrep_one
  [1] 0.019 0.000 0.019 0.000 0.000

Do you have an explanation of these results, in particular (*)?

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-14 Thread Prof Brian Ripley
On Fri, 11 May 2007, Petr Savicky wrote:

> On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
>> I suggest you collaborate with the person who replied that he thought this
>> was a good idea to supply patches against the R-devel sources for
>> scrutiny.
>
> A possible solution is to use strncasecmp instead of strncmp
> in function fgrep_one in R-devel/src/main/character.c.
>
> Corresponding modification of character.c is at
>  http://www.cs.cas.cz/~savicky/ignore_case/character.c
> and diff file w.r.t. the original character.c (downloaded today) is at
>  http://www.cs.cas.cz/~savicky/ignore_case/diff.txt
>
> This seems to work in my installation of R-devel:
>
>  > x <- c("D.G cat", "d.g cat", "dog cat")
>  > z <- "d.g"
>  > grep(z, x, ignore.case = F, fixed = T)
>  [1] 2
>  > grep(z, x, ignore.case = T, fixed = T)  # this is the new behavior
>  [1] 1 2
>  > grep(z, x, ignore.case = T, fixed = F)
>  [1] 1 2 3
>  >
>
> Since fgrep_one is used many times in character.c, adding igcase_opt as
> an additional argument would imply extensive changes to the file.
> So, I introduced a new function fgrep_one_igcase called only once in
> the file. Another solution is possible.
>
> I do not understand well handling multibyte chars, so I did not test
> the function with real multibyte chars, although the code for
> this option is used.

Thanks for looking into this.

strncasecmp is not standard C (not even C99), but R does have a substitute 
for it.  Unfortunately strncasecmp is not usable with multibyte charsets: 
Linux systems have wcsncasecmp but that is not portable.  In these days of 
widespread use of UTF-8 that is a blocking issue, I am afraid.

In the case of grep I think all you need is

grep(tolower(pattern), tolower(x), fixed = TRUE)

and similarly for regexpr.

> Ignore case option is not meaningfull in gsub.

sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)

is different from 'ignore.case=FALSE', and I see the meaning as clear.
So what did you mean?  (Unfortunately the tolower trick does not work for 
[g]sub.)

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-11 Thread Petr Savicky
On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
> I suggest you collaborate with the person who replied that he thought this 
> was a good idea to supply patches against the R-devel sources for 
> scrutiny.

A possible solution is to use strncasecmp instead of strncmp
in function fgrep_one in R-devel/src/main/character.c.

Corresponding modification of character.c is at
  http://www.cs.cas.cz/~savicky/ignore_case/character.c
and diff file w.r.t. the original character.c (downloaded today) is at
  http://www.cs.cas.cz/~savicky/ignore_case/diff.txt

This seems to work in my installation of R-devel:

  > x <- c("D.G cat", "d.g cat", "dog cat")
  > z <- "d.g"
  > grep(z, x, ignore.case = F, fixed = T)
  [1] 2
  > grep(z, x, ignore.case = T, fixed = T)  # this is the new behavior
  [1] 1 2
  > grep(z, x, ignore.case = T, fixed = F)
  [1] 1 2 3
  >

Since fgrep_one is used many times in character.c, adding igcase_opt as
an additional argument would imply extensive changes to the file.
So, I introduced a new function fgrep_one_igcase called only once in
the file. Another solution is possible.

I do not understand well handling multibyte chars, so I did not test
the function with real multibyte chars, although the code for
this option is used.

Ignore case option is not meaningfull in gsub. It could be meaningful
in regexpr, however, this function does not allow ignore.case option,
so I did no changes to it.

All the best, Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-08 Thread Prof Brian Ripley
On Mon, 7 May 2007, Petr Savicky wrote:

> Dear R developers,
>
> I suggest to modify the behaviour of "grep" function with fixed=TRUE option.
>
> Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE,
> if set by the user).

As it clearly says it does.

> I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since
> the default of ignore.case is FALSE, this would not change the behaviour
> of grep, if the user does not set ignore.case explicitly.
>
> In my opinion, fixed=TRUE is most useful for suppressing meta-character
> expansion. On the other hand, for a simple word search, ignoring
> case is sometimes useful.

Well, it was written to use in R's own code as a quick way to match a 
fixed sequence of bytes.  It is not suitable for a 'word' search as it 
does not (just) match to words.

> If for some reason, it is better to keep the current behavior of grep, then I
> suggest to extend the documentation as follows:
>
> ORIGINAL:
>   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
>  is.  Overrides all conflicting arguments.
>
> SUGGESTED:
>   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
>  is.  Overrides all conflicting arguments including ignore.case.

Oh come on, ignore.case clearly conflicts with 'as is'!
Adding unnecessary qualifiers just makes the text harder to read.


I suggest you collaborate with the person who replied that he thought this 
was a good idea to supply patches against the R-devel sources for 
scrutiny.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-07 Thread Gabor Grothendieck
Seems like a good idea to me.

Here is a workaround that works in any event which combines (?i), \Q and \E .
to get the same effect.  (?i) gives case insensitive matches and \Q and \E
quote and endquote the intervening text disabling special characters:

x <- c("D.G cat", "d.g cat", "dog cat")
z <- "d.g"
rx <- paste("(?i)\\Q", z, "\\E", sep = "")
grep(rx, x, perl = TRUE)  # 1 2


On 5/7/07, Petr Savicky <[EMAIL PROTECTED]> wrote:
> Dear R developers,
>
> I suggest to modify the behaviour of "grep" function with fixed=TRUE option.
>
> Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE,
> if set by the user).
>
> I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since
> the default of ignore.case is FALSE, this would not change the behaviour
> of grep, if the user does not set ignore.case explicitly.
>
> In my opinion, fixed=TRUE is most useful for suppressing meta-character
> expansion. On the other hand, for a simple word search, ignoring
> case is sometimes useful.
>
> If for some reason, it is better to keep the current behavior of grep, then I
> suggest to extend the documentation as follows:
>
> ORIGINAL:
>   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
>  is.  Overrides all conflicting arguments.
>
> SUGGESTED:
>   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
>  is.  Overrides all conflicting arguments including ignore.case.
>
> All the best, Petr Savicky.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-06 Thread Marc Schwartz (via MN)
On Tue, 2006-06-06 at 17:08 +0100, Prof Brian Ripley wrote:
> On Tue, 6 Jun 2006, Marc Schwartz (via MN) wrote:
> 
> > On Tue, 2006-06-06 at 11:12 +0100, Prof Brian Ripley wrote:
> >> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
> >>
> >>> Hi all,
> >>>
> >>> Based upon an offlist communication this morning, I am somewhat confused
> >>> (more than I usually am on most Monday mornings...) about the use of
> >>> grep() with factors as the 'x' argument.
> >>>
> >>> The argument guidance in ?grep indicates:
> >>>
> >>> x, text a character vector where matches are sought. Coerced to
> >>>character if possible.
> >>>
> >>> and in the Details section:
> >>>
> >>> Arguments which should be character strings or character vectors are
> >>> coerced to character if possible.
> >>>
> >>>
> >>> The wording of both would seem to reasonably lead to the conclusion that
> >>> a factor could be coerced to a character vector by the use of
> >>> as.character(FACTOR).
> >>
> >> Well, that is not what is meant by the wording, nor what happens: there is
> >> no method dispatch so the factor is coerced from an integer vector to a
> >> character vector.  'coerced' usually means at low level: where
> >> as.character() is involved we tend to say so.
> >>
> >> As for the comments on what happens if value=TRUE: if the 'x' has been
> >> coerced, I would expect the value to be based on the coerced value (and it
> >> currently is).
> >>
> >>> grep("1", factor(letters))
> >>   [1]  1 10 11 12 13 14 15 16 17 18 19 21
> >>> grep("1", factor(letters), value=TRUE)
> >>   [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21"
> >>
> >> So whereas I am quite happy to replace the low-level coercion by method
> >> dispatch on as.character, I don't think this should be altered (and am
> >> pretty sure there is code out there which expects a character vector
> >> result).
> >
> > Prof. Ripley,
> >
> > Thanks for your reply and clarification.
> >
> > I would acknowledge that the coercion of a factor to its numeric values
> > would not be immediately intuitive to me (or others who have commented
> > on this) within the context of grep(). However, in light of your
> > comments and having reviewed the C code, it does make sense.
> >
> > Given this behavior, it would seem reasonable to provide a clarification
> > in ?grep, perhaps as follows:
> >
> > Arguments
> >
> > x, text a character vector where matches are sought. Coerced to
> > character if possible. See Details for factors.
> >
> >
> > Details
> >
> > Arguments which should be character strings or character vectors are
> > coerced to character if possible. In the case of factors, these are
> > coerced using as.integer(x). You must explicitly coerce the factor using
> > as.character(x) to use these functions on the character vector
> > equivalent.
> 
> I do think we should `replace the low-level coercion by method dispatch on 
> as.character', and have done so in R-devel (but am still testing 
> packages).  There have been quite a few instances of such low-level 
> coercion (including for dimnames), and I am currently looking through to 
> see if there are any others that either should be altered or the 
> documentation clarified.

Prof. Ripley,

I did not want to presume that you would indeed do this or more, had
already done so. Though given your additional comments, I now note that
this is mentioned in the NEWS file for R-devel.

I do sincerely appreciate your efforts here.

Perhaps an interim change in ?grep as above for 2.3.1patched might be
considered, though now with an additional comment that this approach
will (might) change in 2.4.0?

I have added Bill Dunlap as a cc: here, given his expressed desire to be
consistent with R on this point.

Regards,

Marc

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-06 Thread Prof Brian Ripley
On Tue, 6 Jun 2006, Marc Schwartz (via MN) wrote:

> On Tue, 2006-06-06 at 11:12 +0100, Prof Brian Ripley wrote:
>> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
>>
>>> Hi all,
>>>
>>> Based upon an offlist communication this morning, I am somewhat confused
>>> (more than I usually am on most Monday mornings...) about the use of
>>> grep() with factors as the 'x' argument.
>>>
>>> The argument guidance in ?grep indicates:
>>>
>>> x, text a character vector where matches are sought. Coerced to
>>>character if possible.
>>>
>>> and in the Details section:
>>>
>>> Arguments which should be character strings or character vectors are
>>> coerced to character if possible.
>>>
>>>
>>> The wording of both would seem to reasonably lead to the conclusion that
>>> a factor could be coerced to a character vector by the use of
>>> as.character(FACTOR).
>>
>> Well, that is not what is meant by the wording, nor what happens: there is
>> no method dispatch so the factor is coerced from an integer vector to a
>> character vector.  'coerced' usually means at low level: where
>> as.character() is involved we tend to say so.
>>
>> As for the comments on what happens if value=TRUE: if the 'x' has been
>> coerced, I would expect the value to be based on the coerced value (and it
>> currently is).
>>
>>> grep("1", factor(letters))
>>   [1]  1 10 11 12 13 14 15 16 17 18 19 21
>>> grep("1", factor(letters), value=TRUE)
>>   [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21"
>>
>> So whereas I am quite happy to replace the low-level coercion by method
>> dispatch on as.character, I don't think this should be altered (and am
>> pretty sure there is code out there which expects a character vector
>> result).
>
> Prof. Ripley,
>
> Thanks for your reply and clarification.
>
> I would acknowledge that the coercion of a factor to its numeric values
> would not be immediately intuitive to me (or others who have commented
> on this) within the context of grep(). However, in light of your
> comments and having reviewed the C code, it does make sense.
>
> Given this behavior, it would seem reasonable to provide a clarification
> in ?grep, perhaps as follows:
>
> Arguments
>
> x, text a character vector where matches are sought. Coerced to
> character if possible. See Details for factors.
>
>
> Details
>
> Arguments which should be character strings or character vectors are
> coerced to character if possible. In the case of factors, these are
> coerced using as.integer(x). You must explicitly coerce the factor using
> as.character(x) to use these functions on the character vector
> equivalent.

I do think we should `replace the low-level coercion by method dispatch on 
as.character', and have done so in R-devel (but am still testing 
packages).  There have been quite a few instances of such low-level 
coercion (including for dimnames), and I am currently looking through to 
see if there are any others that either should be altered or the 
documentation clarified.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-06 Thread Marc Schwartz (via MN)
On Tue, 2006-06-06 at 11:12 +0100, Prof Brian Ripley wrote:
> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
> 
> > Hi all,
> >
> > Based upon an offlist communication this morning, I am somewhat confused
> > (more than I usually am on most Monday mornings...) about the use of
> > grep() with factors as the 'x' argument.
> >
> > The argument guidance in ?grep indicates:
> >
> > x, text a character vector where matches are sought. Coerced to
> >character if possible.
> >
> > and in the Details section:
> >
> > Arguments which should be character strings or character vectors are
> > coerced to character if possible.
> >
> >
> > The wording of both would seem to reasonably lead to the conclusion that
> > a factor could be coerced to a character vector by the use of
> > as.character(FACTOR).
> 
> Well, that is not what is meant by the wording, nor what happens: there is 
> no method dispatch so the factor is coerced from an integer vector to a 
> character vector.  'coerced' usually means at low level: where 
> as.character() is involved we tend to say so.
> 
> As for the comments on what happens if value=TRUE: if the 'x' has been 
> coerced, I would expect the value to be based on the coerced value (and it 
> currently is).
> 
> > grep("1", factor(letters))
>   [1]  1 10 11 12 13 14 15 16 17 18 19 21
> > grep("1", factor(letters), value=TRUE)
>   [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21"
> 
> So whereas I am quite happy to replace the low-level coercion by method 
> dispatch on as.character, I don't think this should be altered (and am 
> pretty sure there is code out there which expects a character vector 
> result).

Prof. Ripley,

Thanks for your reply and clarification.

I would acknowledge that the coercion of a factor to its numeric values
would not be immediately intuitive to me (or others who have commented
on this) within the context of grep(). However, in light of your
comments and having reviewed the C code, it does make sense.

Given this behavior, it would seem reasonable to provide a clarification
in ?grep, perhaps as follows:

Arguments

x, text a character vector where matches are sought. Coerced to
character if possible. See Details for factors.


Details

Arguments which should be character strings or character vectors are
coerced to character if possible. In the case of factors, these are
coerced using as.integer(x). You must explicitly coerce the factor using
as.character(x) to use these functions on the character vector
equivalent.


Thanks for your consideration.

Regards,

Marc Schwartz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-06 Thread Prof Brian Ripley
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:

> Hi all,
>
> Based upon an offlist communication this morning, I am somewhat confused
> (more than I usually am on most Monday mornings...) about the use of
> grep() with factors as the 'x' argument.
>
> The argument guidance in ?grep indicates:
>
> x, text a character vector where matches are sought. Coerced to
>character if possible.
>
> and in the Details section:
>
> Arguments which should be character strings or character vectors are
> coerced to character if possible.
>
>
> The wording of both would seem to reasonably lead to the conclusion that
> a factor could be coerced to a character vector by the use of
> as.character(FACTOR).

Well, that is not what is meant by the wording, nor what happens: there is 
no method dispatch so the factor is coerced from an integer vector to a 
character vector.  'coerced' usually means at low level: where 
as.character() is involved we tend to say so.

As for the comments on what happens if value=TRUE: if the 'x' has been 
coerced, I would expect the value to be based on the coerced value (and it 
currently is).

> grep("1", factor(letters))
  [1]  1 10 11 12 13 14 15 16 17 18 19 21
> grep("1", factor(letters), value=TRUE)
  [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21"

So whereas I am quite happy to replace the low-level coercion by method 
dispatch on as.character, I don't think this should be altered (and am 
pretty sure there is code out there which expects a character vector 
result).

> In tracing through the C code in character.c for do_grep(), which in
> turn calls coerceVector() in coerce.c, unless I am mis-reading the code
> (always possible), I don't see an indication that a factor would be
> coerced to a character vector.
>
> Since a factor -> character coercion would seem at face value, the most
> logical coercion to take place when using grep(), I am curious if I am
> missing something, or if perhaps ?grep needs to be more clear in the
> coercions that will or might take place. Perhaps even the consideration
> of an error message if a factor is passed as the 'x' argument, if indeed
> the coercion would not take place.
>
> Perhaps the easiest example here might be:
>
> # On R Version 2.3.1 (2006-06-01) on FC5
>
>> grep("[a-z]", letters)
> [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
> [23] 23 24 25 26
>
>> grep("[a-z]", factor(letters))
> numeric(0)
>
>
> Thanks for any comments or any virtual rotten tomatoes coming my way at
> high speed.  :-)
>
> Marc Schwartz
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-05 Thread Gabor Grothendieck
On 6/5/06, Bill Dunlap <[EMAIL PROTECTED]> wrote:
> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
>
> > > > > grep("[a-z]", factor(letters))
> > > > numeric(0)
> > >
> > > I was recently surprised by this also.  In addition, if
> > > R's grep did support factors in this way, what sort of
> > > object (factor or character) should it return when value=T?
> > > I recently changed Splus's grep to return a character vector in
> > > that case.
> > >
> > >Splus> grep("[def]", letters[26:1])
> > >[1] 21 22 23
> > >Splus>  grep("[def]", factor(letters[26:1], levels=letters[26:1]))
> > >[1] 21 22 23
> > >Splus> grep("[def]", letters[26:1], value=T)
> > >[1] "f" "e" "d"
> > >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), 
> > > value=T)
> > >[1] "f" "e" "d"
> > >Splus> class(.Last.value)
> > >[1] "character"
> > >
> > > R does this when grepping an integer vector.
> > >R> grep("1", 0:11, value=T)
> > >[1] "1"  "10" "11"
> > > help(grep) says it returns "the matching elements themselves", but
> > > doesn't say if "themselves" means before or after the conversion to
> > > character.
> >
> > Bill,
> >
> > My first inclination for the return value when used on a factor would be
> > the indexed factor elements where grep() would otherwise simply return
> > the indices. This would also maintain the factor levels from the
> > original source factor since "[".factor would normally retain these when
> > drop = FALSE.
>
> That would be my first inclination also.  I would have expected the output of
>   grep(pattern, text, value=TRUE)
> to be identical to that of
>   text[grep(pattern, text, value=FALSE)]
> no matter what class text has.
>
> No end users have seen this in Splus so we can change it to anything,
> but we want to keep it the same as R's.
>
> > I could be convinced either way. The concern of course being that (given
> > the offlist replies I have received today) even experienced users are
> > getting bitten by the current behavior versus their intuitive
> > expectations, which are at least loosely supported by the documentation.
> >

I would have expected

If non-character text arguments are accepted I would have expected
that they be coerced to character so that
grep(pattern, text, ...) would return the same result as
grep(pattern, as.character(text), ...)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-05 Thread Bill Dunlap
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:

> > > > grep("[a-z]", factor(letters))
> > > numeric(0)
> >
> > I was recently surprised by this also.  In addition, if
> > R's grep did support factors in this way, what sort of
> > object (factor or character) should it return when value=T?
> > I recently changed Splus's grep to return a character vector in
> > that case.
> >
> >Splus> grep("[def]", letters[26:1])
> >[1] 21 22 23
> >Splus>  grep("[def]", factor(letters[26:1], levels=letters[26:1]))
> >[1] 21 22 23
> >Splus> grep("[def]", letters[26:1], value=T)
> >[1] "f" "e" "d"
> >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), 
> > value=T)
> >[1] "f" "e" "d"
> >Splus> class(.Last.value)
> >[1] "character"
> >
> > R does this when grepping an integer vector.
> >R> grep("1", 0:11, value=T)
> >[1] "1"  "10" "11"
> > help(grep) says it returns "the matching elements themselves", but
> > doesn't say if "themselves" means before or after the conversion to
> > character.
>
> Bill,
>
> My first inclination for the return value when used on a factor would be
> the indexed factor elements where grep() would otherwise simply return
> the indices. This would also maintain the factor levels from the
> original source factor since "[".factor would normally retain these when
> drop = FALSE.

That would be my first inclination also.  I would have expected the output of
   grep(pattern, text, value=TRUE)
to be identical to that of
   text[grep(pattern, text, value=FALSE)]
no matter what class text has.

No end users have seen this in Splus so we can change it to anything,
but we want to keep it the same as R's.

> I could be convinced either way. The concern of course being that (given
> the offlist replies I have received today) even experienced users are
> getting bitten by the current behavior versus their intuitive
> expectations, which are at least loosely supported by the documentation.
>
> HTH,
>
> Marc Schwartz


Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-05 Thread Sean Davis
Marc Schwartz (via MN) wrote:
> On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote:
> 
>>On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
>>
>>
>>>Based upon an offlist communication this morning, I am somewhat confused
>>>(more than I usually am on most Monday mornings...) about the use of
>>>grep() with factors as the 'x' argument.
>>> ...
>>>
grep("[a-z]", letters)
>>>
>>> [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
>>>[23] 23 24 25 26
>>>
>>>
grep("[a-z]", factor(letters))
>>>
>>>numeric(0)
>>
>>I was recently surprised by this also.  In addition, if
>>R's grep did support factors in this way, what sort of
>>object (factor or character) should it return when value=T?
>>I recently changed Splus's grep to return a character vector in
>>that case.
>>
>>   Splus> grep("[def]", letters[26:1])
>>   [1] 21 22 23
>>   Splus>  grep("[def]", factor(letters[26:1], levels=letters[26:1]))
>>   [1] 21 22 23
>>   Splus> grep("[def]", letters[26:1], value=T)
>>   [1] "f" "e" "d"
>>   Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T)
>>   [1] "f" "e" "d"
>>   Splus> class(.Last.value)
>>   [1] "character"
>>
>>R does this when grepping an integer vector.
>>   R> grep("1", 0:11, value=T)
>>   [1] "1"  "10" "11"
>>help(grep) says it returns "the matching elements themselves", but
>>doesn't say if "themselves" means before or after the conversion to
>>character.
> 
> 
> Bill,
> 
> My first inclination for the return value when used on a factor would be
> the indexed factor elements where grep() would otherwise simply return
> the indices. This would also maintain the factor levels from the
> original source factor since "[".factor would normally retain these when
> drop = FALSE.
> 
> For example:
> 
> # Return the indexed values as would otherwise be done
> # in grep() if the factor to character coercion takes place:
> # Use the same indices 21:23 as above
> 
> 
>>factor(letters[26:1], levels = letters[26:1])[21:23]
> 
> [1] f e d
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
> 
> 
> 
>>From my read of the C code in do_grep() in character.c (again, if
> correct), when 'value = TRUE', the C code appears to first get the
> indices and then build the returned vector from the indexed values from
> the source vector in a for() loop. So this should not be a problem
> philosophically.
> 
> However, given your example of the coercion of integers, perhaps with
> grep() at least, consistent behavior would dictate that return values
> are always character vectors. These could then be coerced manually back
> to a factor, using the original levels, as may be required:
> 
> 
>>factor.letters <- factor(letters[26:1], levels=letters[26:1])
>>factor.letters
> 
>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
> 
> 
>>grep("[def]", as.character(factor.letters))
> 
> [1] 21 22 23
> 
> 
>>res <- grep("[def]", as.character(factor.letters), value = TRUE)
>>res
> 
> [1] "f" "e" "d"
> 
> 
>>factor(res, levels = levels(factor.letters))
> 
> [1] f e d
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
> 
> Which of course is the same result I proposed initially above.
> 
> I could be convinced either way. The concern of course being that (given
> the offlist replies I have received today) even experienced users are
> getting bitten by the current behavior versus their intuitive
> expectations, which are at least loosely supported by the documentation.

I'll chime in on-list to say that I have had the same experience with 
expecting grep to coerce to text.  Despite the question of return 
values, I think of grep (not equivalent to the unix command, I 
understand, but it does have the same name) as operating on "text", not 
the factor levels themselves.  Not a big deal, but it does lead to 
sometimes hard to track bugs if one is not careful to put in 
as.character all the time.

Sean

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-05 Thread Marc Schwartz (via MN)
On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote:
> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
> 
> > Based upon an offlist communication this morning, I am somewhat confused
> > (more than I usually am on most Monday mornings...) about the use of
> > grep() with factors as the 'x' argument.
> >  ...
> > > grep("[a-z]", letters)
> >  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
> > [23] 23 24 25 26
> >
> > > grep("[a-z]", factor(letters))
> > numeric(0)
> 
> I was recently surprised by this also.  In addition, if
> R's grep did support factors in this way, what sort of
> object (factor or character) should it return when value=T?
> I recently changed Splus's grep to return a character vector in
> that case.
> 
>Splus> grep("[def]", letters[26:1])
>[1] 21 22 23
>Splus>  grep("[def]", factor(letters[26:1], levels=letters[26:1]))
>[1] 21 22 23
>Splus> grep("[def]", letters[26:1], value=T)
>[1] "f" "e" "d"
>Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T)
>[1] "f" "e" "d"
>Splus> class(.Last.value)
>[1] "character"
> 
> R does this when grepping an integer vector.
>R> grep("1", 0:11, value=T)
>[1] "1"  "10" "11"
> help(grep) says it returns "the matching elements themselves", but
> doesn't say if "themselves" means before or after the conversion to
> character.

Bill,

My first inclination for the return value when used on a factor would be
the indexed factor elements where grep() would otherwise simply return
the indices. This would also maintain the factor levels from the
original source factor since "[".factor would normally retain these when
drop = FALSE.

For example:

# Return the indexed values as would otherwise be done
# in grep() if the factor to character coercion takes place:
# Use the same indices 21:23 as above

> factor(letters[26:1], levels = letters[26:1])[21:23]
[1] f e d
Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a



>From my read of the C code in do_grep() in character.c (again, if
correct), when 'value = TRUE', the C code appears to first get the
indices and then build the returned vector from the indexed values from
the source vector in a for() loop. So this should not be a problem
philosophically.

However, given your example of the coercion of integers, perhaps with
grep() at least, consistent behavior would dictate that return values
are always character vectors. These could then be coerced manually back
to a factor, using the original levels, as may be required:

> factor.letters <- factor(letters[26:1], levels=letters[26:1])
> factor.letters
 [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a

> grep("[def]", as.character(factor.letters))
[1] 21 22 23

> res <- grep("[def]", as.character(factor.letters), value = TRUE)
> res
[1] "f" "e" "d"

> factor(res, levels = levels(factor.letters))
[1] f e d
Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a

Which of course is the same result I proposed initially above.

I could be convinced either way. The concern of course being that (given
the offlist replies I have received today) even experienced users are
getting bitten by the current behavior versus their intuitive
expectations, which are at least loosely supported by the documentation.

HTH,

Marc Schwartz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep() and factors

2006-06-05 Thread Bill Dunlap
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:

> Based upon an offlist communication this morning, I am somewhat confused
> (more than I usually am on most Monday mornings...) about the use of
> grep() with factors as the 'x' argument.
>  ...
> > grep("[a-z]", letters)
>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
> [23] 23 24 25 26
>
> > grep("[a-z]", factor(letters))
> numeric(0)

I was recently surprised by this also.  In addition, if
R's grep did support factors in this way, what sort of
object (factor or character) should it return when value=T?
I recently changed Splus's grep to return a character vector in
that case.

   Splus> grep("[def]", letters[26:1])
   [1] 21 22 23
   Splus>  grep("[def]", factor(letters[26:1], levels=letters[26:1]))
   [1] 21 22 23
   Splus> grep("[def]", letters[26:1], value=T)
   [1] "f" "e" "d"
   Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T)
   [1] "f" "e" "d"
   Splus> class(.Last.value)
   [1] "character"

R does this when grepping an integer vector.
   R> grep("1", 0:11, value=T)
   [1] "1"  "10" "11"
help(grep) says it returns "the matching elements themselves", but
doesn't say if "themselves" means before or after the conversion to
character.


Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel