Re: [Rd] grep
On 02/10/2016 17:54, Pi wrote: Hello. It would be great if the grep function in R had the option to use the -m parameter as the linux command does. I guess you mean the non-standard flag of the GNU version of grep (probably but not necessarily as used by Linux). That the POSIX standard for grep does not have this (nor any other commonly used implementation I am aware of) indicates that your enthusiasm for this is not shared by grep experts. That would allow to stop a grep search as soon as something is found. It would make many operations much faster. Those who would have to do the work to implement this will not be taking your word for that, but would expect convincing examples of real problems where it was so and grep was the bottleneck. Your 'case' seems to be for a shortcut for any(grepl()) along the lines of anyDuplicated(). [[alternative HTML version deleted]] This is a non-HTML list, as the posting guide told you. And using a real name adds credibility. -- Brian D. Ripley, rip...@stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep and PCRE fun
On Fri, 30 Sep 2011, Simon Urbanek wrote: Jeff, this is really a bug in PCRE since the length (0) is a multiple of 3 as documented so PCRE should not be writing anything. Anyway, this has been now fixed (by Brian). Only in R-devel: R-2-13-branch is now closed (and was by the time I read the message). Cheers, Simon On Sep 29, 2011, at 5:00 PM, Jeffrey Horner wrote: Hello, I think I've found a bug in the C function do_grep located in src/main/grep.c. It seems to affect both the latest revisions of R-2-13-branch and trunk when compiling R without optimizations and with it's own version of pcre located in src/extra, at least on ubuntu 10.04. According to the pcre_exec API (I presume the later versions), the ovecsize argument must be a multiple of 3 , and the ovector argument must point to a location that can hold at least ovecsize integers. All the pcre_exec calls made by do_grep, save one, honors this. That one call seems to overwrite areas of the stack it shouldn't. Here's the smallest example I found that tickles the bug: grep("[^[:blank][:cntrl]]","\\n",perl=TRUE) Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) : negative length vectors are not allowed As described above, this error occurs on ubuntu 10.04 when R is compiled without optimizations ( I typically use CFLAGS="-ggdb" CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the pcre_exec call executed from do_get overwrites the integer nmatches and sets it to -1. This has the effect of making do_grep try and allocate a results vector of length -1, which of course causes the error message above. I'd be interested to know if this bug happens on other platforms. Below is my simple fix for R-2-13-branch (a similar fix works for trunk as well). Jeff $ svn diff main/grep.c Index: main/grep.c === --- main/grep.c (revision 57110) +++ main/grep.c (working copy) @@ -723,7 +723,7 @@ { SEXP pat, text, ind, ans; regex_t reg; -int i, j, n, nmatches = 0, ov, rc; +int i, j, n, nmatches = 0, ov[3], rc; int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert; const char *spat = NULL; pcre *re_pcre = NULL /* -Wall */; @@ -882,7 +882,7 @@ if (fixed_opt) LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) >= 0; else if (perl_opt) { - if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= 0) + if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0) INTEGER(ind)[i] = 1; } else { if (!use_WC) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep and PCRE fun
Jeff, this is really a bug in PCRE since the length (0) is a multiple of 3 as documented so PCRE should not be writing anything. Anyway, this has been now fixed (by Brian). Cheers, Simon On Sep 29, 2011, at 5:00 PM, Jeffrey Horner wrote: > Hello, > > I think I've found a bug in the C function do_grep located in > src/main/grep.c. It seems to affect both the latest revisions of > R-2-13-branch and trunk when compiling R without optimizations and > with it's own version of pcre located in src/extra, at least on ubuntu > 10.04. > > According to the pcre_exec API (I presume the later versions), the > ovecsize argument must be a multiple of 3 , and the ovector argument > must point to a location that can hold at least ovecsize integers. All > the pcre_exec calls made by do_grep, save one, honors this. That one > call seems to overwrite areas of the stack it shouldn't. Here's the > smallest example I found that tickles the bug: > >> grep("[^[:blank][:cntrl]]","\\n",perl=TRUE) > Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) : > negative length vectors are not allowed > > As described above, this error occurs on ubuntu 10.04 when R is > compiled without optimizations ( I typically use CFLAGS="-ggdb" > CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the > pcre_exec call executed from do_get overwrites the integer nmatches > and sets it to -1. This has the effect of making do_grep try and > allocate a results vector of length -1, which of course causes the > error message above. > > I'd be interested to know if this bug happens on other platforms. > > Below is my simple fix for R-2-13-branch (a similar fix works for > trunk as well). > > Jeff > > $ svn diff main/grep.c > Index: main/grep.c > === > --- main/grep.c (revision 57110) > +++ main/grep.c (working copy) > @@ -723,7 +723,7 @@ > { > SEXP pat, text, ind, ans; > regex_t reg; > -int i, j, n, nmatches = 0, ov, rc; > +int i, j, n, nmatches = 0, ov[3], rc; > int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert; > const char *spat = NULL; > pcre *re_pcre = NULL /* -Wall */; > @@ -882,7 +882,7 @@ > if (fixed_opt) > LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) > >= 0; > else if (perl_opt) { > - if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= 0) > + if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0) > INTEGER(ind)[i] = 1; > } else { > if (!use_WC) > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep and PCRE fun
On Thu, Sep 29, 2011 at 2:00 PM, Jeffrey Horner wrote: > Hello, > > I think I've found a bug in the C function do_grep located in > src/main/grep.c. It seems to affect both the latest revisions of > R-2-13-branch and trunk when compiling R without optimizations and > with it's own version of pcre located in src/extra, at least on ubuntu > 10.04. > > According to the pcre_exec API (I presume the later versions), the > ovecsize argument must be a multiple of 3 , and the ovector argument > must point to a location that can hold at least ovecsize integers. All > the pcre_exec calls made by do_grep, save one, honors this. That one > call seems to overwrite areas of the stack it shouldn't. Here's the > smallest example I found that tickles the bug: > >> grep("[^[:blank][:cntrl]]","\\n",perl=TRUE) > Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) : > negative length vectors are not allowed > > As described above, this error occurs on ubuntu 10.04 when R is > compiled without optimizations ( I typically use CFLAGS="-ggdb" > CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the > pcre_exec call executed from do_get overwrites the integer nmatches > and sets it to -1. This has the effect of making do_grep try and > allocate a results vector of length -1, which of course causes the > error message above. > > I'd be interested to know if this bug happens on other platforms. With R devel (2011-09-28 r57099) and R v2.13.1 patched (2011-09-05 r56953) on Windows 7 64-bit you get: > grep("[^[:blank][:cntrl]]","\\n",perl=TRUE) integer(0) /Henrik > > Below is my simple fix for R-2-13-branch (a similar fix works for > trunk as well). > > Jeff > > $ svn diff main/grep.c > Index: main/grep.c > === > --- main/grep.c (revision 57110) > +++ main/grep.c (working copy) > @@ -723,7 +723,7 @@ > { > SEXP pat, text, ind, ans; > regex_t reg; > - int i, j, n, nmatches = 0, ov, rc; > + int i, j, n, nmatches = 0, ov[3], rc; > int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert; > const char *spat = NULL; > pcre *re_pcre = NULL /* -Wall */; > @@ -882,7 +882,7 @@ > if (fixed_opt) > LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) > >= 0; > else if (perl_opt) { > - if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= > 0) > + if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0) > INTEGER(ind)[i] = 1; > } else { > if (!use_WC) > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep and PCRE fun
On Thu, Sep 29, 2011 at 4:00 PM, Jeffrey Horner wrote: > Hello, > > I think I've found a bug in the C function do_grep located in > src/main/grep.c. It seems to affect both the latest revisions of > R-2-13-branch and trunk when compiling R without optimizations and > with it's own version of pcre located in src/extra, at least on ubuntu > 10.04. > > According to the pcre_exec API (I presume the later versions), the > ovecsize argument must be a multiple of 3 , and the ovector argument > must point to a location that can hold at least ovecsize integers. All > the pcre_exec calls made by do_grep, save one, honors this. That one > call seems to overwrite areas of the stack it shouldn't. Here's the > smallest example I found that tickles the bug: > >> grep("[^[:blank][:cntrl]]","\\n",perl=TRUE) > Error in grep("[^[:blank][:cntrl]]", "\\n", perl = TRUE) : > negative length vectors are not allowed As many of you know, that regex is invalid. It's just the one I happened upon that tickled the bug. It actually came from an error that occurred when building R itself. Here's a snippet of my make log: make[1]: Leaving directory `/home/hornerj/R-sources/branches/R-2-13-branch/po' you should 'make docs' now ... make[1]: Entering directory `/home/hornerj/R-sources/branches/R-2-13-branch/doc' Error in grep("[^[:blank:][:cntrl:]]", unlist(Rd[sections == "TEXT"]), : negative length vectors are not allowed Calls: saveRDS -> -> prepare2_Rd -> grep Execution halted make[1]: *** [NEWS.rds] Error 1 > > As described above, this error occurs on ubuntu 10.04 when R is > compiled without optimizations ( I typically use CFLAGS="-ggdb" > CXXFLAGS="-ggdb" FFLAGS="-ggdb" ./configure --enable-R-shlib), and the > pcre_exec call executed from do_get overwrites the integer nmatches > and sets it to -1. This has the effect of making do_grep try and > allocate a results vector of length -1, which of course causes the > error message above. > > I'd be interested to know if this bug happens on other platforms. > > Below is my simple fix for R-2-13-branch (a similar fix works for > trunk as well). > > Jeff > > $ svn diff main/grep.c > Index: main/grep.c > === > --- main/grep.c (revision 57110) > +++ main/grep.c (working copy) > @@ -723,7 +723,7 @@ > { > SEXP pat, text, ind, ans; > regex_t reg; > - int i, j, n, nmatches = 0, ov, rc; > + int i, j, n, nmatches = 0, ov[3], rc; > int igcase_opt, value_opt, perl_opt, fixed_opt, useBytes, invert; > const char *spat = NULL; > pcre *re_pcre = NULL /* -Wall */; > @@ -882,7 +882,7 @@ > if (fixed_opt) > LOGICAL(ind)[i] = fgrep_one(spat, s, useBytes, use_UTF8, NULL) > >= 0; > else if (perl_opt) { > - if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, &ov, 0) >= > 0) > + if (pcre_exec(re_pcre, re_pe, s, strlen(s), 0, 0, ov, 3) >= 0) > INTEGER(ind)[i] = 1; > } else { > if (!use_WC) > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep problem in R-devel 2.14 r57004
I forgot to mention the more obvious ;) - yes, it is a known issue in PCRE 8.13 which is hitting more people. After re-reading the standard I think the problem was that PCRE did not require enclosing [ to treat [. as special. This has been addressed in the PCRE trunk since and it also has a comment on what happened. I have ported that fix into R-devel. Cheers, Simon On Sep 16, 2011, at 9:01 AM, Simon Urbanek wrote: > Mark, quick googling gives the answer - [.] is not what you think it is, you > probably meant [\.]. Bracket expressions starting with [. are collating > symbols which is unsupported by PCRE (only [:xxx:] is supported, neither > [=xxx=] nor [.xxx.] is) but that's probably not what you intended. See POSIX: > > 9.3.5 RE Bracket Expression > [...] > 1. [..] The character sequences "[.", "[=", and "[:" (left-bracket followed > by a period, equals-sign, or colon) shall be special inside a bracket > expression and are used to delimit collating symbols, equivalence class > expressions, and character class expressions. > > Cheers, > Simon > > > > On Sep 16, 2011, at 12:45 AM, > wrote: > >> Problem below with PCRE grep in R-devel; works fine in R-patched. (Unless >> there's been an absolutely massive change in rules for updated PCRE version >> 8.13; jeez I hope not) >> >>> grep( '[.][.]', '', perl=TRUE) >> Error in grep("[.][.]", "", perl = TRUE) : >> invalid regular expression '[.][.]' >> In addition: Warning message: >> In grep("[.][.]", "", perl = TRUE) : PCRE pattern compilation error >> 'POSIX collating elements are not supported' >> at '[.][.]' >> >>> sessionInfo() >> R Under development (unstable) (2011-09-13 r57004) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 >> [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C >> [5] LC_TIME=English_Australia.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> NB I'm sending to R-devel rather than posting a bug report because (i) I >> have a dim recollection that's what we're supposed to do for bugs in >> R-devel, and (ii) Bugzilla doesn't include an R-devel version and (iii) >> couldn't find any guidance on these matters. >> >> Mark >> >> -- >> Mark Bravington >> CSIRO Mathematical & Information Sciences >> Marine Laboratory >> Castray Esplanade >> Hobart 7001 >> TAS >> >> ph (+61) 3 6232 5118 >> fax (+61) 3 6232 5012 >> mob (+61) 438 315 623 >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep problem in R-devel 2.14 r57004
Mark, quick googling gives the answer - [.] is not what you think it is, you probably meant [\.]. Bracket expressions starting with [. are collating symbols which is unsupported by PCRE (only [:xxx:] is supported, neither [=xxx=] nor [.xxx.] is) but that's probably not what you intended. See POSIX: 9.3.5 RE Bracket Expression [...] 1. [..] The character sequences "[.", "[=", and "[:" (left-bracket followed by a period, equals-sign, or colon) shall be special inside a bracket expression and are used to delimit collating symbols, equivalence class expressions, and character class expressions. Cheers, Simon On Sep 16, 2011, at 12:45 AM, wrote: > Problem below with PCRE grep in R-devel; works fine in R-patched. (Unless > there's been an absolutely massive change in rules for updated PCRE version > 8.13; jeez I hope not) > >> grep( '[.][.]', '', perl=TRUE) > Error in grep("[.][.]", "", perl = TRUE) : > invalid regular expression '[.][.]' > In addition: Warning message: > In grep("[.][.]", "", perl = TRUE) : PCRE pattern compilation error >'POSIX collating elements are not supported' >at '[.][.]' > >> sessionInfo() > R Under development (unstable) (2011-09-13 r57004) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 > [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C > [5] LC_TIME=English_Australia.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > NB I'm sending to R-devel rather than posting a bug report because (i) I have > a dim recollection that's what we're supposed to do for bugs in R-devel, and > (ii) Bugzilla doesn't include an R-devel version and (iii) couldn't find any > guidance on these matters. > > Mark > > -- > Mark Bravington > CSIRO Mathematical & Information Sciences > Marine Laboratory > Castray Esplanade > Hobart 7001 > TAS > > ph (+61) 3 6232 5118 > fax (+61) 3 6232 5012 > mob (+61) 438 315 623 > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
On Thu, 17 May 2007, Petr Savicky wrote: >> strncasecmp is not standard C (not even C99), but R does have a substitute >> for it. Unfortunately strncasecmp is not usable with multibyte charsets: >> Linux systems have wcsncasecmp but that is not portable. In these days of >> widespread use of UTF-8 that is a blocking issue, I am afraid. > > What could help are the functions mbrtowc and towctrans and simple > long integer comparison. Are the functions mbrtowc and towctrans > available under Windows? mbrtowc seems to be available as Rmbrtowc > in src/gnuwin32/extra.c. > > I did not find towctrans defined in R sources, but it is in > gnuwin32/Rdll.hide I don't see it in Rdll.hide. It is a C99 function (see your unix man page). > and used in do_tolower. Does this mean that tolower is not usable > with utf-8 under Windows? UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in so far as that makes sense: Chinese chars do not have 'case'). Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is not currently active. >> In the case of grep I think all you need is >> >> grep(tolower(pattern), tolower(x), fixed = TRUE) >> >> and similarly for regexpr. > > Yes. this is correct, but it has disadvantages. It needs more > space and, if value=TRUE, we would have to do something like > x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] > This is hard to implement in src/library/base/R/grep.R, > where the call to .Internal(grep(pattern,...)) is the last command > and I think this should be preserved. > >>> Ignore case option is not meaningfull in gsub. >> >> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE) >> >> is different from 'ignore.case=FALSE', and I see the meaning as clear. >> So what did you mean? (Unfortunately the tolower trick does not work for >> [g]sub.) > > The meaning of ignore.case in [g]sub is problematic due to the following. > sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE) > produces > [1] "xyzD" "xyzd" > but the user may in fact need the following > [1] "XYZD" "xyzd" He may, but that is not what 'ignore case' means, more like 'case honouring'. > It is correct that "xyzD" "xyzd" is produced, but the user > should be aware of the fact that several substitutions like > x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE > sub("ABC", "XYZ", x) # ignore.case=FALSE > may be more useful. > > I have another question concerning the speed of grep. I expected that > fgrep_one function is slower than calling a library routine > for regular expressions. In particular, if the pattern has a lot of > long partial matches in the target string, I expected that it may be much > slower. A short example is > y <- "ab" > x <- "aaab" > grep(y,x) > which requires 110 comparisons (10 comparisons for each of 11 possible > beginnings of y in x). In general, the complexity in the worst case is > O(m*n), where m,n are the lengths of y,x resp. I would expect that > the library function for matching regular expressions needs > time O(m+n) and so will be faster. However, the result obtained > on a larger example is > > > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "") > > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "") > > y <- paste(c(rep("a", times = 1), x2), collapse = "") > > z <- rep(y, times = 100) > > > system.time(i <- grep(x1, z, fixed = T)) > [1] 1.970 0.000 1.985 0.000 0.000 > > > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow > (*) > [1] 40.374 0.003 40.381 0.000 0.000 > > > system.time(i <- grep(x2, z, fixed = T)) > [1] 0.113 0.000 0.113 0.000 0.000 > > > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than > fgrep_one > [1] 0.019 0.000 0.019 0.000 0.000 > > Do you have an explanation of these results, in particular (*)? Yes, there is a comment on the help page to that effect. But these are highly atypical uses. Try perl=TRUE, and be aware that the locale matters a lot in such tests (via the charset). No one is attempting to make R a fast string-processing language and so developers resources are spent on performance where it matters to more typical usage. (E.g. reducing duplication in as.double and friends speeds up just about every R session, and speeds up some numerical sessions dramatically.) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
> strncasecmp is not standard C (not even C99), but R does have a substitute > for it. Unfortunately strncasecmp is not usable with multibyte charsets: > Linux systems have wcsncasecmp but that is not portable. In these days of > widespread use of UTF-8 that is a blocking issue, I am afraid. What could help are the functions mbrtowc and towctrans and simple long integer comparison. Are the functions mbrtowc and towctrans available under Windows? mbrtowc seems to be available as Rmbrtowc in src/gnuwin32/extra.c. I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide and used in do_tolower. Does this mean that tolower is not usable with utf-8 under Windows? > In the case of grep I think all you need is > > grep(tolower(pattern), tolower(x), fixed = TRUE) > > and similarly for regexpr. Yes. this is correct, but it has disadvantages. It needs more space and, if value=TRUE, we would have to do something like x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] This is hard to implement in src/library/base/R/grep.R, where the call to .Internal(grep(pattern,...)) is the last command and I think this should be preserved. > >Ignore case option is not meaningfull in gsub. > > sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE) > > is different from 'ignore.case=FALSE', and I see the meaning as clear. > So what did you mean? (Unfortunately the tolower trick does not work for > [g]sub.) The meaning of ignore.case in [g]sub is problematic due to the following. sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE) produces [1] "xyzD" "xyzd" but the user may in fact need the following [1] "XYZD" "xyzd" It is correct that "xyzD" "xyzd" is produced, but the user should be aware of the fact that several substitutions like x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE sub("ABC", "XYZ", x) # ignore.case=FALSE may be more useful. I have another question concerning the speed of grep. I expected that fgrep_one function is slower than calling a library routine for regular expressions. In particular, if the pattern has a lot of long partial matches in the target string, I expected that it may be much slower. A short example is y <- "ab" x <- "aaab" grep(y,x) which requires 110 comparisons (10 comparisons for each of 11 possible beginnings of y in x). In general, the complexity in the worst case is O(m*n), where m,n are the lengths of y,x resp. I would expect that the library function for matching regular expressions needs time O(m+n) and so will be faster. However, the result obtained on a larger example is > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "") > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "") > y <- paste(c(rep("a", times = 1), x2), collapse = "") > z <- rep(y, times = 100) > system.time(i <- grep(x1, z, fixed = T)) [1] 1.970 0.000 1.985 0.000 0.000 > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*) [1] 40.374 0.003 40.381 0.000 0.000 > system.time(i <- grep(x2, z, fixed = T)) [1] 0.113 0.000 0.113 0.000 0.000 > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one [1] 0.019 0.000 0.019 0.000 0.000 Do you have an explanation of these results, in particular (*)? Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
On Fri, 11 May 2007, Petr Savicky wrote: > On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote: >> I suggest you collaborate with the person who replied that he thought this >> was a good idea to supply patches against the R-devel sources for >> scrutiny. > > A possible solution is to use strncasecmp instead of strncmp > in function fgrep_one in R-devel/src/main/character.c. > > Corresponding modification of character.c is at > http://www.cs.cas.cz/~savicky/ignore_case/character.c > and diff file w.r.t. the original character.c (downloaded today) is at > http://www.cs.cas.cz/~savicky/ignore_case/diff.txt > > This seems to work in my installation of R-devel: > > > x <- c("D.G cat", "d.g cat", "dog cat") > > z <- "d.g" > > grep(z, x, ignore.case = F, fixed = T) > [1] 2 > > grep(z, x, ignore.case = T, fixed = T) # this is the new behavior > [1] 1 2 > > grep(z, x, ignore.case = T, fixed = F) > [1] 1 2 3 > > > > Since fgrep_one is used many times in character.c, adding igcase_opt as > an additional argument would imply extensive changes to the file. > So, I introduced a new function fgrep_one_igcase called only once in > the file. Another solution is possible. > > I do not understand well handling multibyte chars, so I did not test > the function with real multibyte chars, although the code for > this option is used. Thanks for looking into this. strncasecmp is not standard C (not even C99), but R does have a substitute for it. Unfortunately strncasecmp is not usable with multibyte charsets: Linux systems have wcsncasecmp but that is not portable. In these days of widespread use of UTF-8 that is a blocking issue, I am afraid. In the case of grep I think all you need is grep(tolower(pattern), tolower(x), fixed = TRUE) and similarly for regexpr. > Ignore case option is not meaningfull in gsub. sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE) is different from 'ignore.case=FALSE', and I see the meaning as clear. So what did you mean? (Unfortunately the tolower trick does not work for [g]sub.) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote: > I suggest you collaborate with the person who replied that he thought this > was a good idea to supply patches against the R-devel sources for > scrutiny. A possible solution is to use strncasecmp instead of strncmp in function fgrep_one in R-devel/src/main/character.c. Corresponding modification of character.c is at http://www.cs.cas.cz/~savicky/ignore_case/character.c and diff file w.r.t. the original character.c (downloaded today) is at http://www.cs.cas.cz/~savicky/ignore_case/diff.txt This seems to work in my installation of R-devel: > x <- c("D.G cat", "d.g cat", "dog cat") > z <- "d.g" > grep(z, x, ignore.case = F, fixed = T) [1] 2 > grep(z, x, ignore.case = T, fixed = T) # this is the new behavior [1] 1 2 > grep(z, x, ignore.case = T, fixed = F) [1] 1 2 3 > Since fgrep_one is used many times in character.c, adding igcase_opt as an additional argument would imply extensive changes to the file. So, I introduced a new function fgrep_one_igcase called only once in the file. Another solution is possible. I do not understand well handling multibyte chars, so I did not test the function with real multibyte chars, although the code for this option is used. Ignore case option is not meaningfull in gsub. It could be meaningful in regexpr, however, this function does not allow ignore.case option, so I did no changes to it. All the best, Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
On Mon, 7 May 2007, Petr Savicky wrote: > Dear R developers, > > I suggest to modify the behaviour of "grep" function with fixed=TRUE option. > > Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE, > if set by the user). As it clearly says it does. > I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since > the default of ignore.case is FALSE, this would not change the behaviour > of grep, if the user does not set ignore.case explicitly. > > In my opinion, fixed=TRUE is most useful for suppressing meta-character > expansion. On the other hand, for a simple word search, ignoring > case is sometimes useful. Well, it was written to use in R's own code as a quick way to match a fixed sequence of bytes. It is not suitable for a 'word' search as it does not (just) match to words. > If for some reason, it is better to keep the current behavior of grep, then I > suggest to extend the documentation as follows: > > ORIGINAL: > fixed: logical. If 'TRUE', 'pattern' is a string to be matched as > is. Overrides all conflicting arguments. > > SUGGESTED: > fixed: logical. If 'TRUE', 'pattern' is a string to be matched as > is. Overrides all conflicting arguments including ignore.case. Oh come on, ignore.case clearly conflicts with 'as is'! Adding unnecessary qualifiers just makes the text harder to read. I suggest you collaborate with the person who replied that he thought this was a good idea to supply patches against the R-devel sources for scrutiny. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
Seems like a good idea to me. Here is a workaround that works in any event which combines (?i), \Q and \E . to get the same effect. (?i) gives case insensitive matches and \Q and \E quote and endquote the intervening text disabling special characters: x <- c("D.G cat", "d.g cat", "dog cat") z <- "d.g" rx <- paste("(?i)\\Q", z, "\\E", sep = "") grep(rx, x, perl = TRUE) # 1 2 On 5/7/07, Petr Savicky <[EMAIL PROTECTED]> wrote: > Dear R developers, > > I suggest to modify the behaviour of "grep" function with fixed=TRUE option. > > Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE, > if set by the user). > > I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since > the default of ignore.case is FALSE, this would not change the behaviour > of grep, if the user does not set ignore.case explicitly. > > In my opinion, fixed=TRUE is most useful for suppressing meta-character > expansion. On the other hand, for a simple word search, ignoring > case is sometimes useful. > > If for some reason, it is better to keep the current behavior of grep, then I > suggest to extend the documentation as follows: > > ORIGINAL: > fixed: logical. If 'TRUE', 'pattern' is a string to be matched as > is. Overrides all conflicting arguments. > > SUGGESTED: > fixed: logical. If 'TRUE', 'pattern' is a string to be matched as > is. Overrides all conflicting arguments including ignore.case. > > All the best, Petr Savicky. > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On Tue, 2006-06-06 at 17:08 +0100, Prof Brian Ripley wrote: > On Tue, 6 Jun 2006, Marc Schwartz (via MN) wrote: > > > On Tue, 2006-06-06 at 11:12 +0100, Prof Brian Ripley wrote: > >> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > >> > >>> Hi all, > >>> > >>> Based upon an offlist communication this morning, I am somewhat confused > >>> (more than I usually am on most Monday mornings...) about the use of > >>> grep() with factors as the 'x' argument. > >>> > >>> The argument guidance in ?grep indicates: > >>> > >>> x, text a character vector where matches are sought. Coerced to > >>>character if possible. > >>> > >>> and in the Details section: > >>> > >>> Arguments which should be character strings or character vectors are > >>> coerced to character if possible. > >>> > >>> > >>> The wording of both would seem to reasonably lead to the conclusion that > >>> a factor could be coerced to a character vector by the use of > >>> as.character(FACTOR). > >> > >> Well, that is not what is meant by the wording, nor what happens: there is > >> no method dispatch so the factor is coerced from an integer vector to a > >> character vector. 'coerced' usually means at low level: where > >> as.character() is involved we tend to say so. > >> > >> As for the comments on what happens if value=TRUE: if the 'x' has been > >> coerced, I would expect the value to be based on the coerced value (and it > >> currently is). > >> > >>> grep("1", factor(letters)) > >> [1] 1 10 11 12 13 14 15 16 17 18 19 21 > >>> grep("1", factor(letters), value=TRUE) > >> [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21" > >> > >> So whereas I am quite happy to replace the low-level coercion by method > >> dispatch on as.character, I don't think this should be altered (and am > >> pretty sure there is code out there which expects a character vector > >> result). > > > > Prof. Ripley, > > > > Thanks for your reply and clarification. > > > > I would acknowledge that the coercion of a factor to its numeric values > > would not be immediately intuitive to me (or others who have commented > > on this) within the context of grep(). However, in light of your > > comments and having reviewed the C code, it does make sense. > > > > Given this behavior, it would seem reasonable to provide a clarification > > in ?grep, perhaps as follows: > > > > Arguments > > > > x, text a character vector where matches are sought. Coerced to > > character if possible. See Details for factors. > > > > > > Details > > > > Arguments which should be character strings or character vectors are > > coerced to character if possible. In the case of factors, these are > > coerced using as.integer(x). You must explicitly coerce the factor using > > as.character(x) to use these functions on the character vector > > equivalent. > > I do think we should `replace the low-level coercion by method dispatch on > as.character', and have done so in R-devel (but am still testing > packages). There have been quite a few instances of such low-level > coercion (including for dimnames), and I am currently looking through to > see if there are any others that either should be altered or the > documentation clarified. Prof. Ripley, I did not want to presume that you would indeed do this or more, had already done so. Though given your additional comments, I now note that this is mentioned in the NEWS file for R-devel. I do sincerely appreciate your efforts here. Perhaps an interim change in ?grep as above for 2.3.1patched might be considered, though now with an additional comment that this approach will (might) change in 2.4.0? I have added Bill Dunlap as a cc: here, given his expressed desire to be consistent with R on this point. Regards, Marc __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On Tue, 6 Jun 2006, Marc Schwartz (via MN) wrote: > On Tue, 2006-06-06 at 11:12 +0100, Prof Brian Ripley wrote: >> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: >> >>> Hi all, >>> >>> Based upon an offlist communication this morning, I am somewhat confused >>> (more than I usually am on most Monday mornings...) about the use of >>> grep() with factors as the 'x' argument. >>> >>> The argument guidance in ?grep indicates: >>> >>> x, text a character vector where matches are sought. Coerced to >>>character if possible. >>> >>> and in the Details section: >>> >>> Arguments which should be character strings or character vectors are >>> coerced to character if possible. >>> >>> >>> The wording of both would seem to reasonably lead to the conclusion that >>> a factor could be coerced to a character vector by the use of >>> as.character(FACTOR). >> >> Well, that is not what is meant by the wording, nor what happens: there is >> no method dispatch so the factor is coerced from an integer vector to a >> character vector. 'coerced' usually means at low level: where >> as.character() is involved we tend to say so. >> >> As for the comments on what happens if value=TRUE: if the 'x' has been >> coerced, I would expect the value to be based on the coerced value (and it >> currently is). >> >>> grep("1", factor(letters)) >> [1] 1 10 11 12 13 14 15 16 17 18 19 21 >>> grep("1", factor(letters), value=TRUE) >> [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21" >> >> So whereas I am quite happy to replace the low-level coercion by method >> dispatch on as.character, I don't think this should be altered (and am >> pretty sure there is code out there which expects a character vector >> result). > > Prof. Ripley, > > Thanks for your reply and clarification. > > I would acknowledge that the coercion of a factor to its numeric values > would not be immediately intuitive to me (or others who have commented > on this) within the context of grep(). However, in light of your > comments and having reviewed the C code, it does make sense. > > Given this behavior, it would seem reasonable to provide a clarification > in ?grep, perhaps as follows: > > Arguments > > x, text a character vector where matches are sought. Coerced to > character if possible. See Details for factors. > > > Details > > Arguments which should be character strings or character vectors are > coerced to character if possible. In the case of factors, these are > coerced using as.integer(x). You must explicitly coerce the factor using > as.character(x) to use these functions on the character vector > equivalent. I do think we should `replace the low-level coercion by method dispatch on as.character', and have done so in R-devel (but am still testing packages). There have been quite a few instances of such low-level coercion (including for dimnames), and I am currently looking through to see if there are any others that either should be altered or the documentation clarified. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On Tue, 2006-06-06 at 11:12 +0100, Prof Brian Ripley wrote: > On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > > > Hi all, > > > > Based upon an offlist communication this morning, I am somewhat confused > > (more than I usually am on most Monday mornings...) about the use of > > grep() with factors as the 'x' argument. > > > > The argument guidance in ?grep indicates: > > > > x, text a character vector where matches are sought. Coerced to > >character if possible. > > > > and in the Details section: > > > > Arguments which should be character strings or character vectors are > > coerced to character if possible. > > > > > > The wording of both would seem to reasonably lead to the conclusion that > > a factor could be coerced to a character vector by the use of > > as.character(FACTOR). > > Well, that is not what is meant by the wording, nor what happens: there is > no method dispatch so the factor is coerced from an integer vector to a > character vector. 'coerced' usually means at low level: where > as.character() is involved we tend to say so. > > As for the comments on what happens if value=TRUE: if the 'x' has been > coerced, I would expect the value to be based on the coerced value (and it > currently is). > > > grep("1", factor(letters)) > [1] 1 10 11 12 13 14 15 16 17 18 19 21 > > grep("1", factor(letters), value=TRUE) > [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21" > > So whereas I am quite happy to replace the low-level coercion by method > dispatch on as.character, I don't think this should be altered (and am > pretty sure there is code out there which expects a character vector > result). Prof. Ripley, Thanks for your reply and clarification. I would acknowledge that the coercion of a factor to its numeric values would not be immediately intuitive to me (or others who have commented on this) within the context of grep(). However, in light of your comments and having reviewed the C code, it does make sense. Given this behavior, it would seem reasonable to provide a clarification in ?grep, perhaps as follows: Arguments x, text a character vector where matches are sought. Coerced to character if possible. See Details for factors. Details Arguments which should be character strings or character vectors are coerced to character if possible. In the case of factors, these are coerced using as.integer(x). You must explicitly coerce the factor using as.character(x) to use these functions on the character vector equivalent. Thanks for your consideration. Regards, Marc Schwartz __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > Hi all, > > Based upon an offlist communication this morning, I am somewhat confused > (more than I usually am on most Monday mornings...) about the use of > grep() with factors as the 'x' argument. > > The argument guidance in ?grep indicates: > > x, text a character vector where matches are sought. Coerced to >character if possible. > > and in the Details section: > > Arguments which should be character strings or character vectors are > coerced to character if possible. > > > The wording of both would seem to reasonably lead to the conclusion that > a factor could be coerced to a character vector by the use of > as.character(FACTOR). Well, that is not what is meant by the wording, nor what happens: there is no method dispatch so the factor is coerced from an integer vector to a character vector. 'coerced' usually means at low level: where as.character() is involved we tend to say so. As for the comments on what happens if value=TRUE: if the 'x' has been coerced, I would expect the value to be based on the coerced value (and it currently is). > grep("1", factor(letters)) [1] 1 10 11 12 13 14 15 16 17 18 19 21 > grep("1", factor(letters), value=TRUE) [1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21" So whereas I am quite happy to replace the low-level coercion by method dispatch on as.character, I don't think this should be altered (and am pretty sure there is code out there which expects a character vector result). > In tracing through the C code in character.c for do_grep(), which in > turn calls coerceVector() in coerce.c, unless I am mis-reading the code > (always possible), I don't see an indication that a factor would be > coerced to a character vector. > > Since a factor -> character coercion would seem at face value, the most > logical coercion to take place when using grep(), I am curious if I am > missing something, or if perhaps ?grep needs to be more clear in the > coercions that will or might take place. Perhaps even the consideration > of an error message if a factor is passed as the 'x' argument, if indeed > the coercion would not take place. > > Perhaps the easiest example here might be: > > # On R Version 2.3.1 (2006-06-01) on FC5 > >> grep("[a-z]", letters) > [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > [23] 23 24 25 26 > >> grep("[a-z]", factor(letters)) > numeric(0) > > > Thanks for any comments or any virtual rotten tomatoes coming my way at > high speed. :-) > > Marc Schwartz > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On 6/5/06, Bill Dunlap <[EMAIL PROTECTED]> wrote: > On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > > > > > > grep("[a-z]", factor(letters)) > > > > numeric(0) > > > > > > I was recently surprised by this also. In addition, if > > > R's grep did support factors in this way, what sort of > > > object (factor or character) should it return when value=T? > > > I recently changed Splus's grep to return a character vector in > > > that case. > > > > > >Splus> grep("[def]", letters[26:1]) > > >[1] 21 22 23 > > >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) > > >[1] 21 22 23 > > >Splus> grep("[def]", letters[26:1], value=T) > > >[1] "f" "e" "d" > > >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), > > > value=T) > > >[1] "f" "e" "d" > > >Splus> class(.Last.value) > > >[1] "character" > > > > > > R does this when grepping an integer vector. > > >R> grep("1", 0:11, value=T) > > >[1] "1" "10" "11" > > > help(grep) says it returns "the matching elements themselves", but > > > doesn't say if "themselves" means before or after the conversion to > > > character. > > > > Bill, > > > > My first inclination for the return value when used on a factor would be > > the indexed factor elements where grep() would otherwise simply return > > the indices. This would also maintain the factor levels from the > > original source factor since "[".factor would normally retain these when > > drop = FALSE. > > That would be my first inclination also. I would have expected the output of > grep(pattern, text, value=TRUE) > to be identical to that of > text[grep(pattern, text, value=FALSE)] > no matter what class text has. > > No end users have seen this in Splus so we can change it to anything, > but we want to keep it the same as R's. > > > I could be convinced either way. The concern of course being that (given > > the offlist replies I have received today) even experienced users are > > getting bitten by the current behavior versus their intuitive > > expectations, which are at least loosely supported by the documentation. > > I would have expected If non-character text arguments are accepted I would have expected that they be coerced to character so that grep(pattern, text, ...) would return the same result as grep(pattern, as.character(text), ...) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > > > > grep("[a-z]", factor(letters)) > > > numeric(0) > > > > I was recently surprised by this also. In addition, if > > R's grep did support factors in this way, what sort of > > object (factor or character) should it return when value=T? > > I recently changed Splus's grep to return a character vector in > > that case. > > > >Splus> grep("[def]", letters[26:1]) > >[1] 21 22 23 > >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) > >[1] 21 22 23 > >Splus> grep("[def]", letters[26:1], value=T) > >[1] "f" "e" "d" > >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), > > value=T) > >[1] "f" "e" "d" > >Splus> class(.Last.value) > >[1] "character" > > > > R does this when grepping an integer vector. > >R> grep("1", 0:11, value=T) > >[1] "1" "10" "11" > > help(grep) says it returns "the matching elements themselves", but > > doesn't say if "themselves" means before or after the conversion to > > character. > > Bill, > > My first inclination for the return value when used on a factor would be > the indexed factor elements where grep() would otherwise simply return > the indices. This would also maintain the factor levels from the > original source factor since "[".factor would normally retain these when > drop = FALSE. That would be my first inclination also. I would have expected the output of grep(pattern, text, value=TRUE) to be identical to that of text[grep(pattern, text, value=FALSE)] no matter what class text has. No end users have seen this in Splus so we can change it to anything, but we want to keep it the same as R's. > I could be convinced either way. The concern of course being that (given > the offlist replies I have received today) even experienced users are > getting bitten by the current behavior versus their intuitive > expectations, which are at least loosely supported by the documentation. > > HTH, > > Marc Schwartz Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position." __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
Marc Schwartz (via MN) wrote: > On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote: > >>On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: >> >> >>>Based upon an offlist communication this morning, I am somewhat confused >>>(more than I usually am on most Monday mornings...) about the use of >>>grep() with factors as the 'x' argument. >>> ... >>> grep("[a-z]", letters) >>> >>> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 >>>[23] 23 24 25 26 >>> >>> grep("[a-z]", factor(letters)) >>> >>>numeric(0) >> >>I was recently surprised by this also. In addition, if >>R's grep did support factors in this way, what sort of >>object (factor or character) should it return when value=T? >>I recently changed Splus's grep to return a character vector in >>that case. >> >> Splus> grep("[def]", letters[26:1]) >> [1] 21 22 23 >> Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) >> [1] 21 22 23 >> Splus> grep("[def]", letters[26:1], value=T) >> [1] "f" "e" "d" >> Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T) >> [1] "f" "e" "d" >> Splus> class(.Last.value) >> [1] "character" >> >>R does this when grepping an integer vector. >> R> grep("1", 0:11, value=T) >> [1] "1" "10" "11" >>help(grep) says it returns "the matching elements themselves", but >>doesn't say if "themselves" means before or after the conversion to >>character. > > > Bill, > > My first inclination for the return value when used on a factor would be > the indexed factor elements where grep() would otherwise simply return > the indices. This would also maintain the factor levels from the > original source factor since "[".factor would normally retain these when > drop = FALSE. > > For example: > > # Return the indexed values as would otherwise be done > # in grep() if the factor to character coercion takes place: > # Use the same indices 21:23 as above > > >>factor(letters[26:1], levels = letters[26:1])[21:23] > > [1] f e d > Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > > > >>From my read of the C code in do_grep() in character.c (again, if > correct), when 'value = TRUE', the C code appears to first get the > indices and then build the returned vector from the indexed values from > the source vector in a for() loop. So this should not be a problem > philosophically. > > However, given your example of the coercion of integers, perhaps with > grep() at least, consistent behavior would dictate that return values > are always character vectors. These could then be coerced manually back > to a factor, using the original levels, as may be required: > > >>factor.letters <- factor(letters[26:1], levels=letters[26:1]) >>factor.letters > > [1] z y x w v u t s r q p o n m l k j i h g f e d c b a > Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > > >>grep("[def]", as.character(factor.letters)) > > [1] 21 22 23 > > >>res <- grep("[def]", as.character(factor.letters), value = TRUE) >>res > > [1] "f" "e" "d" > > >>factor(res, levels = levels(factor.letters)) > > [1] f e d > Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > > Which of course is the same result I proposed initially above. > > I could be convinced either way. The concern of course being that (given > the offlist replies I have received today) even experienced users are > getting bitten by the current behavior versus their intuitive > expectations, which are at least loosely supported by the documentation. I'll chime in on-list to say that I have had the same experience with expecting grep to coerce to text. Despite the question of return values, I think of grep (not equivalent to the unix command, I understand, but it does have the same name) as operating on "text", not the factor levels themselves. Not a big deal, but it does lead to sometimes hard to track bugs if one is not careful to put in as.character all the time. Sean __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote: > On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > > > Based upon an offlist communication this morning, I am somewhat confused > > (more than I usually am on most Monday mornings...) about the use of > > grep() with factors as the 'x' argument. > > ... > > > grep("[a-z]", letters) > > [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > > [23] 23 24 25 26 > > > > > grep("[a-z]", factor(letters)) > > numeric(0) > > I was recently surprised by this also. In addition, if > R's grep did support factors in this way, what sort of > object (factor or character) should it return when value=T? > I recently changed Splus's grep to return a character vector in > that case. > >Splus> grep("[def]", letters[26:1]) >[1] 21 22 23 >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) >[1] 21 22 23 >Splus> grep("[def]", letters[26:1], value=T) >[1] "f" "e" "d" >Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T) >[1] "f" "e" "d" >Splus> class(.Last.value) >[1] "character" > > R does this when grepping an integer vector. >R> grep("1", 0:11, value=T) >[1] "1" "10" "11" > help(grep) says it returns "the matching elements themselves", but > doesn't say if "themselves" means before or after the conversion to > character. Bill, My first inclination for the return value when used on a factor would be the indexed factor elements where grep() would otherwise simply return the indices. This would also maintain the factor levels from the original source factor since "[".factor would normally retain these when drop = FALSE. For example: # Return the indexed values as would otherwise be done # in grep() if the factor to character coercion takes place: # Use the same indices 21:23 as above > factor(letters[26:1], levels = letters[26:1])[21:23] [1] f e d Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a >From my read of the C code in do_grep() in character.c (again, if correct), when 'value = TRUE', the C code appears to first get the indices and then build the returned vector from the indexed values from the source vector in a for() loop. So this should not be a problem philosophically. However, given your example of the coercion of integers, perhaps with grep() at least, consistent behavior would dictate that return values are always character vectors. These could then be coerced manually back to a factor, using the original levels, as may be required: > factor.letters <- factor(letters[26:1], levels=letters[26:1]) > factor.letters [1] z y x w v u t s r q p o n m l k j i h g f e d c b a Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a > grep("[def]", as.character(factor.letters)) [1] 21 22 23 > res <- grep("[def]", as.character(factor.letters), value = TRUE) > res [1] "f" "e" "d" > factor(res, levels = levels(factor.letters)) [1] f e d Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a Which of course is the same result I proposed initially above. I could be convinced either way. The concern of course being that (given the offlist replies I have received today) even experienced users are getting bitten by the current behavior versus their intuitive expectations, which are at least loosely supported by the documentation. HTH, Marc Schwartz __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep() and factors
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote: > Based upon an offlist communication this morning, I am somewhat confused > (more than I usually am on most Monday mornings...) about the use of > grep() with factors as the 'x' argument. > ... > > grep("[a-z]", letters) > [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > [23] 23 24 25 26 > > > grep("[a-z]", factor(letters)) > numeric(0) I was recently surprised by this also. In addition, if R's grep did support factors in this way, what sort of object (factor or character) should it return when value=T? I recently changed Splus's grep to return a character vector in that case. Splus> grep("[def]", letters[26:1]) [1] 21 22 23 Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) [1] 21 22 23 Splus> grep("[def]", letters[26:1], value=T) [1] "f" "e" "d" Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T) [1] "f" "e" "d" Splus> class(.Last.value) [1] "character" R does this when grepping an integer vector. R> grep("1", 0:11, value=T) [1] "1" "10" "11" help(grep) says it returns "the matching elements themselves", but doesn't say if "themselves" means before or after the conversion to character. Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position." __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel