(Sheepishly)... Yes, thank you Hervé. It would have been nice if I had given correct soutions. Fixed = TRUE could not have of course worked with ["a"] character class!
Here's what I found with a 10 element vector each member of which is a 1e5 length string: > system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1)) user system elapsed 0.013 0.000 0.013 > system.time(nchar(gsub("[^a]", "", x,fixed = FALSE))) user system elapsed 0.251 0.000 0.252 ## WAYYYY slower > system.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE))) user system elapsed 0.007 0.000 0.007 ## twice as fast Clearly and unsurprisingly, the message is to avoid fixed = FALSE; after that, it seems mostly to be: who cares?! Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Nov 14, 2016 at 12:26 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > Hi, > > FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE) > or strsplit( , fixed=TRUE): > > set.seed(1) > Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "") > > system.time(res1 <- nchar(gsub("[^a]", "", Vec))) > # user system elapsed > # 0.585 0.000 0.586 > > system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L) > # user system elapsed > # 0.061 0.000 0.061 > > system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE))) > # user system elapsed > # 0.039 0.000 0.039 > > identical(res1, res2) > # [1] TRUE > identical(res1, res3) > # [1] TRUE > > The gsub( , fixed=TRUE) solution also uses slightly less memory than the > strsplit( , fixed=TRUE) solution. > > Cheers, > H. > > > On 11/14/2016 11:55 AM, Charles C. Berry wrote: >> >> On Mon, 14 Nov 2016, Marc Schwartz wrote: >> >>> >>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccbe...@ucsd.edu> wrote: >>>> >>>> On Mon, 14 Nov 2016, Bert Gunter wrote: >>>> >> [stuff deleted] >> >>> Hi, >>> >>> Both gsub() and strsplit() are using regex based pattern matching >>> internally. That being said, they are ultimately calling .Internal >>> code, so both are pretty fast. >>> >>> For comparison: >>> >>> ## Create a 1,000,000 character vector >>> set.seed(1) >>> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") >>> >>>> nchar(Vec) >>> >>> [1] 1000000 >>> >>> ## Split the vector into single characters and tabulate >>>> >>>> table(strsplit(Vec, split = "")[[1]]) >>> >>> >>> a b c d e f g h i j k l >>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 >>> m n o p q r s t u v w x >>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 >>> y z >>> 38265 38299 >>> >>> >>> ## Get just the count of "a" >>>> >>>> table(strsplit(Vec, split = "")[[1]])["a"] >>> >>> a >>> 38664 >>> >>>> nchar(gsub("[^a]", "", Vec)) >>> >>> [1] 38664 >>> >>> >>> ## Check performance >>>> >>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) >>> >>> user system elapsed >>> 0.100 0.007 0.107 >>> >>>> system.time(nchar(gsub("[^a]", "", Vec))) >>> >>> user system elapsed >>> 0.270 0.001 0.272 >>> >>> >>> So, the above would suggest that using strsplit() is somewhat faster >>> than using gsub(). However, as Chuck notes, in the absence of more >>> exhaustive benchmarking, the difference may or may not be more >>> generalizable. >> >> >> >> Whether splitting on fixed strings rather than treating them as >> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on >> what you split: >> >> First repeating what Marc did... >> >>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"]) >> >> user system elapsed >> 0.132 0.010 0.139 >>> >>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"]) >> >> user system elapsed >> 0.130 0.010 0.138 >> >> ... fixed=TRUE hardly matters. But the idiom I proposed... >> >>> system.time(sum(lengths(strsplit(paste0("X", Vec, >>> "X"),"a",fixed=TRUE)) - 1)) >> >> user system elapsed >> 0.017 0.000 0.018 >>> >>> system.time(sum(lengths(strsplit(paste0("X", Vec, >>> "X"),"a",fixed=FALSE)) - 1)) >> >> user system elapsed >> 0.104 0.000 0.104 >>> >>> >> >> ... is 5 times faster with fixed=TRUE for this case. >> >> This result matchea Marc's count: >> >>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1) >> >> [1] 38664 >>> >>> >> >> Chuck >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.