Chuck, Marc, and anyone else who still has interest in this odd little discussion ...
Yes, and with fixed = TRUE my approach took 1/3 as much time as Chuck's with a 10 element vector each element of which is a character string of length 1e5: > set.seed(1001) > x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse = > "")) > system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1)) user system elapsed 0.012 0.000 0.012 > system.time(nchar(gsub("[^a]", "", x,fixed = TRUE))) user system elapsed 0.004 0.000 0.004 Best, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry <ccbe...@ucsd.edu> wrote: > On Mon, 14 Nov 2016, Marc Schwartz wrote: > >> >>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccbe...@ucsd.edu> wrote: >>> >>> On Mon, 14 Nov 2016, Bert Gunter wrote: >>> > [stuff deleted] > > >> Hi, >> >> Both gsub() and strsplit() are using regex based pattern matching >> internally. That being said, they are ultimately calling .Internal code, so >> both are pretty fast. >> >> For comparison: >> >> ## Create a 1,000,000 character vector >> set.seed(1) >> Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "") >> >>> nchar(Vec) >> >> [1] 1000000 >> >> ## Split the vector into single characters and tabulate >>> >>> table(strsplit(Vec, split = "")[[1]]) >> >> >> a b c d e f g h i j k l >> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 >> m n o p q r s t u v w x >> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 >> y z >> 38265 38299 >> >> >> ## Get just the count of "a" >>> >>> table(strsplit(Vec, split = "")[[1]])["a"] >> >> a >> 38664 >> >>> nchar(gsub("[^a]", "", Vec)) >> >> [1] 38664 >> >> >> ## Check performance >>> >>> system.time(table(strsplit(Vec, split = "")[[1]])["a"]) >> >> user system elapsed >> 0.100 0.007 0.107 >> >>> system.time(nchar(gsub("[^a]", "", Vec))) >> >> user system elapsed >> 0.270 0.001 0.272 >> >> >> So, the above would suggest that using strsplit() is somewhat faster than >> using gsub(). However, as Chuck notes, in the absence of more exhaustive >> benchmarking, the difference may or may not be more generalizable. > > > > Whether splitting on fixed strings rather than treating them as > regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on > what you split: > > First repeating what Marc did... > >> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"]) > > user system elapsed > 0.132 0.010 0.139 >> >> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"]) > > user system elapsed > 0.130 0.010 0.138 > > ... fixed=TRUE hardly matters. But the idiom I proposed... > >> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) - >> 1)) > > user system elapsed > 0.017 0.000 0.018 >> >> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - >> 1)) > > user system elapsed > 0.104 0.000 0.104 >> >> > > ... is 5 times faster with fixed=TRUE for this case. > > This result matchea Marc's count: > >> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1) > > [1] 38664 >> >> > > Chuck ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.