Re: [R] Frequency of a character in a string

Hervé Pagès Mon, 14 Nov 2016 12:28:11 -0800

Hi,

FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
or strsplit( , fixed=TRUE):


  set.seed(1)
  Vec <- paste(sample(letters, 5000000, replace = TRUE), collapse = "")

  system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
  #  user  system elapsed
  # 0.585   0.000   0.586

  system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
  #  user  system elapsed
  # 0.061   0.000   0.061

  system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
  #  user  system elapsed
  # 0.039   0.000   0.039

  identical(res1, res2)
  # [1] TRUE
  identical(res1, res3)
  # [1] TRUE

The gsub( , fixed=TRUE) solution also uses slightly less memory than the
strsplit( , fixed=TRUE) solution.

Cheers,
H.


On 11/14/2016 11:55 AM, Charles C. Berry wrote:

On Mon, 14 Nov 2016, Marc Schwartz wrote:

On Nov 14, 2016, at 11:26 AM, Charles C. Berry <ccbe...@ucsd.edu> wrote:

On Mon, 14 Nov 2016, Bert Gunter wrote:

[stuff deleted]

Hi,

Both gsub() and strsplit() are using regex based pattern matching
internally. That being said, they are ultimately calling .Internal
code, so both are pretty fast.

For comparison:

## Create a 1,000,000 character vector
set.seed(1)
Vec <- paste(sample(letters, 1000000, replace = TRUE), collapse = "")

nchar(Vec)

[1] 1000000

## Split the vector into single characters and tabulate

table(strsplit(Vec, split = "")[[1]])


   a     b     c     d     e     f     g     h     i     j     k     l
38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
   m     n     o     p     q     r     s     t     u     v     w     x
38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
   y     z
38265 38299


## Get just the count of "a"

table(strsplit(Vec, split = "")[[1]])["a"]

   a
38664

nchar(gsub("[^a]", "", Vec))

[1] 38664


## Check performance

system.time(table(strsplit(Vec, split = "")[[1]])["a"])

  user  system elapsed
 0.100   0.007   0.107

system.time(nchar(gsub("[^a]", "", Vec)))

  user  system elapsed
 0.270   0.001   0.272


So, the above would suggest that using strsplit() is somewhat faster
than using gsub(). However, as Chuck notes, in the absence of more
exhaustive benchmarking, the difference may or may not be more
generalizable.



Whether splitting on fixed strings rather than treating them as
regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
what you split:

First repeating what Marc did...

system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])

   user  system elapsed
  0.132   0.010   0.139

system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])

   user  system elapsed
  0.130   0.010   0.138

... fixed=TRUE hardly matters. But the idiom I proposed...

system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=TRUE)) - 1))

   user  system elapsed
  0.017   0.000   0.018

system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=FALSE)) - 1))

   user  system elapsed
  0.104   0.000   0.104


... is 5 times faster with fixed=TRUE for this case.

This result matchea Marc's count:

sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)

[1] 38664


Chuck

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Frequency of a character in a string

Reply via email to