Thanks for 6 ways to skin this cat! I am just beginning to learn about the power of regular expressions and appreciate the many examples of how they can be used in this context. This knowledge will come in handy the next time the number of characters is variable both before and after the dot. On my machine and for my particular example, however, Seth is correct in that substr is by far the fastest. I had forgotten that substr is vectorized.
Below is the output of my speed trials and sessionInfo in case anyone is curious. I artificially made the go.id vector 10X its normal length to magnify differences. I did also check to verify that each solution worked as predicted, which they all did. Thanks again for your generous help, Mark length(go.ids) [1] 79750 > go.ids[1:5] [1] "GO:0006091.NA" "GO:0008104.ISS" "GO:0008104.ISS" "GO:0006091.NA" "GO:0006091.NAS" > system.time(z <- gsub("[.].*", "", go.ids)) [1] 0.47 0.00 0.47 NA NA > system.time(z <- gsub('\\..+$','', go.ids)) [1] 0.56 0.00 0.56 NA NA > system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids)) [1] 1.08 0.00 1.09 NA NA > system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids)) [1] 1.03 0.00 1.03 NA NA > system.time(z <- sub("\\..+", "", go.ids)) [1] 0.49 0.00 0.48 NA NA > system.time(z <- substr(go.ids, 0, 10)) [1] 0.02 0.00 0.01 NA NA > sessionInfo() R version 2.4.1 (2006-12-18) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] "splines" "stats" "graphics" "grDevices" "datasets" "utils" "tools" "methods" "base" other attached packages: rat2302 xlsReadWritePro qvalue affycoretools biomaRt RCurl XML GOstats Category "1.14.0" "1.0.6" "1.8.0" "1.6.0" "1.8.1" "0.8-0" "1.2-0" "2.0.4" "2.0.3" genefilter survival KEGG RBGL annotate GO graph RWinEdt limma "1.12.0" "2.30" "1.14.1" "1.10.0" "1.12.1" "1.14.1" "1.12.0" "1.7-5" "2.9.1" affy affyio Biobase "1.12.2" "1.2.0" "1.12.2" Mark W. Kimpel MD (317) 490-5129 Work, & Mobile (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX -----Original Message----- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 17, 2007 8:11 PM To: Seth Falcon Cc: Kimpel, Mark William; r-help@stat.math.ethz.ch Subject: Re: [R] help with regexpr in gsub On Wed, 2007-01-17 at 16:46 -0800, Seth Falcon wrote: > "Kimpel, Mark William" <[EMAIL PROTECTED]> writes: > > > I have a very long vector of character strings of the format > > "GO:0008104.ISS" and need to strip off the dot and anything that follows > > it. There are always 10 characters before the dot. The actual characters > > and the number of them after the dot is variable. > > > > So, I would like to return in the format "GO:0008104" . I could do this > > with substr and loop over the entire vector, but I thought there might > > be a more elegant (and faster) way to do this. > > > > I have tried gsub using regular expressions without success. The code > > > > gsub(pattern= "\.*?" , replacement="", x=character.vector) > > I guess you want: > > sub("([GO:0-9]+)\\..*$", "\\1", goids) > > [You don't need gsub here] > > But I don't understand why you wouldn't want to use substr. At least > for me substr looks to be about 20x faster than sub for this > problem... > > > > library(GO) > > goids = ls(GOTERM) > > gids = paste(goids, "ISS", sep=".") > > gids[1:10] > [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS" "GO:0000004.ISS" > [5] "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000009.ISS" "GO:0000010.ISS" > [9] "GO:0000011.ISS" "GO:0000012.ISS" > > > system.time(z <- substr(gids, 0, 10)) > user system elapsed > 0.008 0.000 0.007 > > system.time(z2 <- sub("([GO:0-9]+)\\..*$", "\\1", gids)) > user system elapsed > 0.136 0.000 0.134 I think that some of the overhead here in using sub() is due to the effective partitioning of the source vector, a more complex regex and then just returning the first element. This can be shortened to: # Note that I have 12 elements here > gids [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS" "GO:0000004.ISS" [5] "GO:0000005.ISS" "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000008.ISS" [9] "GO:0000009.ISS" "GO:0000010.ISS" "GO:0000011.ISS" "GO:0000012.ISS" > system.time(z2 <- sub("\\..+", "", gids)) [1] 0 0 0 0 0 > z2 [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000004" "GO:0000005" [6] "GO:0000006" "GO:0000007" "GO:0000008" "GO:0000009" "GO:0000010" [11] "GO:0000011" "GO:0000012" Which would appear to be quicker than using substr(). HTH, Marc Schwartz ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.