Re: [R] Re gular Expression help
Gabor Grothendieck wrote: Here are a few more solutions. x is the input vector of character strings. The first is a slightly shorter version of one of Wacek's. The next three all create an anonymous grouping variable (using sub, substr/gsub and strapply respectively) whose components are p and q and then tapply is used to separate out the corresponding components of x according to the grouping: sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE) tapply(x, sub(^[^pq]*(.).*, \\1, x), c) tapply(x, substr(gsub([^pq], , x), 1, 1), c) library(gsubfn) tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c) wow! cool stuff. if you're interested in comparing their efficiency, source the attached script. vQ generate = function(n, m) replicate(n, paste(sample(letters, m, replace=TRUE), collapse=)) tests = list( wacek = function(data) { p = grep(^[^pq]*p, data) list(p=data[p], q=data[-p]) }, gabor1 = function(data) sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE), gabor2 = function(data) tapply(data, sub(^[^pq]*p(.).*, \\1, data), c), gabor3 = function(data) tapply(data, substr(gsub([^pq], , data), 1, 1), c), gabor4 = { library(gsubfn); function(data) tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) } ) data = generate(1000,10) lapply(names(tests), function(name) { cat(name, :\n, sep=) print(system.time(replicate(30,tests[[name]](data } ) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Wacek Kusnierczyk wrote: Gabor Grothendieck wrote: Here are a few more solutions. x is the input vector of character strings. The first is a slightly shorter version of one of Wacek's. The next three all create an anonymous grouping variable (using sub, substr/gsub and strapply respectively) whose components are p and q and then tapply is used to separate out the corresponding components of x according to the grouping: sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE) tapply(x, sub(^[^pq]*(.).*, \\1, x), c) tapply(x, substr(gsub([^pq], , x), 1, 1), c) library(gsubfn) tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c) wow! cool stuff. if you're interested in comparing their efficiency, source the attached script. using lapply with side-effects code should probably be considered bad practice, so replace lapply with a for loop. sorry. vQ generate = function(n, m) replicate(n, paste(sample(letters, m, replace=TRUE), collapse=)) tests = list( wacek = function(data) { p = grep(^[^pq]*p, data) list(p=data[p], q=data[-p]) }, gabor1 = function(data) sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE), gabor2 = function(data) tapply(data, sub(^[^pq]*p(.).*, \\1, data), c), gabor3 = function(data) tapply(data, substr(gsub([^pq], , data), 1, 1), c), gabor4 = { library(gsubfn); function(data) tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) } ) data = generate(1000,10) for (name in names(tests)) { cat(name, :\n, sep=) print(system.time(replicate(30,tests[[name]](data } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
For the problem at hand I think I would use your solution which is both easily understood and fastest. On the other hand the tapply based solutions are coordinate free (i.e. no explicit mucking with indices) and readily generalize to more than 2 groups -- just replace [^pq] with [^pqr], say. On Sat, Nov 8, 2008 at 4:21 PM, Wacek Kusnierczyk [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: Here are a few more solutions. x is the input vector of character strings. The first is a slightly shorter version of one of Wacek's. The next three all create an anonymous grouping variable (using sub, substr/gsub and strapply respectively) whose components are p and q and then tapply is used to separate out the corresponding components of x according to the grouping: sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE) tapply(x, sub(^[^pq]*(.).*, \\1, x), c) tapply(x, substr(gsub([^pq], , x), 1, 1), c) library(gsubfn) tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c) wow! cool stuff. if you're interested in comparing their efficiency, source the attached script. vQ generate = function(n, m) replicate(n, paste(sample(letters, m, replace=TRUE), collapse=)) tests = list( wacek = function(data) { p = grep(^[^pq]*p, data) list(p=data[p], q=data[-p]) }, gabor1 = function(data) sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE), gabor2 = function(data) tapply(data, sub(^[^pq]*p(.).*, \\1, data), c), gabor3 = function(data) tapply(data, substr(gsub([^pq], , data), 1, 1), c), gabor4 = { library(gsubfn); function(data) tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) } ) data = generate(1000,10) lapply(names(tests), function(name) { cat(name, :\n, sep=) print(system.time(replicate(30,tests[[name]](data } ) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Gabor Grothendieck wrote: For the problem at hand I think I would use your solution which is both easily understood and fastest. On the other hand the tapply based solutions are coordinate free (i.e. no explicit mucking with indices) and readily generalize to more than 2 groups -- just replace [^pq] with [^pqr], say. for sure, mine was optimized towards the case, not towards generalizability. the gsubfn one is a loser, though. but the first one *is* easily generalizable, e.g., letters = pqrs sapply(sprintf(^[^%s]*%s, letters, unlist(strsplit(letters, split=))), grep, x=x, value=TRUE) while an order of magnitude faster than the tapply ones. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
I suspect strapply is only relatively slow on short strings where it doesn't matter anyways since for long strings performance would likely be dominated by the underlying regexp operations. I know that users are using the package for very long strings since I once had to lift the 25,000 character limit since I had complaints about that. The expressiveness and brevity of strapply (it would be shortest if it were not for the length of the word simplify) offset any disadvantage in my view. On Sat, Nov 8, 2008 at 5:02 PM, Wacek Kusnierczyk [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: For the problem at hand I think I would use your solution which is both easily understood and fastest. On the other hand the tapply based solutions are coordinate free (i.e. no explicit mucking with indices) and readily generalize to more than 2 groups -- just replace [^pq] with [^pqr], say. for sure, mine was optimized towards the case, not towards generalizability. the gsubfn one is a loser, though. but the first one *is* easily generalizable, e.g., letters = pqrs sapply(sprintf(^[^%s]*%s, letters, unlist(strsplit(letters, split=))), grep, x=x, value=TRUE) while an order of magnitude faster than the tapply ones. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Gabor Grothendieck wrote: I suspect strapply is only relatively slow on short strings where it doesn't matter anyways since for long strings performance would likely be dominated by the underlying regexp operations. I know that users are using the package for very long strings since I once had to lift the 25,000 character limit since I had complaints about that. The expressiveness and brevity of strapply (it would be shortest if it were not for the length of the word simplify) offset any disadvantage in my view. ok, the attached tests against strings of length 3 where the character that matches is precisely the last one. (gabor3 is dummy, because i had no patience to wait over a minute...) note that the strapply version is still approximately an order of magnitude slower. with the original script and string lenght (m) set to 1, the strapply version is two orders of magnitude slower. it might be that the test is poor, though -- design a smart test where strapply wins ;) (related to the original problem, of course.) vQ generate = function(n, m) replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, replace=TRUE), collapse=), sample(letters[16:17], 1), sep=)) tests = list( wacek = function(data) { p = grep(^[^pq]*p, data) list(p=data[p], q=data[-p]) }, gabor1 = function(data) sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE), gabor2 = function(data) tapply(data, sub(^[^pq]*p(.).*, \\1, data), c), gabor3 = function(data) 0, # tapply(data, substr(gsub([^pq], , data), 1, 1), c), gabor4 = { library(gsubfn); function(data) tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) } ) data = generate(10,3) for (name in names(tests)) { cat(name, :\n, sep=) print(system.time(replicate(30,tests[[name]](data } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
I'll see if I can speed it up if I get some time. I personally use it on relatively short strings where the low absolute time means that the higher relative time your comparisons show are not that important. On Sat, Nov 8, 2008 at 5:33 PM, Wacek Kusnierczyk [EMAIL PROTECTED] wrote: Gabor Grothendieck wrote: I suspect strapply is only relatively slow on short strings where it doesn't matter anyways since for long strings performance would likely be dominated by the underlying regexp operations. I know that users are using the package for very long strings since I once had to lift the 25,000 character limit since I had complaints about that. The expressiveness and brevity of strapply (it would be shortest if it were not for the length of the word simplify) offset any disadvantage in my view. ok, the attached tests against strings of length 3 where the character that matches is precisely the last one. (gabor3 is dummy, because i had no patience to wait over a minute...) note that the strapply version is still approximately an order of magnitude slower. with the original script and string lenght (m) set to 1, the strapply version is two orders of magnitude slower. it might be that the test is poor, though -- design a smart test where strapply wins ;) (related to the original problem, of course.) vQ generate = function(n, m) replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, replace=TRUE), collapse=), sample(letters[16:17], 1), sep=)) tests = list( wacek = function(data) { p = grep(^[^pq]*p, data) list(p=data[p], q=data[-p]) }, gabor1 = function(data) sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE), gabor2 = function(data) tapply(data, sub(^[^pq]*p(.).*, \\1, data), c), gabor3 = function(data) 0, # tapply(data, substr(gsub([^pq], , data), 1, 1), c), gabor4 = { library(gsubfn); function(data) tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) } ) data = generate(10,3) for (name in names(tests)) { cat(name, :\n, sep=) print(system.time(replicate(30,tests[[name]](data } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Re gular Expression help
hi there I have a vector with a set of data.I just wanna seperate them based on the first p and q values metioned within the data. [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [2] chr1q22-q24 [3] chr1q22-q24 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 i used a regular expression [+q*] to match up the values but it matches q found anywhere i know i have written like that but i jus want it to match the first p or q values. my result should be for q and [2] chr1q22-q24 [3] chr1q22-q24 for p [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 -- View this message in context: http://www.nabble.com/Regular-Expression-help-tp20385971p20385971.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Rajasekaramya wrote: hi there I have a vector with a set of data.I just wanna seperate them based on the first p and q values metioned within the data. [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [2] chr1q22-q24 [3] chr1q22-q24 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 i used a regular expression [+q*] to match up the values but it matches q found anywhere i know i have written like that but i jus want it to match the first p or q values. my result should be for q and [2] chr1q22-q24 [3] chr1q22-q24 for p [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 Something like sub([^pq]*([pq]).*,\\1,x) should get you the first p or q -- O__ Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Peter Dalgaard wrote: Rajasekaramya wrote: hi there I have a vector with a set of data.I just wanna seperate them based on the first p and q values metioned within the data. [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [2] chr1q22-q24 [3] chr1q22-q24 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 i used a regular expression [+q*] to match up the values but it matches q found anywhere i know i have written like that but i jus want it to match the first p or q values. my result should be for q and [2] chr1q22-q24 [3] chr1q22-q24 for p [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 Something like sub([^pq]*([pq]).*,\\1,x) should get you the first p or q and the following will do the whole job (assuming x is your vector): result = lapply( list(p='p', q='q'), function(letter) grep(paste(^[^pq]*[, ], sep=letter), x, value=TRUE)) result$p # those with p first result$q # those with q first vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Wacek Kusnierczyk wrote: Peter Dalgaard wrote: Rajasekaramya wrote: hi there I have a vector with a set of data.I just wanna seperate them based on the first p and q values metioned within the data. [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [2] chr1q22-q24 [3] chr1q22-q24 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 i used a regular expression [+q*] to match up the values but it matches q found anywhere i know i have written like that but i jus want it to match the first p or q values. my result should be for q and [2] chr1q22-q24 [3] chr1q22-q24 for p [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 Something like sub([^pq]*([pq]).*,\\1,x) should get you the first p or q and the following will do the whole job (assuming x is your vector): result = lapply( list(p='p', q='q'), function(letter) grep(paste(^[^pq]*[, ], sep=letter), x, value=TRUE)) and this one might be slightly faster, depending on your data: result = local({ p = grep(^[^pq]*p, d) list(p=d[p], q=d[-p]) }) vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Wacek Kusnierczyk wrote: Rajasekaramya wrote: hi there I have a vector with a set of data.I just wanna seperate them based on the first p and q values metioned within the data. [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [2] chr1q22-q24 [3] chr1q22-q24 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 i used a regular expression [+q*] to match up the values but it matches q found anywhere i know i have written like that but i jus want it to match the first p or q values. my result should be for q and [2] chr1q22-q24 [3] chr1q22-q24 for p [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 the following will do the whole job (assuming x is your vector): result = local({ p = grep(^[^pq]*p, d) list(p=d[p], q=d[-p]) }) oops, replace 'd' with 'x' vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re gular Expression help
Here are a few more solutions. x is the input vector of character strings. The first is a slightly shorter version of one of Wacek's. The next three all create an anonymous grouping variable (using sub, substr/gsub and strapply respectively) whose components are p and q and then tapply is used to separate out the corresponding components of x according to the grouping: sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE) tapply(x, sub(^[^pq]*(.).*, \\1, x), c) tapply(x, substr(gsub([^pq], , x), 1, 1), c) library(gsubfn) tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c) On Fri, Nov 7, 2008 at 1:09 PM, Rajasekaramya [EMAIL PROTECTED] wrote: hi there I have a vector with a set of data.I just wanna seperate them based on the first p and q values metioned within the data. [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [2] chr1q22-q24 [3] chr1q22-q24 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 i used a regular expression [+q*] to match up the values but it matches q found anywhere i know i have written like that but i jus want it to match the first p or q values. my result should be for q and [2] chr1q22-q24 [3] chr1q22-q24 for p [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3 [4] chr1pter-q24 [5] chr1pter-q24 [6] chr1pter-q24 -- View this message in context: http://www.nabble.com/Regular-Expression-help-tp20385971p20385971.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.