Re: [R] Re gular Expression help

2008-11-08 Thread Wacek Kusnierczyk
Gabor Grothendieck wrote:
 Here are a few more solutions.  x is the input vector
 of character strings.

 The first is a slightly shorter version of one of Wacek's.
 The next three all create an anonymous grouping variable
 (using sub, substr/gsub and strapply respectively)
 whose components are p and q and then tapply
 is used to separate out the corresponding components
 of x according to the grouping:

 sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE)

 tapply(x, sub(^[^pq]*(.).*, \\1, x), c)

 tapply(x, substr(gsub([^pq], , x), 1, 1), c)

 library(gsubfn)
 tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c)
   

wow!  cool stuff.  if you're interested in comparing their efficiency,
source the attached script.

vQ
generate = function(n, m) 
replicate(n, paste(sample(letters, m, replace=TRUE), collapse=))

tests = list(

wacek =
function(data) {
p = grep(^[^pq]*p, data)
list(p=data[p], q=data[-p])
},

gabor1 =
function(data) 
sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE),

gabor2 =
function(data)
tapply(data, sub(^[^pq]*p(.).*, \\1, data), c),

gabor3 =
function(data)
tapply(data, substr(gsub([^pq], , data), 1, 1), c),

gabor4 =
{ library(gsubfn); function(data)
tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) }
)

data = generate(1000,10)
lapply(names(tests), 
function(name) {
cat(name, :\n, sep=)
print(system.time(replicate(30,tests[[name]](data } )
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-08 Thread Wacek Kusnierczyk
Wacek Kusnierczyk wrote:
 Gabor Grothendieck wrote:
   
 Here are a few more solutions.  x is the input vector
 of character strings.

 The first is a slightly shorter version of one of Wacek's.
 The next three all create an anonymous grouping variable
 (using sub, substr/gsub and strapply respectively)
 whose components are p and q and then tapply
 is used to separate out the corresponding components
 of x according to the grouping:

 sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE)

 tapply(x, sub(^[^pq]*(.).*, \\1, x), c)

 tapply(x, substr(gsub([^pq], , x), 1, 1), c)

 library(gsubfn)
 tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c)
   
 

 wow!  cool stuff.  if you're interested in comparing their efficiency,
 source the attached script.

   

using lapply with side-effects code should probably be considered bad
practice, so replace lapply with a for loop.  sorry.

vQ
generate = function(n, m) 
replicate(n, paste(sample(letters, m, replace=TRUE), collapse=))

tests = list(

wacek =
function(data) {
p = grep(^[^pq]*p, data)
list(p=data[p], q=data[-p])
},

gabor1 =
function(data) 
sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE),

gabor2 =
function(data)
tapply(data, sub(^[^pq]*p(.).*, \\1, data), c),

gabor3 =
function(data)
tapply(data, substr(gsub([^pq], , data), 1, 1), c),

gabor4 =
{ library(gsubfn); function(data)
tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) }
)

data = generate(1000,10)
for (name in names(tests)) {
cat(name, :\n, sep=)
print(system.time(replicate(30,tests[[name]](data }
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-08 Thread Gabor Grothendieck
For the problem at hand I think I would use your solution
which is both easily understood and fastest.  On the
other hand the tapply based solutions are coordinate
free (i.e. no explicit mucking with indices) and readily
generalize to more than 2 groups -- just replace [^pq] with
[^pqr], say.

On Sat, Nov 8, 2008 at 4:21 PM, Wacek Kusnierczyk
[EMAIL PROTECTED] wrote:
 Gabor Grothendieck wrote:
 Here are a few more solutions.  x is the input vector
 of character strings.

 The first is a slightly shorter version of one of Wacek's.
 The next three all create an anonymous grouping variable
 (using sub, substr/gsub and strapply respectively)
 whose components are p and q and then tapply
 is used to separate out the corresponding components
 of x according to the grouping:

 sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE)

 tapply(x, sub(^[^pq]*(.).*, \\1, x), c)

 tapply(x, substr(gsub([^pq], , x), 1, 1), c)

 library(gsubfn)
 tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c)


 wow!  cool stuff.  if you're interested in comparing their efficiency,
 source the attached script.

 vQ

 generate = function(n, m)
replicate(n, paste(sample(letters, m, replace=TRUE), collapse=))

 tests = list(

wacek =
function(data) {
p = grep(^[^pq]*p, data)
list(p=data[p], q=data[-p])
},

gabor1 =
function(data)
sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, 
 value=TRUE),

gabor2 =
function(data)
tapply(data, sub(^[^pq]*p(.).*, \\1, data), c),

gabor3 =
function(data)
tapply(data, substr(gsub([^pq], , data), 1, 1), c),

gabor4 =
{ library(gsubfn); function(data)
tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) }
 )

 data = generate(1000,10)
 lapply(names(tests),
function(name) {
cat(name, :\n, sep=)
print(system.time(replicate(30,tests[[name]](data } )



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-08 Thread Wacek Kusnierczyk
Gabor Grothendieck wrote:
 For the problem at hand I think I would use your solution
 which is both easily understood and fastest.  On the
 other hand the tapply based solutions are coordinate
 free (i.e. no explicit mucking with indices) and readily
 generalize to more than 2 groups -- just replace [^pq] with
 [^pqr], say.

   

for sure, mine was optimized towards the case, not towards generalizability.
the gsubfn one is a loser, though.

but the first one *is* easily generalizable, e.g.,

letters = pqrs
sapply(sprintf(^[^%s]*%s, letters, unlist(strsplit(letters,
split=))), grep, x=x, value=TRUE)

while an order of magnitude faster than the tapply ones.

vQ

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-08 Thread Gabor Grothendieck
I suspect strapply is only relatively slow on short strings where
it doesn't matter anyways since for long strings performance would
likely be dominated by the underlying regexp operations.  I know that
users are using the package for very long strings since I once had
to lift the 25,000 character limit since I had complaints about that.
The expressiveness and brevity of strapply (it would be shortest if it
were not for the length of the word simplify) offset any disadvantage
in my view.

On Sat, Nov 8, 2008 at 5:02 PM, Wacek Kusnierczyk
[EMAIL PROTECTED] wrote:
 Gabor Grothendieck wrote:
 For the problem at hand I think I would use your solution
 which is both easily understood and fastest.  On the
 other hand the tapply based solutions are coordinate
 free (i.e. no explicit mucking with indices) and readily
 generalize to more than 2 groups -- just replace [^pq] with
 [^pqr], say.



 for sure, mine was optimized towards the case, not towards generalizability.
 the gsubfn one is a loser, though.

 but the first one *is* easily generalizable, e.g.,

 letters = pqrs
 sapply(sprintf(^[^%s]*%s, letters, unlist(strsplit(letters,
 split=))), grep, x=x, value=TRUE)

 while an order of magnitude faster than the tapply ones.

 vQ


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-08 Thread Wacek Kusnierczyk
Gabor Grothendieck wrote:
 I suspect strapply is only relatively slow on short strings where
 it doesn't matter anyways since for long strings performance would
 likely be dominated by the underlying regexp operations.  I know that
 users are using the package for very long strings since I once had
 to lift the 25,000 character limit since I had complaints about that.
 The expressiveness and brevity of strapply (it would be shortest if it
 were not for the length of the word simplify) offset any disadvantage
 in my view.
   
ok, the attached tests against strings of length 3 where the
character that matches is precisely the last one.  (gabor3 is dummy,
because i had no patience to wait over a minute...)  note that the
strapply version is still approximately an order of magnitude slower. 

with the original script and string lenght (m) set to 1, the
strapply version is two orders of magnitude slower.

it might be that the test is poor, though -- design a smart test where
strapply wins ;)
(related to the original problem, of course.)

vQ
generate = function(n, m) 
replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, 
replace=TRUE), collapse=), sample(letters[16:17], 1), sep=))

tests = list(

wacek =
function(data) {
p = grep(^[^pq]*p, data)
list(p=data[p], q=data[-p])
},

gabor1 =
function(data) 
sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, value=TRUE),

gabor2 =
function(data)
tapply(data, sub(^[^pq]*p(.).*, \\1, data), c),

gabor3 =
function(data) 0,
# tapply(data, substr(gsub([^pq], , data), 1, 1), c),

gabor4 =
{ library(gsubfn); function(data)
tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) }
)

data = generate(10,3)
for (name in names(tests)) {
cat(name, :\n, sep=)
print(system.time(replicate(30,tests[[name]](data }
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-08 Thread Gabor Grothendieck
I'll see if I can speed it up if I get some time.  I personally use it on
relatively short strings where the low absolute time means that
the higher relative time your comparisons show are not that
important.


On Sat, Nov 8, 2008 at 5:33 PM, Wacek Kusnierczyk
[EMAIL PROTECTED] wrote:
 Gabor Grothendieck wrote:
 I suspect strapply is only relatively slow on short strings where
 it doesn't matter anyways since for long strings performance would
 likely be dominated by the underlying regexp operations.  I know that
 users are using the package for very long strings since I once had
 to lift the 25,000 character limit since I had complaints about that.
 The expressiveness and brevity of strapply (it would be shortest if it
 were not for the length of the word simplify) offset any disadvantage
 in my view.

 ok, the attached tests against strings of length 3 where the
 character that matches is precisely the last one.  (gabor3 is dummy,
 because i had no patience to wait over a minute...)  note that the
 strapply version is still approximately an order of magnitude slower.

 with the original script and string lenght (m) set to 1, the
 strapply version is two orders of magnitude slower.

 it might be that the test is poor, though -- design a smart test where
 strapply wins ;)
 (related to the original problem, of course.)

 vQ

 generate = function(n, m)
replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, 
 replace=TRUE), collapse=), sample(letters[16:17], 1), sep=))

 tests = list(

wacek =
function(data) {
p = grep(^[^pq]*p, data)
list(p=data[p], q=data[-p])
},

gabor1 =
function(data)
sapply(c(p=^[^pq]*p, q=^[^pq]*q), grep, x=data, 
 value=TRUE),

gabor2 =
function(data)
tapply(data, sub(^[^pq]*p(.).*, \\1, data), c),

gabor3 =
function(data) 0,
# tapply(data, substr(gsub([^pq], , data), 1, 1), c),

gabor4 =
{ library(gsubfn); function(data)
tapply(data, strapply(data, ^[^pq]*(.), simplify=c), c) }
 )

 data = generate(10,3)
 for (name in names(tests)) {
cat(name, :\n, sep=)
print(system.time(replicate(30,tests[[name]](data }



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Re gular Expression help

2008-11-07 Thread Rajasekaramya

hi there

I have a vector with a set of data.I just wanna seperate them based on the
first p and q values metioned within the data.

[1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
[2] chr1q22-q24 
[3] chr1q22-q24 
[4] chr1pter-q24
[5] chr1pter-q24
[6] chr1pter-q24  

i used a regular expression [+q*] to match up the values but it matches q
found anywhere i know i have written like that but i jus want it to match
the first p or q values.

my result should be for q and 
[2] chr1q22-q24  
[3] chr1q22-q24  

for p
[1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
[4] chr1pter-q24
[5] chr1pter-q24
[6] chr1pter-q24 



-- 
View this message in context: 
http://www.nabble.com/Regular-Expression-help-tp20385971p20385971.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-07 Thread Peter Dalgaard
Rajasekaramya wrote:
 hi there
 
 I have a vector with a set of data.I just wanna seperate them based on the
 first p and q values metioned within the data.
 
 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [2] chr1q22-q24 
 [3] chr1q22-q24 
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24  
 
 i used a regular expression [+q*] to match up the values but it matches q
 found anywhere i know i have written like that but i jus want it to match
 the first p or q values.
 
 my result should be for q and 
 [2] chr1q22-q24  
 [3] chr1q22-q24  
 
 for p
 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24 
 

Something like

sub([^pq]*([pq]).*,\\1,x)

should get you the first p or q


-- 
   O__   Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-07 Thread Wacek Kusnierczyk
Peter Dalgaard wrote:
 Rajasekaramya wrote:
   
 hi there

 I have a vector with a set of data.I just wanna seperate them based on the
 first p and q values metioned within the data.

 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [2] chr1q22-q24 
 [3] chr1q22-q24 
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24  

 i used a regular expression [+q*] to match up the values but it matches q
 found anywhere i know i have written like that but i jus want it to match
 the first p or q values.

 my result should be for q and 
 [2] chr1q22-q24  
 [3] chr1q22-q24  

 for p
 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24 

 

 Something like

 sub([^pq]*([pq]).*,\\1,x)

 should get you the first p or q

   

and the following will do the whole job (assuming x is your vector):

result = lapply(
   list(p='p', q='q'),
   function(letter)
  grep(paste(^[^pq]*[, ], sep=letter), x, value=TRUE))

result$p
# those with p first

result$q
# those with q first

vQ

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-07 Thread Wacek Kusnierczyk
Wacek Kusnierczyk wrote:
 Peter Dalgaard wrote:
   
 Rajasekaramya wrote:
   
 
 hi there

 I have a vector with a set of data.I just wanna seperate them based on the
 first p and q values metioned within the data.

 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [2] chr1q22-q24 
 [3] chr1q22-q24 
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24  

 i used a regular expression [+q*] to match up the values but it matches q
 found anywhere i know i have written like that but i jus want it to match
 the first p or q values.

 my result should be for q and 
 [2] chr1q22-q24  
 [3] chr1q22-q24  

 for p
 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24 

 
   
 Something like

 sub([^pq]*([pq]).*,\\1,x)

 should get you the first p or q

   
 

 and the following will do the whole job (assuming x is your vector):

 result = lapply(
list(p='p', q='q'),
function(letter)
   grep(paste(^[^pq]*[, ], sep=letter), x, value=TRUE))

   

and this one might be slightly faster, depending on your data:

result = local({
   p = grep(^[^pq]*p, d)
   list(p=d[p], q=d[-p])
})

vQ

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-07 Thread Wacek Kusnierczyk
Wacek Kusnierczyk wrote:

   
 
 Rajasekaramya wrote:
   
 
   
 hi there

 I have a vector with a set of data.I just wanna seperate them based on the
 first p and q values metioned within the data.

 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [2] chr1q22-q24 
 [3] chr1q22-q24 
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24  

 i used a regular expression [+q*] to match up the values but it matches q
 found anywhere i know i have written like that but i jus want it to match
 the first p or q values.

 my result should be for q and 
 [2] chr1q22-q24  
 [3] chr1q22-q24  

 for p
 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24 


 
 the following will do the whole job (assuming x is your vector):


 
 result = local({
p = grep(^[^pq]*p, d)
list(p=d[p], q=d[-p])
 })
   

oops, replace 'd' with 'x'

vQ

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re gular Expression help

2008-11-07 Thread Gabor Grothendieck
Here are a few more solutions.  x is the input vector
of character strings.

The first is a slightly shorter version of one of Wacek's.
The next three all create an anonymous grouping variable
(using sub, substr/gsub and strapply respectively)
whose components are p and q and then tapply
is used to separate out the corresponding components
of x according to the grouping:

sapply(c(p = ^[^pq]*p, q = ^[^pq]*q), grep, x = x, value = TRUE)

tapply(x, sub(^[^pq]*(.).*, \\1, x), c)

tapply(x, substr(gsub([^pq], , x), 1, 1), c)

library(gsubfn)
tapply(x, strapply(x, ^[^pq]*(.), simplify = c), c)

On Fri, Nov 7, 2008 at 1:09 PM, Rajasekaramya [EMAIL PROTECTED] wrote:

 hi there

 I have a vector with a set of data.I just wanna seperate them based on the
 first p and q values metioned within the data.

 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [2] chr1q22-q24
 [3] chr1q22-q24
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24

 i used a regular expression [+q*] to match up the values but it matches q
 found anywhere i know i have written like that but i jus want it to match
 the first p or q values.

 my result should be for q and
 [2] chr1q22-q24
 [3] chr1q22-q24

 for p
 [1] chr10p15.3 /// chr3q29 /// chr4q35 /// chr9q34.3
 [4] chr1pter-q24
 [5] chr1pter-q24
 [6] chr1pter-q24



 --
 View this message in context: 
 http://www.nabble.com/Regular-Expression-help-tp20385971p20385971.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.