[R] Speeding up resampling of rows from a large matrix

2007-05-25 Thread Juan Pablo Lewinger
I'm trying to:

Resample with replacement pairs of distinct rows from a 120 x 65,000 
matrix H of 0's and 1's. For each resampled pair sum the resulting 2 
x 65,000 matrix by column:

 0 1 0 1 ...
+
 0 0 1 1 ...
___
=  0 1 1 2 ...

For each column accumulate the number of 0's, 1's and 2's over the 
resamples to obtain a 3 x 65,000 matrix G.

For those interested in the background, H is a matrix of haplotypes, 
each pair of haplotypes forms a genotype, and each column corresponds 
to a SNP. I'm using resampling to compute the null distribution of 
the maximum over correlated SNPs of a simple statistic.


The code:
#---
nSNPs - 1000
H - matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120)
G - matrix(0, nrow=3, ncol=nSNPs)
# Keep in mind that the real H is 120 x 65000

nResamples - 3000
pair - replicate(nResamples, sample(1:120, 2))

gen - function(x){g - sum(x); c(g==0, g==1, g==2)}

for (i in 1:nResamples){
G - G + apply(H[pair[,i],], 2, gen)
}
#---
The problem is that the loop takes about 80 mins to complete and I 
need to repeat the whole thing 10,000 times, which would then take 
over a year and a half!

Is there a way to speed this up so that the full 10,000 iterations 
take a reasonable amount of time (say a week)?

My machine has an Intel Xeon 3.40GHz CPU with 1GB of RAM

  sessionInfo()
R version 2.5.0 (2007-04-23)
i386-pc-mingw32

I would greatly appreciate any help.

Juan Pablo Lewinger
Department of Preventive Medicine
Keck School of Medicine
University of Southern California

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speeding up resampling of rows from a large matrix

2007-05-25 Thread Bill.Venables
Here is a possibility.  The only catch is that if a pair of rows is
selected twice you will get the results in a block, not scattered at
random throughout the columns of G.  I can't see that as a problem.

### --- start code excerpt ---
nSNPs - 1000
H - matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120)

# G - matrix(0, nrow=3, ncol=nSNPs)

# Keep in mind that the real H is 120 x 65000

ij - as.matrix(subset(expand.grid(i = 1:120, j = 1:120), i  j))

nResamples - 3000
sel - sample(1:nrow(ij), nResamples, rep = TRUE)
repf - table(sel)   # replication factors
ij - ij[as.numeric(names(repf)), ]  # distinct choice made

G - matrix(0, nrow = 3, ncol = nrow(ij))  # for now

for(j in 1:ncol(G))
  G[,j] - rowSums(outer(0:2, colSums(H[ij[j, ], ]), ==))

G - G[, rep(1:ncol(G), repf)] # bulk up the result

# _
# _
# _
# _pair - replicate(nResamples, sample(1:120, 2))
# _
# _gen - function(x){g - sum(x); c(g==0, g==1, g==2)}
# _
# _for (i in 1:nResamples){
# _G - G + apply(H[pair[,i],], 2, gen)
# _}
### --- end of code excerpt ---

I did a timing on my machine which is a middle-of-the range windows
monstrosity...

 system.time({
+ 
+ nSNPs - 1000
+ H - matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120)
+ 
+ # G - matrix(0, nrow=3, ncol=nSNPs)
+ 
+ # Keep in mind that the real H is 120 x 65000
+ 
+ ij - as.matrix(subset(expand.grid(i = 1:120, j = 1:120), i  j))
+ 
+ nResamples - 3000
+ sel - sample(1:nrow(ij), nResamples, rep = TRUE)
+ repf - table(sel)   # replication factors
+ ij - ij[as.numeric(names(repf)), ]  # distinct choice made
+ 
+ G - matrix(0, nrow = 3, ncol = nrow(ij))  # for now
+ 
+ for(j in 1:ncol(G))
+   G[,j] - rowSums(outer(0:2, colSums(H[ij[j, ], ]), ==))
+ 
+ G - G[, rep(1:ncol(G), repf)] # bulk up the result
+ 
+ # _
+ # _
+ # _
+ # _pair - replicate(nResamples, sample(1:120, 2))
+ # _
+ # _gen - function(x){g - sum(x); c(g==0, g==1, g==2)}
+ # _
+ # _for (i in 1:nResamples){
+ # _G - G + apply(H[pair[,i],], 2, gen)
+ # _}
+ #
_#--
-
+ # _
+ })
   user  system elapsed 
   0.970.000.99 


Less than a second.  Somewhat of an improvement on the 80 minutes, I
reckon.  This will increase, of course when you step the size of the H
matrix up from 1000 to 65000 columns

Bill Venables
CSIRO Laboratories
PO Box 120, Cleveland, 4163
AUSTRALIA
Office Phone (email preferred): +61 7 3826 7251
Fax (if absolutely necessary):  +61 7 3826 7304
Mobile:(I don't have one!)
Home Phone: +61 7 3286 7700
mailto:[EMAIL PROTECTED]
http://www.cmis.csiro.au/bill.venables/ 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Juan Pablo
Lewinger
Sent: Friday, 25 May 2007 4:04 PM
To: r-help@stat.math.ethz.ch
Subject: [R] Speeding up resampling of rows from a large matrix

I'm trying to:

Resample with replacement pairs of distinct rows from a 120 x 65,000 
matrix H of 0's and 1's. For each resampled pair sum the resulting 2 
x 65,000 matrix by column:

 0 1 0 1 ...
+
 0 0 1 1 ...
___
=  0 1 1 2 ...

For each column accumulate the number of 0's, 1's and 2's over the 
resamples to obtain a 3 x 65,000 matrix G.

For those interested in the background, H is a matrix of haplotypes, 
each pair of haplotypes forms a genotype, and each column corresponds 
to a SNP. I'm using resampling to compute the null distribution of 
the maximum over correlated SNPs of a simple statistic.


The code:
#---

nSNPs - 1000
H - matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120)
G - matrix(0, nrow=3, ncol=nSNPs)
# Keep in mind that the real H is 120 x 65000

nResamples - 3000
pair - replicate(nResamples, sample(1:120, 2))

gen - function(x){g - sum(x); c(g==0, g==1, g==2)}

for (i in 1:nResamples){
G - G + apply(H[pair[,i],], 2, gen)
}
#---

The problem is that the loop takes about 80 mins to complete and I 
need to repeat the whole thing 10,000 times, which would then take 
over a year and a half!

Is there a way to speed this up so that the full 10,000 iterations 
take a reasonable amount of time (say a week)?

My machine has an Intel Xeon 3.40GHz CPU with 1GB of RAM

  sessionInfo()
R version 2.5.0 (2007-04-23)
i386-pc-mingw32

I would greatly appreciate any help.

Juan Pablo Lewinger
Department of Preventive Medicine
Keck School of Medicine
University of Southern California

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman

Re: [R] Speeding up resampling of rows from a large matrix

2007-05-25 Thread Juan Pablo Lewinger
That's beautiful. For the full 120 x 65,000 matrix your approach took 
85 seconds. A truly remarkable improvement over my 80 minutes!

Thank you!

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.