Changbin -
   Does

    100 * sapply(matt$reads,function(x)sum(matt$reads >= x))/length(matt$reads)

give what you want?

    By the way, if you want to use a loop (there's nothing wrong with that),
then try to avoid the most common mistake that people make with loops in R:
having your result grow inside the loop.  Here's a better way to use a loop
to solve your problem:

cover_per_1 <- function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, 
0))/length(data)
   output
}

Using some random data, and comparing to your original cover_per function:

dat = rnorm(1000)
system.time(one <- cover_per(dat))
   user  system elapsed
0.816 0.000 0.824
system.time(two <- cover_per_1(dat))
   user  system elapsed
  0.792   0.000   0.805

Not that big a speedup, but it does increase quite a bit as the problem gets
larger.

There are two obvious ways to speed up your function:
   1)  Eliminate the ifelse function, since automatic coersion from
       logical to numeric does the same thing.
   2)  Multiply by 100 and divide by the length outside the loop:

cover_per_2 <- function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = sum(data >= data[i])
   100 * output / l
}

system.time(three <- cover_per_2(dat))
   user  system elapsed
  0.024   0.000   0.027

That makes the loop just about equivalent to the sapply solution:

system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat))
   user  system elapsed
  0.024   0.000   0.026

                                        - Phil Spector
                                         Statistical Computing Facility
                                         Department of Statistics
                                         UC Berkeley
                                         spec...@stat.berkeley.edu








On Thu, 4 Nov 2010, Changbin Du wrote:

HI, Dear R community,

I have one data set like this,  What I want to do is to calculate the
cumulative coverage. The following codes works for small data set (#rows =
100), but when feed the whole data set,  it still running after 24 hours.
Can someone give some suggestions for long vector?

id                reads
Contig79:1    4
Contig79:2    8
Contig79:3    13
Contig79:4    14
Contig79:5    17
Contig79:6    20
Contig79:7    25
Contig79:8    27
Contig79:9    32
Contig79:10    33
Contig79:11    34

matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
dim(matt)
[1] 3384766       2

matt_plot<-function(matt, outputfile) {
names(matt)<-c("id","reads")

cover<-matt$reads


#calculate the cumulative coverage.
+ cover_per<-function (data) {
+ output<-numeric(0)
+ for (i in data) {
+           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
+           output<-c(output, x)
+                 }
+ return(output)
+ }


result<-cover_per(cover)


Thanks so much!


--
Sincerely,
Changbin
--

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to