Re: [R] how to work with long vectors
There was a numerical typo below, I said the sample sizes were 5 and 10 thousand, I should have said 10 and 20 thousand (the size argument to sample()). Also, I timed cover_per_2 and _3 for size 200,000 and gots times of 338 and 0.12 seconds, respectively. Growing the problem by a factor to 10 made cover_per_2 used 100 times more time and cover_per_3 c. 10 times more (the times are too small to get an accurate ratio). Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of William Dunlap > Sent: Friday, November 05, 2010 9:58 AM > To: Changbin Du > Cc: r-help@r-project.org > Subject: Re: [R] how to work with long vectors > > The following cover_per_3 uses sorting to solve > the problem more quickly. It still has room > for improvement. > > cover_per_3 <- function (data) > { > n <- length(data) > o <- rev(order(data)) > sdata <- data[o] > r <- rle(sdata)$lengths > output <- numeric(n) > output[o] <- rep(cumsum(r), r) > 100 * output/n > } > > (The ecdf function would probabably also do the > job quickly.) > > When trying to work on problems like this I find > it most fruitful to work on smaller datasets and > see how the time grows with the size of the data, > instead of seeing how many days a it takes on a huge > dataset. E.g., the following compares times for > your original function, Phil Spector's simple cleanup > of your function, and the sort based approach for > vectors of length 5 and 10 thousand. > > > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <- > cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; > print(system.time(v_3 <- cover_per_3(z))) >user system elapsed > 38.210.00 38.41 >user system elapsed >0.860.000.86 >user system elapsed > 0 0 0 > > identical(v_3,v) > [1] TRUE > > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <- > cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; > print(system.time(v_3 <- cover_per_3(z))) >user system elapsed > 158.480.07 159.31 >user system elapsed >3.230.003.25 >user system elapsed >0.020.000.02 > > identical(v_3,v) > [1] TRUE > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > -Original Message- > > From: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du > > Sent: Friday, November 05, 2010 9:14 AM > > To: Phil Spector > > Cc: > > Subject: Re: [R] how to work with long vectors > > > > HI, Phil, > > > > I used the following codes and run it overnight for 15 hours, > > this morning, > > I stopped it. It seems it is still not efficient. > > > > > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL > UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW> > ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > > > names(matt)<-c("id","reads") > > > > > dim(matt) > > [1] 3384766 2 > > > > > cover<-matt$reads > > > > > cover_per_2 <- function(data){ > > + l = length(data) > > + output = numeric(l) > > + for(i in 1:l)output[i] = sum(data >= data[i]) > > + 100 * output / l > > + } > > > > > result3<-cover_per_2(cover) > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du > > wrote: > > > > > Thanks Phil, that is great! I WILL try this and let you > > know how it goes. > > > > > > > > > > > > > > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector > > wrote: > > > > > >> Changbin - > > >> Does > > >> > > >>100 * sapply(matt$reads,function(x)sum(matt$reads >= > > >> x))/length(matt$reads) > > >> > > >> give what you want? > > >> > > >>By the way, if you want to use a loop (there's nothing > > wrong with > > >> that), > > >> then try to avoid the most common mistake that people make > > with loops in > > >> R: > > >> having your result grow inside the l
Re: [R] how to work with long vectors
Thanks, William. It gave me a lesson. On Fri, Nov 5, 2010 at 9:58 AM, William Dunlap wrote: > The following cover_per_3 uses sorting to solve > the problem more quickly. It still has room > for improvement. > > cover_per_3 <- function (data) > { >n <- length(data) >o <- rev(order(data)) >sdata <- data[o] >r <- rle(sdata)$lengths >output <- numeric(n) >output[o] <- rep(cumsum(r), r) >100 * output/n > } > > (The ecdf function would probabably also do the > job quickly.) > > When trying to work on problems like this I find > it most fruitful to work on smaller datasets and > see how the time grows with the size of the data, > instead of seeing how many days a it takes on a huge > dataset. E.g., the following compares times for > your original function, Phil Spector's simple cleanup > of your function, and the sort based approach for > vectors of length 5 and 10 thousand. > > > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <- > cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; > print(system.time(v_3 <- cover_per_3(z))) > user system elapsed > 38.210.00 38.41 > user system elapsed > 0.860.000.86 > user system elapsed > 0 0 0 > > identical(v_3,v) > [1] TRUE > > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <- > cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; > print(system.time(v_3 <- cover_per_3(z))) > user system elapsed > 158.480.07 159.31 > user system elapsed > 3.230.003.25 > user system elapsed > 0.020.000.02 > > identical(v_3,v) > [1] TRUE > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > -----Original Message- > > From: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du > > Sent: Friday, November 05, 2010 9:14 AM > > To: Phil Spector > > Cc: > > Subject: Re: [R] how to work with long vectors > > > > HI, Phil, > > > > I used the following codes and run it overnight for 15 hours, > > this morning, > > I stopped it. It seems it is still not efficient. > > > > > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL > UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW> > ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > > > names(matt)<-c("id","reads") > > > > > dim(matt) > > [1] 3384766 2 > > > > > cover<-matt$reads > > > > > cover_per_2 <- function(data){ > > + l = length(data) > > + output = numeric(l) > > + for(i in 1:l)output[i] = sum(data >= data[i]) > > + 100 * output / l > > + } > > > > > result3<-cover_per_2(cover) > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du > > wrote: > > > > > Thanks Phil, that is great! I WILL try this and let you > > know how it goes. > > > > > > > > > > > > > > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector > > wrote: > > > > > >> Changbin - > > >> Does > > >> > > >>100 * sapply(matt$reads,function(x)sum(matt$reads >= > > >> x))/length(matt$reads) > > >> > > >> give what you want? > > >> > > >>By the way, if you want to use a loop (there's nothing > > wrong with > > >> that), > > >> then try to avoid the most common mistake that people make > > with loops in > > >> R: > > >> having your result grow inside the loop. Here's a better > > way to use a > > >> loop > > >> to solve your problem: > > >> > > >> cover_per_1 <- function(data){ > > >> l = length(data) > > >> output = numeric(l) > > >> for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, > > >> 0))/length(data) > > >> output > > >> } > > >> > > >> Using some random data, and comparing to your original > > cover_per function: > > >> > > >> dat = rnorm(1000) > > >>> system.time(one <- cover_per(dat)) > > >>> > > >> user system elapsed &g
Re: [R] how to work with long vectors
The following cover_per_3 uses sorting to solve the problem more quickly. It still has room for improvement. cover_per_3 <- function (data) { n <- length(data) o <- rev(order(data)) sdata <- data[o] r <- rle(sdata)$lengths output <- numeric(n) output[o] <- rep(cumsum(r), r) 100 * output/n } (The ecdf function would probabably also do the job quickly.) When trying to work on problems like this I find it most fruitful to work on smaller datasets and see how the time grows with the size of the data, instead of seeing how many days a it takes on a huge dataset. E.g., the following compares times for your original function, Phil Spector's simple cleanup of your function, and the sort based approach for vectors of length 5 and 10 thousand. > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <- cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; print(system.time(v_3 <- cover_per_3(z))) user system elapsed 38.210.00 38.41 user system elapsed 0.860.000.86 user system elapsed 0 0 0 > identical(v_3,v) [1] TRUE > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <- cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ; print(system.time(v_3 <- cover_per_3(z))) user system elapsed 158.480.07 159.31 user system elapsed 3.230.003.25 user system elapsed 0.020.000.02 > identical(v_3,v) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -Original Message- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du > Sent: Friday, November 05, 2010 9:14 AM > To: Phil Spector > Cc: > Subject: Re: [R] how to work with long vectors > > HI, Phil, > > I used the following codes and run it overnight for 15 hours, > this morning, > I stopped it. It seems it is still not efficient. > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > > names(matt)<-c("id","reads") > > > dim(matt) > [1] 3384766 2 > > > cover<-matt$reads > > > cover_per_2 <- function(data){ > + l = length(data) > + output = numeric(l) > + for(i in 1:l)output[i] = sum(data >= data[i]) > + 100 * output / l > + } > > > result3<-cover_per_2(cover) > > > > > > > > > > > > > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du > wrote: > > > Thanks Phil, that is great! I WILL try this and let you > know how it goes. > > > > > > > > > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector > wrote: > > > >> Changbin - > >> Does > >> > >>100 * sapply(matt$reads,function(x)sum(matt$reads >= > >> x))/length(matt$reads) > >> > >> give what you want? > >> > >>By the way, if you want to use a loop (there's nothing > wrong with > >> that), > >> then try to avoid the most common mistake that people make > with loops in > >> R: > >> having your result grow inside the loop. Here's a better > way to use a > >> loop > >> to solve your problem: > >> > >> cover_per_1 <- function(data){ > >> l = length(data) > >> output = numeric(l) > >> for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, > >> 0))/length(data) > >> output > >> } > >> > >> Using some random data, and comparing to your original > cover_per function: > >> > >> dat = rnorm(1000) > >>> system.time(one <- cover_per(dat)) > >>> > >> user system elapsed > >> 0.816 0.000 0.824 > >> > >>> system.time(two <- cover_per_1(dat)) > >>> > >> user system elapsed > >> 0.792 0.000 0.805 > >> > >> Not that big a speedup, but it does increase quite a bit > as the problem > >> gets > >> larger. > >> > >> There are two obvious ways to speed up your function: > >> 1) Eliminate the ifelse function, since automatic coersion from > >> logical to numeric does the same thing. > >> 2) Multiply by 100 and divide by the length outside the loop: > >> > >> cover_per_2 <- function(data){ > >> l = length(data) > >> output
Re: [R] how to work with long vectors
Thanks Martin! I will try it and will let your guys know how it goes. On Fri, Nov 5, 2010 at 9:42 AM, Martin Morgan wrote: > On 11/05/2010 09:13 AM, Changbin Du wrote: > > HI, Phil, > > > > I used the following codes and run it overnight for 15 hours, this > morning, > > I stopped it. It seems it is still not efficient. > > > > > >> > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > >> names(matt)<-c("id","reads") > > > >> dim(matt) > > [1] 3384766 2 > > [snip] > > >>> On Thu, 4 Nov 2010, Changbin Du wrote: > >>> > >>> HI, Dear R community, > > I have one data set like this, What I want to do is to calculate the > cumulative coverage. The following codes works for small data set > (#rows > = > 100), but when feed the whole data set, it still running after 24 > hours. > Can someone give some suggestions for long vector? > > idreads > Contig79:14 > Contig79:28 > Contig79:313 > Contig79:414 > Contig79:517 > Contig79:620 > Contig79:725 > Contig79:827 > Contig79:932 > Contig79:1033 > Contig79:1134 > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + output<-c(output, x) > + } > + return(output) > + } > > > result<-cover_per(cover) > > Hi Changbin > > If I understand correctly, your contigs 'start' at position 1, and have > 'width' equal to matt$reads. You'd like to know the coverage at the last > covered location of each contig in matt$reads. > > ## first time only > source("http://bioconductor.org";) > biocLite("IRanges") > > ## > library(IRanges) > contigs = IRanges(start=1, width=matt$reads) > cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1 > as.vector(cvg[matt$reads]) / nrow(matt) ## at the end of each contig > > for a larger data set: > > > matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100 > > contigs = IRanges(start=1, width=matt$reads) > > system.time(cvg <- coverage(contigs)) > user system elapsed > 5.145 0.050 5.202 > > Martin > > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > >> > >> > >> -- > >> Sincerely, > >> Changbin > >> -- > >> > >> Changbin Du > >> DOE Joint Genome Institute > >> Bldg 400 Rm 457 > >> 2800 Mitchell Dr > >> Walnut Creet, CA 94598 > >> Phone: 925-927-2856 > >> > >> > >> > > > > > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > -- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 94598 Phone: 925-927-2856 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to work with long vectors
On 11/05/2010 09:42 AM, Martin Morgan wrote: > ## first time only > source("http://bioconductor.org";) oops, source("http://bioconductor.org/biocLite.R";) > biocLite("IRanges") > > ## > library(IRanges) > contigs = IRanges(start=1, width=matt$reads) > cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1 > as.vector(cvg[matt$reads]) / nrow(matt) ## at the end of each contig > > for a larger data set: > >> matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100 >> contigs = IRanges(start=1, width=matt$reads) >> system.time(cvg <- coverage(contigs)) >user system elapsed > 5.145 0.050 5.202 > > Martin > > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > >[[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >>> >>> >>> -- >>> Sincerely, >>> Changbin >>> -- >>> >>> Changbin Du >>> DOE Joint Genome Institute >>> Bldg 400 Rm 457 >>> 2800 Mitchell Dr >>> Walnut Creet, CA 94598 >>> Phone: 925-927-2856 >>> >>> >>> >> >> > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to work with long vectors
On 11/05/2010 09:13 AM, Changbin Du wrote: > HI, Phil, > > I used the following codes and run it overnight for 15 hours, this morning, > I stopped it. It seems it is still not efficient. > > >> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # >> names(matt)<-c("id","reads") > >> dim(matt) > [1] 3384766 2 [snip] >>> On Thu, 4 Nov 2010, Changbin Du wrote: >>> >>> HI, Dear R community, I have one data set like this, What I want to do is to calculate the cumulative coverage. The following codes works for small data set (#rows = 100), but when feed the whole data set, it still running after 24 hours. Can someone give some suggestions for long vector? idreads Contig79:14 Contig79:28 Contig79:313 Contig79:414 Contig79:517 Contig79:620 Contig79:725 Contig79:827 Contig79:932 Contig79:1033 Contig79:1134 matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", sep="\t", skip=0, header=F,fill=T) # dim(matt) [1] 3384766 2 matt_plot<-function(matt, outputfile) { names(matt)<-c("id","reads") cover<-matt$reads #calculate the cumulative coverage. + cover_per<-function (data) { + output<-numeric(0) + for (i in data) { + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) + output<-c(output, x) + } + return(output) + } result<-cover_per(cover) Hi Changbin If I understand correctly, your contigs 'start' at position 1, and have 'width' equal to matt$reads. You'd like to know the coverage at the last covered location of each contig in matt$reads. ## first time only source("http://bioconductor.org";) biocLite("IRanges") ## library(IRanges) contigs = IRanges(start=1, width=matt$reads) cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1 as.vector(cvg[matt$reads]) / nrow(matt) ## at the end of each contig for a larger data set: > matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100 > contigs = IRanges(start=1, width=matt$reads) > system.time(cvg <- coverage(contigs)) user system elapsed 5.145 0.050 5.202 Martin Thanks so much! -- Sincerely, Changbin -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. >> >> >> -- >> Sincerely, >> Changbin >> -- >> >> Changbin Du >> DOE Joint Genome Institute >> Bldg 400 Rm 457 >> 2800 Mitchell Dr >> Walnut Creet, CA 94598 >> Phone: 925-927-2856 >> >> >> > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to work with long vectors
HI, Phil, I used the following codes and run it overnight for 15 hours, this morning, I stopped it. It seems it is still not efficient. > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", sep="\t", skip=0, header=F,fill=T) # > names(matt)<-c("id","reads") > dim(matt) [1] 3384766 2 > cover<-matt$reads > cover_per_2 <- function(data){ + l = length(data) + output = numeric(l) + for(i in 1:l)output[i] = sum(data >= data[i]) + 100 * output / l + } > result3<-cover_per_2(cover) On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du wrote: > Thanks Phil, that is great! I WILL try this and let you know how it goes. > > > > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector > wrote: > >> Changbin - >> Does >> >>100 * sapply(matt$reads,function(x)sum(matt$reads >= >> x))/length(matt$reads) >> >> give what you want? >> >>By the way, if you want to use a loop (there's nothing wrong with >> that), >> then try to avoid the most common mistake that people make with loops in >> R: >> having your result grow inside the loop. Here's a better way to use a >> loop >> to solve your problem: >> >> cover_per_1 <- function(data){ >> l = length(data) >> output = numeric(l) >> for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, >> 0))/length(data) >> output >> } >> >> Using some random data, and comparing to your original cover_per function: >> >> dat = rnorm(1000) >>> system.time(one <- cover_per(dat)) >>> >> user system elapsed >> 0.816 0.000 0.824 >> >>> system.time(two <- cover_per_1(dat)) >>> >> user system elapsed >> 0.792 0.000 0.805 >> >> Not that big a speedup, but it does increase quite a bit as the problem >> gets >> larger. >> >> There are two obvious ways to speed up your function: >> 1) Eliminate the ifelse function, since automatic coersion from >> logical to numeric does the same thing. >> 2) Multiply by 100 and divide by the length outside the loop: >> >> cover_per_2 <- function(data){ >> l = length(data) >> output = numeric(l) >> for(i in 1:l)output[i] = sum(data >= data[i]) >> 100 * output / l >> } >> >> system.time(three <- cover_per_2(dat)) >>> >> user system elapsed >> 0.024 0.000 0.027 >> >> That makes the loop just about equivalent to the sapply solution: >> >> system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat)) >>> >> user system elapsed >> 0.024 0.000 0.026 >> >>- Phil Spector >> Statistical Computing Facility >> Department of Statistics >> UC Berkeley >> spec...@stat.berkeley.edu >> >> >> >> >> >> >> >> >> >> On Thu, 4 Nov 2010, Changbin Du wrote: >> >> HI, Dear R community, >>> >>> I have one data set like this, What I want to do is to calculate the >>> cumulative coverage. The following codes works for small data set (#rows >>> = >>> 100), but when feed the whole data set, it still running after 24 hours. >>> Can someone give some suggestions for long vector? >>> >>> idreads >>> Contig79:14 >>> Contig79:28 >>> Contig79:313 >>> Contig79:414 >>> Contig79:517 >>> Contig79:620 >>> Contig79:725 >>> Contig79:827 >>> Contig79:932 >>> Contig79:1033 >>> Contig79:1134 >>> >>> >>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", >>> sep="\t", skip=0, header=F,fill=T) # >>> dim(matt) >>> [1] 3384766 2 >>> >>> matt_plot<-function(matt, outputfile) { >>> names(matt)<-c("id","reads") >>> >>> cover<-matt$reads >>> >>> >>> #calculate the cumulative coverage. >>> + cover_per<-function (data) { >>> + output<-numeric(0) >>> + for (i in data) { >>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) >>> + output<-c(output, x) >>> + } >>> + return(output) >>> + } >>> >>> >>> result<-cover_per(cover) >>> >>> >>> Thanks so much! >>> >>> >>> -- >>> Sincerely, >>> Changbin >>> -- >>> >>>[[alternative HTML version deleted]] >>> >>> __ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> > > > -- > Sincerely, > Changbin > -- > > Changbin Du > DOE Joint Genome Institute > Bldg 400 Rm 457 > 2800 Mitchell Dr > Walnut Creet, CA 94598 > Phone: 925-927-2856 > > > -- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 9459
Re: [R] how to work with long vectors
Thanks Phil, that is great! I WILL try this and let you know how it goes. On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector wrote: > Changbin - > Does > >100 * sapply(matt$reads,function(x)sum(matt$reads >= > x))/length(matt$reads) > > give what you want? > >By the way, if you want to use a loop (there's nothing wrong with that), > then try to avoid the most common mistake that people make with loops in R: > having your result grow inside the loop. Here's a better way to use a loop > to solve your problem: > > cover_per_1 <- function(data){ > l = length(data) > output = numeric(l) > for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, > 0))/length(data) > output > } > > Using some random data, and comparing to your original cover_per function: > > dat = rnorm(1000) >> system.time(one <- cover_per(dat)) >> > user system elapsed > 0.816 0.000 0.824 > >> system.time(two <- cover_per_1(dat)) >> > user system elapsed > 0.792 0.000 0.805 > > Not that big a speedup, but it does increase quite a bit as the problem > gets > larger. > > There are two obvious ways to speed up your function: > 1) Eliminate the ifelse function, since automatic coersion from > logical to numeric does the same thing. > 2) Multiply by 100 and divide by the length outside the loop: > > cover_per_2 <- function(data){ > l = length(data) > output = numeric(l) > for(i in 1:l)output[i] = sum(data >= data[i]) > 100 * output / l > } > > system.time(three <- cover_per_2(dat)) >> > user system elapsed > 0.024 0.000 0.027 > > That makes the loop just about equivalent to the sapply solution: > > system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat)) >> > user system elapsed > 0.024 0.000 0.026 > >- Phil Spector > Statistical Computing Facility > Department of Statistics > UC Berkeley > spec...@stat.berkeley.edu > > > > > > > > > > On Thu, 4 Nov 2010, Changbin Du wrote: > > HI, Dear R community, >> >> I have one data set like this, What I want to do is to calculate the >> cumulative coverage. The following codes works for small data set (#rows = >> 100), but when feed the whole data set, it still running after 24 hours. >> Can someone give some suggestions for long vector? >> >> idreads >> Contig79:14 >> Contig79:28 >> Contig79:313 >> Contig79:414 >> Contig79:517 >> Contig79:620 >> Contig79:725 >> Contig79:827 >> Contig79:932 >> Contig79:1033 >> Contig79:1134 >> >> >> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", >> sep="\t", skip=0, header=F,fill=T) # >> dim(matt) >> [1] 3384766 2 >> >> matt_plot<-function(matt, outputfile) { >> names(matt)<-c("id","reads") >> >> cover<-matt$reads >> >> >> #calculate the cumulative coverage. >> + cover_per<-function (data) { >> + output<-numeric(0) >> + for (i in data) { >> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) >> + output<-c(output, x) >> + } >> + return(output) >> + } >> >> >> result<-cover_per(cover) >> >> >> Thanks so much! >> >> >> -- >> Sincerely, >> Changbin >> -- >> >>[[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> -- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 94598 Phone: 925-927-2856 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to work with long vectors
Changbin - Does 100 * sapply(matt$reads,function(x)sum(matt$reads >= x))/length(matt$reads) give what you want? By the way, if you want to use a loop (there's nothing wrong with that), then try to avoid the most common mistake that people make with loops in R: having your result grow inside the loop. Here's a better way to use a loop to solve your problem: cover_per_1 <- function(data){ l = length(data) output = numeric(l) for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, 0))/length(data) output } Using some random data, and comparing to your original cover_per function: dat = rnorm(1000) system.time(one <- cover_per(dat)) user system elapsed 0.816 0.000 0.824 system.time(two <- cover_per_1(dat)) user system elapsed 0.792 0.000 0.805 Not that big a speedup, but it does increase quite a bit as the problem gets larger. There are two obvious ways to speed up your function: 1) Eliminate the ifelse function, since automatic coersion from logical to numeric does the same thing. 2) Multiply by 100 and divide by the length outside the loop: cover_per_2 <- function(data){ l = length(data) output = numeric(l) for(i in 1:l)output[i] = sum(data >= data[i]) 100 * output / l } system.time(three <- cover_per_2(dat)) user system elapsed 0.024 0.000 0.027 That makes the loop just about equivalent to the sapply solution: system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat)) user system elapsed 0.024 0.000 0.026 - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Thu, 4 Nov 2010, Changbin Du wrote: HI, Dear R community, I have one data set like this, What I want to do is to calculate the cumulative coverage. The following codes works for small data set (#rows = 100), but when feed the whole data set, it still running after 24 hours. Can someone give some suggestions for long vector? idreads Contig79:14 Contig79:28 Contig79:313 Contig79:414 Contig79:517 Contig79:620 Contig79:725 Contig79:827 Contig79:932 Contig79:1033 Contig79:1134 matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", sep="\t", skip=0, header=F,fill=T) # dim(matt) [1] 3384766 2 matt_plot<-function(matt, outputfile) { names(matt)<-c("id","reads") cover<-matt$reads #calculate the cumulative coverage. + cover_per<-function (data) { + output<-numeric(0) + for (i in data) { + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) + output<-c(output, x) + } + return(output) + } result<-cover_per(cover) Thanks so much! -- Sincerely, Changbin -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to work with long vectors
Thanks Martin, I will try this. On Thu, Nov 4, 2010 at 10:06 AM, Martin Morgan wrote: > On 11/04/2010 09:45 AM, Changbin Du wrote: > > Thanks, Jim! > > > > This is not what I want, What I want is calculate the percentage of > reads > > bigger or equal to that reads in each position.MY output is like the > > following: > > Hi Changbin -- I might be repeating myself, but the Bioconductor > packages IRanges and GenomicRanges are designed to work with this sort > of data, and include 'coverage' functions that do what you're interested > in. Look into ?GRanges if interested. > > > > http://bioconductor.org/help/bioc-views/release/BiocViews.html#___HighThroughputSequencing > > Martin > > > for row 1, all the reads is >= 4, so the cover_per is 100, > > for row 2, 99 % reads >=4, so the cover_per is 99. > >> head(final) > > cover_per reads > > 1 100 4 > > 299 8 > > 39813 > > 49714 > > 59617 > > 69520 > > > > I attached the input file with this email. This file is only 100 rows, > very > > small. MY original data set is 3384766 rows. > > > > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > >> dim(matt) > > [1] 3384766 2 > > > > Thanks so much for your time! > > > >> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t", > > skip=0, header=F,fill=T) # > >> names(matt)<-c("id","reads") > >> dim(matt) > > [1] 100 2 > >> cover<-matt$reads > >> > >> > >> #calculate the cumulative coverage. > >> cover_per<-function (data) { > > + output<-numeric(0) > > + for (i in data) { > > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > > + output<-c(output, x) > > + } > > + return(output) > > + } > >> > >> > >> result<-cover_per(cover) > >> head(result) > > [1] 100 99 98 97 96 95 > >> > >> final<-data.frame(result, cover) > >> > >> names(final)<-c("cover_per", "reads") > >> head(final) > > cover_per reads > > 1 100 4 > > 299 8 > > 39813 > > 49714 > > 59617 > > 69520 > > > > > > > > > > > > On Thu, Nov 4, 2010 at 9:18 AM, jim holtman wrote: > > > >> Is this what you want: > >> > >>> x > >>id reads > >> 1 Contig79:1 4 > >> 2 Contig79:2 8 > >> 3 Contig79:313 > >> 4 Contig79:414 > >> 5 Contig79:517 > >> 6 Contig79:620 > >> 7 Contig79:725 > >> 8 Contig79:827 > >> 9 Contig79:932 > >> 10 Contig79:1033 > >> 11 Contig79:1134 > >>> x$percent <- x$reads / max(x$reads) * 100 > >>> x > >>id reads percent > >> 1 Contig79:1 4 11.76471 > >> 2 Contig79:2 8 23.52941 > >> 3 Contig79:313 38.23529 > >> 4 Contig79:414 41.17647 > >> 5 Contig79:517 50.0 > >> 6 Contig79:620 58.82353 > >> 7 Contig79:725 73.52941 > >> 8 Contig79:827 79.41176 > >> 9 Contig79:932 94.11765 > >> 10 Contig79:1033 97.05882 > >> 11 Contig79:1134 100.0 > >> > >> > >> On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du > wrote: > >>> HI, Dear R community, > >>> > >>> I have one data set like this, What I want to do is to calculate the > >>> cumulative coverage. The following codes works for small data set > (#rows > >> = > >>> 100), but when feed the whole data set, it still running after 24 > hours. > >>> Can someone give some suggestions for long vector? > >>> > >>> idreads > >>> Contig79:14 > >>> Contig79:28 > >>> Contig79:313 > >>> Contig79:414 > >>> Contig79:517 > >>> Contig79:620 > >>> Contig79:725 > >>> Contig79:827 > >>> Contig79:932 > >>> Contig79:1033 > >>> Contig79:1134 > >>> > >>> > >> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > >>> sep="\t", skip=0, header=F,fill=T) # > >>> dim(matt) > >>> [1] 3384766 2 > >>> > >>> matt_plot<-function(matt, outputfile) { > >>> names(matt)<-c("id","reads") > >>> > >>> cover<-matt$reads > >>> > >>> > >>> #calculate the cumulative coverage. > >>> + cover_per<-function (data) { > >>> + output<-numeric(0) > >>> + for (i in data) { > >>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > >>> + output<-c(output, x) > >>> + } > >>> + return(output) > >>> + } > >>> > >>> > >>> result<-cover_per(cover) > >>> > >>> > >>> Thanks so much! > >>> > >>> > >>> -- > >>> Sincerely, > >>> Changbin > >>> -- > >>> > >>>[[alternative HTML version deleted]] > >>> > >>> __ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the
Re: [R] how to work with long vectors
On 11/04/2010 09:45 AM, Changbin Du wrote: > Thanks, Jim! > > This is not what I want, What I want is calculate the percentage of reads > bigger or equal to that reads in each position.MY output is like the > following: Hi Changbin -- I might be repeating myself, but the Bioconductor packages IRanges and GenomicRanges are designed to work with this sort of data, and include 'coverage' functions that do what you're interested in. Look into ?GRanges if interested. http://bioconductor.org/help/bioc-views/release/BiocViews.html#___HighThroughputSequencing Martin > for row 1, all the reads is >= 4, so the cover_per is 100, > for row 2, 99 % reads >=4, so the cover_per is 99. >> head(final) > cover_per reads > 1 100 4 > 299 8 > 39813 > 49714 > 59617 > 69520 > > I attached the input file with this email. This file is only 100 rows, very > small. MY original data set is 3384766 rows. > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # >> dim(matt) > [1] 3384766 2 > > Thanks so much for your time! > >> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t", > skip=0, header=F,fill=T) # >> names(matt)<-c("id","reads") >> dim(matt) > [1] 100 2 >> cover<-matt$reads >> >> >> #calculate the cumulative coverage. >> cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + output<-c(output, x) > + } > + return(output) > + } >> >> >> result<-cover_per(cover) >> head(result) > [1] 100 99 98 97 96 95 >> >> final<-data.frame(result, cover) >> >> names(final)<-c("cover_per", "reads") >> head(final) > cover_per reads > 1 100 4 > 299 8 > 39813 > 49714 > 59617 > 69520 > > > > > > On Thu, Nov 4, 2010 at 9:18 AM, jim holtman wrote: > >> Is this what you want: >> >>> x >>id reads >> 1 Contig79:1 4 >> 2 Contig79:2 8 >> 3 Contig79:313 >> 4 Contig79:414 >> 5 Contig79:517 >> 6 Contig79:620 >> 7 Contig79:725 >> 8 Contig79:827 >> 9 Contig79:932 >> 10 Contig79:1033 >> 11 Contig79:1134 >>> x$percent <- x$reads / max(x$reads) * 100 >>> x >>id reads percent >> 1 Contig79:1 4 11.76471 >> 2 Contig79:2 8 23.52941 >> 3 Contig79:313 38.23529 >> 4 Contig79:414 41.17647 >> 5 Contig79:517 50.0 >> 6 Contig79:620 58.82353 >> 7 Contig79:725 73.52941 >> 8 Contig79:827 79.41176 >> 9 Contig79:932 94.11765 >> 10 Contig79:1033 97.05882 >> 11 Contig79:1134 100.0 >> >> >> On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du wrote: >>> HI, Dear R community, >>> >>> I have one data set like this, What I want to do is to calculate the >>> cumulative coverage. The following codes works for small data set (#rows >> = >>> 100), but when feed the whole data set, it still running after 24 hours. >>> Can someone give some suggestions for long vector? >>> >>> idreads >>> Contig79:14 >>> Contig79:28 >>> Contig79:313 >>> Contig79:414 >>> Contig79:517 >>> Contig79:620 >>> Contig79:725 >>> Contig79:827 >>> Contig79:932 >>> Contig79:1033 >>> Contig79:1134 >>> >>> >> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", >>> sep="\t", skip=0, header=F,fill=T) # >>> dim(matt) >>> [1] 3384766 2 >>> >>> matt_plot<-function(matt, outputfile) { >>> names(matt)<-c("id","reads") >>> >>> cover<-matt$reads >>> >>> >>> #calculate the cumulative coverage. >>> + cover_per<-function (data) { >>> + output<-numeric(0) >>> + for (i in data) { >>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) >>> + output<-c(output, x) >>> + } >>> + return(output) >>> + } >>> >>> >>> result<-cover_per(cover) >>> >>> >>> Thanks so much! >>> >>> >>> -- >>> Sincerely, >>> Changbin >>> -- >>> >>>[[alternative HTML version deleted]] >>> >>> __ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >> >> -- >> Jim Holtman >> Cincinnati, OH >> +1 513 646 9390 >> >> What is the problem that you are trying to solve? >> > > > > > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posti
Re: [R] how to work with long vectors
HI, Henrique, Thanks for the great help! I compared the output from your codes: > te<-rev(100 * cumsum(matt$reads > 1) / length(matt$reads) ) > te [1] 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 [19] 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 [37] 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 [55] 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 [73] 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 [91] 10 9 8 7 6 5 4 3 2 1 the output from my code, > result [1] 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 [19] 82 81 80 79 79 77 77 77 74 73 72 71 70 70 68 67 67 65 [37] 64 64 62 62 60 59 58 57 56 56 54 53 52 51 51 49 48 47 [55] 46 45 45 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 [73] 28 27 27 27 24 24 22 21 20 19 19 19 19 15 14 14 12 11 [91] 10 9 8 7 7 5 4 3 2 1 There is no tie in your output. Look at the data set: There are ties in the data set. Your codes work fast, but I think the results is not accurate. Thanks so much for the great help! > matt[c(1:35), ] id reads 1 Contig79:1 4 2 Contig79:2 8 ; ; 22 Contig79:2264 23 Contig79:2364 24 Contig79:2468 25 Contig79:2568 26 Contig79:2668 I also attached the testing file with this email. Thanks! On Thu, Nov 4, 2010 at 9:12 AM, Henrique Dallazuanna wrote: > Try this: > > rev(100 * cumsum(matt$reads > 1) / length(matt$reads) ) > > On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du wrote: > >> HI, Dear R community, >> >> I have one data set like this, What I want to do is to calculate the >> cumulative coverage. The following codes works for small data set (#rows = >> 100), but when feed the whole data set, it still running after 24 hours. >> Can someone give some suggestions for long vector? >> >> idreads >> Contig79:14 >> Contig79:28 >> Contig79:313 >> Contig79:414 >> Contig79:517 >> Contig79:620 >> Contig79:725 >> Contig79:827 >> Contig79:932 >> Contig79:1033 >> Contig79:1134 >> >> >> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", >> sep="\t", skip=0, header=F,fill=T) # >> dim(matt) >> [1] 3384766 2 >> >> matt_plot<-function(matt, outputfile) { >> names(matt)<-c("id","reads") >> >> cover<-matt$reads >> >> >> #calculate the cumulative coverage. >> + cover_per<-function (data) { >> + output<-numeric(0) >> + for (i in data) { >> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) >> + output<-c(output, x) >> + } >> + return(output) >> + } >> >> >> result<-cover_per(cover) >> >> >> Thanks so much! >> >> >> -- >> Sincerely, >> Changbin >> -- >> >>[[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Henrique Dallazuanna > Curitiba-Paraná-Brasil > 25° 25' 40" S 49° 16' 22" O > -- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 94598 Phone: 925-927-2856 Contig79:1 4 Contig79:2 8 Contig79:3 13 Contig79:4 14 Contig79:5 17 Contig79:6 20 Contig79:7 25 Contig79:8 27 Contig79:9 32 Contig79:10 33 Contig79:11 34 Contig79:12 36 Contig79:13 39 Contig79:14 40 Contig79:15 44 Contig79:16 49 Contig79:17 55 Contig79:18 56 Contig79:19 59 Contig79:20 60 Contig79:21 62 Contig79:22 64 Contig79:23 64 Contig79:24 68 Contig79:25 68 Contig79:26 68 Contig79:27 70 Contig79:28 73 Contig79:29 76 Contig79:30 77 Contig79:31 78 Contig79:32 78 Contig79:33 79 Contig79:34 80 Contig79:35 80 Contig79:36 84 Contig79:37 87 Contig79:38 87 Contig79:39 88 Contig79:40 88 Contig79:41 89 Contig79:42 93 Contig79:43 94 Contig79:44 98 Contig79:45 99 Contig79:46 99 Contig79:47 102 Contig79:48 103 Contig79:49 108 Contig79:50 112 Contig79:51 112 Contig79:52 113 Contig79:53 116 Contig79:54 118 Contig79:55 120 Contig79:56 124 Contig79:57 124 Contig79:58 126 Contig79:59 128 Contig79:60 130 Contig79:61 133 Contig79:62 134 Contig79:63 136 Contig79:64 139 Contig79:65 144 Contig79:66 145 Contig79:67 146 Contig79:68 148 Contig79:69 149 Contig79:70 151 Contig79:71 156 Contig79:72 157 Contig79:73 158 Contig79
Re: [R] how to work with long vectors
Thanks, Jim! This is not what I want, What I want is calculate the percentage of reads bigger or equal to that reads in each position.MY output is like the following: for row 1, all the reads is >= 4, so the cover_per is 100, for row 2, 99 % reads >=4, so the cover_per is 99. > head(final) cover_per reads 1 100 4 299 8 39813 49714 59617 69520 I attached the input file with this email. This file is only 100 rows, very small. MY original data set is 3384766 rows. > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", sep="\t", skip=0, header=F,fill=T) # > dim(matt) [1] 3384766 2 Thanks so much for your time! > matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t", skip=0, header=F,fill=T) # > names(matt)<-c("id","reads") > dim(matt) [1] 100 2 > cover<-matt$reads > > > #calculate the cumulative coverage. > cover_per<-function (data) { + output<-numeric(0) + for (i in data) { + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) + output<-c(output, x) + } + return(output) + } > > > result<-cover_per(cover) > head(result) [1] 100 99 98 97 96 95 > > final<-data.frame(result, cover) > > names(final)<-c("cover_per", "reads") > head(final) cover_per reads 1 100 4 299 8 39813 49714 59617 69520 On Thu, Nov 4, 2010 at 9:18 AM, jim holtman wrote: > Is this what you want: > > > x >id reads > 1 Contig79:1 4 > 2 Contig79:2 8 > 3 Contig79:313 > 4 Contig79:414 > 5 Contig79:517 > 6 Contig79:620 > 7 Contig79:725 > 8 Contig79:827 > 9 Contig79:932 > 10 Contig79:1033 > 11 Contig79:1134 > > x$percent <- x$reads / max(x$reads) * 100 > > x >id reads percent > 1 Contig79:1 4 11.76471 > 2 Contig79:2 8 23.52941 > 3 Contig79:313 38.23529 > 4 Contig79:414 41.17647 > 5 Contig79:517 50.0 > 6 Contig79:620 58.82353 > 7 Contig79:725 73.52941 > 8 Contig79:827 79.41176 > 9 Contig79:932 94.11765 > 10 Contig79:1033 97.05882 > 11 Contig79:1134 100.0 > > > On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du wrote: > > HI, Dear R community, > > > > I have one data set like this, What I want to do is to calculate the > > cumulative coverage. The following codes works for small data set (#rows > = > > 100), but when feed the whole data set, it still running after 24 hours. > > Can someone give some suggestions for long vector? > > > > idreads > > Contig79:14 > > Contig79:28 > > Contig79:313 > > Contig79:414 > > Contig79:517 > > Contig79:620 > > Contig79:725 > > Contig79:827 > > Contig79:932 > > Contig79:1033 > > Contig79:1134 > > > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > > sep="\t", skip=0, header=F,fill=T) # > > dim(matt) > > [1] 3384766 2 > > > > matt_plot<-function(matt, outputfile) { > > names(matt)<-c("id","reads") > > > > cover<-matt$reads > > > > > > #calculate the cumulative coverage. > > + cover_per<-function (data) { > > + output<-numeric(0) > > + for (i in data) { > > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > > + output<-c(output, x) > > + } > > + return(output) > > + } > > > > > > result<-cover_per(cover) > > > > > > Thanks so much! > > > > > > -- > > Sincerely, > > Changbin > > -- > > > >[[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > -- Sincerely, Changbin -- Changbin Du DOE Joint Genome Institute Bldg 400 Rm 457 2800 Mitchell Dr Walnut Creet, CA 94598 Phone: 925-927-2856 Contig79:1 4 Contig79:2 8 Contig79:3 13 Contig79:4 14 Contig79:5 17 Contig79:6 20 Contig79:7 25 Contig79:8 27 Contig79:9 32 Contig79:10 33 Contig79:11 34 Contig79:12 36 Contig79:13 39 Contig79:14 40 Contig79:15 44 Contig79:16 49 Contig79:17 55 Contig79:18 56 Contig79:19 59 Contig79:20 60 Contig79:21 62 Contig79:22 64 Contig79:23 64 Contig79:24 68 Contig79:25 68 Contig79:26 68 Contig79:27 70 Contig79:28 73 Contig79:29 76 Contig79:30 77 Contig79:31
Re: [R] how to work with long vectors
Try this: rev(100 * cumsum(matt$reads > 1) / length(matt$reads) ) On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du wrote: > HI, Dear R community, > > I have one data set like this, What I want to do is to calculate the > cumulative coverage. The following codes works for small data set (#rows = > 100), but when feed the whole data set, it still running after 24 hours. > Can someone give some suggestions for long vector? > > idreads > Contig79:14 > Contig79:28 > Contig79:313 > Contig79:414 > Contig79:517 > Contig79:620 > Contig79:725 > Contig79:827 > Contig79:932 > Contig79:1033 > Contig79:1134 > > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + output<-c(output, x) > + } > + return(output) > + } > > > result<-cover_per(cover) > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > >[[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to work with long vectors
Is this what you want: > x id reads 1 Contig79:1 4 2 Contig79:2 8 3 Contig79:313 4 Contig79:414 5 Contig79:517 6 Contig79:620 7 Contig79:725 8 Contig79:827 9 Contig79:932 10 Contig79:1033 11 Contig79:1134 > x$percent <- x$reads / max(x$reads) * 100 > x id reads percent 1 Contig79:1 4 11.76471 2 Contig79:2 8 23.52941 3 Contig79:313 38.23529 4 Contig79:414 41.17647 5 Contig79:517 50.0 6 Contig79:620 58.82353 7 Contig79:725 73.52941 8 Contig79:827 79.41176 9 Contig79:932 94.11765 10 Contig79:1033 97.05882 11 Contig79:1134 100.0 On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du wrote: > HI, Dear R community, > > I have one data set like this, What I want to do is to calculate the > cumulative coverage. The following codes works for small data set (#rows = > 100), but when feed the whole data set, it still running after 24 hours. > Can someone give some suggestions for long vector? > > id reads > Contig79:1 4 > Contig79:2 8 > Contig79:3 13 > Contig79:4 14 > Contig79:5 17 > Contig79:6 20 > Contig79:7 25 > Contig79:8 27 > Contig79:9 32 > Contig79:10 33 > Contig79:11 34 > > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # > dim(matt) > [1] 3384766 2 > > matt_plot<-function(matt, outputfile) { > names(matt)<-c("id","reads") > > cover<-matt$reads > > > #calculate the cumulative coverage. > + cover_per<-function (data) { > + output<-numeric(0) > + for (i in data) { > + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) > + output<-c(output, x) > + } > + return(output) > + } > > > result<-cover_per(cover) > > > Thanks so much! > > > -- > Sincerely, > Changbin > -- > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.