Re: [R] how to work with long vectors

2010-11-05 Thread William Dunlap
There was a numerical typo below, I said
the sample sizes were 5 and 10 thousand,
I should have said 10 and 20 thousand
(the size argument to sample()).

Also, I timed cover_per_2 and _3 for size
200,000 and gots times of 338 and 0.12 seconds,
respectively.  Growing the problem by a factor
to 10 made cover_per_2 used 100 times more time
and cover_per_3 c. 10 times more (the times are
too small to get an accurate ratio).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -Original Message-
> From: r-help-boun...@r-project.org 
> [mailto:r-help-boun...@r-project.org] On Behalf Of William Dunlap
> Sent: Friday, November 05, 2010 9:58 AM
> To: Changbin Du
> Cc: r-help@r-project.org
> Subject: Re: [R] how to work with long vectors
> 
> The following cover_per_3 uses sorting to solve
> the problem more quickly.  It still has room
> for improvement.
> 
> cover_per_3 <- function (data) 
> {
> n <- length(data)
> o <- rev(order(data))
> sdata <- data[o]
> r <- rle(sdata)$lengths
> output <- numeric(n)
> output[o] <- rep(cumsum(r), r)
> 100 * output/n
> }
> 
> (The ecdf function would probabably also do the
> job quickly.)
> 
> When trying to work on problems like this I find
> it most fruitful to work on smaller datasets and
> see how the time grows with the size of the data,
> instead of seeing how many days a it takes on a huge
> dataset.  E.g., the following compares times for
> your original function, Phil Spector's simple cleanup
> of your function, and the sort based approach for
> vectors of length 5 and 10 thousand.
> 
> > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>user  system elapsed 
>   38.210.00   38.41 
>user  system elapsed 
>0.860.000.86 
>user  system elapsed 
>   0   0   0 
> > identical(v_3,v)
> [1] TRUE
> > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>user  system elapsed 
>  158.480.07  159.31 
>user  system elapsed 
>3.230.003.25 
>user  system elapsed 
>0.020.000.02 
> > identical(v_3,v)
> [1] TRUE
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com  
> 
> > -Original Message-
> > From: r-help-boun...@r-project.org 
> > [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du
> > Sent: Friday, November 05, 2010 9:14 AM
> > To: Phil Spector
> > Cc: 
> > Subject: Re: [R] how to work with long vectors
> > 
> > HI, Phil,
> > 
> > I used the following codes and run it overnight for 15 hours, 
> > this morning,
> > I stopped it. It seems it is still not efficient.
> > 
> > 
> > >
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> > > names(matt)<-c("id","reads")
> > 
> > > dim(matt)
> > [1] 3384766   2
> > 
> > >  cover<-matt$reads
> > 
> > > cover_per_2 <- function(data){
> > +   l = length(data)
> > +   output = numeric(l)
> > +   for(i in 1:l)output[i] = sum(data >= data[i])
> > +   100 * output / l
> > + }
> > 
> > > result3<-cover_per_2(cover)
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du 
> >  wrote:
> > 
> > > Thanks Phil, that is great! I WILL try this and let you 
> > know how it goes.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
> > wrote:
> > >
> > >> Changbin -
> > >>   Does
> > >>
> > >>100 * sapply(matt$reads,function(x)sum(matt$reads >=
> > >> x))/length(matt$reads)
> > >>
> > >> give what you want?
> > >>
> > >>By the way, if you want to use a loop (there's nothing 
> > wrong with
> > >> that),
> > >> then try to avoid the most common mistake that people make 
> > with loops in
> > >> R:
> > >> having your result grow inside the l

Re: [R] how to work with long vectors

2010-11-05 Thread Changbin Du
Thanks, William. It gave me a lesson.


On Fri, Nov 5, 2010 at 9:58 AM, William Dunlap  wrote:

> The following cover_per_3 uses sorting to solve
> the problem more quickly.  It still has room
> for improvement.
>
> cover_per_3 <- function (data)
> {
>n <- length(data)
>o <- rev(order(data))
>sdata <- data[o]
>r <- rle(sdata)$lengths
>output <- numeric(n)
>output[o] <- rep(cumsum(r), r)
>100 * output/n
> }
>
> (The ecdf function would probabably also do the
> job quickly.)
>
> When trying to work on problems like this I find
> it most fruitful to work on smaller datasets and
> see how the time grows with the size of the data,
> instead of seeing how many days a it takes on a huge
> dataset.  E.g., the following compares times for
> your original function, Phil Spector's simple cleanup
> of your function, and the sort based approach for
> vectors of length 5 and 10 thousand.
>
> > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>   user  system elapsed
>  38.210.00   38.41
>   user  system elapsed
>   0.860.000.86
>   user  system elapsed
>  0   0   0
> > identical(v_3,v)
> [1] TRUE
> > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>   user  system elapsed
>  158.480.07  159.31
>   user  system elapsed
>   3.230.003.25
>   user  system elapsed
>   0.020.000.02
> > identical(v_3,v)
> [1] TRUE
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
> > -----Original Message-
> > From: r-help-boun...@r-project.org
> > [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du
> > Sent: Friday, November 05, 2010 9:14 AM
> > To: Phil Spector
> > Cc: 
> > Subject: Re: [R] how to work with long vectors
> >
> > HI, Phil,
> >
> > I used the following codes and run it overnight for 15 hours,
> > this morning,
> > I stopped it. It seems it is still not efficient.
> >
> >
> > >
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> > > names(matt)<-c("id","reads")
> >
> > > dim(matt)
> > [1] 3384766   2
> >
> > >  cover<-matt$reads
> >
> > > cover_per_2 <- function(data){
> > +   l = length(data)
> > +   output = numeric(l)
> > +   for(i in 1:l)output[i] = sum(data >= data[i])
> > +   100 * output / l
> > + }
> >
> > > result3<-cover_per_2(cover)
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du
> >  wrote:
> >
> > > Thanks Phil, that is great! I WILL try this and let you
> > know how it goes.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector
> > wrote:
> > >
> > >> Changbin -
> > >>   Does
> > >>
> > >>100 * sapply(matt$reads,function(x)sum(matt$reads >=
> > >> x))/length(matt$reads)
> > >>
> > >> give what you want?
> > >>
> > >>By the way, if you want to use a loop (there's nothing
> > wrong with
> > >> that),
> > >> then try to avoid the most common mistake that people make
> > with loops in
> > >> R:
> > >> having your result grow inside the loop.  Here's a better
> > way to use a
> > >> loop
> > >> to solve your problem:
> > >>
> > >> cover_per_1 <- function(data){
> > >>   l = length(data)
> > >>   output = numeric(l)
> > >>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> > >> 0))/length(data)
> > >>   output
> > >> }
> > >>
> > >> Using some random data, and comparing to your original
> > cover_per function:
> > >>
> > >>  dat = rnorm(1000)
> > >>> system.time(one <- cover_per(dat))
> > >>>
> > >>   user  system elapsed
&g

Re: [R] how to work with long vectors

2010-11-05 Thread William Dunlap
The following cover_per_3 uses sorting to solve
the problem more quickly.  It still has room
for improvement.

cover_per_3 <- function (data) 
{
n <- length(data)
o <- rev(order(data))
sdata <- data[o]
r <- rle(sdata)$lengths
output <- numeric(n)
output[o] <- rep(cumsum(r), r)
100 * output/n
}

(The ecdf function would probabably also do the
job quickly.)

When trying to work on problems like this I find
it most fruitful to work on smaller datasets and
see how the time grows with the size of the data,
instead of seeing how many days a it takes on a huge
dataset.  E.g., the following compares times for
your original function, Phil Spector's simple cleanup
of your function, and the sort based approach for
vectors of length 5 and 10 thousand.

> z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
   user  system elapsed 
  38.210.00   38.41 
   user  system elapsed 
   0.860.000.86 
   user  system elapsed 
  0   0   0 
> identical(v_3,v)
[1] TRUE
> z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
   user  system elapsed 
 158.480.07  159.31 
   user  system elapsed 
   3.230.003.25 
   user  system elapsed 
   0.020.000.02 
> identical(v_3,v)
[1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -Original Message-
> From: r-help-boun...@r-project.org 
> [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du
> Sent: Friday, November 05, 2010 9:14 AM
> To: Phil Spector
> Cc: 
> Subject: Re: [R] how to work with long vectors
> 
> HI, Phil,
> 
> I used the following codes and run it overnight for 15 hours, 
> this morning,
> I stopped it. It seems it is still not efficient.
> 
> 
> >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> > names(matt)<-c("id","reads")
> 
> > dim(matt)
> [1] 3384766   2
> 
> >  cover<-matt$reads
> 
> > cover_per_2 <- function(data){
> +   l = length(data)
> +   output = numeric(l)
> +   for(i in 1:l)output[i] = sum(data >= data[i])
> +   100 * output / l
> + }
> 
> > result3<-cover_per_2(cover)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du 
>  wrote:
> 
> > Thanks Phil, that is great! I WILL try this and let you 
> know how it goes.
> >
> >
> >
> >
> > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
> wrote:
> >
> >> Changbin -
> >>   Does
> >>
> >>100 * sapply(matt$reads,function(x)sum(matt$reads >=
> >> x))/length(matt$reads)
> >>
> >> give what you want?
> >>
> >>By the way, if you want to use a loop (there's nothing 
> wrong with
> >> that),
> >> then try to avoid the most common mistake that people make 
> with loops in
> >> R:
> >> having your result grow inside the loop.  Here's a better 
> way to use a
> >> loop
> >> to solve your problem:
> >>
> >> cover_per_1 <- function(data){
> >>   l = length(data)
> >>   output = numeric(l)
> >>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> >> 0))/length(data)
> >>   output
> >> }
> >>
> >> Using some random data, and comparing to your original 
> cover_per function:
> >>
> >>  dat = rnorm(1000)
> >>> system.time(one <- cover_per(dat))
> >>>
> >>   user  system elapsed
> >>  0.816   0.000   0.824
> >>
> >>> system.time(two <- cover_per_1(dat))
> >>>
> >>   user  system elapsed
> >>  0.792   0.000   0.805
> >>
> >> Not that big a speedup, but it does increase quite a bit 
> as the problem
> >> gets
> >> larger.
> >>
> >> There are two obvious ways to speed up your function:
> >>   1)  Eliminate the ifelse function, since automatic coersion from
> >>   logical to numeric does the same thing.
> >>   2)  Multiply by 100 and divide by the length outside the loop:
> >>
> >> cover_per_2 <- function(data){
> >>   l = length(data)
> >>   output

Re: [R] how to work with long vectors

2010-11-05 Thread Changbin Du
Thanks Martin! I will try it and will let your guys know how it goes.


On Fri, Nov 5, 2010 at 9:42 AM, Martin Morgan  wrote:

> On 11/05/2010 09:13 AM, Changbin Du wrote:
> > HI, Phil,
> >
> > I used the following codes and run it overnight for 15 hours, this
> morning,
> > I stopped it. It seems it is still not efficient.
> >
> >
> >>
> >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> >> names(matt)<-c("id","reads")
> >
> >> dim(matt)
> > [1] 3384766   2
>
> [snip]
>
> >>> On Thu, 4 Nov 2010, Changbin Du wrote:
> >>>
> >>>  HI, Dear R community,
> 
>  I have one data set like this,  What I want to do is to calculate the
>  cumulative coverage. The following codes works for small data set
> (#rows
>  =
>  100), but when feed the whole data set,  it still running after 24
> hours.
>  Can someone give some suggestions for long vector?
> 
>  idreads
>  Contig79:14
>  Contig79:28
>  Contig79:313
>  Contig79:414
>  Contig79:517
>  Contig79:620
>  Contig79:725
>  Contig79:827
>  Contig79:932
>  Contig79:1033
>  Contig79:1134
> 
> 
> 
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>  sep="\t", skip=0, header=F,fill=T) #
>  dim(matt)
>  [1] 3384766   2
> 
>  matt_plot<-function(matt, outputfile) {
>  names(matt)<-c("id","reads")
> 
>  cover<-matt$reads
> 
> 
>  #calculate the cumulative coverage.
>  + cover_per<-function (data) {
>  + output<-numeric(0)
>  + for (i in data) {
>  +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>  +   output<-c(output, x)
>  + }
>  + return(output)
>  + }
> 
> 
>  result<-cover_per(cover)
>
> Hi Changbin
>
> If I understand correctly, your contigs 'start' at position 1, and have
> 'width' equal to matt$reads. You'd like to know the coverage at the last
> covered location of each contig in matt$reads.
>
> ## first time only
> source("http://bioconductor.org";)
> biocLite("IRanges")
>
> ##
> library(IRanges)
> contigs = IRanges(start=1, width=matt$reads)
> cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
> as.vector(cvg[matt$reads]) / nrow(matt)  ## at the end of each contig
>
> for a larger data set:
>
> > matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100
> > contigs = IRanges(start=1, width=matt$reads)
> > system.time(cvg <- coverage(contigs))
>   user  system elapsed
>  5.145   0.050   5.202
>
> Martin
>
> 
> 
>  Thanks so much!
> 
> 
>  --
>  Sincerely,
>  Changbin
>  --
> 
> [[alternative HTML version deleted]]
> 
>  __
>  R-help@r-project.org mailing list
>  https://stat.ethz.ch/mailman/listinfo/r-help
>  PLEASE do read the posting guide
>  http://www.R-project.org/posting-guide.html
>  and provide commented, minimal, self-contained, reproducible code.
> 
> 
> >>
> >>
> >> --
> >> Sincerely,
> >> Changbin
> >> --
> >>
> >> Changbin Du
> >> DOE Joint Genome Institute
> >> Bldg 400 Rm 457
> >> 2800 Mitchell Dr
> >> Walnut Creet, CA 94598
> >> Phone: 925-927-2856
> >>
> >>
> >>
> >
> >
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>



-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-05 Thread Martin Morgan
On 11/05/2010 09:42 AM, Martin Morgan wrote:

> ## first time only
> source("http://bioconductor.org";)

oops, source("http://bioconductor.org/biocLite.R";)

> biocLite("IRanges")
> 
> ##
> library(IRanges)
> contigs = IRanges(start=1, width=matt$reads)
> cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
> as.vector(cvg[matt$reads]) / nrow(matt)  ## at the end of each contig
> 
> for a larger data set:
> 
>> matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100
>> contigs = IRanges(start=1, width=matt$reads)
>> system.time(cvg <- coverage(contigs))
>user  system elapsed
>   5.145   0.050   5.202
> 
> Martin
> 
>
>
> Thanks so much!
>
>
> --
> Sincerely,
> Changbin
> --
>
>[[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>>>
>>>
>>> --
>>> Sincerely,
>>> Changbin
>>> --
>>>
>>> Changbin Du
>>> DOE Joint Genome Institute
>>> Bldg 400 Rm 457
>>> 2800 Mitchell Dr
>>> Walnut Creet, CA 94598
>>> Phone: 925-927-2856
>>>
>>>
>>>
>>
>>
> 
> 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-05 Thread Martin Morgan
On 11/05/2010 09:13 AM, Changbin Du wrote:
> HI, Phil,
> 
> I used the following codes and run it overnight for 15 hours, this morning,
> I stopped it. It seems it is still not efficient.
> 
> 
>>
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
>> names(matt)<-c("id","reads")
> 
>> dim(matt)
> [1] 3384766   2

[snip]

>>> On Thu, 4 Nov 2010, Changbin Du wrote:
>>>
>>>  HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows
 =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 idreads
 Contig79:14
 Contig79:28
 Contig79:313
 Contig79:414
 Contig79:517
 Contig79:620
 Contig79:725
 Contig79:827
 Contig79:932
 Contig79:1033
 Contig79:1134


 matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
 sep="\t", skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2

 matt_plot<-function(matt, outputfile) {
 names(matt)<-c("id","reads")

 cover<-matt$reads


 #calculate the cumulative coverage.
 + cover_per<-function (data) {
 + output<-numeric(0)
 + for (i in data) {
 +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
 +   output<-c(output, x)
 + }
 + return(output)
 + }


 result<-cover_per(cover)

Hi Changbin

If I understand correctly, your contigs 'start' at position 1, and have
'width' equal to matt$reads. You'd like to know the coverage at the last
covered location of each contig in matt$reads.

## first time only
source("http://bioconductor.org";)
biocLite("IRanges")

##
library(IRanges)
contigs = IRanges(start=1, width=matt$reads)
cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
as.vector(cvg[matt$reads]) / nrow(matt)  ## at the end of each contig

for a larger data set:

> matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100
> contigs = IRanges(start=1, width=matt$reads)
> system.time(cvg <- coverage(contigs))
   user  system elapsed
  5.145   0.050   5.202

Martin



 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


>>
>>
>> --
>> Sincerely,
>> Changbin
>> --
>>
>> Changbin Du
>> DOE Joint Genome Institute
>> Bldg 400 Rm 457
>> 2800 Mitchell Dr
>> Walnut Creet, CA 94598
>> Phone: 925-927-2856
>>
>>
>>
> 
> 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-05 Thread Changbin Du
HI, Phil,

I used the following codes and run it overnight for 15 hours, this morning,
I stopped it. It seems it is still not efficient.


>
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
> names(matt)<-c("id","reads")

> dim(matt)
[1] 3384766   2

>  cover<-matt$reads

> cover_per_2 <- function(data){
+   l = length(data)
+   output = numeric(l)
+   for(i in 1:l)output[i] = sum(data >= data[i])
+   100 * output / l
+ }

> result3<-cover_per_2(cover)












On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du  wrote:

> Thanks Phil, that is great! I WILL try this and let you know how it goes.
>
>
>
>
> On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
> wrote:
>
>> Changbin -
>>   Does
>>
>>100 * sapply(matt$reads,function(x)sum(matt$reads >=
>> x))/length(matt$reads)
>>
>> give what you want?
>>
>>By the way, if you want to use a loop (there's nothing wrong with
>> that),
>> then try to avoid the most common mistake that people make with loops in
>> R:
>> having your result grow inside the loop.  Here's a better way to use a
>> loop
>> to solve your problem:
>>
>> cover_per_1 <- function(data){
>>   l = length(data)
>>   output = numeric(l)
>>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
>> 0))/length(data)
>>   output
>> }
>>
>> Using some random data, and comparing to your original cover_per function:
>>
>>  dat = rnorm(1000)
>>> system.time(one <- cover_per(dat))
>>>
>>   user  system elapsed
>>  0.816   0.000   0.824
>>
>>> system.time(two <- cover_per_1(dat))
>>>
>>   user  system elapsed
>>  0.792   0.000   0.805
>>
>> Not that big a speedup, but it does increase quite a bit as the problem
>> gets
>> larger.
>>
>> There are two obvious ways to speed up your function:
>>   1)  Eliminate the ifelse function, since automatic coersion from
>>   logical to numeric does the same thing.
>>   2)  Multiply by 100 and divide by the length outside the loop:
>>
>> cover_per_2 <- function(data){
>>   l = length(data)
>>   output = numeric(l)
>>   for(i in 1:l)output[i] = sum(data >= data[i])
>>   100 * output / l
>> }
>>
>>  system.time(three <- cover_per_2(dat))
>>>
>>   user  system elapsed
>>  0.024   0.000   0.027
>>
>> That makes the loop just about equivalent to the sapply solution:
>>
>>  system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat))
>>>
>>   user  system elapsed
>>  0.024   0.000   0.026
>>
>>- Phil Spector
>> Statistical Computing Facility
>> Department of Statistics
>> UC Berkeley
>> spec...@stat.berkeley.edu
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, 4 Nov 2010, Changbin Du wrote:
>>
>>  HI, Dear R community,
>>>
>>> I have one data set like this,  What I want to do is to calculate the
>>> cumulative coverage. The following codes works for small data set (#rows
>>> =
>>> 100), but when feed the whole data set,  it still running after 24 hours.
>>> Can someone give some suggestions for long vector?
>>>
>>> idreads
>>> Contig79:14
>>> Contig79:28
>>> Contig79:313
>>> Contig79:414
>>> Contig79:517
>>> Contig79:620
>>> Contig79:725
>>> Contig79:827
>>> Contig79:932
>>> Contig79:1033
>>> Contig79:1134
>>>
>>>
>>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>>> sep="\t", skip=0, header=F,fill=T) #
>>> dim(matt)
>>> [1] 3384766   2
>>>
>>> matt_plot<-function(matt, outputfile) {
>>> names(matt)<-c("id","reads")
>>>
>>> cover<-matt$reads
>>>
>>>
>>> #calculate the cumulative coverage.
>>> + cover_per<-function (data) {
>>> + output<-numeric(0)
>>> + for (i in data) {
>>> +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>>> +   output<-c(output, x)
>>> + }
>>> + return(output)
>>> + }
>>>
>>>
>>> result<-cover_per(cover)
>>>
>>>
>>> Thanks so much!
>>>
>>>
>>> --
>>> Sincerely,
>>> Changbin
>>> --
>>>
>>>[[alternative HTML version deleted]]
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>
>
> --
> Sincerely,
> Changbin
> --
>
> Changbin Du
> DOE Joint Genome Institute
> Bldg 400 Rm 457
> 2800 Mitchell Dr
> Walnut Creet, CA 94598
> Phone: 925-927-2856
>
>
>


-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 9459

Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
Thanks Phil, that is great! I WILL try this and let you know how it goes.



On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector wrote:

> Changbin -
>   Does
>
>100 * sapply(matt$reads,function(x)sum(matt$reads >=
> x))/length(matt$reads)
>
> give what you want?
>
>By the way, if you want to use a loop (there's nothing wrong with that),
> then try to avoid the most common mistake that people make with loops in R:
> having your result grow inside the loop.  Here's a better way to use a loop
> to solve your problem:
>
> cover_per_1 <- function(data){
>   l = length(data)
>   output = numeric(l)
>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> 0))/length(data)
>   output
> }
>
> Using some random data, and comparing to your original cover_per function:
>
>  dat = rnorm(1000)
>> system.time(one <- cover_per(dat))
>>
>   user  system elapsed
>  0.816   0.000   0.824
>
>> system.time(two <- cover_per_1(dat))
>>
>   user  system elapsed
>  0.792   0.000   0.805
>
> Not that big a speedup, but it does increase quite a bit as the problem
> gets
> larger.
>
> There are two obvious ways to speed up your function:
>   1)  Eliminate the ifelse function, since automatic coersion from
>   logical to numeric does the same thing.
>   2)  Multiply by 100 and divide by the length outside the loop:
>
> cover_per_2 <- function(data){
>   l = length(data)
>   output = numeric(l)
>   for(i in 1:l)output[i] = sum(data >= data[i])
>   100 * output / l
> }
>
>  system.time(three <- cover_per_2(dat))
>>
>   user  system elapsed
>  0.024   0.000   0.027
>
> That makes the loop just about equivalent to the sapply solution:
>
>  system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat))
>>
>   user  system elapsed
>  0.024   0.000   0.026
>
>- Phil Spector
> Statistical Computing Facility
> Department of Statistics
> UC Berkeley
> spec...@stat.berkeley.edu
>
>
>
>
>
>
>
>
>
> On Thu, 4 Nov 2010, Changbin Du wrote:
>
>  HI, Dear R community,
>>
>> I have one data set like this,  What I want to do is to calculate the
>> cumulative coverage. The following codes works for small data set (#rows =
>> 100), but when feed the whole data set,  it still running after 24 hours.
>> Can someone give some suggestions for long vector?
>>
>> idreads
>> Contig79:14
>> Contig79:28
>> Contig79:313
>> Contig79:414
>> Contig79:517
>> Contig79:620
>> Contig79:725
>> Contig79:827
>> Contig79:932
>> Contig79:1033
>> Contig79:1134
>>
>>
>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>> sep="\t", skip=0, header=F,fill=T) #
>> dim(matt)
>> [1] 3384766   2
>>
>> matt_plot<-function(matt, outputfile) {
>> names(matt)<-c("id","reads")
>>
>> cover<-matt$reads
>>
>>
>> #calculate the cumulative coverage.
>> + cover_per<-function (data) {
>> + output<-numeric(0)
>> + for (i in data) {
>> +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>> +   output<-c(output, x)
>> + }
>> + return(output)
>> + }
>>
>>
>> result<-cover_per(cover)
>>
>>
>> Thanks so much!
>>
>>
>> --
>> Sincerely,
>> Changbin
>> --
>>
>>[[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>


-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-04 Thread Phil Spector

Changbin -
   Does

100 * sapply(matt$reads,function(x)sum(matt$reads >= x))/length(matt$reads)

give what you want?

By the way, if you want to use a loop (there's nothing wrong with that),
then try to avoid the most common mistake that people make with loops in R:
having your result grow inside the loop.  Here's a better way to use a loop
to solve your problem:

cover_per_1 <- function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1, 
0))/length(data)
   output
}

Using some random data, and comparing to your original cover_per function:


dat = rnorm(1000)
system.time(one <- cover_per(dat))

   user  system elapsed
  0.816   0.000   0.824 

system.time(two <- cover_per_1(dat))

   user  system elapsed
  0.792   0.000   0.805

Not that big a speedup, but it does increase quite a bit as the problem gets
larger.

There are two obvious ways to speed up your function:
   1)  Eliminate the ifelse function, since automatic coersion from
   logical to numeric does the same thing.
   2)  Multiply by 100 and divide by the length outside the loop:

cover_per_2 <- function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = sum(data >= data[i])
   100 * output / l
}


system.time(three <- cover_per_2(dat))

   user  system elapsed
  0.024   0.000   0.027

That makes the loop just about equivalent to the sapply solution:


system.time(four <- 100*sapply(dat,function(x)sum(dat >= x))/length(dat))

   user  system elapsed
  0.024   0.000   0.026

- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu








On Thu, 4 Nov 2010, Changbin Du wrote:


HI, Dear R community,

I have one data set like this,  What I want to do is to calculate the
cumulative coverage. The following codes works for small data set (#rows =
100), but when feed the whole data set,  it still running after 24 hours.
Can someone give some suggestions for long vector?

idreads
Contig79:14
Contig79:28
Contig79:313
Contig79:414
Contig79:517
Contig79:620
Contig79:725
Contig79:827
Contig79:932
Contig79:1033
Contig79:1134

matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
dim(matt)
[1] 3384766   2

matt_plot<-function(matt, outputfile) {
names(matt)<-c("id","reads")

cover<-matt$reads


#calculate the cumulative coverage.
+ cover_per<-function (data) {
+ output<-numeric(0)
+ for (i in data) {
+   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
+   output<-c(output, x)
+ }
+ return(output)
+ }


result<-cover_per(cover)


Thanks so much!


--
Sincerely,
Changbin
--

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
Thanks Martin, I will try this.

On Thu, Nov 4, 2010 at 10:06 AM, Martin Morgan  wrote:

> On 11/04/2010 09:45 AM, Changbin Du wrote:
> > Thanks, Jim!
> >
> > This is not what I want,  What I want is calculate the percentage of
> reads
> > bigger or equal to that reads in each position.MY output is like the
> > following:
>
> Hi Changbin -- I might be repeating myself, but the Bioconductor
> packages IRanges and GenomicRanges are designed to work with this sort
> of data, and include 'coverage' functions that do what you're interested
> in. Look into ?GRanges if interested.
>
>
>
> http://bioconductor.org/help/bioc-views/release/BiocViews.html#___HighThroughputSequencing
>
> Martin
>
> > for row 1, all the reads is >= 4, so the cover_per is 100,
> > for row 2, 99 % reads >=4, so the cover_per is 99.
> >> head(final)
> >   cover_per reads
> > 1   100 4
> > 299 8
> > 39813
> > 49714
> > 59617
> > 69520
> >
> > I attached the input file with this email. This file is only 100 rows,
> very
> > small. MY original data set is 3384766 rows.
> >
> >  >
> >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> >> dim(matt)
> > [1] 3384766   2
> >
> > Thanks so much for your time!
> >
> >> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t",
> > skip=0, header=F,fill=T) #
> >> names(matt)<-c("id","reads")
> >> dim(matt)
> > [1] 100   2
> >> cover<-matt$reads
> >>
> >>
> >> #calculate the cumulative coverage.
> >> cover_per<-function (data) {
> > + output<-numeric(0)
> > + for (i in data) {
> > +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> > +   output<-c(output, x)
> > + }
> > + return(output)
> > + }
> >>
> >>
> >> result<-cover_per(cover)
> >> head(result)
> > [1] 100  99  98  97  96  95
> >>
> >> final<-data.frame(result, cover)
> >>
> >> names(final)<-c("cover_per", "reads")
> >> head(final)
> >   cover_per reads
> > 1   100 4
> > 299 8
> > 39813
> > 49714
> > 59617
> > 69520
> >
> >
> >
> >
> >
> > On Thu, Nov 4, 2010 at 9:18 AM, jim holtman  wrote:
> >
> >> Is this what you want:
> >>
> >>> x
> >>id reads
> >> 1   Contig79:1 4
> >> 2   Contig79:2 8
> >> 3   Contig79:313
> >> 4   Contig79:414
> >> 5   Contig79:517
> >> 6   Contig79:620
> >> 7   Contig79:725
> >> 8   Contig79:827
> >> 9   Contig79:932
> >> 10 Contig79:1033
> >> 11 Contig79:1134
> >>> x$percent <- x$reads / max(x$reads) * 100
> >>> x
> >>id reads   percent
> >> 1   Contig79:1 4  11.76471
> >> 2   Contig79:2 8  23.52941
> >> 3   Contig79:313  38.23529
> >> 4   Contig79:414  41.17647
> >> 5   Contig79:517  50.0
> >> 6   Contig79:620  58.82353
> >> 7   Contig79:725  73.52941
> >> 8   Contig79:827  79.41176
> >> 9   Contig79:932  94.11765
> >> 10 Contig79:1033  97.05882
> >> 11 Contig79:1134 100.0
> >>
> >>
> >> On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du 
> wrote:
> >>> HI, Dear R community,
> >>>
> >>> I have one data set like this,  What I want to do is to calculate the
> >>> cumulative coverage. The following codes works for small data set
> (#rows
> >> =
> >>> 100), but when feed the whole data set,  it still running after 24
> hours.
> >>> Can someone give some suggestions for long vector?
> >>>
> >>> idreads
> >>> Contig79:14
> >>> Contig79:28
> >>> Contig79:313
> >>> Contig79:414
> >>> Contig79:517
> >>> Contig79:620
> >>> Contig79:725
> >>> Contig79:827
> >>> Contig79:932
> >>> Contig79:1033
> >>> Contig79:1134
> >>>
> >>>
> >>
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> >>> sep="\t", skip=0, header=F,fill=T) #
> >>> dim(matt)
> >>> [1] 3384766   2
> >>>
> >>> matt_plot<-function(matt, outputfile) {
> >>> names(matt)<-c("id","reads")
> >>>
> >>>  cover<-matt$reads
> >>>
> >>>
> >>> #calculate the cumulative coverage.
> >>> + cover_per<-function (data) {
> >>> + output<-numeric(0)
> >>> + for (i in data) {
> >>> +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> >>> +   output<-c(output, x)
> >>> + }
> >>> + return(output)
> >>> + }
> >>>
> >>>
> >>>  result<-cover_per(cover)
> >>>
> >>>
> >>> Thanks so much!
> >>>
> >>>
> >>> --
> >>> Sincerely,
> >>> Changbin
> >>> --
> >>>
> >>>[[alternative HTML version deleted]]
> >>>
> >>> __
> >>> R-help@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the 

Re: [R] how to work with long vectors

2010-11-04 Thread Martin Morgan
On 11/04/2010 09:45 AM, Changbin Du wrote:
> Thanks, Jim!
> 
> This is not what I want,  What I want is calculate the percentage of reads
> bigger or equal to that reads in each position.MY output is like the
> following:

Hi Changbin -- I might be repeating myself, but the Bioconductor
packages IRanges and GenomicRanges are designed to work with this sort
of data, and include 'coverage' functions that do what you're interested
in. Look into ?GRanges if interested.


http://bioconductor.org/help/bioc-views/release/BiocViews.html#___HighThroughputSequencing

Martin

> for row 1, all the reads is >= 4, so the cover_per is 100,
> for row 2, 99 % reads >=4, so the cover_per is 99.
>> head(final)
>   cover_per reads
> 1   100 4
> 299 8
> 39813
> 49714
> 59617
> 69520
> 
> I attached the input file with this email. This file is only 100 rows, very
> small. MY original data set is 3384766 rows.
> 
>  >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
>> dim(matt)
> [1] 3384766   2
> 
> Thanks so much for your time!
> 
>> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t",
> skip=0, header=F,fill=T) #
>> names(matt)<-c("id","reads")
>> dim(matt)
> [1] 100   2
>> cover<-matt$reads
>>
>>
>> #calculate the cumulative coverage.
>> cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> +   output<-c(output, x)
> + }
> + return(output)
> + }
>>
>>
>> result<-cover_per(cover)
>> head(result)
> [1] 100  99  98  97  96  95
>>
>> final<-data.frame(result, cover)
>>
>> names(final)<-c("cover_per", "reads")
>> head(final)
>   cover_per reads
> 1   100 4
> 299 8
> 39813
> 49714
> 59617
> 69520
> 
> 
> 
> 
> 
> On Thu, Nov 4, 2010 at 9:18 AM, jim holtman  wrote:
> 
>> Is this what you want:
>>
>>> x
>>id reads
>> 1   Contig79:1 4
>> 2   Contig79:2 8
>> 3   Contig79:313
>> 4   Contig79:414
>> 5   Contig79:517
>> 6   Contig79:620
>> 7   Contig79:725
>> 8   Contig79:827
>> 9   Contig79:932
>> 10 Contig79:1033
>> 11 Contig79:1134
>>> x$percent <- x$reads / max(x$reads) * 100
>>> x
>>id reads   percent
>> 1   Contig79:1 4  11.76471
>> 2   Contig79:2 8  23.52941
>> 3   Contig79:313  38.23529
>> 4   Contig79:414  41.17647
>> 5   Contig79:517  50.0
>> 6   Contig79:620  58.82353
>> 7   Contig79:725  73.52941
>> 8   Contig79:827  79.41176
>> 9   Contig79:932  94.11765
>> 10 Contig79:1033  97.05882
>> 11 Contig79:1134 100.0
>>
>>
>> On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du  wrote:
>>> HI, Dear R community,
>>>
>>> I have one data set like this,  What I want to do is to calculate the
>>> cumulative coverage. The following codes works for small data set (#rows
>> =
>>> 100), but when feed the whole data set,  it still running after 24 hours.
>>> Can someone give some suggestions for long vector?
>>>
>>> idreads
>>> Contig79:14
>>> Contig79:28
>>> Contig79:313
>>> Contig79:414
>>> Contig79:517
>>> Contig79:620
>>> Contig79:725
>>> Contig79:827
>>> Contig79:932
>>> Contig79:1033
>>> Contig79:1134
>>>
>>>
>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>>> sep="\t", skip=0, header=F,fill=T) #
>>> dim(matt)
>>> [1] 3384766   2
>>>
>>> matt_plot<-function(matt, outputfile) {
>>> names(matt)<-c("id","reads")
>>>
>>>  cover<-matt$reads
>>>
>>>
>>> #calculate the cumulative coverage.
>>> + cover_per<-function (data) {
>>> + output<-numeric(0)
>>> + for (i in data) {
>>> +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>>> +   output<-c(output, x)
>>> + }
>>> + return(output)
>>> + }
>>>
>>>
>>>  result<-cover_per(cover)
>>>
>>>
>>> Thanks so much!
>>>
>>>
>>> --
>>> Sincerely,
>>> Changbin
>>> --
>>>
>>>[[alternative HTML version deleted]]
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
> 
> 
> 
> 
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posti

Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
HI, Henrique,

Thanks for the great help!

I compared the output from your codes:
> te<-rev(100 * cumsum(matt$reads > 1) / length(matt$reads) )
> te
  [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84
83
 [19]  82  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66
65
 [37]  64  63  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48
47
 [55]  46  45  44  43  42  41  40  39  38  37  36  35  34  33  32  31  30
29
 [73]  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12
11
 [91]  10   9   8   7   6   5   4   3   2   1

 the output from my code,
> result
  [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84
83
 [19]  82  81  80  79  79  77  77  77  74  73  72  71  70  70  68  67  67
65
 [37]  64  64  62  62  60  59  58  57  56  56  54  53  52  51  51  49  48
47
 [55]  46  45  45  43  42  41  40  39  38  37  36  35  34  33  32  31  30
29
 [73]  28  27  27  27  24  24  22  21  20  19  19  19  19  15  14  14  12
11
 [91]  10   9   8   7   7   5   4   3   2   1

There is no tie in your output. Look at the data set: There are ties in the
data set. Your codes work fast, but I think the results is not accurate.
Thanks so much for the great help!

> matt[c(1:35), ]
id reads
1   Contig79:1 4
2   Contig79:2 8
;
;
22 Contig79:2264
23 Contig79:2364
24 Contig79:2468
25 Contig79:2568
26 Contig79:2668

I also attached the testing file with this email. Thanks!



On Thu, Nov 4, 2010 at 9:12 AM, Henrique Dallazuanna wrote:

> Try this:
>
> rev(100 * cumsum(matt$reads > 1) / length(matt$reads) )
>
> On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du  wrote:
>
>> HI, Dear R community,
>>
>> I have one data set like this,  What I want to do is to calculate the
>> cumulative coverage. The following codes works for small data set (#rows =
>> 100), but when feed the whole data set,  it still running after 24 hours.
>> Can someone give some suggestions for long vector?
>>
>> idreads
>> Contig79:14
>> Contig79:28
>> Contig79:313
>> Contig79:414
>> Contig79:517
>> Contig79:620
>> Contig79:725
>> Contig79:827
>> Contig79:932
>> Contig79:1033
>> Contig79:1134
>>
>>
>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>> sep="\t", skip=0, header=F,fill=T) #
>> dim(matt)
>> [1] 3384766   2
>>
>> matt_plot<-function(matt, outputfile) {
>> names(matt)<-c("id","reads")
>>
>>  cover<-matt$reads
>>
>>
>> #calculate the cumulative coverage.
>> + cover_per<-function (data) {
>> + output<-numeric(0)
>> + for (i in data) {
>> +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>> +   output<-c(output, x)
>> + }
>> + return(output)
>> + }
>>
>>
>>  result<-cover_per(cover)
>>
>>
>> Thanks so much!
>>
>>
>> --
>> Sincerely,
>> Changbin
>> --
>>
>>[[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Henrique Dallazuanna
> Curitiba-Paraná-Brasil
> 25° 25' 40" S 49° 16' 22" O
>



-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856
Contig79:1  4
Contig79:2  8
Contig79:3  13
Contig79:4  14
Contig79:5  17
Contig79:6  20
Contig79:7  25
Contig79:8  27
Contig79:9  32
Contig79:10 33
Contig79:11 34
Contig79:12 36
Contig79:13 39
Contig79:14 40
Contig79:15 44
Contig79:16 49
Contig79:17 55
Contig79:18 56
Contig79:19 59
Contig79:20 60
Contig79:21 62
Contig79:22 64
Contig79:23 64
Contig79:24 68
Contig79:25 68
Contig79:26 68
Contig79:27 70
Contig79:28 73
Contig79:29 76
Contig79:30 77
Contig79:31 78
Contig79:32 78
Contig79:33 79
Contig79:34 80
Contig79:35 80
Contig79:36 84
Contig79:37 87
Contig79:38 87
Contig79:39 88
Contig79:40 88
Contig79:41 89
Contig79:42 93
Contig79:43 94
Contig79:44 98
Contig79:45 99
Contig79:46 99
Contig79:47 102
Contig79:48 103
Contig79:49 108
Contig79:50 112
Contig79:51 112
Contig79:52 113
Contig79:53 116
Contig79:54 118
Contig79:55 120
Contig79:56 124
Contig79:57 124
Contig79:58 126
Contig79:59 128
Contig79:60 130
Contig79:61 133
Contig79:62 134
Contig79:63 136
Contig79:64 139
Contig79:65 144
Contig79:66 145
Contig79:67 146
Contig79:68 148
Contig79:69 149
Contig79:70 151
Contig79:71 156
Contig79:72 157
Contig79:73 158
Contig79

Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
Thanks, Jim!

This is not what I want,  What I want is calculate the percentage of reads
bigger or equal to that reads in each position.MY output is like the
following:
for row 1, all the reads is >= 4, so the cover_per is 100,
for row 2, 99 % reads >=4, so the cover_per is 99.
> head(final)
  cover_per reads
1   100 4
299 8
39813
49714
59617
69520

I attached the input file with this email. This file is only 100 rows, very
small. MY original data set is 3384766 rows.

 >
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
> dim(matt)
[1] 3384766   2

Thanks so much for your time!

> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t",
skip=0, header=F,fill=T) #
> names(matt)<-c("id","reads")
> dim(matt)
[1] 100   2
> cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> cover_per<-function (data) {
+ output<-numeric(0)
+ for (i in data) {
+   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
+   output<-c(output, x)
+ }
+ return(output)
+ }
>
>
> result<-cover_per(cover)
> head(result)
[1] 100  99  98  97  96  95
>
> final<-data.frame(result, cover)
>
> names(final)<-c("cover_per", "reads")
> head(final)
  cover_per reads
1   100 4
299 8
39813
49714
59617
69520





On Thu, Nov 4, 2010 at 9:18 AM, jim holtman  wrote:

> Is this what you want:
>
> > x
>id reads
> 1   Contig79:1 4
> 2   Contig79:2 8
> 3   Contig79:313
> 4   Contig79:414
> 5   Contig79:517
> 6   Contig79:620
> 7   Contig79:725
> 8   Contig79:827
> 9   Contig79:932
> 10 Contig79:1033
> 11 Contig79:1134
> > x$percent <- x$reads / max(x$reads) * 100
> > x
>id reads   percent
> 1   Contig79:1 4  11.76471
> 2   Contig79:2 8  23.52941
> 3   Contig79:313  38.23529
> 4   Contig79:414  41.17647
> 5   Contig79:517  50.0
> 6   Contig79:620  58.82353
> 7   Contig79:725  73.52941
> 8   Contig79:827  79.41176
> 9   Contig79:932  94.11765
> 10 Contig79:1033  97.05882
> 11 Contig79:1134 100.0
>
>
> On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du  wrote:
> > HI, Dear R community,
> >
> > I have one data set like this,  What I want to do is to calculate the
> > cumulative coverage. The following codes works for small data set (#rows
> =
> > 100), but when feed the whole data set,  it still running after 24 hours.
> > Can someone give some suggestions for long vector?
> >
> > idreads
> > Contig79:14
> > Contig79:28
> > Contig79:313
> > Contig79:414
> > Contig79:517
> > Contig79:620
> > Contig79:725
> > Contig79:827
> > Contig79:932
> > Contig79:1033
> > Contig79:1134
> >
> >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> > dim(matt)
> > [1] 3384766   2
> >
> > matt_plot<-function(matt, outputfile) {
> > names(matt)<-c("id","reads")
> >
> >  cover<-matt$reads
> >
> >
> > #calculate the cumulative coverage.
> > + cover_per<-function (data) {
> > + output<-numeric(0)
> > + for (i in data) {
> > +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> > +   output<-c(output, x)
> > + }
> > + return(output)
> > + }
> >
> >
> >  result<-cover_per(cover)
> >
> >
> > Thanks so much!
> >
> >
> > --
> > Sincerely,
> > Changbin
> > --
> >
> >[[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>



-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856
Contig79:1  4
Contig79:2  8
Contig79:3  13
Contig79:4  14
Contig79:5  17
Contig79:6  20
Contig79:7  25
Contig79:8  27
Contig79:9  32
Contig79:10 33
Contig79:11 34
Contig79:12 36
Contig79:13 39
Contig79:14 40
Contig79:15 44
Contig79:16 49
Contig79:17 55
Contig79:18 56
Contig79:19 59
Contig79:20 60
Contig79:21 62
Contig79:22 64
Contig79:23 64
Contig79:24 68
Contig79:25 68
Contig79:26 68
Contig79:27 70
Contig79:28 73
Contig79:29 76
Contig79:30 77
Contig79:31

Re: [R] how to work with long vectors

2010-11-04 Thread Henrique Dallazuanna
Try this:

rev(100 * cumsum(matt$reads > 1) / length(matt$reads) )

On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du  wrote:

> HI, Dear R community,
>
> I have one data set like this,  What I want to do is to calculate the
> cumulative coverage. The following codes works for small data set (#rows =
> 100), but when feed the whole data set,  it still running after 24 hours.
> Can someone give some suggestions for long vector?
>
> idreads
> Contig79:14
> Contig79:28
> Contig79:313
> Contig79:414
> Contig79:517
> Contig79:620
> Contig79:725
> Contig79:827
> Contig79:932
> Contig79:1033
> Contig79:1134
>
>
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> dim(matt)
> [1] 3384766   2
>
> matt_plot<-function(matt, outputfile) {
> names(matt)<-c("id","reads")
>
>  cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> + cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> +   x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> +   output<-c(output, x)
> + }
> + return(output)
> + }
>
>
>  result<-cover_per(cover)
>
>
> Thanks so much!
>
>
> --
> Sincerely,
> Changbin
> --
>
>[[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-04 Thread jim holtman
Is this what you want:

> x
id reads
1   Contig79:1 4
2   Contig79:2 8
3   Contig79:313
4   Contig79:414
5   Contig79:517
6   Contig79:620
7   Contig79:725
8   Contig79:827
9   Contig79:932
10 Contig79:1033
11 Contig79:1134
> x$percent <- x$reads / max(x$reads) * 100
> x
id reads   percent
1   Contig79:1 4  11.76471
2   Contig79:2 8  23.52941
3   Contig79:313  38.23529
4   Contig79:414  41.17647
5   Contig79:517  50.0
6   Contig79:620  58.82353
7   Contig79:725  73.52941
8   Contig79:827  79.41176
9   Contig79:932  94.11765
10 Contig79:1033  97.05882
11 Contig79:1134 100.0


On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du  wrote:
> HI, Dear R community,
>
> I have one data set like this,  What I want to do is to calculate the
> cumulative coverage. The following codes works for small data set (#rows =
> 100), but when feed the whole data set,  it still running after 24 hours.
> Can someone give some suggestions for long vector?
>
> id                reads
> Contig79:1    4
> Contig79:2    8
> Contig79:3    13
> Contig79:4    14
> Contig79:5    17
> Contig79:6    20
> Contig79:7    25
> Contig79:8    27
> Contig79:9    32
> Contig79:10    33
> Contig79:11    34
>
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> dim(matt)
> [1] 3384766       2
>
> matt_plot<-function(matt, outputfile) {
> names(matt)<-c("id","reads")
>
>  cover<-matt$reads
>
>
> #calculate the cumulative coverage.
> + cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> +           output<-c(output, x)
> +                 }
> + return(output)
> + }
>
>
>  result<-cover_per(cover)
>
>
> Thanks so much!
>
>
> --
> Sincerely,
> Changbin
> --
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.