Re: [R] how to work with long vectors

2010-11-05 Thread Changbin Du
HI, Phil,

I used the following codes and run it overnight for 15 hours, this morning,
I stopped it. It seems it is still not efficient.



matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
sep=\t, skip=0, header=F,fill=T) #
 names(matt)-c(id,reads)

 dim(matt)
[1] 3384766   2

  cover-matt$reads

 cover_per_2 - function(data){
+   l = length(data)
+   output = numeric(l)
+   for(i in 1:l)output[i] = sum(data = data[i])
+   100 * output / l
+ }

 result3-cover_per_2(cover)












On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du changb...@gmail.com wrote:

 Thanks Phil, that is great! I WILL try this and let you know how it goes.




 On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
 spec...@stat.berkeley.eduwrote:

 Changbin -
   Does

100 * sapply(matt$reads,function(x)sum(matt$reads =
 x))/length(matt$reads)

 give what you want?

By the way, if you want to use a loop (there's nothing wrong with
 that),
 then try to avoid the most common mistake that people make with loops in
 R:
 having your result grow inside the loop.  Here's a better way to use a
 loop
 to solve your problem:

 cover_per_1 - function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = 100 * sum(ifelse(data = data[i], 1,
 0))/length(data)
   output
 }

 Using some random data, and comparing to your original cover_per function:

  dat = rnorm(1000)
 system.time(one - cover_per(dat))

   user  system elapsed
  0.816   0.000   0.824

 system.time(two - cover_per_1(dat))

   user  system elapsed
  0.792   0.000   0.805

 Not that big a speedup, but it does increase quite a bit as the problem
 gets
 larger.

 There are two obvious ways to speed up your function:
   1)  Eliminate the ifelse function, since automatic coersion from
   logical to numeric does the same thing.
   2)  Multiply by 100 and divide by the length outside the loop:

 cover_per_2 - function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = sum(data = data[i])
   100 * output / l
 }

  system.time(three - cover_per_2(dat))

   user  system elapsed
  0.024   0.000   0.027

 That makes the loop just about equivalent to the sapply solution:

  system.time(four - 100*sapply(dat,function(x)sum(dat = x))/length(dat))

   user  system elapsed
  0.024   0.000   0.026

- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu









 On Thu, 4 Nov 2010, Changbin Du wrote:

  HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows
 =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 idreads
 Contig79:14
 Contig79:28
 Contig79:313
 Contig79:414
 Contig79:517
 Contig79:620
 Contig79:725
 Contig79:827
 Contig79:932
 Contig79:1033
 Contig79:1134


 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2

 matt_plot-function(matt, outputfile) {
 names(matt)-c(id,reads)

 cover-matt$reads


 #calculate the cumulative coverage.
 + cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +   output-c(output, x)
 + }
 + return(output)
 + }


 result-cover_per(cover)


 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Sincerely,
 Changbin
 --

 Changbin Du
 DOE Joint Genome Institute
 Bldg 400 Rm 457
 2800 Mitchell Dr
 Walnut Creet, CA 94598
 Phone: 925-927-2856





-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-05 Thread Martin Morgan
On 11/05/2010 09:13 AM, Changbin Du wrote:
 HI, Phil,
 
 I used the following codes and run it overnight for 15 hours, this morning,
 I stopped it. It seems it is still not efficient.
 
 

 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 names(matt)-c(id,reads)
 
 dim(matt)
 [1] 3384766   2

[snip]

 On Thu, 4 Nov 2010, Changbin Du wrote:

  HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows
 =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 idreads
 Contig79:14
 Contig79:28
 Contig79:313
 Contig79:414
 Contig79:517
 Contig79:620
 Contig79:725
 Contig79:827
 Contig79:932
 Contig79:1033
 Contig79:1134


 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2

 matt_plot-function(matt, outputfile) {
 names(matt)-c(id,reads)

 cover-matt$reads


 #calculate the cumulative coverage.
 + cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +   output-c(output, x)
 + }
 + return(output)
 + }


 result-cover_per(cover)

Hi Changbin

If I understand correctly, your contigs 'start' at position 1, and have
'width' equal to matt$reads. You'd like to know the coverage at the last
covered location of each contig in matt$reads.

## first time only
source(http://bioconductor.org;)
biocLite(IRanges)

##
library(IRanges)
contigs = IRanges(start=1, width=matt$reads)
cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
as.vector(cvg[matt$reads]) / nrow(matt)  ## at the end of each contig

for a larger data set:

 matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100
 contigs = IRanges(start=1, width=matt$reads)
 system.time(cvg - coverage(contigs))
   user  system elapsed
  5.145   0.050   5.202

Martin



 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Sincerely,
 Changbin
 --

 Changbin Du
 DOE Joint Genome Institute
 Bldg 400 Rm 457
 2800 Mitchell Dr
 Walnut Creet, CA 94598
 Phone: 925-927-2856



 
 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-05 Thread Martin Morgan
On 11/05/2010 09:42 AM, Martin Morgan wrote:

 ## first time only
 source(http://bioconductor.org;)

oops, source(http://bioconductor.org/biocLite.R;)

 biocLite(IRanges)
 
 ##
 library(IRanges)
 contigs = IRanges(start=1, width=matt$reads)
 cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
 as.vector(cvg[matt$reads]) / nrow(matt)  ## at the end of each contig
 
 for a larger data set:
 
 matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100
 contigs = IRanges(start=1, width=matt$reads)
 system.time(cvg - coverage(contigs))
user  system elapsed
   5.145   0.050   5.202
 
 Martin
 


 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Sincerely,
 Changbin
 --

 Changbin Du
 DOE Joint Genome Institute
 Bldg 400 Rm 457
 2800 Mitchell Dr
 Walnut Creet, CA 94598
 Phone: 925-927-2856





 
 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-05 Thread Changbin Du
Thanks Martin! I will try it and will let your guys know how it goes.


On Fri, Nov 5, 2010 at 9:42 AM, Martin Morgan mtmor...@fhcrc.org wrote:

 On 11/05/2010 09:13 AM, Changbin Du wrote:
  HI, Phil,
 
  I used the following codes and run it overnight for 15 hours, this
 morning,
  I stopped it. It seems it is still not efficient.
 
 
 
 
 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
  sep=\t, skip=0, header=F,fill=T) #
  names(matt)-c(id,reads)
 
  dim(matt)
  [1] 3384766   2

 [snip]

  On Thu, 4 Nov 2010, Changbin Du wrote:
 
   HI, Dear R community,
 
  I have one data set like this,  What I want to do is to calculate the
  cumulative coverage. The following codes works for small data set
 (#rows
  =
  100), but when feed the whole data set,  it still running after 24
 hours.
  Can someone give some suggestions for long vector?
 
  idreads
  Contig79:14
  Contig79:28
  Contig79:313
  Contig79:414
  Contig79:517
  Contig79:620
  Contig79:725
  Contig79:827
  Contig79:932
  Contig79:1033
  Contig79:1134
 
 
 
 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
  sep=\t, skip=0, header=F,fill=T) #
  dim(matt)
  [1] 3384766   2
 
  matt_plot-function(matt, outputfile) {
  names(matt)-c(id,reads)
 
  cover-matt$reads
 
 
  #calculate the cumulative coverage.
  + cover_per-function (data) {
  + output-numeric(0)
  + for (i in data) {
  +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
  +   output-c(output, x)
  + }
  + return(output)
  + }
 
 
  result-cover_per(cover)

 Hi Changbin

 If I understand correctly, your contigs 'start' at position 1, and have
 'width' equal to matt$reads. You'd like to know the coverage at the last
 covered location of each contig in matt$reads.

 ## first time only
 source(http://bioconductor.org;)
 biocLite(IRanges)

 ##
 library(IRanges)
 contigs = IRanges(start=1, width=matt$reads)
 cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
 as.vector(cvg[matt$reads]) / nrow(matt)  ## at the end of each contig

 for a larger data set:

  matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100
  contigs = IRanges(start=1, width=matt$reads)
  system.time(cvg - coverage(contigs))
   user  system elapsed
  5.145   0.050   5.202

 Martin

 
 
  Thanks so much!
 
 
  --
  Sincerely,
  Changbin
  --
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
  --
  Sincerely,
  Changbin
  --
 
  Changbin Du
  DOE Joint Genome Institute
  Bldg 400 Rm 457
  2800 Mitchell Dr
  Walnut Creet, CA 94598
  Phone: 925-927-2856
 
 
 
 
 


 --
 Computational Biology
 Fred Hutchinson Cancer Research Center
 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

 Location: M1-B861
 Telephone: 206 667-2793




-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-05 Thread William Dunlap
The following cover_per_3 uses sorting to solve
the problem more quickly.  It still has room
for improvement.

cover_per_3 - function (data) 
{
n - length(data)
o - rev(order(data))
sdata - data[o]
r - rle(sdata)$lengths
output - numeric(n)
output[o] - rep(cumsum(r), r)
100 * output/n
}

(The ecdf function would probabably also do the
job quickly.)

When trying to work on problems like this I find
it most fruitful to work on smaller datasets and
see how the time grows with the size of the data,
instead of seeing how many days a it takes on a huge
dataset.  E.g., the following compares times for
your original function, Phil Spector's simple cleanup
of your function, and the sort based approach for
vectors of length 5 and 10 thousand.

 z-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v -
cover_per(z))) ; print(system.time(v_2 - cover_per_2(z))) ;
print(system.time(v_3 - cover_per_3(z)))
   user  system elapsed 
  38.210.00   38.41 
   user  system elapsed 
   0.860.000.86 
   user  system elapsed 
  0   0   0 
 identical(v_3,v)
[1] TRUE
 z-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v -
cover_per(z))) ; print(system.time(v_2 - cover_per_2(z))) ;
print(system.time(v_3 - cover_per_3(z)))
   user  system elapsed 
 158.480.07  159.31 
   user  system elapsed 
   3.230.003.25 
   user  system elapsed 
   0.020.000.02 
 identical(v_3,v)
[1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du
 Sent: Friday, November 05, 2010 9:14 AM
 To: Phil Spector
 Cc: r-help@r-project.org
 Subject: Re: [R] how to work with long vectors
 
 HI, Phil,
 
 I used the following codes and run it overnight for 15 hours, 
 this morning,
 I stopped it. It seems it is still not efficient.
 
 
 
 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
  names(matt)-c(id,reads)
 
  dim(matt)
 [1] 3384766   2
 
   cover-matt$reads
 
  cover_per_2 - function(data){
 +   l = length(data)
 +   output = numeric(l)
 +   for(i in 1:l)output[i] = sum(data = data[i])
 +   100 * output / l
 + }
 
  result3-cover_per_2(cover)
 
 
 
 
 
 
 
 
 
 
 
 
 On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du 
 changb...@gmail.com wrote:
 
  Thanks Phil, that is great! I WILL try this and let you 
 know how it goes.
 
 
 
 
  On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
 spec...@stat.berkeley.eduwrote:
 
  Changbin -
Does
 
 100 * sapply(matt$reads,function(x)sum(matt$reads =
  x))/length(matt$reads)
 
  give what you want?
 
 By the way, if you want to use a loop (there's nothing 
 wrong with
  that),
  then try to avoid the most common mistake that people make 
 with loops in
  R:
  having your result grow inside the loop.  Here's a better 
 way to use a
  loop
  to solve your problem:
 
  cover_per_1 - function(data){
l = length(data)
output = numeric(l)
for(i in 1:l)output[i] = 100 * sum(ifelse(data = data[i], 1,
  0))/length(data)
output
  }
 
  Using some random data, and comparing to your original 
 cover_per function:
 
   dat = rnorm(1000)
  system.time(one - cover_per(dat))
 
user  system elapsed
   0.816   0.000   0.824
 
  system.time(two - cover_per_1(dat))
 
user  system elapsed
   0.792   0.000   0.805
 
  Not that big a speedup, but it does increase quite a bit 
 as the problem
  gets
  larger.
 
  There are two obvious ways to speed up your function:
1)  Eliminate the ifelse function, since automatic coersion from
logical to numeric does the same thing.
2)  Multiply by 100 and divide by the length outside the loop:
 
  cover_per_2 - function(data){
l = length(data)
output = numeric(l)
for(i in 1:l)output[i] = sum(data = data[i])
100 * output / l
  }
 
   system.time(three - cover_per_2(dat))
 
user  system elapsed
   0.024   0.000   0.027
 
  That makes the loop just about equivalent to the sapply solution:
 
   system.time(four - 100*sapply(dat,function(x)sum(dat = 
 x))/length(dat))
 
user  system elapsed
   0.024   0.000   0.026
 
 - Phil Spector
  Statistical 
 Computing Facility
  Department of Statistics
  UC Berkeley
  spec...@stat.berkeley.edu
 
 
 
 
 
 
 
 
 
  On Thu, 4 Nov 2010, Changbin Du wrote:
 
   HI, Dear R community,
 
  I have one data set like this,  What I want to do is to 
 calculate the
  cumulative coverage. The following codes works for small 
 data set (#rows
  =
  100), but when feed the whole data set,  it still running 
 after 24 hours.
  Can

Re: [R] how to work with long vectors

2010-11-05 Thread Changbin Du
Thanks, William. It gave me a lesson.


On Fri, Nov 5, 2010 at 9:58 AM, William Dunlap wdun...@tibco.com wrote:

 The following cover_per_3 uses sorting to solve
 the problem more quickly.  It still has room
 for improvement.

 cover_per_3 - function (data)
 {
n - length(data)
o - rev(order(data))
sdata - data[o]
r - rle(sdata)$lengths
output - numeric(n)
output[o] - rep(cumsum(r), r)
100 * output/n
 }

 (The ecdf function would probabably also do the
 job quickly.)

 When trying to work on problems like this I find
 it most fruitful to work on smaller datasets and
 see how the time grows with the size of the data,
 instead of seeing how many days a it takes on a huge
 dataset.  E.g., the following compares times for
 your original function, Phil Spector's simple cleanup
 of your function, and the sort based approach for
 vectors of length 5 and 10 thousand.

  z-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v -
 cover_per(z))) ; print(system.time(v_2 - cover_per_2(z))) ;
 print(system.time(v_3 - cover_per_3(z)))
   user  system elapsed
  38.210.00   38.41
   user  system elapsed
   0.860.000.86
   user  system elapsed
  0   0   0
  identical(v_3,v)
 [1] TRUE
  z-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v -
 cover_per(z))) ; print(system.time(v_2 - cover_per_2(z))) ;
 print(system.time(v_3 - cover_per_3(z)))
   user  system elapsed
  158.480.07  159.31
   user  system elapsed
   3.230.003.25
   user  system elapsed
   0.020.000.02
  identical(v_3,v)
 [1] TRUE

 Bill Dunlap
 Spotfire, TIBCO Software
 wdunlap tibco.com

  -Original Message-
  From: r-help-boun...@r-project.org
  [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du
  Sent: Friday, November 05, 2010 9:14 AM
  To: Phil Spector
  Cc: r-help@r-project.org
  Subject: Re: [R] how to work with long vectors
 
  HI, Phil,
 
  I used the following codes and run it overnight for 15 hours,
  this morning,
  I stopped it. It seems it is still not efficient.
 
 
  
  matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILL
 UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW
 ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
  sep=\t, skip=0, header=F,fill=T) #
   names(matt)-c(id,reads)
 
   dim(matt)
  [1] 3384766   2
 
cover-matt$reads
 
   cover_per_2 - function(data){
  +   l = length(data)
  +   output = numeric(l)
  +   for(i in 1:l)output[i] = sum(data = data[i])
  +   100 * output / l
  + }
 
   result3-cover_per_2(cover)
 
 
 
 
 
 
 
 
 
 
 
 
  On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du
  changb...@gmail.com wrote:
 
   Thanks Phil, that is great! I WILL try this and let you
  know how it goes.
  
  
  
  
   On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector
  spec...@stat.berkeley.eduwrote:
  
   Changbin -
 Does
  
  100 * sapply(matt$reads,function(x)sum(matt$reads =
   x))/length(matt$reads)
  
   give what you want?
  
  By the way, if you want to use a loop (there's nothing
  wrong with
   that),
   then try to avoid the most common mistake that people make
  with loops in
   R:
   having your result grow inside the loop.  Here's a better
  way to use a
   loop
   to solve your problem:
  
   cover_per_1 - function(data){
 l = length(data)
 output = numeric(l)
 for(i in 1:l)output[i] = 100 * sum(ifelse(data = data[i], 1,
   0))/length(data)
 output
   }
  
   Using some random data, and comparing to your original
  cover_per function:
  
dat = rnorm(1000)
   system.time(one - cover_per(dat))
  
 user  system elapsed
0.816   0.000   0.824
  
   system.time(two - cover_per_1(dat))
  
 user  system elapsed
0.792   0.000   0.805
  
   Not that big a speedup, but it does increase quite a bit
  as the problem
   gets
   larger.
  
   There are two obvious ways to speed up your function:
 1)  Eliminate the ifelse function, since automatic coersion from
 logical to numeric does the same thing.
 2)  Multiply by 100 and divide by the length outside the loop:
  
   cover_per_2 - function(data){
 l = length(data)
 output = numeric(l)
 for(i in 1:l)output[i] = sum(data = data[i])
 100 * output / l
   }
  
system.time(three - cover_per_2(dat))
  
 user  system elapsed
0.024   0.000   0.027
  
   That makes the loop just about equivalent to the sapply solution:
  
system.time(four - 100*sapply(dat,function(x)sum(dat =
  x))/length(dat))
  
 user  system elapsed
0.024   0.000   0.026
  
  - Phil Spector
   Statistical
  Computing Facility
   Department of Statistics
   UC Berkeley
   spec...@stat.berkeley.edu
  
  
  
  
  
  
  
  
  
   On Thu, 4 Nov 2010, Changbin Du wrote:
  
HI, Dear R community

Re: [R] how to work with long vectors

2010-11-05 Thread William Dunlap
There was a numerical typo below, I said
the sample sizes were 5 and 10 thousand,
I should have said 10 and 20 thousand
(the size argument to sample()).

Also, I timed cover_per_2 and _3 for size
200,000 and gots times of 338 and 0.12 seconds,
respectively.  Growing the problem by a factor
to 10 made cover_per_2 used 100 times more time
and cover_per_3 c. 10 times more (the times are
too small to get an accurate ratio).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of William Dunlap
 Sent: Friday, November 05, 2010 9:58 AM
 To: Changbin Du
 Cc: r-help@r-project.org
 Subject: Re: [R] how to work with long vectors
 
 The following cover_per_3 uses sorting to solve
 the problem more quickly.  It still has room
 for improvement.
 
 cover_per_3 - function (data) 
 {
 n - length(data)
 o - rev(order(data))
 sdata - data[o]
 r - rle(sdata)$lengths
 output - numeric(n)
 output[o] - rep(cumsum(r), r)
 100 * output/n
 }
 
 (The ecdf function would probabably also do the
 job quickly.)
 
 When trying to work on problems like this I find
 it most fruitful to work on smaller datasets and
 see how the time grows with the size of the data,
 instead of seeing how many days a it takes on a huge
 dataset.  E.g., the following compares times for
 your original function, Phil Spector's simple cleanup
 of your function, and the sort based approach for
 vectors of length 5 and 10 thousand.
 
  z-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v -
 cover_per(z))) ; print(system.time(v_2 - cover_per_2(z))) ;
 print(system.time(v_3 - cover_per_3(z)))
user  system elapsed 
   38.210.00   38.41 
user  system elapsed 
0.860.000.86 
user  system elapsed 
   0   0   0 
  identical(v_3,v)
 [1] TRUE
  z-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v -
 cover_per(z))) ; print(system.time(v_2 - cover_per_2(z))) ;
 print(system.time(v_3 - cover_per_3(z)))
user  system elapsed 
  158.480.07  159.31 
user  system elapsed 
3.230.003.25 
user  system elapsed 
0.020.000.02 
  identical(v_3,v)
 [1] TRUE
 
 Bill Dunlap
 Spotfire, TIBCO Software
 wdunlap tibco.com  
 
  -Original Message-
  From: r-help-boun...@r-project.org 
  [mailto:r-help-boun...@r-project.org] On Behalf Of Changbin Du
  Sent: Friday, November 05, 2010 9:14 AM
  To: Phil Spector
  Cc: r-help@r-project.org
  Subject: Re: [R] how to work with long vectors
  
  HI, Phil,
  
  I used the following codes and run it overnight for 15 hours, 
  this morning,
  I stopped it. It seems it is still not efficient.
  
  
  
  matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILL
 UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW
 ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
  sep=\t, skip=0, header=F,fill=T) #
   names(matt)-c(id,reads)
  
   dim(matt)
  [1] 3384766   2
  
cover-matt$reads
  
   cover_per_2 - function(data){
  +   l = length(data)
  +   output = numeric(l)
  +   for(i in 1:l)output[i] = sum(data = data[i])
  +   100 * output / l
  + }
  
   result3-cover_per_2(cover)
  
  
  
  
  
  
  
  
  
  
  
  
  On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du 
  changb...@gmail.com wrote:
  
   Thanks Phil, that is great! I WILL try this and let you 
  know how it goes.
  
  
  
  
   On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
  spec...@stat.berkeley.eduwrote:
  
   Changbin -
 Does
  
  100 * sapply(matt$reads,function(x)sum(matt$reads =
   x))/length(matt$reads)
  
   give what you want?
  
  By the way, if you want to use a loop (there's nothing 
  wrong with
   that),
   then try to avoid the most common mistake that people make 
  with loops in
   R:
   having your result grow inside the loop.  Here's a better 
  way to use a
   loop
   to solve your problem:
  
   cover_per_1 - function(data){
 l = length(data)
 output = numeric(l)
 for(i in 1:l)output[i] = 100 * sum(ifelse(data = data[i], 1,
   0))/length(data)
 output
   }
  
   Using some random data, and comparing to your original 
  cover_per function:
  
dat = rnorm(1000)
   system.time(one - cover_per(dat))
  
 user  system elapsed
0.816   0.000   0.824
  
   system.time(two - cover_per_1(dat))
  
 user  system elapsed
0.792   0.000   0.805
  
   Not that big a speedup, but it does increase quite a bit 
  as the problem
   gets
   larger.
  
   There are two obvious ways to speed up your function:
 1)  Eliminate the ifelse function, since automatic 
 coersion from
 logical to numeric does the same thing.
 2)  Multiply by 100 and divide by the length outside the loop:
  
   cover_per_2 - function(data){
 l = length(data)
 output = numeric(l)
 for(i in 1:l)output[i] = sum(data = data[i])
 100 * output / l
   }
  
system.time(three

Re: [R] how to work with long vectors

2010-11-04 Thread jim holtman
Is this what you want:

 x
id reads
1   Contig79:1 4
2   Contig79:2 8
3   Contig79:313
4   Contig79:414
5   Contig79:517
6   Contig79:620
7   Contig79:725
8   Contig79:827
9   Contig79:932
10 Contig79:1033
11 Contig79:1134
 x$percent - x$reads / max(x$reads) * 100
 x
id reads   percent
1   Contig79:1 4  11.76471
2   Contig79:2 8  23.52941
3   Contig79:313  38.23529
4   Contig79:414  41.17647
5   Contig79:517  50.0
6   Contig79:620  58.82353
7   Contig79:725  73.52941
8   Contig79:827  79.41176
9   Contig79:932  94.11765
10 Contig79:1033  97.05882
11 Contig79:1134 100.0


On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du changb...@gmail.com wrote:
 HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 id                reads
 Contig79:1    4
 Contig79:2    8
 Contig79:3    13
 Contig79:4    14
 Contig79:5    17
 Contig79:6    20
 Contig79:7    25
 Contig79:8    27
 Contig79:9    32
 Contig79:10    33
 Contig79:11    34

 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766       2

 matt_plot-function(matt, outputfile) {
 names(matt)-c(id,reads)

  cover-matt$reads


 #calculate the cumulative coverage.
 + cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +           x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +           output-c(output, x)
 +                 }
 + return(output)
 + }


  result-cover_per(cover)


 Thanks so much!


 --
 Sincerely,
 Changbin
 --

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-04 Thread Henrique Dallazuanna
Try this:

rev(100 * cumsum(matt$reads  1) / length(matt$reads) )

On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du changb...@gmail.com wrote:

 HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 idreads
 Contig79:14
 Contig79:28
 Contig79:313
 Contig79:414
 Contig79:517
 Contig79:620
 Contig79:725
 Contig79:827
 Contig79:932
 Contig79:1033
 Contig79:1134


 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2

 matt_plot-function(matt, outputfile) {
 names(matt)-c(id,reads)

  cover-matt$reads


 #calculate the cumulative coverage.
 + cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +   output-c(output, x)
 + }
 + return(output)
 + }


  result-cover_per(cover)


 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40 S 49° 16' 22 O

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
Thanks, Jim!

This is not what I want,  What I want is calculate the percentage of reads
bigger or equal to that reads in each position.MY output is like the
following:
for row 1, all the reads is = 4, so the cover_per is 100,
for row 2, 99 % reads =4, so the cover_per is 99.
 head(final)
  cover_per reads
1   100 4
299 8
39813
49714
59617
69520

I attached the input file with this email. This file is only 100 rows, very
small. MY original data set is 3384766 rows.

 
matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
[1] 3384766   2

Thanks so much for your time!

 matt-read.table(/home/cdu/operon/dimer5_0623/matt_test.txt, sep=\t,
skip=0, header=F,fill=T) #
 names(matt)-c(id,reads)
 dim(matt)
[1] 100   2
 cover-matt$reads


 #calculate the cumulative coverage.
 cover_per-function (data) {
+ output-numeric(0)
+ for (i in data) {
+   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
+   output-c(output, x)
+ }
+ return(output)
+ }


 result-cover_per(cover)
 head(result)
[1] 100  99  98  97  96  95

 final-data.frame(result, cover)

 names(final)-c(cover_per, reads)
 head(final)
  cover_per reads
1   100 4
299 8
39813
49714
59617
69520





On Thu, Nov 4, 2010 at 9:18 AM, jim holtman jholt...@gmail.com wrote:

 Is this what you want:

  x
id reads
 1   Contig79:1 4
 2   Contig79:2 8
 3   Contig79:313
 4   Contig79:414
 5   Contig79:517
 6   Contig79:620
 7   Contig79:725
 8   Contig79:827
 9   Contig79:932
 10 Contig79:1033
 11 Contig79:1134
  x$percent - x$reads / max(x$reads) * 100
  x
id reads   percent
 1   Contig79:1 4  11.76471
 2   Contig79:2 8  23.52941
 3   Contig79:313  38.23529
 4   Contig79:414  41.17647
 5   Contig79:517  50.0
 6   Contig79:620  58.82353
 7   Contig79:725  73.52941
 8   Contig79:827  79.41176
 9   Contig79:932  94.11765
 10 Contig79:1033  97.05882
 11 Contig79:1134 100.0


 On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du changb...@gmail.com wrote:
  HI, Dear R community,
 
  I have one data set like this,  What I want to do is to calculate the
  cumulative coverage. The following codes works for small data set (#rows
 =
  100), but when feed the whole data set,  it still running after 24 hours.
  Can someone give some suggestions for long vector?
 
  idreads
  Contig79:14
  Contig79:28
  Contig79:313
  Contig79:414
  Contig79:517
  Contig79:620
  Contig79:725
  Contig79:827
  Contig79:932
  Contig79:1033
  Contig79:1134
 
 
 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
  sep=\t, skip=0, header=F,fill=T) #
  dim(matt)
  [1] 3384766   2
 
  matt_plot-function(matt, outputfile) {
  names(matt)-c(id,reads)
 
   cover-matt$reads
 
 
  #calculate the cumulative coverage.
  + cover_per-function (data) {
  + output-numeric(0)
  + for (i in data) {
  +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
  +   output-c(output, x)
  + }
  + return(output)
  + }
 
 
   result-cover_per(cover)
 
 
  Thanks so much!
 
 
  --
  Sincerely,
  Changbin
  --
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?




-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856
Contig79:1  4
Contig79:2  8
Contig79:3  13
Contig79:4  14
Contig79:5  17
Contig79:6  20
Contig79:7  25
Contig79:8  27
Contig79:9  32
Contig79:10 33
Contig79:11 34
Contig79:12 36
Contig79:13 39
Contig79:14 40
Contig79:15 44
Contig79:16 49
Contig79:17 55
Contig79:18 56
Contig79:19 59
Contig79:20 60
Contig79:21 62
Contig79:22 64
Contig79:23 64
Contig79:24 68
Contig79:25 68
Contig79:26 68
Contig79:27 70
Contig79:28 73
Contig79:29 76
Contig79:30 77
Contig79:31 78
Contig79:32 78
Contig79:33 79
Contig79:34 80
Contig79:35 80
Contig79:36 84
Contig79:37 87
Contig79:38 87
Contig79:39 88
Contig79:40 88
Contig79:41 89

Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
HI, Henrique,

Thanks for the great help!

I compared the output from your codes:
 te-rev(100 * cumsum(matt$reads  1) / length(matt$reads) )
 te
  [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84
83
 [19]  82  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66
65
 [37]  64  63  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48
47
 [55]  46  45  44  43  42  41  40  39  38  37  36  35  34  33  32  31  30
29
 [73]  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12
11
 [91]  10   9   8   7   6   5   4   3   2   1

 the output from my code,
 result
  [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84
83
 [19]  82  81  80  79  79  77  77  77  74  73  72  71  70  70  68  67  67
65
 [37]  64  64  62  62  60  59  58  57  56  56  54  53  52  51  51  49  48
47
 [55]  46  45  45  43  42  41  40  39  38  37  36  35  34  33  32  31  30
29
 [73]  28  27  27  27  24  24  22  21  20  19  19  19  19  15  14  14  12
11
 [91]  10   9   8   7   7   5   4   3   2   1

There is no tie in your output. Look at the data set: There are ties in the
data set. Your codes work fast, but I think the results is not accurate.
Thanks so much for the great help!

 matt[c(1:35), ]
id reads
1   Contig79:1 4
2   Contig79:2 8
;
;
22 Contig79:2264
23 Contig79:2364
24 Contig79:2468
25 Contig79:2568
26 Contig79:2668

I also attached the testing file with this email. Thanks!



On Thu, Nov 4, 2010 at 9:12 AM, Henrique Dallazuanna www...@gmail.comwrote:

 Try this:

 rev(100 * cumsum(matt$reads  1) / length(matt$reads) )

 On Thu, Nov 4, 2010 at 1:46 PM, Changbin Du changb...@gmail.com wrote:

 HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 idreads
 Contig79:14
 Contig79:28
 Contig79:313
 Contig79:414
 Contig79:517
 Contig79:620
 Contig79:725
 Contig79:827
 Contig79:932
 Contig79:1033
 Contig79:1134


 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2

 matt_plot-function(matt, outputfile) {
 names(matt)-c(id,reads)

  cover-matt$reads


 #calculate the cumulative coverage.
 + cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +   output-c(output, x)
 + }
 + return(output)
 + }


  result-cover_per(cover)


 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Henrique Dallazuanna
 Curitiba-Paraná-Brasil
 25° 25' 40 S 49° 16' 22 O




-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856
Contig79:1  4
Contig79:2  8
Contig79:3  13
Contig79:4  14
Contig79:5  17
Contig79:6  20
Contig79:7  25
Contig79:8  27
Contig79:9  32
Contig79:10 33
Contig79:11 34
Contig79:12 36
Contig79:13 39
Contig79:14 40
Contig79:15 44
Contig79:16 49
Contig79:17 55
Contig79:18 56
Contig79:19 59
Contig79:20 60
Contig79:21 62
Contig79:22 64
Contig79:23 64
Contig79:24 68
Contig79:25 68
Contig79:26 68
Contig79:27 70
Contig79:28 73
Contig79:29 76
Contig79:30 77
Contig79:31 78
Contig79:32 78
Contig79:33 79
Contig79:34 80
Contig79:35 80
Contig79:36 84
Contig79:37 87
Contig79:38 87
Contig79:39 88
Contig79:40 88
Contig79:41 89
Contig79:42 93
Contig79:43 94
Contig79:44 98
Contig79:45 99
Contig79:46 99
Contig79:47 102
Contig79:48 103
Contig79:49 108
Contig79:50 112
Contig79:51 112
Contig79:52 113
Contig79:53 116
Contig79:54 118
Contig79:55 120
Contig79:56 124
Contig79:57 124
Contig79:58 126
Contig79:59 128
Contig79:60 130
Contig79:61 133
Contig79:62 134
Contig79:63 136
Contig79:64 139
Contig79:65 144
Contig79:66 145
Contig79:67 146
Contig79:68 148
Contig79:69 149
Contig79:70 151
Contig79:71 156
Contig79:72 157
Contig79:73 158
Contig79:74 159
Contig79:75 159
Contig79:76 159
Contig79:77 160
Contig79:78 160
Contig79:79 161
Contig79:80 163

Re: [R] how to work with long vectors

2010-11-04 Thread Martin Morgan
On 11/04/2010 09:45 AM, Changbin Du wrote:
 Thanks, Jim!
 
 This is not what I want,  What I want is calculate the percentage of reads
 bigger or equal to that reads in each position.MY output is like the
 following:

Hi Changbin -- I might be repeating myself, but the Bioconductor
packages IRanges and GenomicRanges are designed to work with this sort
of data, and include 'coverage' functions that do what you're interested
in. Look into ?GRanges if interested.


http://bioconductor.org/help/bioc-views/release/BiocViews.html#___HighThroughputSequencing

Martin

 for row 1, all the reads is = 4, so the cover_per is 100,
 for row 2, 99 % reads =4, so the cover_per is 99.
 head(final)
   cover_per reads
 1   100 4
 299 8
 39813
 49714
 59617
 69520
 
 I attached the input file with this email. This file is only 100 rows, very
 small. MY original data set is 3384766 rows.
 
  
 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2
 
 Thanks so much for your time!
 
 matt-read.table(/home/cdu/operon/dimer5_0623/matt_test.txt, sep=\t,
 skip=0, header=F,fill=T) #
 names(matt)-c(id,reads)
 dim(matt)
 [1] 100   2
 cover-matt$reads


 #calculate the cumulative coverage.
 cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +   output-c(output, x)
 + }
 + return(output)
 + }


 result-cover_per(cover)
 head(result)
 [1] 100  99  98  97  96  95

 final-data.frame(result, cover)

 names(final)-c(cover_per, reads)
 head(final)
   cover_per reads
 1   100 4
 299 8
 39813
 49714
 59617
 69520
 
 
 
 
 
 On Thu, Nov 4, 2010 at 9:18 AM, jim holtman jholt...@gmail.com wrote:
 
 Is this what you want:

 x
id reads
 1   Contig79:1 4
 2   Contig79:2 8
 3   Contig79:313
 4   Contig79:414
 5   Contig79:517
 6   Contig79:620
 7   Contig79:725
 8   Contig79:827
 9   Contig79:932
 10 Contig79:1033
 11 Contig79:1134
 x$percent - x$reads / max(x$reads) * 100
 x
id reads   percent
 1   Contig79:1 4  11.76471
 2   Contig79:2 8  23.52941
 3   Contig79:313  38.23529
 4   Contig79:414  41.17647
 5   Contig79:517  50.0
 6   Contig79:620  58.82353
 7   Contig79:725  73.52941
 8   Contig79:827  79.41176
 9   Contig79:932  94.11765
 10 Contig79:1033  97.05882
 11 Contig79:1134 100.0


 On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du changb...@gmail.com wrote:
 HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows
 =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 idreads
 Contig79:14
 Contig79:28
 Contig79:313
 Contig79:414
 Contig79:517
 Contig79:620
 Contig79:725
 Contig79:827
 Contig79:932
 Contig79:1033
 Contig79:1134


 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2

 matt_plot-function(matt, outputfile) {
 names(matt)-c(id,reads)

  cover-matt$reads


 #calculate the cumulative coverage.
 + cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +   output-c(output, x)
 + }
 + return(output)
 + }


  result-cover_per(cover)


 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?

 
 
 
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-help@r-project.org mailing list

Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
Thanks Martin, I will try this.

On Thu, Nov 4, 2010 at 10:06 AM, Martin Morgan mtmor...@fhcrc.org wrote:

 On 11/04/2010 09:45 AM, Changbin Du wrote:
  Thanks, Jim!
 
  This is not what I want,  What I want is calculate the percentage of
 reads
  bigger or equal to that reads in each position.MY output is like the
  following:

 Hi Changbin -- I might be repeating myself, but the Bioconductor
 packages IRanges and GenomicRanges are designed to work with this sort
 of data, and include 'coverage' functions that do what you're interested
 in. Look into ?GRanges if interested.



 http://bioconductor.org/help/bioc-views/release/BiocViews.html#___HighThroughputSequencing

 Martin

  for row 1, all the reads is = 4, so the cover_per is 100,
  for row 2, 99 % reads =4, so the cover_per is 99.
  head(final)
cover_per reads
  1   100 4
  299 8
  39813
  49714
  59617
  69520
 
  I attached the input file with this email. This file is only 100 rows,
 very
  small. MY original data set is 3384766 rows.
 
   
 
 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
  sep=\t, skip=0, header=F,fill=T) #
  dim(matt)
  [1] 3384766   2
 
  Thanks so much for your time!
 
  matt-read.table(/home/cdu/operon/dimer5_0623/matt_test.txt, sep=\t,
  skip=0, header=F,fill=T) #
  names(matt)-c(id,reads)
  dim(matt)
  [1] 100   2
  cover-matt$reads
 
 
  #calculate the cumulative coverage.
  cover_per-function (data) {
  + output-numeric(0)
  + for (i in data) {
  +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
  +   output-c(output, x)
  + }
  + return(output)
  + }
 
 
  result-cover_per(cover)
  head(result)
  [1] 100  99  98  97  96  95
 
  final-data.frame(result, cover)
 
  names(final)-c(cover_per, reads)
  head(final)
cover_per reads
  1   100 4
  299 8
  39813
  49714
  59617
  69520
 
 
 
 
 
  On Thu, Nov 4, 2010 at 9:18 AM, jim holtman jholt...@gmail.com wrote:
 
  Is this what you want:
 
  x
 id reads
  1   Contig79:1 4
  2   Contig79:2 8
  3   Contig79:313
  4   Contig79:414
  5   Contig79:517
  6   Contig79:620
  7   Contig79:725
  8   Contig79:827
  9   Contig79:932
  10 Contig79:1033
  11 Contig79:1134
  x$percent - x$reads / max(x$reads) * 100
  x
 id reads   percent
  1   Contig79:1 4  11.76471
  2   Contig79:2 8  23.52941
  3   Contig79:313  38.23529
  4   Contig79:414  41.17647
  5   Contig79:517  50.0
  6   Contig79:620  58.82353
  7   Contig79:725  73.52941
  8   Contig79:827  79.41176
  9   Contig79:932  94.11765
  10 Contig79:1033  97.05882
  11 Contig79:1134 100.0
 
 
  On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du changb...@gmail.com
 wrote:
  HI, Dear R community,
 
  I have one data set like this,  What I want to do is to calculate the
  cumulative coverage. The following codes works for small data set
 (#rows
  =
  100), but when feed the whole data set,  it still running after 24
 hours.
  Can someone give some suggestions for long vector?
 
  idreads
  Contig79:14
  Contig79:28
  Contig79:313
  Contig79:414
  Contig79:517
  Contig79:620
  Contig79:725
  Contig79:827
  Contig79:932
  Contig79:1033
  Contig79:1134
 
 
 
 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
  sep=\t, skip=0, header=F,fill=T) #
  dim(matt)
  [1] 3384766   2
 
  matt_plot-function(matt, outputfile) {
  names(matt)-c(id,reads)
 
   cover-matt$reads
 
 
  #calculate the cumulative coverage.
  + cover_per-function (data) {
  + output-numeric(0)
  + for (i in data) {
  +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
  +   output-c(output, x)
  + }
  + return(output)
  + }
 
 
   result-cover_per(cover)
 
 
  Thanks so much!
 
 
  --
  Sincerely,
  Changbin
  --
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
  --
  Jim Holtman
  Cincinnati, OH
  +1 513 646 9390
 
  What is the problem that you are trying to solve?
 
 
 
 
 
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, 

Re: [R] how to work with long vectors

2010-11-04 Thread Phil Spector

Changbin -
   Does

100 * sapply(matt$reads,function(x)sum(matt$reads = x))/length(matt$reads)

give what you want?

By the way, if you want to use a loop (there's nothing wrong with that),
then try to avoid the most common mistake that people make with loops in R:
having your result grow inside the loop.  Here's a better way to use a loop
to solve your problem:

cover_per_1 - function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = 100 * sum(ifelse(data = data[i], 1, 
0))/length(data)
   output
}

Using some random data, and comparing to your original cover_per function:


dat = rnorm(1000)
system.time(one - cover_per(dat))

   user  system elapsed
  0.816   0.000   0.824 

system.time(two - cover_per_1(dat))

   user  system elapsed
  0.792   0.000   0.805

Not that big a speedup, but it does increase quite a bit as the problem gets
larger.

There are two obvious ways to speed up your function:
   1)  Eliminate the ifelse function, since automatic coersion from
   logical to numeric does the same thing.
   2)  Multiply by 100 and divide by the length outside the loop:

cover_per_2 - function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = sum(data = data[i])
   100 * output / l
}


system.time(three - cover_per_2(dat))

   user  system elapsed
  0.024   0.000   0.027

That makes the loop just about equivalent to the sapply solution:


system.time(four - 100*sapply(dat,function(x)sum(dat = x))/length(dat))

   user  system elapsed
  0.024   0.000   0.026

- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu








On Thu, 4 Nov 2010, Changbin Du wrote:


HI, Dear R community,

I have one data set like this,  What I want to do is to calculate the
cumulative coverage. The following codes works for small data set (#rows =
100), but when feed the whole data set,  it still running after 24 hours.
Can someone give some suggestions for long vector?

idreads
Contig79:14
Contig79:28
Contig79:313
Contig79:414
Contig79:517
Contig79:620
Contig79:725
Contig79:827
Contig79:932
Contig79:1033
Contig79:1134

matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
sep=\t, skip=0, header=F,fill=T) #
dim(matt)
[1] 3384766   2

matt_plot-function(matt, outputfile) {
names(matt)-c(id,reads)

cover-matt$reads


#calculate the cumulative coverage.
+ cover_per-function (data) {
+ output-numeric(0)
+ for (i in data) {
+   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
+   output-c(output, x)
+ }
+ return(output)
+ }


result-cover_per(cover)


Thanks so much!


--
Sincerely,
Changbin
--

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with long vectors

2010-11-04 Thread Changbin Du
Thanks Phil, that is great! I WILL try this and let you know how it goes.



On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector spec...@stat.berkeley.eduwrote:

 Changbin -
   Does

100 * sapply(matt$reads,function(x)sum(matt$reads =
 x))/length(matt$reads)

 give what you want?

By the way, if you want to use a loop (there's nothing wrong with that),
 then try to avoid the most common mistake that people make with loops in R:
 having your result grow inside the loop.  Here's a better way to use a loop
 to solve your problem:

 cover_per_1 - function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = 100 * sum(ifelse(data = data[i], 1,
 0))/length(data)
   output
 }

 Using some random data, and comparing to your original cover_per function:

  dat = rnorm(1000)
 system.time(one - cover_per(dat))

   user  system elapsed
  0.816   0.000   0.824

 system.time(two - cover_per_1(dat))

   user  system elapsed
  0.792   0.000   0.805

 Not that big a speedup, but it does increase quite a bit as the problem
 gets
 larger.

 There are two obvious ways to speed up your function:
   1)  Eliminate the ifelse function, since automatic coersion from
   logical to numeric does the same thing.
   2)  Multiply by 100 and divide by the length outside the loop:

 cover_per_2 - function(data){
   l = length(data)
   output = numeric(l)
   for(i in 1:l)output[i] = sum(data = data[i])
   100 * output / l
 }

  system.time(three - cover_per_2(dat))

   user  system elapsed
  0.024   0.000   0.027

 That makes the loop just about equivalent to the sapply solution:

  system.time(four - 100*sapply(dat,function(x)sum(dat = x))/length(dat))

   user  system elapsed
  0.024   0.000   0.026

- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu









 On Thu, 4 Nov 2010, Changbin Du wrote:

  HI, Dear R community,

 I have one data set like this,  What I want to do is to calculate the
 cumulative coverage. The following codes works for small data set (#rows =
 100), but when feed the whole data set,  it still running after 24 hours.
 Can someone give some suggestions for long vector?

 idreads
 Contig79:14
 Contig79:28
 Contig79:313
 Contig79:414
 Contig79:517
 Contig79:620
 Contig79:725
 Contig79:827
 Contig79:932
 Contig79:1033
 Contig79:1134


 matt-read.table(/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth,
 sep=\t, skip=0, header=F,fill=T) #
 dim(matt)
 [1] 3384766   2

 matt_plot-function(matt, outputfile) {
 names(matt)-c(id,reads)

 cover-matt$reads


 #calculate the cumulative coverage.
 + cover_per-function (data) {
 + output-numeric(0)
 + for (i in data) {
 +   x-(100*sum(ifelse(data = i, 1, 0))/length(data))
 +   output-c(output, x)
 + }
 + return(output)
 + }


 result-cover_per(cover)


 Thanks so much!


 --
 Sincerely,
 Changbin
 --

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Sincerely,
Changbin
--

Changbin Du
DOE Joint Genome Institute
Bldg 400 Rm 457
2800 Mitchell Dr
Walnut Creet, CA 94598
Phone: 925-927-2856

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.