Re: [Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Patrick Aboyoun Sat, 04 Jul 2009 21:45:00 -0700

Simon,

I have a further update. Martin and I have added methods to performthe "+" operation for GenomeData objects containing coverageinformation in succinct function calls.

I worked on beefing up the low-level IRanges Sequence class, which hasaround 80 subclasses including GenomeData, GenomeDataList, RleList, toinclude functional programming tools like Reduce, Filter, Find, Map,Position, and mapply. These methods behave like the standard methodsin base so if you have two GenomeData objects containing Rle objects,you can use the Map function with f = "+" to perform element-wiseaddition. As with the Map function from base, the Map method forSequence returns a list object. Martin has also added a gdreducefunction to BSgenome that behaves like the Map function except itreturns a GenomeData object. I need to talk to Martin to see if hesees a need for a gdreduce function when we can simply make a Mapmethod for GenomeData objects that does the same thing. My preferenceis to replace gdreduce with a Map method for GenomeData since thereisn't much benefit to having two functions that are designed toperform the same operation.

suppressMessages(library(BSgenome))
x <- GenomeData(list(chr1 = Rle(1:10), chr2 = Rle(10:1)))
Map("+", x, x)

$chr1
  'integer' Rle instance of length 10 with 10 runs
  Lengths:  1 1 1 1 1 1 1 1 1 1
  Values :  2 4 6 8 10 12 14 16 18 20

$chr2
  'integer' Rle instance of length 10 with 10 runs
  Lengths:  1 1 1 1 1 1 1 1 1 1
  Values :  20 18 16 14 12 10 8 6 4 2

gdreduce("+", x, x)

A GenomeData instance
chromosomes(2): chr1 chr2

gdreduce("+", x, x)[[1]]

  'integer' Rle instance of length 10 with 10 runs
  Lengths:  1 1 1 1 1 1 1 1 1 1
  Values :  2 4 6 8 10 12 14 16 18 20

gdreduce("+", x, x)[[2]]

  'integer' Rle instance of length 10 with 10 runs
  Lengths:  1 1 1 1 1 1 1 1 1 1
  Values :  20 18 16 14 12 10 8 6 4 2

sessionInfo()

R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome_1.13.8    Biostrings_2.13.23 IRanges_1.3.30

loaded via a namespace (and not attached):
[1] Biobase_2.5.4



Quoting Patrick Aboyoun <[email protected]>:

Simon,
I checked in a speedup for coverage calculations to the IRanges
package. It should be about 30 - 50% faster now. On my laptop, it took
around 50 seconds to calculate the coverage of 29 million ranges over a
1e8 sequence domain on my laptop. Under the old coverage code, this
calculation took about 100 seconds.

suppressMessages(library(IRanges))
N <- 29e6
set.seed(0)
x <- IRanges(start=sample(1e8 - 36, N, replace = TRUE), width = 36)
system.time(coverage(x, width = 1e8))

   user  system elapsed
 49.747   4.262  54.305

sessionInfo()

R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] IRanges_1.3.28



Patrick


Quoting Simon Anders <[email protected]>:

Hi

Patrick Aboyoun wrote:

Simon,
Could you provide some profiling information to show where thebottlenecks are?


I don't know if there is really a clear bottleneck. 9 minutes to
calculate the coverage of 29 mio reads is 20 seconds per mio reads;
this is probably what the coverage function always needed. So, in the
code given in my mail, the summing up of the GenomeData objects is just
awkward but not a performance penalty.

I am also wondering if I should be building up the
functionality for RleList, which could have `+` and other Math
operations. We have a lot of classes in the Sequence space and it is
not clear yet which classes are going to be part of the winning
solution.


I'd say that this is the main issue. I discover new classes every day.
You just mentioned 'RleList', Michael mentions 'GenomeDataList', and
Martin has another way to go again.

I'm sorry to say that, at least for me, this has become hopelessly
confusing, and I imagine that many other users fell the same. You write
that "it is not clear yet which classes are going to be part of the
winning solution" and I completely agree that it makes more sense to
have a few good classes rather than adding functionality to any class
on demand. So, maybe don't bother with a `+` operation for now.

Best regards
 Simon


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Reply via email to