Re: [Bioc-devel] coverage as IntegerList

Hervé Pagès Tue, 11 Feb 2014 18:59:24 -0800

Hi,

Why not. But I don't expect a significant speed up. Here is why:


There are actually 2 algos implemented by coverage(): one called "sort"
that computes the coverage directly into "Rle space", and one called
"hash" that computes the coverage into an ordinary integer vector and
turns this vector into an Rle at the end (this conversion is cheap).

By default coverage() tries to automatically pick up the appropriate
algo: "hash" when the data are dense, "sort" otherwise. The criteria
used to decide whether the data are dense or not is a little bit
naive (and could maybe be improved?): it just compares the number
of ranges in the input with the length of the coverage vector to
return. If nb of ranges > 0.25 * length-of-coverage-vector, the data
is considered to be dense. Clearly this formula is kind of arbitrary
and I'm sure it could be tweaked a little bit to do a better job.

Note that the user can choose the algo to use via the 'method' arg.
If you know your data are dense, use method="hash". It will be almost
as fast as if coverage() was returning an IntegerList, except that
the coverage is turned into an Rle (but only at the end). I would
expect this final coercion to be nothing compared to the computation
of the coverage itself. This would need to be confirmed by some
profiling though.

Anyway maybe there are other benefits of returning an IntegerList:
smaller memory footprint when the data are dense,
more beginner-friendly container, maybe slightly faster
downstream computations (can this be a bottleneck?), others?

H.


On 02/11/2014 05:06 PM, Michael Lawrence wrote:

Right, it would be a choice. The compression is not worth it when the data
are dense.


On Tue, Feb 11, 2014 at 4:18 PM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

Sounds reasonable, _especially_ if you think it is faster.  You're the
expert.  I assume you will allow the user to choose the return value?
  Having the option of Rle's is still nice, for some use cases.


On Tue, Feb 11, 2014 at 7:12 PM, Michael Lawrence <
lawrence.mich...@gene.com> wrote:

Just a thought: support coverage calculation directly to IntegerList. Will
very often be faster than RleList, especially when limiting to regions
without long runs of zeros, and with WGS data.

Something to put on the TODO list?

Michael

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] coverage as IntegerList

Reply via email to