On 06/28/2016 03:54 AM, Johnston, Jeffrey wrote:
During the BioC developer day, Hervé brought up the idea of possibly extending 
the concept of GenomicViews to other use cases. I'd like to share how we 
currently use GenomicRanges, Views and RleLists to perform certain analyses, 
and hopefully demonstrate that a way to directly use Views with GRanges would 
be quite useful.

As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and a set 
of genomic ranges we are interested in (as a BED file). We want to find the sum 
of the ChIP-seq signal found in our regions of interest for each of the two 
samples. We'd first import the BED file as a GRanges object. Annotating the 
GRanges with two metadata columns representing the ChIP-seq signal for the two 
samples seems like a straightforward use for Views (in particular, viewSums), 
but it is a bit convoluted.

The main problem is that Views work with RangesLists, not GRanges. Coercing a 
GRanges to a RangesList possibly disrupts the order, so we have to first 
reorder the GRanges to match the order it will be given after coercion (keeping 
track so we can later revert back to the original order):

regions <- import("regions_of_interest.bed")
sample1_cov <- import("sample1.bw", as="RleList")
sample2_cov <- import("sample2.bw", as="RleList")
oo <- order(regions)
regions <- regions[oo]

Now we can construct a View and use viewSums to obtain our values (unlisting 
them as they are grouped by seqnames) and add them as a metadata column in our 
GRanges, restoring the original order of the GRanges when we are done:

v <- Views(sample1_cov, as(regions, "RangesList"))
mcols(regions)$sample1_signal <- unlist(viewSums(v))
regions[oo] <- regions

We then repeat the process to add another metadata column for sample2.

To me, having the ability to use a GRanges as a view into an RleList makes a 
lot more sense. That would allow us to reduce all the above complexity to 
something like:

regions$sample1_signal <- viewSums(Views(sample1_cov, regions))
regions$sample2_signal <- viewSums(Views(sample2_cov, regions))

That alone would be great! But, there's a way to make it even better. Storing 
these RleLists in memory for each of our samples is quite inefficient, 
especially since our regions of interest are only a small portion of them. The 
rtracklayer package already has some very useful functionality for 
instantiating an RleList with only the data from specific ranges of a BigWig 
file. Taking advantage of that, we can dramatically reduce our memory usage and 
increase our performance like so:

regions <- import("regions_of_interest.bed")
sample1_cov <- import("sample1.bw", as="RleList", which=regions)
sample2_cov <- import("sample2.bw", as="RleList", which=regions)
regions$sample1_signal <- viewSums(Views(sample1_cov, regions))
regions$sample2_signal <- viewSums(Views(sample2_cov, regions))

But can't this functionality be included in Views? Why not have it accept a 
BigWig file in place of an RleList and have it selectively load the portions of 
the BigWig it needs based on the provided GRanges? That would allow this:

regions <- import("regions_of_interest.bed")
regions$sample1_signal <- viewSums(Views("sample1.bw", regions))
regions$sample2_signal <- viewSums(Views("sample2.bw", regions))

Exactly. Thanks Jeff for taking the time to put this so nicely together.
That's what I've been having in mind for years. The above is very
intuitive and frees the user of having to deal with a lot of low-level
details.

The ability to subset with a GRanges subscript already provides some
level of convenience but having the ability to formally define views
on genomic data goes even further. In particular, as you pointed out,
it allows the use of delayed views realization strategies behind the
scene which can lead to important performance boosts.

H.


The above is quite close to how I use GRanges and BigWigs now, except I have to 
write and maintain all the hackish code to link BigWig files, GRanges, Views, 
RangesLists and RleLists together into something that behaves as one would 
intuitively expect.

I’d welcome any thoughts on how people perform similar analyses that involve 
GRanges and data stored in BigWig files or RleLists, and whether this would 
also be useful to them.

Thanks,
Jeff Johnston
Zeitlinger Lab
Stowers Institute

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to