Re: [Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)
We have this low level functionality in genomation::ScoreMatrixBin (when bin.num=1), where users can give a set of regions they are interested in as GRanges and input data in the form of RleList, GRanges, Bam file or BigWig file, and get the a summary of scores per region of interest. Currently, it can do mean,median but not sum. Usually you will need "mean" if the regions of interest are not equal length. you probably want this functionality as core functionality, which is a good idea. In that case, you should be careful when dealing with methylation data, where no coverage doesn't mean 0 % methylation. This causes problems when using import.bw() on bigWig files containing methylation data. bases with no coverage are imported as 0 % methylation. Best, Altuna On Wed, Jun 29, 2016 at 1:45 AM, Hervé Pagès wrote: > On 06/28/2016 03:54 AM, Johnston, Jeffrey wrote: > >> During the BioC developer day, Hervé brought up the idea of possibly >> extending the concept of GenomicViews to other use cases. I'd like to share >> how we currently use GenomicRanges, Views and RleLists to perform certain >> analyses, and hopefully demonstrate that a way to directly use Views with >> GRanges would be quite useful. >> >> As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and >> a set of genomic ranges we are interested in (as a BED file). We want to >> find the sum of the ChIP-seq signal found in our regions of interest for >> each of the two samples. We'd first import the BED file as a GRanges >> object. Annotating the GRanges with two metadata columns representing the >> ChIP-seq signal for the two samples seems like a straightforward use for >> Views (in particular, viewSums), but it is a bit convoluted. >> >> The main problem is that Views work with RangesLists, not GRanges. >> Coercing a GRanges to a RangesList possibly disrupts the order, so we have >> to first reorder the GRanges to match the order it will be given after >> coercion (keeping track so we can later revert back to the original order): >> >> regions <- import("regions_of_interest.bed") >> sample1_cov <- import("sample1.bw", as="RleList") >> sample2_cov <- import("sample2.bw", as="RleList") >> oo <- order(regions) >> regions <- regions[oo] >> >> Now we can construct a View and use viewSums to obtain our values >> (unlisting them as they are grouped by seqnames) and add them as a metadata >> column in our GRanges, restoring the original order of the GRanges when we >> are done: >> >> v <- Views(sample1_cov, as(regions, "RangesList")) >> mcols(regions)$sample1_signal <- unlist(viewSums(v)) >> regions[oo] <- regions >> >> We then repeat the process to add another metadata column for sample2. >> >> To me, having the ability to use a GRanges as a view into an RleList >> makes a lot more sense. That would allow us to reduce all the above >> complexity to something like: >> >> regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) >> regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) >> >> That alone would be great! But, there's a way to make it even better. >> Storing these RleLists in memory for each of our samples is quite >> inefficient, especially since our regions of interest are only a small >> portion of them. The rtracklayer package already has some very useful >> functionality for instantiating an RleList with only the data from specific >> ranges of a BigWig file. Taking advantage of that, we can dramatically >> reduce our memory usage and increase our performance like so: >> >> regions <- import("regions_of_interest.bed") >> sample1_cov <- import("sample1.bw", as="RleList", which=regions) >> sample2_cov <- import("sample2.bw", as="RleList", which=regions) >> regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) >> regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) >> >> But can't this functionality be included in Views? Why not have it accept >> a BigWig file in place of an RleList and have it selectively load the >> portions of the BigWig it needs based on the provided GRanges? That would >> allow this: >> >> regions <- import("regions_of_interest.bed") >> regions$sample1_signal <- viewSums(Views("sample1.bw", regions)) >> regions$sample2_signal <- viewSums(Views("sample2.bw", regions)) >> > > Exactly. Thanks Jeff for taking the time to put this so nicely together. > That's what I've been having in mind for years. The above is very > intuitive and frees the user of having to deal with a lot of low-level > details. > > The ability to subset with a GRanges subscript already provides some > level of convenience but having the ability to formally define views > on genomic data goes even further. In particular, as you pointed out, > it allows the use of delayed views realization strategies behind the > scene which can lead to important performance boosts. > > H. > > >> The above is quite close to how I use GRanges and BigWigs now, except I >> have to write and maintain all the hackish code
Re: [Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)
Thanks Michael for clarifying. BTW it's interesting to notice that Views objects are actually List objects (i.e. Views extends List). Views is just below List in the class hierarchy and at the same level as SimpleList and CompressedList. Its internals are actually very close to CompressedList which also uses views behind the scene. An important difference is that with Views the views can be anywhere on the subject and can overlap, whereas with CompressedList they have to cover the unlistData slot and cannot overlap. This might sound restrictive but it's actually what makes CompressedList unbeatable when using the unlist/transform/relist trick. Also because a Views object is a List object, sum(Views(cov, gr)) works and is equivalent to sum(cov[gr]). viewSums() and family are cute but not needed because redundant with the existing verbs. H. On 06/28/2016 09:03 PM, Michael Lawrence wrote: Yes, I am generally in favor of the GViews, and I've been a fan of Views since they were invented. Was just pointing out that we can do a lot with Lists these days. On Tue, Jun 28, 2016 at 4:56 PM, Hervé Pagès wrote: On 06/28/2016 05:43 AM, Michael Lawrence wrote: Thanks for these use cases. I've been wondering about the usefulness of Views given how far along Lists have come since the invention of Views. Instead of viewSums(Views(cov, gr)) we could just do sum(cov[gr]). The latter works today. The difference between the Views and List approach is that the Views data structure defers the extraction until summarization. I think this is an important feature. So we can always retrieve the entire underlying vector, and the ranges of interest. But for summarize-and-forget use cases, Lists should work fine. They work, but they're not super convenient. The user needs to know how to import data for his/her regions of interest only. The way to do this can vary across the type of file (e.g. 2bit vs BigWig). Having GViews objects hides these details and provides a unified mode of operating on a set of genomic regions of interest. Also, unlike a List, a GViews object bundles together the genomic regions and the data associated with them. This makes the object self-descriptive and thus reduces the risk of errors. I like the idea of pushing the aggregation down to the data. BigWig files are particularly well suited for this, because they have precomputed summary statistics. See summary,BigWigFile. It would take some hacking of the Kent library to expose everything we want, like the sums. This sounds like an argument in favor of using GViews objects to me. Thanks, H. Michael On Tue, Jun 28, 2016 at 3:54 AM, Johnston, Jeffrey wrote: During the BioC developer day, Hervé brought up the idea of possibly extending the concept of GenomicViews to other use cases. I'd like to share how we currently use GenomicRanges, Views and RleLists to perform certain analyses, and hopefully demonstrate that a way to directly use Views with GRanges would be quite useful. As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and a set of genomic ranges we are interested in (as a BED file). We want to find the sum of the ChIP-seq signal found in our regions of interest for each of the two samples. We'd first import the BED file as a GRanges object. Annotating the GRanges with two metadata columns representing the ChIP-seq signal for the two samples seems like a straightforward use for Views (in particular, viewSums), but it is a bit convoluted. The main problem is that Views work with RangesLists, not GRanges. Coercing a GRanges to a RangesList possibly disrupts the order, so we have to first reorder the GRanges to match the order it will be given after coercion (keeping track so we can later revert back to the original order): regions <- import("regions_of_interest.bed") sample1_cov <- import("sample1.bw", as="RleList") sample2_cov <- import("sample2.bw", as="RleList") oo <- order(regions) regions <- regions[oo] Now we can construct a View and use viewSums to obtain our values (unlisting them as they are grouped by seqnames) and add them as a metadata column in our GRanges, restoring the original order of the GRanges when we are done: v <- Views(sample1_cov, as(regions, "RangesList")) mcols(regions)$sample1_signal <- unlist(viewSums(v)) regions[oo] <- regions We then repeat the process to add another metadata column for sample2. To me, having the ability to use a GRanges as a view into an RleList makes a lot more sense. That would allow us to reduce all the above complexity to something like: regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) That alone would be great! But, there's a way to make it even better. Storing these RleLists in memory for each of our samples is quite inefficient, especially since our regions of interest are only a small portion of them. The rtracklayer package already has some very useful functionalit
Re: [Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)
Yes, I am generally in favor of the GViews, and I've been a fan of Views since they were invented. Was just pointing out that we can do a lot with Lists these days. On Tue, Jun 28, 2016 at 4:56 PM, Hervé Pagès wrote: > On 06/28/2016 05:43 AM, Michael Lawrence wrote: >> >> Thanks for these use cases. >> >> I've been wondering about the usefulness of Views given how far along >> Lists have come since the invention of Views. Instead of >> viewSums(Views(cov, gr)) we could just do sum(cov[gr]). The latter >> works today. The difference between the Views and List approach is >> that the Views data structure defers the extraction until >> summarization. > > > I think this is an important feature. > >> So we can always retrieve the entire underlying vector, >> and the ranges of interest. But for summarize-and-forget use cases, >> Lists should work fine. > > > They work, but they're not super convenient. The user needs to know > how to import data for his/her regions of interest only. The way to > do this can vary across the type of file (e.g. 2bit vs BigWig). > Having GViews objects hides these details and provides a unified > mode of operating on a set of genomic regions of interest. > > Also, unlike a List, a GViews object bundles together the genomic > regions and the data associated with them. This makes the object > self-descriptive and thus reduces the risk of errors. > >> >> I like the idea of pushing the aggregation down to the data. BigWig >> files are particularly well suited for this, because they have >> precomputed summary statistics. See summary,BigWigFile. It would take >> some hacking of the Kent library to expose everything we want, like >> the sums. > > > This sounds like an argument in favor of using GViews objects to me. > > Thanks, > H. > > >> >> Michael >> >> On Tue, Jun 28, 2016 at 3:54 AM, Johnston, Jeffrey >> wrote: >>> >>> During the BioC developer day, Hervé brought up the idea of possibly >>> extending the concept of GenomicViews to other use cases. I'd like to share >>> how we currently use GenomicRanges, Views and RleLists to perform certain >>> analyses, and hopefully demonstrate that a way to directly use Views with >>> GRanges would be quite useful. >>> >>> As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and >>> a set of genomic ranges we are interested in (as a BED file). We want to >>> find the sum of the ChIP-seq signal found in our regions of interest for >>> each of the two samples. We'd first import the BED file as a GRanges object. >>> Annotating the GRanges with two metadata columns representing the ChIP-seq >>> signal for the two samples seems like a straightforward use for Views (in >>> particular, viewSums), but it is a bit convoluted. >>> >>> The main problem is that Views work with RangesLists, not GRanges. >>> Coercing a GRanges to a RangesList possibly disrupts the order, so we have >>> to first reorder the GRanges to match the order it will be given after >>> coercion (keeping track so we can later revert back to the original order): >>> >>> regions <- import("regions_of_interest.bed") >>> sample1_cov <- import("sample1.bw", as="RleList") >>> sample2_cov <- import("sample2.bw", as="RleList") >>> oo <- order(regions) >>> regions <- regions[oo] >>> >>> Now we can construct a View and use viewSums to obtain our values >>> (unlisting them as they are grouped by seqnames) and add them as a metadata >>> column in our GRanges, restoring the original order of the GRanges when we >>> are done: >>> >>> v <- Views(sample1_cov, as(regions, "RangesList")) >>> mcols(regions)$sample1_signal <- unlist(viewSums(v)) >>> regions[oo] <- regions >>> >>> We then repeat the process to add another metadata column for sample2. >>> >>> To me, having the ability to use a GRanges as a view into an RleList >>> makes a lot more sense. That would allow us to reduce all the above >>> complexity to something like: >>> >>> regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) >>> regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) >>> >>> That alone would be great! But, there's a way to make it even better. >>> Storing these RleLists in memory for each of our samples is quite >>> inefficient, especially since our regions of interest are only a small >>> portion of them. The rtracklayer package already has some very useful >>> functionality for instantiating an RleList with only the data from specific >>> ranges of a BigWig file. Taking advantage of that, we can dramatically >>> reduce our memory usage and increase our performance like so: >>> >>> regions <- import("regions_of_interest.bed") >>> sample1_cov <- import("sample1.bw", as="RleList", which=regions) >>> sample2_cov <- import("sample2.bw", as="RleList", which=regions) >>> regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) >>> regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) >>> >>> But can't this functionality be included in Views? Why not have it accept >>> a BigWig
Re: [Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)
On 06/28/2016 05:43 AM, Michael Lawrence wrote: Thanks for these use cases. I've been wondering about the usefulness of Views given how far along Lists have come since the invention of Views. Instead of viewSums(Views(cov, gr)) we could just do sum(cov[gr]). The latter works today. The difference between the Views and List approach is that the Views data structure defers the extraction until summarization. I think this is an important feature. So we can always retrieve the entire underlying vector, and the ranges of interest. But for summarize-and-forget use cases, Lists should work fine. They work, but they're not super convenient. The user needs to know how to import data for his/her regions of interest only. The way to do this can vary across the type of file (e.g. 2bit vs BigWig). Having GViews objects hides these details and provides a unified mode of operating on a set of genomic regions of interest. Also, unlike a List, a GViews object bundles together the genomic regions and the data associated with them. This makes the object self-descriptive and thus reduces the risk of errors. I like the idea of pushing the aggregation down to the data. BigWig files are particularly well suited for this, because they have precomputed summary statistics. See summary,BigWigFile. It would take some hacking of the Kent library to expose everything we want, like the sums. This sounds like an argument in favor of using GViews objects to me. Thanks, H. Michael On Tue, Jun 28, 2016 at 3:54 AM, Johnston, Jeffrey wrote: During the BioC developer day, Hervé brought up the idea of possibly extending the concept of GenomicViews to other use cases. I'd like to share how we currently use GenomicRanges, Views and RleLists to perform certain analyses, and hopefully demonstrate that a way to directly use Views with GRanges would be quite useful. As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and a set of genomic ranges we are interested in (as a BED file). We want to find the sum of the ChIP-seq signal found in our regions of interest for each of the two samples. We'd first import the BED file as a GRanges object. Annotating the GRanges with two metadata columns representing the ChIP-seq signal for the two samples seems like a straightforward use for Views (in particular, viewSums), but it is a bit convoluted. The main problem is that Views work with RangesLists, not GRanges. Coercing a GRanges to a RangesList possibly disrupts the order, so we have to first reorder the GRanges to match the order it will be given after coercion (keeping track so we can later revert back to the original order): regions <- import("regions_of_interest.bed") sample1_cov <- import("sample1.bw", as="RleList") sample2_cov <- import("sample2.bw", as="RleList") oo <- order(regions) regions <- regions[oo] Now we can construct a View and use viewSums to obtain our values (unlisting them as they are grouped by seqnames) and add them as a metadata column in our GRanges, restoring the original order of the GRanges when we are done: v <- Views(sample1_cov, as(regions, "RangesList")) mcols(regions)$sample1_signal <- unlist(viewSums(v)) regions[oo] <- regions We then repeat the process to add another metadata column for sample2. To me, having the ability to use a GRanges as a view into an RleList makes a lot more sense. That would allow us to reduce all the above complexity to something like: regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) That alone would be great! But, there's a way to make it even better. Storing these RleLists in memory for each of our samples is quite inefficient, especially since our regions of interest are only a small portion of them. The rtracklayer package already has some very useful functionality for instantiating an RleList with only the data from specific ranges of a BigWig file. Taking advantage of that, we can dramatically reduce our memory usage and increase our performance like so: regions <- import("regions_of_interest.bed") sample1_cov <- import("sample1.bw", as="RleList", which=regions) sample2_cov <- import("sample2.bw", as="RleList", which=regions) regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) But can't this functionality be included in Views? Why not have it accept a BigWig file in place of an RleList and have it selectively load the portions of the BigWig it needs based on the provided GRanges? That would allow this: regions <- import("regions_of_interest.bed") regions$sample1_signal <- viewSums(Views("sample1.bw", regions)) regions$sample2_signal <- viewSums(Views("sample2.bw", regions)) The above is quite close to how I use GRanges and BigWigs now, except I have to write and maintain all the hackish code to link BigWig files, GRanges, Views, RangesLists and RleLists together in
Re: [Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)
On 06/28/2016 03:54 AM, Johnston, Jeffrey wrote: During the BioC developer day, Hervé brought up the idea of possibly extending the concept of GenomicViews to other use cases. I'd like to share how we currently use GenomicRanges, Views and RleLists to perform certain analyses, and hopefully demonstrate that a way to directly use Views with GRanges would be quite useful. As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and a set of genomic ranges we are interested in (as a BED file). We want to find the sum of the ChIP-seq signal found in our regions of interest for each of the two samples. We'd first import the BED file as a GRanges object. Annotating the GRanges with two metadata columns representing the ChIP-seq signal for the two samples seems like a straightforward use for Views (in particular, viewSums), but it is a bit convoluted. The main problem is that Views work with RangesLists, not GRanges. Coercing a GRanges to a RangesList possibly disrupts the order, so we have to first reorder the GRanges to match the order it will be given after coercion (keeping track so we can later revert back to the original order): regions <- import("regions_of_interest.bed") sample1_cov <- import("sample1.bw", as="RleList") sample2_cov <- import("sample2.bw", as="RleList") oo <- order(regions) regions <- regions[oo] Now we can construct a View and use viewSums to obtain our values (unlisting them as they are grouped by seqnames) and add them as a metadata column in our GRanges, restoring the original order of the GRanges when we are done: v <- Views(sample1_cov, as(regions, "RangesList")) mcols(regions)$sample1_signal <- unlist(viewSums(v)) regions[oo] <- regions We then repeat the process to add another metadata column for sample2. To me, having the ability to use a GRanges as a view into an RleList makes a lot more sense. That would allow us to reduce all the above complexity to something like: regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) That alone would be great! But, there's a way to make it even better. Storing these RleLists in memory for each of our samples is quite inefficient, especially since our regions of interest are only a small portion of them. The rtracklayer package already has some very useful functionality for instantiating an RleList with only the data from specific ranges of a BigWig file. Taking advantage of that, we can dramatically reduce our memory usage and increase our performance like so: regions <- import("regions_of_interest.bed") sample1_cov <- import("sample1.bw", as="RleList", which=regions) sample2_cov <- import("sample2.bw", as="RleList", which=regions) regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) But can't this functionality be included in Views? Why not have it accept a BigWig file in place of an RleList and have it selectively load the portions of the BigWig it needs based on the provided GRanges? That would allow this: regions <- import("regions_of_interest.bed") regions$sample1_signal <- viewSums(Views("sample1.bw", regions)) regions$sample2_signal <- viewSums(Views("sample2.bw", regions)) Exactly. Thanks Jeff for taking the time to put this so nicely together. That's what I've been having in mind for years. The above is very intuitive and frees the user of having to deal with a lot of low-level details. The ability to subset with a GRanges subscript already provides some level of convenience but having the ability to formally define views on genomic data goes even further. In particular, as you pointed out, it allows the use of delayed views realization strategies behind the scene which can lead to important performance boosts. H. The above is quite close to how I use GRanges and BigWigs now, except I have to write and maintain all the hackish code to link BigWig files, GRanges, Views, RangesLists and RleLists together into something that behaves as one would intuitively expect. I’d welcome any thoughts on how people perform similar analyses that involve GRanges and data stored in BigWig files or RleLists, and whether this would also be useful to them. Thanks, Jeff Johnston Zeitlinger Lab Stowers Institute ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)
Thanks for these use cases. I've been wondering about the usefulness of Views given how far along Lists have come since the invention of Views. Instead of viewSums(Views(cov, gr)) we could just do sum(cov[gr]). The latter works today. The difference between the Views and List approach is that the Views data structure defers the extraction until summarization. So we can always retrieve the entire underlying vector, and the ranges of interest. But for summarize-and-forget use cases, Lists should work fine. I like the idea of pushing the aggregation down to the data. BigWig files are particularly well suited for this, because they have precomputed summary statistics. See summary,BigWigFile. It would take some hacking of the Kent library to expose everything we want, like the sums. Michael On Tue, Jun 28, 2016 at 3:54 AM, Johnston, Jeffrey wrote: > During the BioC developer day, Hervé brought up the idea of possibly > extending the concept of GenomicViews to other use cases. I'd like to share > how we currently use GenomicRanges, Views and RleLists to perform certain > analyses, and hopefully demonstrate that a way to directly use Views with > GRanges would be quite useful. > > As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and a > set of genomic ranges we are interested in (as a BED file). We want to find > the sum of the ChIP-seq signal found in our regions of interest for each of > the two samples. We'd first import the BED file as a GRanges object. > Annotating the GRanges with two metadata columns representing the ChIP-seq > signal for the two samples seems like a straightforward use for Views (in > particular, viewSums), but it is a bit convoluted. > > The main problem is that Views work with RangesLists, not GRanges. Coercing a > GRanges to a RangesList possibly disrupts the order, so we have to first > reorder the GRanges to match the order it will be given after coercion > (keeping track so we can later revert back to the original order): > > regions <- import("regions_of_interest.bed") > sample1_cov <- import("sample1.bw", as="RleList") > sample2_cov <- import("sample2.bw", as="RleList") > oo <- order(regions) > regions <- regions[oo] > > Now we can construct a View and use viewSums to obtain our values (unlisting > them as they are grouped by seqnames) and add them as a metadata column in > our GRanges, restoring the original order of the GRanges when we are done: > > v <- Views(sample1_cov, as(regions, "RangesList")) > mcols(regions)$sample1_signal <- unlist(viewSums(v)) > regions[oo] <- regions > > We then repeat the process to add another metadata column for sample2. > > To me, having the ability to use a GRanges as a view into an RleList makes a > lot more sense. That would allow us to reduce all the above complexity to > something like: > > regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) > regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) > > That alone would be great! But, there's a way to make it even better. Storing > these RleLists in memory for each of our samples is quite inefficient, > especially since our regions of interest are only a small portion of them. > The rtracklayer package already has some very useful functionality for > instantiating an RleList with only the data from specific ranges of a BigWig > file. Taking advantage of that, we can dramatically reduce our memory usage > and increase our performance like so: > > regions <- import("regions_of_interest.bed") > sample1_cov <- import("sample1.bw", as="RleList", which=regions) > sample2_cov <- import("sample2.bw", as="RleList", which=regions) > regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) > regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) > > But can't this functionality be included in Views? Why not have it accept a > BigWig file in place of an RleList and have it selectively load the portions > of the BigWig it needs based on the provided GRanges? That would allow this: > > regions <- import("regions_of_interest.bed") > regions$sample1_signal <- viewSums(Views("sample1.bw", regions)) > regions$sample2_signal <- viewSums(Views("sample2.bw", regions)) > > The above is quite close to how I use GRanges and BigWigs now, except I have > to write and maintain all the hackish code to link BigWig files, GRanges, > Views, RangesLists and RleLists together into something that behaves as one > would intuitively expect. > > I’d welcome any thoughts on how people perform similar analyses that involve > GRanges and data stored in BigWig files or RleLists, and whether this would > also be useful to them. > > Thanks, > Jeff Johnston > Zeitlinger Lab > Stowers Institute > > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailma
[Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)
During the BioC developer day, Hervé brought up the idea of possibly extending the concept of GenomicViews to other use cases. I'd like to share how we currently use GenomicRanges, Views and RleLists to perform certain analyses, and hopefully demonstrate that a way to directly use Views with GRanges would be quite useful. As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and a set of genomic ranges we are interested in (as a BED file). We want to find the sum of the ChIP-seq signal found in our regions of interest for each of the two samples. We'd first import the BED file as a GRanges object. Annotating the GRanges with two metadata columns representing the ChIP-seq signal for the two samples seems like a straightforward use for Views (in particular, viewSums), but it is a bit convoluted. The main problem is that Views work with RangesLists, not GRanges. Coercing a GRanges to a RangesList possibly disrupts the order, so we have to first reorder the GRanges to match the order it will be given after coercion (keeping track so we can later revert back to the original order): regions <- import("regions_of_interest.bed") sample1_cov <- import("sample1.bw", as="RleList") sample2_cov <- import("sample2.bw", as="RleList") oo <- order(regions) regions <- regions[oo] Now we can construct a View and use viewSums to obtain our values (unlisting them as they are grouped by seqnames) and add them as a metadata column in our GRanges, restoring the original order of the GRanges when we are done: v <- Views(sample1_cov, as(regions, "RangesList")) mcols(regions)$sample1_signal <- unlist(viewSums(v)) regions[oo] <- regions We then repeat the process to add another metadata column for sample2. To me, having the ability to use a GRanges as a view into an RleList makes a lot more sense. That would allow us to reduce all the above complexity to something like: regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) That alone would be great! But, there's a way to make it even better. Storing these RleLists in memory for each of our samples is quite inefficient, especially since our regions of interest are only a small portion of them. The rtracklayer package already has some very useful functionality for instantiating an RleList with only the data from specific ranges of a BigWig file. Taking advantage of that, we can dramatically reduce our memory usage and increase our performance like so: regions <- import("regions_of_interest.bed") sample1_cov <- import("sample1.bw", as="RleList", which=regions) sample2_cov <- import("sample2.bw", as="RleList", which=regions) regions$sample1_signal <- viewSums(Views(sample1_cov, regions)) regions$sample2_signal <- viewSums(Views(sample2_cov, regions)) But can't this functionality be included in Views? Why not have it accept a BigWig file in place of an RleList and have it selectively load the portions of the BigWig it needs based on the provided GRanges? That would allow this: regions <- import("regions_of_interest.bed") regions$sample1_signal <- viewSums(Views("sample1.bw", regions)) regions$sample2_signal <- viewSums(Views("sample2.bw", regions)) The above is quite close to how I use GRanges and BigWigs now, except I have to write and maintain all the hackish code to link BigWig files, GRanges, Views, RangesLists and RleLists together into something that behaves as one would intuitively expect. I’d welcome any thoughts on how people perform similar analyses that involve GRanges and data stored in BigWig files or RleLists, and whether this would also be useful to them. Thanks, Jeff Johnston Zeitlinger Lab Stowers Institute ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel