[Bioc-devel] Extending Views to support GRanges (and possibly BigWig files)

2016-06-28 Thread Johnston, Jeffrey
During the BioC developer day, Hervé brought up the idea of possibly extending 
the concept of GenomicViews to other use cases. I'd like to share how we 
currently use GenomicRanges, Views and RleLists to perform certain analyses, 
and hopefully demonstrate that a way to directly use Views with GRanges would 
be quite useful.

As an example, let's say we have 2 ChIP-seq samples (as BigWig files) and a set 
of genomic ranges we are interested in (as a BED file). We want to find the sum 
of the ChIP-seq signal found in our regions of interest for each of the two 
samples. We'd first import the BED file as a GRanges object. Annotating the 
GRanges with two metadata columns representing the ChIP-seq signal for the two 
samples seems like a straightforward use for Views (in particular, viewSums), 
but it is a bit convoluted.

The main problem is that Views work with RangesLists, not GRanges. Coercing a 
GRanges to a RangesList possibly disrupts the order, so we have to first 
reorder the GRanges to match the order it will be given after coercion (keeping 
track so we can later revert back to the original order):

regions <- import("regions_of_interest.bed")
sample1_cov <- import("sample1.bw", as="RleList")
sample2_cov <- import("sample2.bw", as="RleList")
oo <- order(regions)
regions <- regions[oo]

Now we can construct a View and use viewSums to obtain our values (unlisting 
them as they are grouped by seqnames) and add them as a metadata column in our 
GRanges, restoring the original order of the GRanges when we are done:

v <- Views(sample1_cov, as(regions, "RangesList"))
mcols(regions)$sample1_signal <- unlist(viewSums(v))
regions[oo] <- regions

We then repeat the process to add another metadata column for sample2.

To me, having the ability to use a GRanges as a view into an RleList makes a 
lot more sense. That would allow us to reduce all the above complexity to 
something like:

regions$sample1_signal <- viewSums(Views(sample1_cov, regions))
regions$sample2_signal <- viewSums(Views(sample2_cov, regions))

That alone would be great! But, there's a way to make it even better. Storing 
these RleLists in memory for each of our samples is quite inefficient, 
especially since our regions of interest are only a small portion of them. The 
rtracklayer package already has some very useful functionality for 
instantiating an RleList with only the data from specific ranges of a BigWig 
file. Taking advantage of that, we can dramatically reduce our memory usage and 
increase our performance like so:

regions <- import("regions_of_interest.bed")
sample1_cov <- import("sample1.bw", as="RleList", which=regions)
sample2_cov <- import("sample2.bw", as="RleList", which=regions)
regions$sample1_signal <- viewSums(Views(sample1_cov, regions))
regions$sample2_signal <- viewSums(Views(sample2_cov, regions))

But can't this functionality be included in Views? Why not have it accept a 
BigWig file in place of an RleList and have it selectively load the portions of 
the BigWig it needs based on the provided GRanges? That would allow this:

regions <- import("regions_of_interest.bed")
regions$sample1_signal <- viewSums(Views("sample1.bw", regions))
regions$sample2_signal <- viewSums(Views("sample2.bw", regions))

The above is quite close to how I use GRanges and BigWigs now, except I have to 
write and maintain all the hackish code to link BigWig files, GRanges, Views, 
RangesLists and RleLists together into something that behaves as one would 
intuitively expect.

I’d welcome any thoughts on how people perform similar analyses that involve 
GRanges and data stored in BigWig files or RleLists, and whether this would 
also be useful to them.

Thanks,
Jeff Johnston
Zeitlinger Lab
Stowers Institute

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Parallel processing of reads in a single fastq file

2014-08-06 Thread Johnston, Jeffrey
Hi,

I have been using FastqStreamer() and yield() to process a large fastq file in 
chunks, modifying both the read and name and then appending the output to a new 
fastq file as each chunk is processed. This works great, but would benefit 
greatly from being parallelized.

As far as I can tell, this problem isn’t easily solved with the existing 
parallel tools because you can’t determine how many jobs you’ll need in advance 
(you just call yield() until it stops returning reads).

After some digging, I found the sclapply() function in the HTSeqGenie package 
by Gregoire Pau, which he describes as a “multicore dispatcher”:

https://stat.ethz.ch/pipermail/bioc-devel/2013-October/004754.html

I wasn’t able to get the package to install from source due to some 
dependencies (there are no binaries for Mac), but I did extract the function 
and adapt it slightly for my use case. Here’s an example:

processChunk - function(fq_chunk) {
  # manipulate fastq reads here
}

yieldHelper - function() {
  fq - yield(fqstream)
  if(length(fq) == 0) return(NULL)
  fq
}

fqstream - FastqStreamer(“…”, n=1e6)
sclapply(yieldHelper, processChunk, max.parallel.jobs=4)
close(fqstream)

Based on the discussion linked above, it seems like there was some interest in 
integrating this idea into BiocParallel. I would find that very useful as it 
improves performance quite a bit and can likely be applied to numerous 
stream-based processing tasks.

I will point out that in my case above, the processChunk() function doesn’t 
return anything. Instead it appends the modified fastq records to a new file. I 
have to use the Unix lockfile command to ensure that only one child process 
appends to the output file at a time. I am not certain if there is a more 
elegant solution to this (perhaps a queue that is emptied by a dedicated writer 
process?).

Thanks,
Jeff




[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel