I should also have compared to just using coverage(BigWigFileViews) approach3 <- function(ranges, files) { bwViews <- BigWigFileViews(fileRange = ranges, fileList = files) cov <- coverage(bwViews) vec <- sapply(assay(cov), function(xx) mean(xx[[1]])) out <- matrix(vec, ncol = length(files), nrow = length(ranges)) out }
system.time({ mat3 <- approach3(all.grs[1:10], all.files[1:5]) }) ## user system elapsed ## 9.705 1.531 2.646 system.time({ mat3 <- approach3(all.grs[1:100], all.files[1:5]) }) ## user system elapsed ## 66.454 1.587 18.151 On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen < kasperdanielhan...@gmail.com> wrote: > I am testing GenomicFiles. > > My use case: I have 231k ranges of average width 1.9kb and total width 442 > MB. I also have 38 BigWig files. I want to compute the average coverage > of the 38 BigWig files inside each range. This is similar to wanting to > get coverage of - say - all promoters in the human genome. > > My data is residing on a file server which is connected to the compute > node through ethernet, so basically I have very slow file access. Also. > the BigWig files are (perhaps unusually) big: ~2GB / piece. > > Below I have two approaches, one using basically straightforward code and > the other using GenomicFiles (examples are 10 / 100 ranges on 5 files). > Basically GenomicFiles is 10-20x slower than the straightforward approach. > This is most likely because reduceByRange/reduceByFile processes each > (range, file) separately. > > It seems naturally (to me) to allow some chunking of the mapping of the > ranges. My naive approach is fast (I assume) because I read multiple > ranges through one import() command. I know I have random access to the > BigWig files, but there must still be some overhead of seeking and perhaps > more importantly instantiating/moving stuff back and forth to R. So > basically I would like to be able to write a MAP function which takes > ranges, file > instead of just > range, file > And then chunk over say 1,000s of ranges. I could then have an argument > to reduceByXX called something like rangeSize, which is kind of yieldSize. > > Perhaps this is what is intended for the reduceByYield on BigWig files? > > In a way, this is what is done in the vignette with the coverage(BAMFILE) > example where tileGenome is essentially constructed by the user to chunk > the coverage computation. > > I think the example of a set of regions I am querying on the files, will > be an extremely common usecase going forward. The raw data for all the > regions together is "too big" but I do a computation on each region to > reduce the size. In this situation, all the focus is on the MAP step. I > see the need for REDUCE in the case of the t-test example in the vignette, > where the return object is a single "thing" for each base. But in general, > I think we will use these file structures a lot to construct things without > REDUCE (across neither files nor ranges). > > Also, something completely different, it seems like it would be convenient > for stuff like BigWigFileViews to not have to actually parse the file in > the MAP step. Somehow I would envision some kind of reading function, > stored inside the object, which just returns an Rle when I ask for a > (range, file). Perhaps this is better left for later. > > Best, > Kasper > > Examples > > approach1 <- function(ranges, files) { > ## By hand > all.Rle <- lapply(files, function(file) { > rle <- import(file, as = "Rle", which = ranges, format = > "bw")[ranges] > rle > }) > print(object.size(all.Rle), units = "auto") > mat <- do.call(cbind, lapply(all.Rle, function(xx) { > sapply(xx, mean) > })) > invisible(mat) > } > > system.time({ > mat1 <- approach1(all.grs[1:10], all.files[1:5]) > }) > > 160.9 Kb > user system elapsed > 1.109 0.001 1.113 > > system.time({ > mat1 <- approach1(all.grs[1:100], all.files[1:5]) > }) # less than 4x slower than previous call > > 3 Mb > user system elapsed > 4.101 0.019 4.291 > > > approach2 <- function(ranges, files) { > gf <- GenomicFiles(rowData = ranges, files = files) > MAPPER <- function(range, file, ....) { > library(rtracklayer) > rle <- import(file, which = range, as = "Rle", format = > "bw")[range] > mean(rle) > } > sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM = > SerialParam()) > sExp > } > > system.time({ > mat2 <- approach2(all.grs[1:10], all.files[1:5]) > }) # 8-9x slower than approach1 > > user system elapsed > 9.349 0.280 9.581 > > system.time({ > mat2 <- approach2(all.grs[1:100], all.files[1:5]) > }) # 9x slower than previous call, 20x slow than approach1 on same input > > user system elapsed > 89.310 0.627 91.044 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel