hi Valerie, this sounds good to me.
I am thinking of working on a function (here or elsewhere) that helps decide, for reduce by range, how to optimally chunk GRanges into a GRangesList. Practically, this could involve sampling the size of the imported data for a subset of cells in the (ranges, files) matrix, and asking/estimating the amount of memory available for each worker. best, Mike On Mon, Oct 27, 2014 at 2:35 PM, Valerie Obenchain <voben...@fhcrc.org> wrote: > Hi Kasper and Mike, > > I've added 2 new functions to GenomicFiles and deprecated the old classes. > The vignette now has a graphic which (hopefully) clarifies the MAP / REDUCE > mechanics of the different functions. > > Below is some performance testing for the new functions and answers to > leftover questions from previous emails. > > > Major changes to GenomicFiles (in devel): > > - *FileViews classes have been deprecated: > > The idea is to use the GenomicFiles class to hold any type of file be it > BAM, BigWig, character vector etc. instead of having name-specific classes > like BigWigFileViews. Currently GenomicFiles does not inherit from > SummarizedExperiment but it may in the future. > > - Add reduceFiles() and reduceRanges(): > > These functions pass all ranges or files to MAP vs the lapply approach > taken in reduceByFiles() and reduceByRanges(). > > > (1) Performance: > > When testing with reduceByFile() you noted "GenomicFiles" is 10-20x slower > than the straightforward approach". You also noted this was probably > because of the lapply over all ranges - true. (Most likely there was > overhead in creating the SE object as well.) With the new reduceFiles(), > passing all ranges at once, we see performance very similar to that of the > 'by hand' approach. > > In the test code I've used Bam instead of BigWig. Both test functions > output lists, have comparable MAP and REDUCE steps etc. > > I used 5 files ('fls') and a granges ('grs') of length 100. > > length(grs) > [1] 100 > > > sum(width(grs)) > [1] 1000000 > > FUN1 is the 'by hand' version. These results are similar to what you saw, > not quite a 4x difference between 10 and 100 ranges. > > >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10) > > Unit: seconds > > expr min lq mean median uq > max > > FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331 1.222298 > 1.256741 > > FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486 4.278463 > 4.533846 > > neval > > 10 > > 10 > > FUN2 is the reduceFiles() approach and the results are very similar to > FUN1. > > >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10) > > Unit: seconds > > expr min lq mean median uq > max > > FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154 1.267655 > 1.275698 > > FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007 4.384064 > 4.676068 > > neval > > 10 > > 10 > > > (2) Request for "chunking of the mapping of ranges": > > For now we decided not to add a 'yieldSize' argument for chunking. There > are 2 approaches to chunking through ranges *within* the same file. In both > cases the user spits the ranges, either before calling the function or in > the MAP step. > > i) reduceByFile() with a GRangesList: > > The user provides a GRangesList as the 'ranges' arg. On each worker, > lapply applies MAP to the one files and all elements of the GRangesList. > > ii) reduceFiles() with a MAP that handles chunking: > > The user split ranges in MAP and uses lapply or another loop to iterate. > For example, > > MAP <- function(range, file, ...) { > lst = split(range, someFactor) > someFUN = function(rngs, file, ...) do something > lapply(lst, FUN=someFun, ...) > } > > The same ideas apply for chunking though ranges *across* files with > reduceByRange() and reduceRanges(). > > iii) reduceByRange() with a GRangesList: > > Mike has a good example here: > https://gist.github.com/mikelove/deaff999984dc75f125d > > iv) reduceRanges(): > > 'ranges' should be a GRangesList. The MAP step will operate on an element > of the GRangesList and all files. Unless you want to operate on all files > at once I'd use reduceByRange() instead. > > > (3) Return objects have different shape: > > Previous question: > > "... > Why does the return object of reduceByFile vs reduceByRange (with > summarize = FALSE) different? I understand why internally you have > different nesting schemes (files and ranges) for the two functions, but it > is not clear to me that it is desirable to have the return object depend on > how the computation was done. > ..." > > reduceByFile() and reduceFiles() output a list the same length as the > number of files while reduceByRange() and reduceRanges() output a list the > same length as the number of ranges. > > Reduction is different depending on which function is chosen; data are > collapsed either within a file or across files. When REDUCE does something > substantial the output are not equivalent. > > While it's possible to get the same result (REDUCE simply unlists or isn't > used), the two approaches were not intended to be equivalent ways of > arriving at the same end. The idea was that the user had a specific use > case in mind - they either wanted to collapse the data across or within > files. > > > (4) return object from coverage(BigWigFileViews): > > Previous comment: > > "... > coverage(BigWigFileViews) returns a "wrong" assay object in my opinion, > ... > Specifically, each (i,j) entry in the object is an RleList with a single > element with a name equal to the seqnames of the i'th entry in the query > GRanges. To me, this extra nestedness is unnecessary; I would have > expected an Rle instead of an RleList with 1 element. > ..." > > The return value from coverage(x) is an RleList with one coverage vector > per seqlevel in 'x'. Even if there is only one seqlevel, the result still > comes back as an RleList. This is just the default behavior. > > > (5) separate the 'read' function from the MAP step > > Previous comment: > > "... > Also, something completely different, it seems like it would be convenient > for stuff like BigWigFileViews to not have to actually parse the file in > the MAP step. Somehow I would envision some kind of reading function, > stored inside the object, which just returns an Rle when I ask for a > (range, file). Perhaps this is better left for later. > ..." > > The current approach for the reduce* functions is for MAP to both extract > and manipulate data. The idea of separating the extraction step is actually > implemented in reduceByYield(). (This function used to be yieldReduce() in > Rsamtools in past releases.) For reduceByYield() teh user must specify > YIELD (a reader function), MAP, REDUCE and DONE (criteria to stop > iteration). > > I'm not sure what is best here. I thought the many-argument approach of > reduceByYield() was possibly confusing or burdensome and so didn't use it > in the other GenomicFiles functions. Maybe it's not confusing but instead > makes the individual steps more clear. What do you think, > > - Should the reader function be separate from the MAP? What are the > advantages? > > - Should READER, MAP, REDUCE be stored inside the GenomicFiles object or > supplied as arguments to the functions? > > > (6) unnamed assay in SummarizedExperiment > > Previous comment: > > "... > The return object of reduceByRange / reduceByFile with summarize = TRUE is > a SummarizedExperiment with an unnamed assay. I was surprised to see that > this is even possible. > ..." > > There is no default name for SummarizedExperiment in general. I've named > the assay 'data' for lack of a better term. We could also go with > 'reducedData' or another suggestion. > > > Thanks for the feedback. > > Valerie > > > > > On 10/01/2014 08:30 AM, Michael Love wrote: > >> hi Kasper, >> >> For a concrete example, I posted a R and Rout file here: >> >> https://gist.github.com/mikelove/deaff999984dc75f125d >> >> Things to note: 'ranges' is a GRangesList, I cbind() the numeric >> vectors in the REDUCE, and then rbind() the final list to get the >> desired matrix. >> >> Other than the weird column name 'init', does this give you what you want? >> >> best, >> >> Mike >> >> On Tue, Sep 30, 2014 at 2:08 PM, Michael Love >> <michaelisaiahl...@gmail.com> wrote: >> >>> hi Kasper and Valerie, >>> >>> In Kasper's original email: >>> >>> "I would like to be able to write a MAP function which takes >>> ranges, file >>> instead of just >>> range, file >>> And then chunk over say 1,000s of ranges. I could then have an >>> argument to reduceByXX called something like rangeSize, which is kind >>> of yieldSize." >>> >>> I was thinking, for your BigWig example, we get part of the way just >>> by splitting the ranges into a GRangesList of your desired chunk size. >>> Then the parallel iteration is over chunks of GRanges. Within your MAP >>> function: >>> >>> import(file, which = ranges, as = "Rle", format = "bw")[ranges] >>> >>> returns an RleList, and calling mean() on this gives a numeric vector >>> of the mean of the coverage for each range. >>> >>> Then the only work needed is how to package the result into something >>> reasonable. What you would get now is a list (length of GRangesList) >>> of lists (length of files) of vectors (length of GRanges). >>> >>> best, >>> >>> Mike >>> >>> On Mon, Sep 29, 2014 at 7:09 PM, Valerie Obenchain <voben...@fhcrc.org> >>> wrote: >>> >>>> Hi Kasper, >>>> >>>> The reduceBy* functions were intended to combine data across ranges or >>>> files. If we have 10 ranges and 3 files you can think of it as a 10 x 3 >>>> grid >>>> where we'll have 30 queries and therefore 30 pieces of information that >>>> can >>>> be combined across different dimensions. >>>> >>>> The request to 'processes ranges in one operation' is a good suggestion >>>> and, >>>> as you say, may even be the more common use case. >>>> >>>> Some action items and explanations: >>>> >>>> 1) treat ranges / files as a group >>>> >>>> I'll add functions to treat all ranges / files as a group; essentially >>>> no >>>> REDUCER other than concatenation. Chunking (optional) will occur as >>>> defined >>>> by 'yieldSize' in the BamFile. >>>> >>>> 2) class clean-up >>>> >>>> The GenomicFileViews class was the first iteration. It was overly >>>> complicated and the complication didn't gain us much. In my mind the >>>> GenomicFiles class is a striped down version should replace the >>>> *FileViews >>>> classes. I plan to deprecate the *FileViews classes. >>>> >>>> Ideally the class(s) in GenomicFiles would inherit from >>>> SummarizedExperiment. A stand-alone package for SummarizedExperiment is >>>> in >>>> the works for the near future. Until those plans are finalized I'd >>>> rather >>>> not change the inheritance structure in GenomicFiles. >>>> >>>> 3) pack / unpack >>>> >>>> I experimented with this an came to the conclusion that packing / >>>> unpacking >>>> should be left to the user vs being done behind the scenes with >>>> reduceBy*. >>>> The primary difficulty is that you don't have pre-knowledge of the >>>> output >>>> format of the user's MAPPER. If MAPPER outputs an Rle, unpacking may be >>>> straightforward. If MAPPER outputs a NumericList or matrix or vector >>>> with no >>>> genomic coordinates then things are more complicated. >>>> >>>> I'm open if others have suggestions / prototypes. >>>> >>>> 4) reduceByYield >>>> >>>> This was intended for chunking through ranges in a single file. You can >>>> imagine using bplapply() over files where each file is chunked through >>>> with >>>> reduceByYield(). All ranges are chunked by 'yieldSize' defined in the >>>> BamFile unless a 'param' dictates a subset of ranges. >>>> >>>> 5) additional tidy >>>> >>>> I'll add a name to the 'assays' when summarize=TRUE, make sure return >>>> types >>>> are consistent when summarize=FALSE, etc. >>>> >>>> Thanks for the testing and feedback. >>>> >>>> >>>> Valerie >>>> >>>> >>>> On 09/29/2014 07:18 AM, Michael Love wrote: >>>> >>>>> >>>>> On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen >>>>> <kasperdanielhan...@gmail.com> wrote: >>>>> >>>>>> >>>>>> I don't fully understand "the use case for reducing by range is when >>>>>> the >>>>>> entire dataset won't fit into memory". The basic assumption of these >>>>>> functions (as far as I can see) is that the output data fits in >>>>>> memory. >>>>>> What may not fit in memory is various earlier "iterations" of the >>>>>> data. >>>>>> For >>>>>> example, in my use case, if I just read in all the data in all the >>>>>> ranges >>>>>> in >>>>>> all the samples it is basically Rle's across 450MB times 38 files, >>>>>> which >>>>>> is >>>>>> not small. What fits in memory is smaller chunks of this; that is >>>>>> true >>>>>> for >>>>>> every application. >>>>>> >>>>>> >>>>> I was unclear. I meant that, in approach1, you have an object, >>>>> all.Rle, which contains Rles for every range over every file. Can you >>>>> actually run this approach on the full dataset? >>>>> >>>>> Reducing by range (or file) only makes sense when the final output >>>>>> includes >>>>>> one entity for several ranges/files ... right? So I don't see how >>>>>> reduce >>>>>> would help me. >>>>>> >>>>>> >>>>> Yes, I think we agree. This is not a good use case for reduce by range >>>>> as now implemented. >>>>> >>>>> This is a use case which would benefit from the user-facing function >>>>> internally calling pack()/unpack() to reduce the number of import() >>>>> calls, and then in the end giving back the mean coverage over the >>>>> input ranges. I want this too. >>>>> >>>>> >>>>> https://github.com/Bioconductor/GenomicFileViews/ >>>>> issues/2#issuecomment-32625456 >>>>> (link to the old github repo, the new github repo is named >>>>> GenomicFiles) >>>>> >>>>> As I see the pack()/unpack() paradigm, it just re-orders the query >>>>>> ranges >>>>>> (which is super nice and matters a lot for speed for some >>>>>> applications). >>>>>> As >>>>>> I understand the code (and my understanding is developing) we need an >>>>>> extra >>>>>> layer to support processing multiple ranges in one operation. >>>>>> >>>>>> I am happy to help apart from complaining. >>>>>> >>>>>> Best, >>>>>> Kasper >>>>>> >>>>>> On Mon, Sep 29, 2014 at 8:55 AM, Michael Love >>>>>> <michaelisaiahl...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks for checking it out and benchmarking. We should be more clear >>>>>>> in the docs that the use case for reducing by range is when the >>>>>>> entire >>>>>>> dataset won't fit into memory. Also, we had some discussion and >>>>>>> Valerie had written up methods for packing up the ranges supplied by >>>>>>> the user into a better form for querying files. In your case it would >>>>>>> have packed many ranges together, to reduce the number of import() >>>>>>> calls like your naive approach. See the pack/unpack functions, which >>>>>>> are not in the vignette but are in the man pages. If I remember, >>>>>>> writing code to unpack() the result was not so simple, and >>>>>>> development >>>>>>> of these functions was set aside for the moment. >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen >>>>>>> <kasperdanielhan...@gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I am testing GenomicFiles. >>>>>>>> >>>>>>>> My use case: I have 231k ranges of average width 1.9kb and total >>>>>>>> width >>>>>>>> 442 >>>>>>>> MB. I also have 38 BigWig files. I want to compute the average >>>>>>>> coverage >>>>>>>> of the 38 BigWig files inside each range. This is similar to >>>>>>>> wanting >>>>>>>> to >>>>>>>> get coverage of - say - all promoters in the human genome. >>>>>>>> >>>>>>>> My data is residing on a file server which is connected to the >>>>>>>> compute >>>>>>>> node >>>>>>>> through ethernet, so basically I have very slow file access. Also. >>>>>>>> the >>>>>>>> BigWig files are (perhaps unusually) big: ~2GB / piece. >>>>>>>> >>>>>>>> Below I have two approaches, one using basically straightforward >>>>>>>> code >>>>>>>> and >>>>>>>> the other using GenomicFiles (examples are 10 / 100 ranges on 5 >>>>>>>> files). >>>>>>>> Basically GenomicFiles is 10-20x slower than the straightforward >>>>>>>> approach. >>>>>>>> This is most likely because reduceByRange/reduceByFile processes >>>>>>>> each >>>>>>>> (range, file) separately. >>>>>>>> >>>>>>>> It seems naturally (to me) to allow some chunking of the mapping of >>>>>>>> the >>>>>>>> ranges. My naive approach is fast (I assume) because I read >>>>>>>> multiple >>>>>>>> ranges through one import() command. I know I have random access to >>>>>>>> the >>>>>>>> BigWig files, but there must still be some overhead of seeking and >>>>>>>> perhaps >>>>>>>> more importantly instantiating/moving stuff back and forth to R. So >>>>>>>> basically I would like to be able to write a MAP function which >>>>>>>> takes >>>>>>>> ranges, file >>>>>>>> instead of just >>>>>>>> range, file >>>>>>>> And then chunk over say 1,000s of ranges. I could then have an >>>>>>>> argument >>>>>>>> to >>>>>>>> reduceByXX called something like rangeSize, which is kind of >>>>>>>> yieldSize. >>>>>>>> >>>>>>>> Perhaps this is what is intended for the reduceByYield on BigWig >>>>>>>> files? >>>>>>>> >>>>>>>> In a way, this is what is done in the vignette with the >>>>>>>> coverage(BAMFILE) >>>>>>>> example where tileGenome is essentially constructed by the user to >>>>>>>> chunk >>>>>>>> the coverage computation. >>>>>>>> >>>>>>>> I think the example of a set of regions I am querying on the files, >>>>>>>> will >>>>>>>> be >>>>>>>> an extremely common usecase going forward. The raw data for all the >>>>>>>> regions >>>>>>>> together is "too big" but I do a computation on each region to >>>>>>>> reduce >>>>>>>> the >>>>>>>> size. In this situation, all the focus is on the MAP step. I see >>>>>>>> the >>>>>>>> need >>>>>>>> for REDUCE in the case of the t-test example in the vignette, where >>>>>>>> the >>>>>>>> return object is a single "thing" for each base. But in general, I >>>>>>>> think >>>>>>>> we will use these file structures a lot to construct things without >>>>>>>> REDUCE >>>>>>>> (across neither files nor ranges). >>>>>>>> >>>>>>>> Also, something completely different, it seems like it would be >>>>>>>> convenient >>>>>>>> for stuff like BigWigFileViews to not have to actually parse the >>>>>>>> file >>>>>>>> in >>>>>>>> the MAP step. Somehow I would envision some kind of reading >>>>>>>> function, >>>>>>>> stored inside the object, which just returns an Rle when I ask for a >>>>>>>> (range, file). Perhaps this is better left for later. >>>>>>>> >>>>>>>> Best, >>>>>>>> Kasper >>>>>>>> >>>>>>>> Examples >>>>>>>> >>>>>>>> approach1 <- function(ranges, files) { >>>>>>>> ## By hand >>>>>>>> all.Rle <- lapply(files, function(file) { >>>>>>>> rle <- import(file, as = "Rle", which = ranges, format = >>>>>>>> "bw")[ranges] >>>>>>>> rle >>>>>>>> }) >>>>>>>> print(object.size(all.Rle), units = "auto") >>>>>>>> mat <- do.call(cbind, lapply(all.Rle, function(xx) { >>>>>>>> sapply(xx, mean) >>>>>>>> })) >>>>>>>> invisible(mat) >>>>>>>> } >>>>>>>> >>>>>>>> system.time({ >>>>>>>> mat1 <- approach1(all.grs[1:10], all.files[1:5]) >>>>>>>> }) >>>>>>>> >>>>>>>> 160.9 Kb >>>>>>>> user system elapsed >>>>>>>> 1.109 0.001 1.113 >>>>>>>> >>>>>>>> system.time({ >>>>>>>> mat1 <- approach1(all.grs[1:100], all.files[1:5]) >>>>>>>> }) # less than 4x slower than previous call >>>>>>>> >>>>>>>> 3 Mb >>>>>>>> user system elapsed >>>>>>>> 4.101 0.019 4.291 >>>>>>>> >>>>>>>> >>>>>>>> approach2 <- function(ranges, files) { >>>>>>>> gf <- GenomicFiles(rowData = ranges, files = files) >>>>>>>> MAPPER <- function(range, file, ....) { >>>>>>>> library(rtracklayer) >>>>>>>> rle <- import(file, which = range, as = "Rle", format = >>>>>>>> "bw")[range] >>>>>>>> mean(rle) >>>>>>>> } >>>>>>>> sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, >>>>>>>> BPPARAM >>>>>>>> = >>>>>>>> SerialParam()) >>>>>>>> sExp >>>>>>>> } >>>>>>>> >>>>>>>> system.time({ >>>>>>>> mat2 <- approach2(all.grs[1:10], all.files[1:5]) >>>>>>>> }) # 8-9x slower than approach1 >>>>>>>> >>>>>>>> user system elapsed >>>>>>>> 9.349 0.280 9.581 >>>>>>>> >>>>>>>> system.time({ >>>>>>>> mat2 <- approach2(all.grs[1:100], all.files[1:5]) >>>>>>>> }) # 9x slower than previous call, 20x slow than approach1 on same >>>>>>>> input >>>>>>>> >>>>>>>> user system elapsed >>>>>>>> 89.310 0.627 91.044 >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>> >>>>> >>>> > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel