Thanks, Herve. These are amazing improvements, and it's great to see that the Ranges infrastructure continues to evolve and improve.
Michael On Fri, Dec 5, 2014 at 1:17 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > Developers, > > In BioC 3.1 the code behind overlap operations like findOverlaps(), > countOverlaps(), subsetByOverlaps(), summarizeOverlaps(), nearest(), > etc... has been refactored and improved. Some highlights on what has > changed: > > - The underlying code used for finding/counting overlaps is now based > on the beautiful Nested Containment List algorithm by Alexander V. > Alekseyenko1 and Christopher J. Lee2 [1]. > > - The old algorithm based on interval trees is still available. The > 'algorithm' argument was added to most overlap-based operations to > let the user choose between the new (algorithm="nclist", the default) > and the old (algorithm="intervaltree") algorithm. Except for the 3 > particular situations mentioned below, choosing one or the other > doesn't affect the output, only the performance. > > - Either the query or subject can be preprocessed with NCList() for > a Ranges object (replacement for IntervalTree()), and GNCList() > for a GenomicRanges object (replacement for GIntervalTree()). > For a one time use, it's not advised to explicitely preprocess the > input. This is because findOverlaps() or countOverlaps() will take > care of it and do a better job at it (that is, they preprocess only > what's needed when it's needed and release memory as they go). > > - GNCList() supports preprocessing of a GenomicRanges object with > ranges located on a circular sequence. > > - With the new algorithm, countOverlaps() on Ranges or GenomicRanges > objects doesn't call findOverlaps() to collect all the hits in a > growing Hits object and count them only at the end. Instead the > counting happens at the C level and the hits are not kept. This > reduces memory usage considerably when there is a lot of hits. > > - Zero-width ranges are now interpreted as insertion points and are > considered to overlap with ranges that contain them. This is the > 1st situation where using 'algorithm="nclist"' or > 'algorithm="intervaltree"' produces different output. > > - When using 'select="arbitrary"', the new algorithm will generally > not select the same hits as the old algorithm. This is the 2nd > situation where using 'algorithm="nclist"' or > 'algorithm="intervaltree"' produces different output. > > - When using the old interval tree algorithm, 'maxgap' has a special > meaning if 'type' is "start", "end", or "within". This is not yet > the case with the new algorithm. That feature seems somewhat useful > though so maybe the new algorithm should also support it? Anyway, > this is the 3rd situation where using 'algorithm="nclist"' or > 'algorithm="intervaltree"' produces different output. > > - Objects preprocessed with NCList() or GNCList() are serializable. > > Performance: > > In BioC 3.1 overlaps operations like findOverlaps() or > summarizeOverlaps() on GRanges and/or GRangesList objects are > typically between 3x and 10x faster than in BioC 3.0 for a medium > size data set (e.g. 25 million reads). Memory usage is also reduced > by 25% or more. These numbers may vary but the bigger the data set, > the better they will be. > > This is with IRanges 2.1.29 and GenomicRanges 1.19.21 (will become > available tomorrow). > > Some remaining tasks: > > (a) Bring back special meaning of 'maxgap' when 'type' is "start", > "end", or "within". > > (b) Update documentation. > > (c) Enable user interrupts. > > (d) GNCLists for prepreocessing a GRangesList object (just a > GNCList inside a CompressedList). > > (e) Use threads (OpenMP) for even better performance (to be > implemented in an add-on package). > > Even though a lot of time and effort was spent in testing the new > code and making sure it won't crash your session, you might run into > problems. Please report. > > Enjoy and let me know if you have questions or comments about this. > > Thanks! > H. > > [1] Alexander V. Alekseyenko1 and Christopher J. Lee2. > Nested Containment List (NCList): a new algorithm for accelerating > interval query of genome alignment and interval databases. > Bioinformatics (2007) 23 (11): 1386-1393. > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel