Thanks. There is an improvement with those options. Not as much as the character creation example, but it does help.
user system elapsed 1048.42 71.16 1120.75 ________________________________________ From: Martin Morgan [mtmor...@fhcrc.org] Sent: Friday, 2 August 2013 9:24 PM To: Hervé Pagès Cc: Dario Strbenac; bioc-devel@r-project.org Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs On 07/31/2013 04:05 PM, Hervé Pagès wrote: > Hi Dario, > > Thanks for providing the file. > > The file has 55780092 records in it and yes it only contains primary > alignments so this is the best situation for the pairing algorithm. > > Here are the timings I get for the readBamGappedAlignmentPairs() > function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit > Ubuntu server with 384 Gb of RAM: > > ## With BioC release (Rsamtools 1.12.3) > ## ------------------------------------ > ## Timings: > ## - readBamGappedAlignments: 10 min 30 s (A=A1+A2) > ## - scanBam: 6 min (A1) I haven't looked at Dario's data directly, but a surprising bottleneck is the creation of large character vectors $ R --vanilla [...] > i = seq_len(55780092/2) > system.time(as.character(i)) user system elapsed 74.880 0.408 75.451 This can be alleviated by telling R that that you're likely to need lots of memory up front; I alias R so that $ R --min-vsize=2048M --min-nsize=20M --vanilla [...] > i = seq_len(55780092/2) > system.time(as.character(i)) user system elapsed 25.269 0.472 25.796 or $ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla [...] > i = seq_len(55780092/2) > system.time(as.character(i)) user system elapsed 12.196 0.464 12.687 These are documented on ?Memory and with > R.version.string [1] "R version 3.0.1 Patched (2013-06-05 r62876)" Martin > ## - other: 4 min 30 s (A2) > ## - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2) > ## - findMateAlignment: 1 min 30 s (B1) > ## - other: 4 min (B2) > ## Total time: 16 min (A+B) > ## Memory usage: 13 Gb > > ## With BioC devel (Rsamtools 1.13.19) > ## ----------------------------------- > ## Timings: > ## - readGAlignmentsFromBam: 11 min (A=A1+A2) > ## - scanBam: 6 min (A1) > ## - other: 5 min (A2) > ## - makeGAlignmentPairs: 4 min (B=B1+B2) > ## - findMateAlignment: 1 min (B1) > ## - other: 3 min (B2) > ## Total time: 15 min (A+B) > ## Memory usage: 13 Gb > > Indeed, not much difference between release and devel :-/ > > I didn't have the opportunity to do this kind of detailed timing of > the readGAlignmentPairsFromBam() function on such a big file before, > but I find it pretty interesting (I only benchmarked findMateAlignment() > and it was on simulated data). > > In particular I'm quite surprised to discover that the pairing algo > (implemented in findMateAlignment) is actually not the bottleneck > (and it being optimized in BioC devel explains the small difference > between total time in release and total time in devel). > > Just a note on the timings you sent earlier (I'm copying them here > again): > > > system.time(RNAreads <- readBamGappedAlignmentPairs(file)) > user system elapsed > 1852.16 59.00 1939.11 > > system.time(RNAreads <- readBamGappedAlignments(file)) > user system elapsed > 192.80 8.12 214.97 > > Subtracting the 2 times above to infer the time spent for the pairing > is a shortcut that overlooks the following: when > readBamGappedAlignmentPairs() calls readBamGappedAlignments() > internally, it calls it with 'use.names=TRUE', and also with a 'param' > arg that specifies the extra BAM fields, and also with a specific flag > filter. All those things are required for the pairing algo. So a more > sensible comparison is to compare: > > RNAreads <- readBamGappedAlignmentPairs(file) > > versus: > > flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE, > hasUnmappedMate=FALSE) > what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos") > param0 <- ScanBamParam(flag=flag0, what=what0) > RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0) > > This internal call to readBamGappedAlignments() takes more than 70% > of the total time of readBamGappedAlignmentPairs() (11 min / 15 min > with BioC devel on my Linux server). This is a little bit surprising > to me and that's something I'd like to investigate. > > Cheers, > H. > > > On 07/30/2013 05:00 PM, Dario Strbenac wrote: >> Hi. >> >> The files are >> >> http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam >> http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai >> >> While you have them, could you also investigate the granges coercion function >> ? It also took half an hour. >> >> ________________________________________ >> From: Hervé Pagès [hpa...@fhcrc.org] >> Sent: Tuesday, 30 July 2013 3:29 AM >> To: Dario Strbenac >> Cc: bioc-devel@r-project.org >> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs >> >> On 07/29/2013 12:00 AM, Dario Strbenac wrote: >>> Because I only allowed unique mappings, the ratio is 2. After installing the >>> development package versions, I only got a 10% improvement. >> >> Mmmh that's disappointing. Can you put the file somewhere so I can have >> a look? Thanks. >> >> H. >> >>> >>> user system elapsed >>> 1681.36 65.79 1752.10 >>> >>> ________________________________________ >>> From: Pages, Herve [hpa...@fhcrc.org] >>> Sent: Monday, 29 July 2013 3:29 PM >>> To: Dario Strbenac >>> Cc: bioc-devel@r-project.org >>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs >>> >>> Hi Dario, >>> >>> The function was optimized in Bioc-devel. Depending on the number >>> of records in your BAM file (which is more relevant than its size), >>> it should now be between 2x and 5x faster. Please give it a try and >>> let us know. >>> >>> Also keep in mind that the pairing algo will slow down if >>> the "average nb of records per unique QNAME" is 3 or more. >>> This was discussed here last month: >>> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html >>> >>> Cheers, >>> H. >>> >>> >>> ----- Original Message ----- >>> From: "Dario Strbenac" <dstr7...@uni.sydney.edu.au> >>> To: bioc-devel@r-project.org >>> Sent: Sunday, July 28, 2013 9:00:30 PM >>> Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs >>> >>> Hello, >>> >>> Could readBamGappedAlignmentPairs be optimised ? It's surprising that it >>> takes an extra 29 minutes to calculate the pairing information, for the >>> example below. The BAM file is 4 GB in size. >>> >>>> system.time(RNAreads <- readBamGappedAlignmentPairs(file)) >>> user system elapsed >>> 1852.16 59.00 1939.11 >>>> system.time(RNAreads <- readBamGappedAlignments(file)) >>> user system elapsed >>> 192.80 8.12 214.97 >>> >>> -------------------------------------- >>> Dario Strbenac >>> PhD Student >>> University of Sydney >>> Camperdown NSW 2050 >>> Australia >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpa...@fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel