Try to run it through the lineprof package for memory profiling; I have found this to be very helpful.
Here is an old blog post I wrote about it http://www.hansenlab.org/rstats/2014/01/30/lineprof/ Kasper On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker <becker.g...@gene.com> wrote: > The profiling I attached in my previous email is for 24 geno fields, as I > said, but our typical usecase involves only ~4-6 fields, and is faster but > still on the order of dozens of minutes. > > Sorry for the confusion. > ~G > > > On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <becke...@gene.com> wrote: > > > Martin and Val. > > > > I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) > > with profiling enabled. The results of summaryRprof for that run are > > attached, though for a variety of reasons they are pretty misleading. > > > > It took over an hour to write (3700+seconds), so it's definitely a > > bottleneck when the data get very large, even if it isn't for smaller > data. > > > > Michael and I both think the culprit is all the pasting and cbinding that > > is going on, and more to the point, that memory for an internal > > representation to be written out is allocated at all. Streaming across > the > > object, looping by rows and writing directly to file (e.g. from C) should > > be blisteringly fast in comparison. > > > > ~G > > > > > > On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <micha...@gene.com> > > wrote: > > > >> Gabe is still testing/profiling, but we'll send something randomized > >> along eventually. > >> > >> > >> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmor...@fhcrc.org> > >> wrote: > >> > >>> I didn't see in the original thread a reproducible (simulated, I guess) > >>> example, to be explicit about what the problem is?? > >>> > >>> Martin > >>> > >>> > >>> On 08/26/2014 10:47 AM, Michael Lawrence wrote: > >>> > >>>> My understanding is that the heap optimization provided marginal > gains, > >>>> and > >>>> that we need to think harder about how to optimize the all of the > string > >>>> manipulation in writeVcf. We either need to reduce it or reduce its > >>>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. > >>>> > >>>> > >>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain < > voben...@fhcrc.org> > >>>> wrote: > >>>> > >>>> Hi Gabe, > >>>>> > >>>>> Martin responded, and so did Michael, > >>>>> > >>>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html > >>>>> > >>>>> It sounded like Michael was ok with working with/around heap > >>>>> initialization. > >>>>> > >>>>> Michael, is that right or should we still consider this on the table? > >>>>> > >>>>> > >>>>> Val > >>>>> > >>>>> > >>>>> On 08/26/2014 09:34 AM, Gabe Becker wrote: > >>>>> > >>>>> Val, > >>>>>> > >>>>>> Has there been any movement on this? This remains a substantial > >>>>>> bottleneck for us when writing very large VCF files (e.g. > >>>>>> variants+genotypes for whole genome NGS samples). > >>>>>> > >>>>>> I was able to see a ~25% speedup with 4 cores and an "optimal" > >>>>>> speedup > >>>>>> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive > >>>>>> parallelization strategy and no other changes. I suspect this could > be > >>>>>> improved on quite a bit, or possibly made irrelevant with judicious > >>>>>> use > >>>>>> of serial C code. > >>>>>> > >>>>>> Did you and Martin make any plans regarding optimizing writeVcf? > >>>>>> > >>>>>> Best > >>>>>> ~G > >>>>>> > >>>>>> > >>>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain < > voben...@fhcrc.org > >>>>>> <mailto:voben...@fhcrc.org>> wrote: > >>>>>> > >>>>>> Hi Michael, > >>>>>> > >>>>>> I'm interested in working on this. I'll discuss with Martin > next > >>>>>> week when we're both back in the office. > >>>>>> > >>>>>> Val > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 08/05/14 07:46, Michael Lawrence wrote: > >>>>>> > >>>>>> Hi guys (Val, Martin, Herve): > >>>>>> > >>>>>> Anyone have an itch for optimization? The writeVcf function > >>>>>> is > >>>>>> currently a > >>>>>> bottleneck in our WGS genotyping pipeline. For a typical 50 > >>>>>> million row > >>>>>> gVCF, it was taking 2.25 hours prior to yesterday's > >>>>>> improvements > >>>>>> (pasteCollapseRows) that brought it down to about 1 hour, > >>>>>> which > >>>>>> is still > >>>>>> too long by my standards (> 0). Only takes 3 minutes to > call > >>>>>> the > >>>>>> genotypes > >>>>>> (and associated likelihoods etc) from the variant calls > >>>>>> (using > >>>>>> 80 cores and > >>>>>> 450 GB RAM on one node), so the output is an issue. > Profiling > >>>>>> suggests that > >>>>>> the running time scales non-linearly in the number of rows. > >>>>>> > >>>>>> Digging a little deeper, it seems to be something with R's > >>>>>> string/memory > >>>>>> allocation. Below, pasting 1 million strings takes 6 > >>>>>> seconds, but > >>>>>> 10 > >>>>>> million strings takes over 2 minutes. It gets way worse > with > >>>>>> 50 > >>>>>> million. I > >>>>>> suspect it has something to do with R's string hash table. > >>>>>> > >>>>>> set.seed(1000) > >>>>>> end <- sample(1e8, 1e6) > >>>>>> system.time(paste0("END", "=", end)) > >>>>>> user system elapsed > >>>>>> 6.396 0.028 6.420 > >>>>>> > >>>>>> end <- sample(1e8, 1e7) > >>>>>> system.time(paste0("END", "=", end)) > >>>>>> user system elapsed > >>>>>> 134.714 0.352 134.978 > >>>>>> > >>>>>> Indeed, even this takes a long time (in a fresh session): > >>>>>> > >>>>>> set.seed(1000) > >>>>>> end <- sample(1e8, 1e6) > >>>>>> end <- sample(1e8, 1e7) > >>>>>> system.time(as.character(end)) > >>>>>> user system elapsed > >>>>>> 57.224 0.156 57.366 > >>>>>> > >>>>>> But running it a second time is faster (about what one > would > >>>>>> expect?): > >>>>>> > >>>>>> system.time(levels <- as.character(end)) > >>>>>> user system elapsed > >>>>>> 23.582 0.021 23.589 > >>>>>> > >>>>>> I did some simple profiling of R to find that the resizing > of > >>>>>> the string > >>>>>> hash table is not a significant component of the time. So > >>>>>> maybe > >>>>>> something > >>>>>> to do with the R heap/gc? No time right now to go deeper. > >>>>>> But I > >>>>>> know Martin > >>>>>> likes this sort of thing ;) > >>>>>> > >>>>>> Michael > >>>>>> > >>>>>> [[alternative HTML version deleted]] > >>>>>> > >>>>>> _________________________________________________ > >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > >>>>>> mailing list > >>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel > >>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> > >>>>>> > >>>>>> > >>>>>> _________________________________________________ > >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> > >>>>>> mailing > >>>>>> list > >>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel > >>>>>> > >>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Computational Biologist > >>>>>> Genentech Research > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> _______________________________________________ > >>>> Bioc-devel@r-project.org mailing list > >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >>>> > >>>> > >>> > >>> -- > >>> Computational Biology / Fred Hutchinson Cancer Research Center > >>> 1100 Fairview Ave. N. > >>> PO Box 19024 Seattle, WA 98109 > >>> > >>> Location: Arnold Building M1 B861 > >>> Phone: (206) 667-2793 > >>> > >> > >> > > > > > > -- > > Computational Biologist > > Genentech Research > > > > > > -- > Computational Biologist > Genentech Research > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel