This approach, writing in chunks, is the same Herve and I used for writing FASTA in the Biostrings package, although I see that Herve has now replaced the R implementation with a C implementation. I similarly found an absolutely huge speed up when writing genomes, by chunking.
Best, Kasper On Tue, Sep 2, 2014 at 4:33 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 08/27/2014 11:56 AM, Gabe Becker wrote: > >> The profiling I attached in my previous email is for 24 geno fields, as I >> said, >> but our typical usecase involves only ~4-6 fields, and is faster but >> still on >> the order of dozens of minutes. >> > > I think Val is arriving at a (much) more efficient implementation, but... > > I wanted to share my guess that the poor _scaling_ is because the garbage > collector runs multiple times as the different strings are pasted together, > and has to traverse, in linear time, increasing numbers of allocated SEXPs. > So times scale approximately quadratically with the number of rows in the > VCF > > An efficiency is to reduce the number of SEXPs in play by writing out in > chunks -- as each chunk is written, the SEXPs become available for > collection and are re-used. Here's my toy example > > time.R > ====== > splitIndices <- function (nx, ncl) > { > i <- seq_len(nx) > if (ncl == 0L) > list() > else if (ncl == 1L || nx == 1L) > list(i) > else { > fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl) > breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L) > structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) > } > } > > x = as.character(seq_len(1e7)); y = sample(x) > if (!is.na(Sys.getenv("SPLIT", NA))) { > idx <- splitIndices(length(x), 20) > system.time(for (i in idx) paste(x[i], y[i], sep=":")) > } else { > system.time(paste(x, y, sep=":")) > } > > > running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the > relevant time is > > user system elapsed > 15.320 0.064 15.381 > > versus with $ R --no-save --quiet -f time.R it is > > user system elapsed > 95.360 0.164 95.511 > > I think this is likely an overall strategy when dealing with character > data -- processing in independent chunks of moderate (1M?) size (enabling > as a consequence parallel evaluation in modest memory) that are sufficient > to benefit from vectorization, but that do not entail allocation of large > numbers of in-use SEXPs. > > Martin > > >> Sorry for the confusion. >> ~G >> >> >> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <becke...@gene.com >> <mailto:becke...@gene.com>> wrote: >> >> Martin and Val. >> >> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno >> fields) with >> profiling enabled. The results of summaryRprof for that run are >> attached, >> though for a variety of reasons they are pretty misleading. >> >> It took over an hour to write (3700+seconds), so it's definitely a >> bottleneck when the data get very large, even if it isn't for smaller >> data. >> >> Michael and I both think the culprit is all the pasting and cbinding >> that is >> going on, and more to the point, that memory for an internal >> representation >> to be written out is allocated at all. Streaming across the object, >> looping >> by rows and writing directly to file (e.g. from C) should be >> blisteringly >> fast in comparison. >> >> ~G >> >> >> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <micha...@gene.com >> <mailto:micha...@gene.com>> wrote: >> >> Gabe is still testing/profiling, but we'll send something >> randomized >> along eventually. >> >> >> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan < >> mtmor...@fhcrc.org >> <mailto:mtmor...@fhcrc.org>> wrote: >> >> I didn't see in the original thread a reproducible >> (simulated, I >> guess) example, to be explicit about what the problem is?? >> >> Martin >> >> >> On 08/26/2014 10:47 AM, Michael Lawrence wrote: >> >> My understanding is that the heap optimization provided >> marginal >> gains, and >> that we need to think harder about how to optimize the >> all of >> the string >> manipulation in writeVcf. We either need to reduce it or >> reduce its >> overhead (i.e., the CHARSXP allocation). Gabe is doing >> more tests. >> >> >> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain >> <voben...@fhcrc.org <mailto:voben...@fhcrc.org>> >> wrote: >> >> Hi Gabe, >> >> Martin responded, and so did Michael, >> >> https://stat.ethz.ch/__pipermail/bioc-devel/2014-__ >> August/006082.html >> >> <https://stat.ethz.ch/pipermail/bioc-devel/2014- >> August/006082.html> >> >> It sounded like Michael was ok with working >> with/around heap >> initialization. >> >> Michael, is that right or should we still consider >> this on >> the table? >> >> >> Val >> >> >> On 08/26/2014 09:34 AM, Gabe Becker wrote: >> >> Val, >> >> Has there been any movement on this? This remains >> a >> substantial >> bottleneck for us when writing very large VCF >> files (e.g. >> variants+genotypes for whole genome NGS samples). >> >> I was able to see a ~25% speedup with 4 cores >> and an >> "optimal" speedup >> of ~2x with 10-12 cores for a VCF with 500k rows >> using >> a very naive >> parallelization strategy and no other changes. I >> suspect >> this could be >> improved on quite a bit, or possibly made >> irrelevant >> with judicious use >> of serial C code. >> >> Did you and Martin make any plans regarding >> optimizing >> writeVcf? >> >> Best >> ~G >> >> >> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain >> <voben...@fhcrc.org <mailto:voben...@fhcrc.org> >> <mailto:voben...@fhcrc.org <mailto: >> voben...@fhcrc.org>>> >> >> wrote: >> >> Hi Michael, >> >> I'm interested in working on this. I'll >> discuss >> with Martin next >> week when we're both back in the office. >> >> Val >> >> >> >> >> >> On 08/05/14 07:46, Michael Lawrence wrote: >> >> Hi guys (Val, Martin, Herve): >> >> Anyone have an itch for optimization? >> The >> writeVcf function is >> currently a >> bottleneck in our WGS genotyping >> pipeline. For >> a typical 50 >> million row >> gVCF, it was taking 2.25 hours prior to >> yesterday's improvements >> (pasteCollapseRows) that brought it >> down to >> about 1 hour, which >> is still >> too long by my standards (> 0). Only >> takes 3 >> minutes to call the >> genotypes >> (and associated likelihoods etc) from >> the >> variant calls (using >> 80 cores and >> 450 GB RAM on one node), so the output >> is an >> issue. Profiling >> suggests that >> the running time scales non-linearly in >> the >> number of rows. >> >> Digging a little deeper, it seems to be >> something with R's >> string/memory >> allocation. Below, pasting 1 million >> strings >> takes 6 seconds, but >> 10 >> million strings takes over 2 minutes. >> It gets >> way worse with 50 >> million. I >> suspect it has something to do with R's >> string >> hash table. >> >> set.seed(1000) >> end <- sample(1e8, 1e6) >> system.time(paste0("END", "=", end)) >> user system elapsed >> 6.396 0.028 6.420 >> >> end <- sample(1e8, 1e7) >> system.time(paste0("END", "=", end)) >> user system elapsed >> 134.714 0.352 134.978 >> >> Indeed, even this takes a long time (in >> a >> fresh session): >> >> set.seed(1000) >> end <- sample(1e8, 1e6) >> end <- sample(1e8, 1e7) >> system.time(as.character(end)) >> user system elapsed >> 57.224 0.156 57.366 >> >> But running it a second time is faster >> (about >> what one would >> expect?): >> >> system.time(levels <- as.character(end)) >> user system elapsed >> 23.582 0.021 23.589 >> >> I did some simple profiling of R to >> find that >> the resizing of >> the string >> hash table is not a significant >> component of >> the time. So maybe >> something >> to do with the R heap/gc? No time right >> now to >> go deeper. But I >> know Martin >> likes this sort of thing ;) >> >> Michael >> >> [[alternative HTML version >> deleted]] >> >> >> ______________________________ >> _____________________ >> Bioc-devel@r-project.org >> <mailto:Bioc-devel@r-project.org> >> <mailto:Bioc-devel@r-project.__org >> <mailto:Bioc-devel@r-project.org>> >> mailing list >> https://stat.ethz.ch/mailman/_ >> ___listinfo/bioc-devel >> <https://stat.ethz.ch/mailman/ >> __listinfo/bioc-devel> >> >> <https://stat.ethz.ch/mailman/ >> __listinfo/bioc-devel >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> >> >> >> ______________________________ >> _____________________ >> Bioc-devel@r-project.org >> <mailto:Bioc-devel@r-project.org> >> <mailto:Bioc-devel@r-project.__org >> <mailto:Bioc-devel@r-project.org>> mailing >> list >> https://stat.ethz.ch/mailman/_ >> ___listinfo/bioc-devel >> <https://stat.ethz.ch/mailman/ >> __listinfo/bioc-devel> >> >> >> <https://stat.ethz.ch/mailman/ >> __listinfo/bioc-devel >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> >> >> >> >> >> -- >> Computational Biologist >> Genentech Research >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _________________________________________________ >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org >> > >> mailing list >> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 <tel:%28206%29%20667-2793> >> >> >> >> >> >> -- >> Computational Biologist >> Genentech Research >> >> >> >> >> -- >> Computational Biologist >> Genentech Research >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel