Val, That is great. I'll check this out and test it on our end.
~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <voben...@fhcrc.org> wrote: > The new writeVcf code is in 1.11.28. > > Using the illumina file you suggested, geno fields only, writing now takes > about 17 minutes. > > > hdr > class: VCFHeader > samples(1): NA12877 > meta(6): fileformat ApplyRecalibration ... reference source > fixed(1): FILTER > info(22): AC AF ... culprit set > geno(8): GT GQX ... PL VF > > > param = ScanVcfParam(info=NA) > > vcf = readVcf(fl, "", param=param) > > dim(vcf) > [1] 51612762 1 > > > system.time(writeVcf(vcf, "out.vcf")) > user system elapsed > 971.032 6.568 1004.593 > > In 1.11.28, parsing of geno data was moved to C. If this didn't speed > things up enough we were planning to implement 'chunking' through the VCF > and/or move the parsing of info to C, however, it looks like geno was the > bottleneck. > > I've tested a number of samples/fields combinations in files with >= .5 > million rows and the improvement over writeVcf() in release is ~ 90%. > > Valerie > > > > > On 09/04/14 15:28, Valerie Obenchain wrote: > >> Thanks Gabe. I should have something for you on Monday. >> >> Val >> >> >> On 09/04/2014 01:56 PM, Gabe Becker wrote: >> >>> Val and Martin, >>> >>> Apologies for the delay. >>> >>> We realized that the Illumina platinum genome vcf files make a good test >>> case, assuming you strip out all the info (info=NA when reading it into >>> R) stuff. >>> >>> ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz >>> took about ~4.2 hrs to write out, and is about 1.5x the size of the >>> files we are actually dealing with (~50M ranges vs our ~30M). >>> >>> Looking forward a new vastly improved writeVcf :). >>> >>> ~G >>> >>> >>> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence >>> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote: >>> >>> Yes, it's very clear that the scaling is non-linear, and Gabe has >>> been experimenting with a chunk-wise + parallel algorithm. >>> Unfortunately there is some frustrating overhead with the >>> parallelism. But I'm glad Val is arriving at something quicker. >>> >>> Michael >>> >>> >>> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmor...@fhcrc.org >>> <mailto:mtmor...@fhcrc.org>> wrote: >>> >>> On 08/27/2014 11:56 AM, Gabe Becker wrote: >>> >>> The profiling I attached in my previous email is for 24 geno >>> fields, as I said, >>> but our typical usecase involves only ~4-6 fields, and is >>> faster but still on >>> the order of dozens of minutes. >>> >>> >>> I think Val is arriving at a (much) more efficient >>> implementation, but... >>> >>> I wanted to share my guess that the poor _scaling_ is because >>> the garbage collector runs multiple times as the different >>> strings are pasted together, and has to traverse, in linear >>> time, increasing numbers of allocated SEXPs. So times scale >>> approximately quadratically with the number of rows in the VCF >>> >>> An efficiency is to reduce the number of SEXPs in play by >>> writing out in chunks -- as each chunk is written, the SEXPs >>> become available for collection and are re-used. Here's my toy >>> example >>> >>> time.R >>> ====== >>> splitIndices <- function (nx, ncl) >>> { >>> i <- seq_len(nx) >>> if (ncl == 0L) >>> list() >>> else if (ncl == 1L || nx == 1L) >>> list(i) >>> else { >>> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl) >>> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L) >>> structure(split(i, cut(i, breaks, labels=FALSE)), names >>> = NULL) >>> } >>> } >>> >>> x = as.character(seq_len(1e7)); y = sample(x) >>> if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) { >>> idx <- splitIndices(length(x), 20) >>> system.time(for (i in idx) paste(x[i], y[i], sep=":")) >>> } else { >>> system.time(paste(x, y, sep=":")) >>> } >>> >>> >>> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f >>> time.R the relevant time is >>> >>> user system elapsed >>> 15.320 0.064 15.381 >>> >>> versus with $ R --no-save --quiet -f time.R it is >>> >>> user system elapsed >>> 95.360 0.164 95.511 >>> >>> I think this is likely an overall strategy when dealing with >>> character data -- processing in independent chunks of moderate >>> (1M?) size (enabling as a consequence parallel evaluation in >>> modest memory) that are sufficient to benefit from >>> vectorization, but that do not entail allocation of large >>> numbers of in-use SEXPs. >>> >>> Martin >>> >>> >>> Sorry for the confusion. >>> ~G >>> >>> >>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker >>> <becke...@gene.com <mailto:becke...@gene.com> >>> <mailto:becke...@gene.com <mailto:becke...@gene.com>>> >>> wrote: >>> >>> Martin and Val. >>> >>> I re-ran writeVcf on our (G)VCF data (34790518 ranges, >>> 24 geno fields) with >>> profiling enabled. The results of summaryRprof for that >>> run are attached, >>> though for a variety of reasons they are pretty >>> misleading. >>> >>> It took over an hour to write (3700+seconds), so it's >>> definitely a >>> bottleneck when the data get very large, even if it >>> isn't for smaller data. >>> >>> Michael and I both think the culprit is all the pasting >>> and cbinding that is >>> going on, and more to the point, that memory for an >>> internal representation >>> to be written out is allocated at all. Streaming >>> across the object, looping >>> by rows and writing directly to file (e.g. from C) >>> should be blisteringly >>> fast in comparison. >>> >>> ~G >>> >>> >>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence >>> <micha...@gene.com <mailto:micha...@gene.com> >>> <mailto:micha...@gene.com <mailto:micha...@gene.com>>> >>> wrote: >>> >>> Gabe is still testing/profiling, but we'll send >>> something randomized >>> along eventually. >>> >>> >>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan >>> <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org> >>> <mailto:mtmor...@fhcrc.org >>> <mailto:mtmor...@fhcrc.org>>> wrote: >>> >>> I didn't see in the original thread a >>> reproducible (simulated, I >>> guess) example, to be explicit about what the >>> problem is?? >>> >>> Martin >>> >>> >>> On 08/26/2014 10:47 AM, Michael Lawrence wrote: >>> >>> My understanding is that the heap >>> optimization provided marginal >>> gains, and >>> that we need to think harder about how to >>> optimize the all of >>> the string >>> manipulation in writeVcf. We either need to >>> reduce it or reduce its >>> overhead (i.e., the CHARSXP allocation). >>> Gabe is doing more tests. >>> >>> >>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie >>> Obenchain >>> <voben...@fhcrc.org >>> <mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org >>> <mailto:voben...@fhcrc.org>>> >>> >>> wrote: >>> >>> Hi Gabe, >>> >>> Martin responded, and so did Michael, >>> >>> >>> https://stat.ethz.ch/____pipermail/bioc-devel/2014-____ >>> August/006082.html >>> >>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html> >>> >>> >>> >>> <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html >>> >>> <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>> >>> >>> It sounded like Michael was ok with >>> working with/around heap >>> initialization. >>> >>> Michael, is that right or should we >>> still consider this on >>> the table? >>> >>> >>> Val >>> >>> >>> On 08/26/2014 09:34 AM, Gabe Becker >>> wrote: >>> >>> Val, >>> >>> Has there been any movement on >>> this? This remains a >>> substantial >>> bottleneck for us when writing very >>> large VCF files (e.g. >>> variants+genotypes for whole genome >>> NGS samples). >>> >>> I was able to see a ~25% speedup >>> with 4 cores and an >>> "optimal" speedup >>> of ~2x with 10-12 cores for a VCF >>> with 500k rows using >>> a very naive >>> parallelization strategy and no >>> other changes. I suspect >>> this could be >>> improved on quite a bit, or >>> possibly made irrelevant >>> with judicious use >>> of serial C code. >>> >>> Did you and Martin make any plans >>> regarding optimizing >>> writeVcf? >>> >>> Best >>> ~G >>> >>> >>> On Tue, Aug 5, 2014 at 2:33 PM, >>> Valerie Obenchain >>> <voben...@fhcrc.org >>> <mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org >>> <mailto:voben...@fhcrc.org>> >>> <mailto:voben...@fhcrc.org >>> <mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org >>> <mailto:voben...@fhcrc.org>>>> >>> >>> wrote: >>> >>> Hi Michael, >>> >>> I'm interested in working on >>> this. I'll discuss >>> with Martin next >>> week when we're both back in >>> the office. >>> >>> Val >>> >>> >>> >>> >>> >>> On 08/05/14 07:46, Michael >>> Lawrence wrote: >>> >>> Hi guys (Val, Martin, >>> Herve): >>> >>> Anyone have an itch for >>> optimization? The >>> writeVcf function is >>> currently a >>> bottleneck in our WGS >>> genotyping pipeline. For >>> a typical 50 >>> million row >>> gVCF, it was taking 2.25 >>> hours prior to >>> yesterday's improvements >>> (pasteCollapseRows) that >>> brought it down to >>> about 1 hour, which >>> is still >>> too long by my standards >>> (> 0). Only takes 3 >>> minutes to call the >>> genotypes >>> (and associated >>> likelihoods etc) from the >>> variant calls (using >>> 80 cores and >>> 450 GB RAM on one node), >>> so the output is an >>> issue. Profiling >>> suggests that >>> the running time scales >>> non-linearly in the >>> number of rows. >>> >>> Digging a little deeper, >>> it seems to be >>> something with R's >>> string/memory >>> allocation. Below, >>> pasting 1 million strings >>> takes 6 seconds, but >>> 10 >>> million strings takes >>> over 2 minutes. It gets >>> way worse with 50 >>> million. I >>> suspect it has something >>> to do with R's string >>> hash table. >>> >>> set.seed(1000) >>> end <- sample(1e8, 1e6) >>> system.time(paste0("END", >>> "=", end)) >>> user system elapsed >>> 6.396 0.028 6.420 >>> >>> end <- sample(1e8, 1e7) >>> system.time(paste0("END", >>> "=", end)) >>> user system elapsed >>> 134.714 0.352 134.978 >>> >>> Indeed, even this takes a >>> long time (in a >>> fresh session): >>> >>> set.seed(1000) >>> end <- sample(1e8, 1e6) >>> end <- sample(1e8, 1e7) >>> >>> system.time(as.character(end)) >>> user system elapsed >>> 57.224 0.156 57.366 >>> >>> But running it a second >>> time is faster (about >>> what one would >>> expect?): >>> >>> system.time(levels <- >>> as.character(end)) >>> user system elapsed >>> 23.582 0.021 23.589 >>> >>> I did some simple >>> profiling of R to find that >>> the resizing of >>> the string >>> hash table is not a >>> significant component of >>> the time. So maybe >>> something >>> to do with the R heap/gc? >>> No time right now to >>> go deeper. But I >>> know Martin >>> likes this sort of >>> thing ;) >>> >>> Michael >>> >>> [[alternative >>> HTML version deleted]] >>> >>> >>> >>> _____________________________________________________ >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >>> <mailto:Bioc-devel@r-project.__org >>> <mailto:Bioc-devel@r-project.org>> >>> <mailto:Bioc-devel@r-project. >>> <mailto:Bioc-devel@r-project.>____org >>> <mailto:Bioc-devel@r-project.__org >>> <mailto:Bioc-devel@r-project.org>>> >>> mailing list >>> >>> https://stat.ethz.ch/mailman/______listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel> >>> >>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>> >>> >>> >>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel> >>> >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>> >>> >>> >>> >>> _____________________________________________________ >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >>> <mailto:Bioc-devel@r-project.__org >>> <mailto:Bioc-devel@r-project.org>> >>> <mailto:Bioc-devel@r-project. >>> <mailto:Bioc-devel@r-project.>____org >>> <mailto:Bioc-devel@r-project.__org >>> <mailto:Bioc-devel@r-project.org>>> mailing >>> list >>> https://stat.ethz.ch/mailman/______listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel> >>> >>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>> >>> >>> >>> >>> >>> <https://stat.ethz.ch/mailman/____listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel> >>> >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>> >>> >>> >>> >>> >>> -- >>> Computational Biologist >>> Genentech Research >>> >>> >>> >>> >>> >>> [[alternative HTML version >>> deleted]] >>> >>> >>> ___________________________________________________ >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >>> <mailto:Bioc-devel@r-project.__org >>> <mailto:Bioc-devel@r-project.org>> >>> mailing list >>> https://stat.ethz.ch/mailman/____listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel> >>> >>> <https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>> >>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer >>> Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793> >>> >>> >>> >>> >>> >>> >>> -- >>> Computational Biologist >>> Genentech Research >>> >>> >>> >>> >>> -- >>> Computational Biologist >>> Genentech Research >>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793> >>> >>> >>> >>> >>> >>> -- >>> Computational Biologist >>> Genentech Research >>> >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > -- Computational Biologist Genentech Research [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel