Re: [Bioc-devel] writeVcf performance

Valerie Obenchain Thu, 04 Sep 2014 15:29:48 -0700

Thanks Gabe. I should have something for you on Monday.


Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
<lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote:

    Yes, it's very clear that the scaling is non-linear, and Gabe has
    been experimenting with a chunk-wise + parallel algorithm.
    Unfortunately there is some frustrating overhead with the
    parallelism. But I'm glad Val is arriving at something quicker.

    Michael


    On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmor...@fhcrc.org
    <mailto:mtmor...@fhcrc.org>> wrote:

        On 08/27/2014 11:56 AM, Gabe Becker wrote:

            The profiling I attached in my previous email is for 24 geno
            fields, as I said,
            but our typical usecase involves only ~4-6 fields, and is
            faster but still on
            the order of dozens of minutes.


        I think Val is arriving at a (much) more efficient
        implementation, but...

        I wanted to share my guess that the poor _scaling_ is because
        the garbage collector runs multiple times as the different
        strings are pasted together, and has to traverse, in linear
        time, increasing numbers of allocated SEXPs. So times scale
        approximately quadratically with the number of rows in the VCF

        An efficiency is to reduce the number of SEXPs in play by
        writing out in chunks -- as each chunk is written, the SEXPs
        become available for collection and are re-used. Here's my toy
        example

        time.R
        ======
        splitIndices <- function (nx, ncl)
        {
             i <- seq_len(nx)
             if (ncl == 0L)
                 list()
             else if (ncl == 1L || nx == 1L)
                 list(i)
             else {
                 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
                 structure(split(i, cut(i, breaks, labels=FALSE)), names
        = NULL)
             }
        }

        x = as.character(seq_len(1e7)); y = sample(x)
        if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
             idx <- splitIndices(length(x), 20)
             system.time(for (i in idx) paste(x[i], y[i], sep=":"))
        } else {
             system.time(paste(x, y, sep=":"))
        }


        running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
        time.R the relevant time is

            user  system elapsed
          15.320   0.064  15.381

        versus with $ R --no-save --quiet -f time.R it is

            user  system elapsed
          95.360   0.164  95.511

        I think this is likely an overall strategy when dealing with
        character data -- processing in independent chunks of moderate
        (1M?) size (enabling as a consequence parallel evaluation in
        modest memory) that are sufficient to benefit from
        vectorization, but that do not entail allocation of large
        numbers of in-use SEXPs.

        Martin


            Sorry for the confusion.
            ~G


            On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
            <becke...@gene.com <mailto:becke...@gene.com>
            <mailto:becke...@gene.com <mailto:becke...@gene.com>>> wrote:

                 Martin and Val.

                 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
            24 geno fields) with
                 profiling enabled. The results of summaryRprof for that
            run are attached,
                 though for a variety of reasons they are pretty misleading.

                 It took over an hour to write (3700+seconds), so it's
            definitely a
                 bottleneck when the data get very large, even if it
            isn't for smaller data.

                 Michael and I both think the culprit is all the pasting
            and cbinding that is
                 going on, and more to the point, that memory for an
            internal representation
                 to be written out is allocated at all.  Streaming
            across the object, looping
                 by rows and writing directly to file (e.g. from C)
            should be blisteringly
                 fast in comparison.

                 ~G


                 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
            <micha...@gene.com <mailto:micha...@gene.com>
                 <mailto:micha...@gene.com <mailto:micha...@gene.com>>>
            wrote:

                     Gabe is still testing/profiling, but we'll send
            something randomized
                     along eventually.


                     On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
            <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
                     <mailto:mtmor...@fhcrc.org
            <mailto:mtmor...@fhcrc.org>>> wrote:

                         I didn't see in the original thread a
            reproducible (simulated, I
                         guess) example, to be explicit about what the
            problem is??

                         Martin


                         On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                             My understanding is that the heap
            optimization provided marginal
                             gains, and
                             that we need to think harder about how to
            optimize the all of
                             the string
                             manipulation in writeVcf. We either need to
            reduce it or reduce its
                             overhead (i.e., the CHARSXP allocation).
            Gabe is doing more tests.


                             On Tue, Aug 26, 2014 at 9:43 AM, Valerie
            Obenchain
                             <voben...@fhcrc.org
            <mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>>>

                             wrote:

                                 Hi Gabe,

                                 Martin responded, and so did Michael,

            
https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
            
<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>


            
<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
            <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>

                                 It sounded like Michael was ok with
            working with/around heap
                                 initialization.

                                 Michael, is that right or should we
            still consider this on
                                 the table?


                                 Val


                                 On 08/26/2014 09:34 AM, Gabe Becker wrote:

                                     Val,

                                     Has there been any movement on
            this? This remains a
                                     substantial
                                     bottleneck for us when writing very
            large VCF files (e.g.
                                     variants+genotypes for whole genome
            NGS samples).

                                     I was able to see a ~25% speedup
            with 4 cores and  an
                                     "optimal" speedup
                                     of ~2x with 10-12 cores for a VCF
            with 500k rows  using
                                     a very naive
                                     parallelization strategy and no
            other changes. I suspect
                                     this could be
                                     improved on quite a bit, or
            possibly made irrelevant
                                     with judicious use
                                     of serial C code.

                                     Did you and Martin make any plans
            regarding optimizing
                                     writeVcf?

                                     Best
                                     ~G


                                     On Tue, Aug 5, 2014 at 2:33 PM,
            Valerie Obenchain
                                     <voben...@fhcrc.org
            <mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>>
                                     <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
            <mailto:voben...@fhcrc.org>>>>

                                     wrote:

                                           Hi Michael,

                                           I'm interested in working on
            this. I'll discuss
                                     with Martin next
                                           week when we're both back in
            the office.

                                           Val





                                           On 08/05/14 07:46, Michael
            Lawrence wrote:

                                               Hi guys (Val, Martin, Herve):

                                               Anyone have an itch for
            optimization? The
                                     writeVcf function is
                                               currently a
                                               bottleneck in our WGS
            genotyping pipeline. For
                                     a typical 50
                                               million row
                                               gVCF, it was taking 2.25
            hours prior to
                                     yesterday's improvements
                                               (pasteCollapseRows) that
            brought it down to
                                     about 1 hour, which
                                               is still
                                               too long by my standards
            (> 0). Only takes 3
                                     minutes to call the
                                               genotypes
                                               (and associated
            likelihoods etc) from the
                                     variant calls (using
                                               80 cores and
                                               450 GB RAM on one node),
            so the output is an
                                     issue. Profiling
                                               suggests that
                                               the running time scales
            non-linearly in the
                                     number of rows.

                                               Digging a little deeper,
            it seems to be
                                     something with R's
                                               string/memory
                                               allocation. Below,
            pasting 1 million strings
                                     takes 6 seconds, but
                                     10
                                               million strings takes
            over 2 minutes. It gets
                                     way worse with 50
                                               million. I
                                               suspect it has something
            to do with R's string
                                     hash table.

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                                   6.396   0.028   6.420

                                               end <- sample(1e8, 1e7)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                               134.714   0.352 134.978

                                               Indeed, even this takes a
            long time (in a
                                     fresh session):

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               end <- sample(1e8, 1e7)

            system.time(as.character(end))
                                                    user  system elapsed
                                                  57.224   0.156  57.366

                                               But running it a second
            time is faster (about
                                     what one would
                                               expect?):

                                               system.time(levels <-
            as.character(end))
                                                    user  system elapsed
                                                  23.582   0.021  23.589

                                               I did some simple
            profiling of R to find that
                                     the resizing of
                                               the string
                                               hash table is not a
            significant component of
                                     the time. So maybe
                                               something
                                               to do with the R heap/gc?
            No time right now to
                                     go deeper. But I
                                               know Martin
                                               likes this sort of thing ;)

                                               Michael

                                                        [[alternative
            HTML version deleted]]



            _____________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
                                     <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>
                                     <mailto:Bioc-devel@r-project.
            <mailto:Bioc-devel@r-project.>____org
                                     <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>>
                                               mailing list

            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>


            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>



            _____________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
                                     <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>
                                     <mailto:Bioc-devel@r-project.
            <mailto:Bioc-devel@r-project.>____org
                                     <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>> mailing
                                     list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>




            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>




                                     --
                                     Computational Biologist
                                     Genentech Research





                                      [[alternative HTML version deleted]]


            ___________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>
                             mailing list
            https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>



                         --
                         Computational Biology / Fred Hutchinson Cancer
            Research Center
                         1100 Fairview Ave. N.
                         PO Box 19024 Seattle, WA 98109

                         Location: Arnold Building M1 B861
                         Phone: (206) 667-2793
            <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>






                 --
                 Computational Biologist
                 Genentech Research




            --
            Computational Biologist
            Genentech Research



        --
        Computational Biology / Fred Hutchinson Cancer Research Center
        1100 Fairview Ave. N.
        PO Box 19024 Seattle, WA 98109

        Location: Arnold Building M1 B861
        Phone: (206) 667-2793 <tel:%28206%29%20667-2793>





--
Computational Biologist
Genentech Research


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] writeVcf performance

Reply via email to