Val and Martin,
Apologies for the delay.
We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.
ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).
Looking forward a new vastly improved writeVcf :).
~G
On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
<lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote:
Yes, it's very clear that the scaling is non-linear, and Gabe has
been experimenting with a chunk-wise + parallel algorithm.
Unfortunately there is some frustrating overhead with the
parallelism. But I'm glad Val is arriving at something quicker.
Michael
On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>> wrote:
On 08/27/2014 11:56 AM, Gabe Becker wrote:
The profiling I attached in my previous email is for 24 geno
fields, as I said,
but our typical usecase involves only ~4-6 fields, and is
faster but still on
the order of dozens of minutes.
I think Val is arriving at a (much) more efficient
implementation, but...
I wanted to share my guess that the poor _scaling_ is because
the garbage collector runs multiple times as the different
strings are pasted together, and has to traverse, in linear
time, increasing numbers of allocated SEXPs. So times scale
approximately quadratically with the number of rows in the VCF
An efficiency is to reduce the number of SEXPs in play by
writing out in chunks -- as each chunk is written, the SEXPs
become available for collection and are re-used. Here's my toy
example
time.R
======
splitIndices <- function (nx, ncl)
{
i <- seq_len(nx)
if (ncl == 0L)
list()
else if (ncl == 1L || nx == 1L)
list(i)
else {
fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
structure(split(i, cut(i, breaks, labels=FALSE)), names
= NULL)
}
}
x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
idx <- splitIndices(length(x), 20)
system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
system.time(paste(x, y, sep=":"))
}
running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
time.R the relevant time is
user system elapsed
15.320 0.064 15.381
versus with $ R --no-save --quiet -f time.R it is
user system elapsed
95.360 0.164 95.511
I think this is likely an overall strategy when dealing with
character data -- processing in independent chunks of moderate
(1M?) size (enabling as a consequence parallel evaluation in
modest memory) that are sufficient to benefit from
vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.
Martin
Sorry for the confusion.
~G
On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
<becke...@gene.com <mailto:becke...@gene.com>
<mailto:becke...@gene.com <mailto:becke...@gene.com>>> wrote:
Martin and Val.
I re-ran writeVcf on our (G)VCF data (34790518 ranges,
24 geno fields) with
profiling enabled. The results of summaryRprof for that
run are attached,
though for a variety of reasons they are pretty misleading.
It took over an hour to write (3700+seconds), so it's
definitely a
bottleneck when the data get very large, even if it
isn't for smaller data.
Michael and I both think the culprit is all the pasting
and cbinding that is
going on, and more to the point, that memory for an
internal representation
to be written out is allocated at all. Streaming
across the object, looping
by rows and writing directly to file (e.g. from C)
should be blisteringly
fast in comparison.
~G
On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
<micha...@gene.com <mailto:micha...@gene.com>
<mailto:micha...@gene.com <mailto:micha...@gene.com>>>
wrote:
Gabe is still testing/profiling, but we'll send
something randomized
along eventually.
On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
<mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
<mailto:mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>>> wrote:
I didn't see in the original thread a
reproducible (simulated, I
guess) example, to be explicit about what the
problem is??
Martin
On 08/26/2014 10:47 AM, Michael Lawrence wrote:
My understanding is that the heap
optimization provided marginal
gains, and
that we need to think harder about how to
optimize the all of
the string
manipulation in writeVcf. We either need to
reduce it or reduce its
overhead (i.e., the CHARSXP allocation).
Gabe is doing more tests.
On Tue, Aug 26, 2014 at 9:43 AM, Valerie
Obenchain
<voben...@fhcrc.org
<mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org>>>
wrote:
Hi Gabe,
Martin responded, and so did Michael,
https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>
<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>
It sounded like Michael was ok with
working with/around heap
initialization.
Michael, is that right or should we
still consider this on
the table?
Val
On 08/26/2014 09:34 AM, Gabe Becker wrote:
Val,
Has there been any movement on
this? This remains a
substantial
bottleneck for us when writing very
large VCF files (e.g.
variants+genotypes for whole genome
NGS samples).
I was able to see a ~25% speedup
with 4 cores and an
"optimal" speedup
of ~2x with 10-12 cores for a VCF
with 500k rows using
a very naive
parallelization strategy and no
other changes. I suspect
this could be
improved on quite a bit, or
possibly made irrelevant
with judicious use
of serial C code.
Did you and Martin make any plans
regarding optimizing
writeVcf?
Best
~G
On Tue, Aug 5, 2014 at 2:33 PM,
Valerie Obenchain
<voben...@fhcrc.org
<mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org>>
<mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org>>>>
wrote:
Hi Michael,
I'm interested in working on
this. I'll discuss
with Martin next
week when we're both back in
the office.
Val
On 08/05/14 07:46, Michael
Lawrence wrote:
Hi guys (Val, Martin, Herve):
Anyone have an itch for
optimization? The
writeVcf function is
currently a
bottleneck in our WGS
genotyping pipeline. For
a typical 50
million row
gVCF, it was taking 2.25
hours prior to
yesterday's improvements
(pasteCollapseRows) that
brought it down to
about 1 hour, which
is still
too long by my standards
(> 0). Only takes 3
minutes to call the
genotypes
(and associated
likelihoods etc) from the
variant calls (using
80 cores and
450 GB RAM on one node),
so the output is an
issue. Profiling
suggests that
the running time scales
non-linearly in the
number of rows.
Digging a little deeper,
it seems to be
something with R's
string/memory
allocation. Below,
pasting 1 million strings
takes 6 seconds, but
10
million strings takes
over 2 minutes. It gets
way worse with 50
million. I
suspect it has something
to do with R's string
hash table.
set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END",
"=", end))
user system elapsed
6.396 0.028 6.420
end <- sample(1e8, 1e7)
system.time(paste0("END",
"=", end))
user system elapsed
134.714 0.352 134.978
Indeed, even this takes a
long time (in a
fresh session):
set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
user system elapsed
57.224 0.156 57.366
But running it a second
time is faster (about
what one would
expect?):
system.time(levels <-
as.character(end))
user system elapsed
23.582 0.021 23.589
I did some simple
profiling of R to find that
the resizing of
the string
hash table is not a
significant component of
the time. So maybe
something
to do with the R heap/gc?
No time right now to
go deeper. But I
know Martin
likes this sort of thing ;)
Michael
[[alternative
HTML version deleted]]
_____________________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>>
<mailto:Bioc-devel@r-project.
<mailto:Bioc-devel@r-project.>____org
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>>>
mailing list
https://stat.ethz.ch/mailman/______listinfo/bioc-devel
<https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
<https://stat.ethz.ch/mailman/____listinfo/bioc-devel
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
<https://stat.ethz.ch/mailman/____listinfo/bioc-devel
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
_____________________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>>
<mailto:Bioc-devel@r-project.
<mailto:Bioc-devel@r-project.>____org
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>>> mailing
list
https://stat.ethz.ch/mailman/______listinfo/bioc-devel
<https://stat.ethz.ch/mailman/____listinfo/bioc-devel>
<https://stat.ethz.ch/mailman/____listinfo/bioc-devel
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>
<https://stat.ethz.ch/mailman/____listinfo/bioc-devel
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
___________________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>>
mailing list
https://stat.ethz.ch/mailman/____listinfo/bioc-devel
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
<https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
--
Computational Biology / Fred Hutchinson Cancer
Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
<tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
--
Computational Biologist
Genentech Research
--
Computational Biologist
Genentech Research
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
--
Computational Biologist
Genentech Research