Valerie,
Apologies for this taking much longer than it should have. The changes in
Bioc-devel have wreaked havoc on the code we use to to generate and process
the data we need to write out, but the fault is mine for not getting on top
of it sooner.
I'm not seeing the speed you mentioned above in
Hi Gabe,
It would help to have a common baseline. Please show the output for
writing the Illumina file you sent originally:
library(VariantAnnotation)
fl - NA12877_S1.genome.vcf.gz
vcf - readVcf(fl, , param=ScanVcfParam(info=NA))
dim(vcf)
gc()
print(system.time(writeVcf(vcf, tempfile(
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think. Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).
Hi Val,
On 09/09/2014 02:12 PM, Valerie Obenchain wrote:
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.
I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
Hi Herve,
This unit test passes in VA 1.11.30 (the current version in svn). It was
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.
Val
On 09/09/2014 02:47 PM, Hervé Pagès wrote:
Hi Val,
On 09/09/2014 02:12 PM, Valerie Obenchain wrote:
Writing 'list' data has
Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before
reporting this. Thanks and sorry for the noise.
H.
On 09/09/2014 03:09 PM, Valerie Obenchain wrote:
Hi Herve,
This unit test passes in VA 1.11.30 (the current version in svn). It was
related to writeVcf(), not the
The new writeVcf code is in 1.11.28.
Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.
hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT
Val,
That is great. I'll check this out and test it on our end.
~G
On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org
wrote:
The new writeVcf code is in 1.11.28.
Using the illumina file you suggested, geno fields only, writing now takes
about 17 minutes.
hdr
class:
This approach, writing in chunks, is the same Herve and I used for writing
FASTA in the Biostrings package, although I see that Herve has now replaced
the R implementation with a C implementation. I similarly found an
absolutely huge speed up when writing genomes, by chunking.
Best,
Kasper
On
Val and Martin,
Apologies for the delay.
We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into R)
stuff.
ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took
about ~4.2 hrs to write
Thanks Gabe. I should have something for you on Monday.
Val
On 09/04/2014 01:56 PM, Gabe Becker wrote:
Val and Martin,
Apologies for the delay.
We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R)
Yes, it's very clear that the scaling is non-linear, and Gabe has been
experimenting with a chunk-wise + parallel algorithm. Unfortunately there
is some frustrating overhead with the parallelism. But I'm glad Val is
arriving at something quicker.
Michael
On Tue, Sep 2, 2014 at 1:33 PM, Martin
Try to run it through the lineprof package for memory profiling; I have
found this to be very helpful.
Here is an old blog post I wrote about it
http://www.hansenlab.org/rstats/2014/01/30/lineprof/
Kasper
On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker becker.g...@gene.com wrote:
The
Martin and Val.
I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
profiling enabled. The results of summaryRprof for that run are attached,
though for a variety of reasons they are pretty misleading.
It took over an hour to write (3700+seconds), so it's definitely a
The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.
Sorry for the confusion.
~G
On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com wrote:
Hi Gabe,
Martin responded, and so did Michael,
https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
It sounded like Michael was ok with working with/around heap
initialization.
Michael, is that right or should we still consider this on the table?
Val
On 08/26/2014 09:34 AM,
I thought it might come down to the heap initialization. We'll work with
that.
On Wed, Aug 13, 2014 at 4:42 PM, Martin Morgan mtmor...@fhcrc.org wrote:
On 08/05/2014 07:46 AM, Michael Lawrence wrote:
Hi guys (Val, Martin, Herve):
Anyone have an itch for optimization? The writeVcf function
On 08/05/2014 07:46 AM, Michael Lawrence wrote:
Hi guys (Val, Martin, Herve):
Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
Hi guys (Val, Martin, Herve):
Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1
19 matches
Mail list logo