Re: [Bioc-devel] writeVcf performance

2014-09-30 Thread Gabe Becker
Valerie, Apologies for this taking much longer than it should have. The changes in Bioc-devel have wreaked havoc on the code we use to to generate and process the data we need to write out, but the fault is mine for not getting on top of it sooner. I'm not seeing the speed you mentioned above in

Re: [Bioc-devel] writeVcf performance

2014-09-30 Thread Valerie Obenchain
Hi Gabe, It would help to have a common baseline. Please show the output for writing the Illumina file you sent originally: library(VariantAnnotation) fl - NA12877_S1.genome.vcf.gz vcf - readVcf(fl, , param=ScanVcfParam(info=NA)) dim(vcf) gc() print(system.time(writeVcf(vcf, tempfile(

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the process of moving SimpleList and DataFrame from IRanges to S4Vectors; finished up today I think. Anyhow, if you get VariantAnnotation from svn you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe rtracklayer).

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Hervé Pagès
Hi Val, On 09/09/2014 02:12 PM, Valerie Obenchain wrote: Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the process of moving SimpleList and DataFrame from IRanges to S4Vectors; finished up today I think. I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain
Hi Herve, This unit test passes in VA 1.11.30 (the current version in svn). It was related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours. Val On 09/09/2014 02:47 PM, Hervé Pagès wrote: Hi Val, On 09/09/2014 02:12 PM, Valerie Obenchain wrote: Writing 'list' data has

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Hervé Pagès
Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before reporting this. Thanks and sorry for the noise. H. On 09/09/2014 03:09 PM, Valerie Obenchain wrote: Hi Herve, This unit test passes in VA 1.11.30 (the current version in svn). It was related to writeVcf(), not the

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain
The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Gabe Becker
Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class:

Re: [Bioc-devel] writeVcf performance

2014-09-05 Thread Kasper Daniel Hansen
This approach, writing in chunks, is the same Herve and I used for writing FASTA in the Biostrings package, although I see that Herve has now replaced the R implementation with a C implementation. I similarly found an absolutely huge speed up when writing genomes, by chunking. Best, Kasper On

Re: [Bioc-devel] writeVcf performance

2014-09-04 Thread Gabe Becker
Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write

Re: [Bioc-devel] writeVcf performance

2014-09-04 Thread Valerie Obenchain
Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R)

Re: [Bioc-devel] writeVcf performance

2014-09-02 Thread Michael Lawrence
Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin

Re: [Bioc-devel] writeVcf performance

2014-08-29 Thread Kasper Daniel Hansen
Try to run it through the lineprof package for memory profiling; I have found this to be very helpful. Here is an old blog post I wrote about it http://www.hansenlab.org/rstats/2014/01/30/lineprof/ Kasper On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker becker.g...@gene.com wrote: The

Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a

Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com wrote:

Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Valerie Obenchain
Hi Gabe, Martin responded, and so did Michael, https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM,

Re: [Bioc-devel] writeVcf performance

2014-08-14 Thread Michael Lawrence
I thought it might come down to the heap initialization. We'll work with that. On Wed, Aug 13, 2014 at 4:42 PM, Martin Morgan mtmor...@fhcrc.org wrote: On 08/05/2014 07:46 AM, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function

Re: [Bioc-devel] writeVcf performance

2014-08-13 Thread Martin Morgan
On 08/05/2014 07:46 AM, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements

[Bioc-devel] writeVcf performance

2014-08-05 Thread Michael Lawrence
Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1