Re: [Bioc-devel] writeVcf performance
Valerie, Apologies for this taking much longer than it should have. The changes in Bioc-devel have wreaked havoc on the code we use to to generate and process the data we need to write out, but the fault is mine for not getting on top of it sooner. I'm not seeing the speed you mentioned above in the latest devel version (1.11.35). It took ~1.5hrs to write the an expanded vcf with 56M rows (print output and sessionInfo() follow). I'll try reading in the illumina platinum and writing it back out to see if it is something about our specific vcf object (could ExpandedVCF vs VCF be an issue?). vcfgeno *class: ExpandedVCF * *dim: 50307989 1 * rowData(vcf): GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER info(vcf): DataFrame with 1 column: END Fields with no header: END geno(vcf): SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP geno(header(vcf)): Number TypeDescription AD 2 Integer Allelic depths (number of reads in each observed al... DP 1 Integer Total read depth FT 1 String Variant filters GT 1 String Genotype GQ 1 Integer Genotype quality PL 3 Integer Normalized, Phred-scaled likelihoods for genotypes MIN_DP 1 Integer Minimum DP observed within the GVCF block sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17 [3] AnnotationDbi_1.27.16 Biobase_2.25.0 [5] gmapR_1.7.8 VTGenotyping_0.0.1 [7] BiocParallel_0.99.22 futile.logger_1.3.7 [9] VariantTools_1.7.5*VariantAnnotation_1.11.35* [11] Rsamtools_1.17.34 Biostrings_2.33.14 [13] XVector_0.5.8 rtracklayer_1.25.16 [15] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23 [17] IRanges_1.99.28 S4Vectors_0.2.4 [19] BiocGenerics_0.11.5 switchr_0.2.1 loaded via a namespace (and not attached): [1] annotate_1.43.5base64enc_0.1-2 [3] BatchJobs_1.4 BBmisc_1.7 [5] biomaRt_2.21.1 bitops_1.0-6 [7] brew_1.0-6 BSgenome_1.33.9 [9] CGPtools_2.2.0 checkmate_1.4 [11] codetools_0.2-9DBI_0.3.1 [13] DESeq_1.17.0 digest_0.6.4 [15] fail_1.2 foreach_1.4.2 [17] futile.options_1.0.0 genefilter_1.47.6 [19] geneplotter_1.43.0 GenomicAlignments_1.1.29 [21] genoset_1.19.32gneDB_0.4.18 [23] grid_3.1.1 iterators_1.0.7 [25] lambda.r_1.1.6 lattice_0.20-29 [27] Matrix_1.1-4 RColorBrewer_1.0-5 [29] RCurl_1.95-4.3 rjson_0.2.14 [31] RSQLite_0.11.4 sendmailR_1.2-1 [33] splines_3.1.1 stringr_0.6.2 [35] survival_2.37-7tools_3.1.1 [37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1 [39] xtable_1.7-4 zlibbioc_1.11.1 On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain voben...@fhcrc.org wrote: Hi Gabe, Have you had a chance to test writeVcf? The changes made over the past week have shaved off more time. It now takes ~ 9 minutes to write the NA12877 example. dim(vcf) [1] 516127621 gc() used (Mb) gc trigger(Mb) max used(Mb) Ncells 157818565 8428.5 298615851 15947.9 261235336 13951.5 Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8 print(system.time(writeVcf(vcf, tempfile( user system elapsed 555.282 6.700 565.700 gc() used (Mb) gc trigger(Mb) max used(Mb) Ncells 157821990 8428.7 329305975 17586.9 261482807 13964.7 Vcells 1176960717 8979.5 2183277445 16657.1 2171401955 16566.5 In the most recent version (1.11.35) I've added chunking for files with 1e5 records. Right now the choice of # records per chunk is simple, based on total records only. We are still experimenting with this. You can override default chunking with 'nchunk'. Examples on the man page. Valerie On 09/08/14 08:43, Gabe Becker wrote: Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source
Re: [Bioc-devel] writeVcf performance
Hi Gabe, It would help to have a common baseline. Please show the output for writing the Illumina file you sent originally: library(VariantAnnotation) fl - NA12877_S1.genome.vcf.gz vcf - readVcf(fl, , param=ScanVcfParam(info=NA)) dim(vcf) gc() print(system.time(writeVcf(vcf, tempfile( gc() Here's my output. For this file, writing takes between 8-10 minutes depending on machine load. (Machine has 378 GIG of RAM.) dim(vcf) [1] 516127621 gc() used (Mb) gc trigger(Mb) max used(Mb) Ncells 157824501 8428.8 298615851 15947.9 261241334 13951.8 Vcells 1109937571 8468.2 1778479081 13568.8 1693642432 12921.5 print(system.time(writeVcf(vcf, tempfile( user system elapsed 601.265 9.261 614.230 gc() used (Mb) gc trigger(Mb) max used(Mb) Ncells 157829989 8429.1 298615851 15947.9 261241334 13951.8 Vcells 1109941106 8468.2 2270282109 17320.9 2243573149 17117.2 I'm not seeing substantial differences in writeVcf with expanded vs collapsed VCF. However, as an aside, note that calling writeVcf on VRanges (or equivalently ExpandedVCF) the new file will have multiple rows per position if there were multiple ALT alleles. I'm not sure this is technically valid as per the VCF specs but am assuming Michael knew about this. Five positions in original VCF: fl - system.file(extdata, ex2.vcf, package=VariantAnnotation) vcf - readVcf(fl, ) dim(vcf) [1] 5 3 vr - as(vcf, VRanges) length(vr) [1] 21 Before writing out, VRanges is coerced to ExpandedVCF; 7 rows are written out. dim(as(vr, VCF)) [1] 7 3 Can you provide a portion of 'vcfgeno' for testing? Valerie sessionInfo() R version 3.1.0 Patched (2014-06-24 r66016) Platform: x86_64-unknown-linux-gnu (64-bit) ... other attached packages: [1] VariantAnnotation_1.11.35 Rsamtools_1.17.34 [3] Biostrings_2.33.14XVector_0.5.8 [5] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23 [7] IRanges_1.99.28 S4Vectors_0.2.4 [9] BiocGenerics_0.11.5 On 09/30/2014 08:36 AM, Gabe Becker wrote: Valerie, Apologies for this taking much longer than it should have. The changes in Bioc-devel have wreaked havoc on the code we use to to generate and process the data we need to write out, but the fault is mine for not getting on top of it sooner. I'm not seeing the speed you mentioned above in the latest devel version (1.11.35). It took ~1.5hrs to write the an expanded vcf with 56M rows (print output and sessionInfo() follow). I'll try reading in the illumina platinum and writing it back out to see if it is something about our specific vcf object (could ExpandedVCF vs VCF be an issue?). vcfgeno *class: ExpandedVCF * *dim: 50307989 1 * rowData(vcf): GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER info(vcf): DataFrame with 1 column: END Fields with no header: END geno(vcf): SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP geno(header(vcf)): Number TypeDescription AD 2 Integer Allelic depths (number of reads in each observed al... DP 1 Integer Total read depth FT 1 String Variant filters GT 1 String Genotype GQ 1 Integer Genotype quality PL 3 Integer Normalized, Phred-scaled likelihoods for genotypes MIN_DP 1 Integer Minimum DP observed within the GVCF block sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17 [3] AnnotationDbi_1.27.16 Biobase_2.25.0 [5] gmapR_1.7.8 VTGenotyping_0.0.1 [7] BiocParallel_0.99.22 futile.logger_1.3.7 [9] VariantTools_1.7.5 *VariantAnnotation_1.11.35* [11] Rsamtools_1.17.34 Biostrings_2.33.14 [13] XVector_0.5.8 rtracklayer_1.25.16 [15] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23 [17] IRanges_1.99.28 S4Vectors_0.2.4 [19] BiocGenerics_0.11.5 switchr_0.2.1 loaded via a namespace (and not attached): [1] annotate_1.43.5base64enc_0.1-2 [3] BatchJobs_1.4 BBmisc_1.7 [5] biomaRt_2.21.1 bitops_1.0-6 [7] brew_1.0-6 BSgenome_1.33.9 [9] CGPtools_2.2.0 checkmate_1.4 [11] codetools_0.2-9DBI_0.3.1 [13] DESeq_1.17.0 digest_0.6.4 [15] fail_1.2 foreach_1.4.2 [17] futile.options_1.0.0 genefilter_1.47.6 [19] geneplotter_1.43.0 GenomicAlignments_1.1.29 [21]
Re: [Bioc-devel] writeVcf performance
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the process of moving SimpleList and DataFrame from IRanges to S4Vectors; finished up today I think. Anyhow, if you get VariantAnnotation from svn you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe rtracklayer). I'm working on the 'chunking' approach next. It looks like we can still gain from adding that. Valerie On 09/08/2014 12:00 PM, Valerie Obenchain wrote: fyi Martin found a bug with the treatment of list data (ie, Number = '.') in the header. Working on a fix ... Val On 09/08/2014 08:43 AM, Gabe Becker wrote: Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT GQX ... PL VF param = ScanVcfParam(info=NA) vcf = readVcf(fl, , param=param) dim(vcf) [1] 516127621 system.time(writeVcf(vcf, out.vcf)) user system elapsed 971.0326.568 1004.593 In 1.11.28, parsing of geno data was moved to C. If this didn't speed things up enough we were planning to implement 'chunking' through the VCF and/or move the parsing of info to C, however, it looks like geno was the bottleneck. I've tested a number of samples/fields combinations in files with = .5 million rows and the improvement over writeVcf() in release is ~ 90%. Valerie On 09/04/14 15:28, Valerie Obenchain wrote: Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com mailto:lawrence.michael@gene.__com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the different strings are pasted together, and has to traverse, in linear time, increasing numbers of allocated SEXPs. So times scale approximately quadratically with the number of rows in the VCF An efficiency is to reduce the number of SEXPs in play by writing out in chunks -- as each chunk is written, the SEXPs become available for collection and are re-used. Here's my toy example time.R == splitIndices - function (nx, ncl) { i - seq_len(nx) if (ncl == 0L)
Re: [Bioc-devel] writeVcf performance
Hi Val, On 09/09/2014 02:12 PM, Valerie Obenchain wrote: Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the process of moving SimpleList and DataFrame from IRanges to S4Vectors; finished up today I think. I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check' failed for me because of an unit test error in test_VRanges_vcf(). Here is how to quickly reproduce: library(VariantAnnotation) library(RUnit) source(path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R) dest - tempfile() vr - make_TARGET_VRanges_vcf() writeVcf(vr, dest) vcf - readVcf(dest, genome = hg19) perm - c(1, 7, 8, 4, 2, 10) vcf.vr - as(vcf, VRanges)[perm] genome(vr) - hg19 checkIdenticalVCF(vr, vcf.vr) # Error in checkIdentical(orig, vcf) : FALSE Hard for me to tell whether this is related to DataFrame moving from IRanges to S4Vectors or to a regression in writeVcf(). Do you think you can have a look? Thanks for the help and sorry for the trouble. H. Anyhow, if you get VariantAnnotation from svn you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe rtracklayer). I'm working on the 'chunking' approach next. It looks like we can still gain from adding that. Valerie On 09/08/2014 12:00 PM, Valerie Obenchain wrote: fyi Martin found a bug with the treatment of list data (ie, Number = '.') in the header. Working on a fix ... Val On 09/08/2014 08:43 AM, Gabe Becker wrote: Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT GQX ... PL VF param = ScanVcfParam(info=NA) vcf = readVcf(fl, , param=param) dim(vcf) [1] 516127621 system.time(writeVcf(vcf, out.vcf)) user system elapsed 971.0326.568 1004.593 In 1.11.28, parsing of geno data was moved to C. If this didn't speed things up enough we were planning to implement 'chunking' through the VCF and/or move the parsing of info to C, however, it looks like geno was the bottleneck. I've tested a number of samples/fields combinations in files with = .5 million rows and the improvement over writeVcf() in release is ~ 90%. Valerie On 09/04/14 15:28, Valerie Obenchain wrote: Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com mailto:lawrence.michael@gene.__com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the
Re: [Bioc-devel] writeVcf performance
Hi Herve, This unit test passes in VA 1.11.30 (the current version in svn). It was related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours. Val On 09/09/2014 02:47 PM, Hervé Pagès wrote: Hi Val, On 09/09/2014 02:12 PM, Valerie Obenchain wrote: Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the process of moving SimpleList and DataFrame from IRanges to S4Vectors; finished up today I think. I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check' failed for me because of an unit test error in test_VRanges_vcf(). Here is how to quickly reproduce: library(VariantAnnotation) library(RUnit) source(path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R) dest - tempfile() vr - make_TARGET_VRanges_vcf() writeVcf(vr, dest) vcf - readVcf(dest, genome = hg19) perm - c(1, 7, 8, 4, 2, 10) vcf.vr - as(vcf, VRanges)[perm] genome(vr) - hg19 checkIdenticalVCF(vr, vcf.vr) # Error in checkIdentical(orig, vcf) : FALSE Hard for me to tell whether this is related to DataFrame moving from IRanges to S4Vectors or to a regression in writeVcf(). Do you think you can have a look? Thanks for the help and sorry for the trouble. H. Anyhow, if you get VariantAnnotation from svn you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe rtracklayer). I'm working on the 'chunking' approach next. It looks like we can still gain from adding that. Valerie On 09/08/2014 12:00 PM, Valerie Obenchain wrote: fyi Martin found a bug with the treatment of list data (ie, Number = '.') in the header. Working on a fix ... Val On 09/08/2014 08:43 AM, Gabe Becker wrote: Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT GQX ... PL VF param = ScanVcfParam(info=NA) vcf = readVcf(fl, , param=param) dim(vcf) [1] 516127621 system.time(writeVcf(vcf, out.vcf)) user system elapsed 971.0326.568 1004.593 In 1.11.28, parsing of geno data was moved to C. If this didn't speed things up enough we were planning to implement 'chunking' through the VCF and/or move the parsing of info to C, however, it looks like geno was the bottleneck. I've tested a number of samples/fields combinations in files with = .5 million rows and the improvement over writeVcf() in release is ~ 90%. Valerie On 09/04/14 15:28, Valerie Obenchain wrote: Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com mailto:lawrence.michael@gene.__com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient
Re: [Bioc-devel] writeVcf performance
Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before reporting this. Thanks and sorry for the noise. H. On 09/09/2014 03:09 PM, Valerie Obenchain wrote: Hi Herve, This unit test passes in VA 1.11.30 (the current version in svn). It was related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours. Val On 09/09/2014 02:47 PM, Hervé Pagès wrote: Hi Val, On 09/09/2014 02:12 PM, Valerie Obenchain wrote: Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the process of moving SimpleList and DataFrame from IRanges to S4Vectors; finished up today I think. I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check' failed for me because of an unit test error in test_VRanges_vcf(). Here is how to quickly reproduce: library(VariantAnnotation) library(RUnit) source(path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R) dest - tempfile() vr - make_TARGET_VRanges_vcf() writeVcf(vr, dest) vcf - readVcf(dest, genome = hg19) perm - c(1, 7, 8, 4, 2, 10) vcf.vr - as(vcf, VRanges)[perm] genome(vr) - hg19 checkIdenticalVCF(vr, vcf.vr) # Error in checkIdentical(orig, vcf) : FALSE Hard for me to tell whether this is related to DataFrame moving from IRanges to S4Vectors or to a regression in writeVcf(). Do you think you can have a look? Thanks for the help and sorry for the trouble. H. Anyhow, if you get VariantAnnotation from svn you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe rtracklayer). I'm working on the 'chunking' approach next. It looks like we can still gain from adding that. Valerie On 09/08/2014 12:00 PM, Valerie Obenchain wrote: fyi Martin found a bug with the treatment of list data (ie, Number = '.') in the header. Working on a fix ... Val On 09/08/2014 08:43 AM, Gabe Becker wrote: Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT GQX ... PL VF param = ScanVcfParam(info=NA) vcf = readVcf(fl, , param=param) dim(vcf) [1] 516127621 system.time(writeVcf(vcf, out.vcf)) user system elapsed 971.0326.568 1004.593 In 1.11.28, parsing of geno data was moved to C. If this didn't speed things up enough we were planning to implement 'chunking' through the VCF and/or move the parsing of info to C, however, it looks like geno was the bottleneck. I've tested a number of samples/fields combinations in files with = .5 million rows and the improvement over writeVcf() in release is ~ 90%. Valerie On 09/04/14 15:28, Valerie Obenchain wrote: Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com mailto:lawrence.michael@gene.__com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is
Re: [Bioc-devel] writeVcf performance
The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT GQX ... PL VF param = ScanVcfParam(info=NA) vcf = readVcf(fl, , param=param) dim(vcf) [1] 516127621 system.time(writeVcf(vcf, out.vcf)) user system elapsed 971.0326.568 1004.593 In 1.11.28, parsing of geno data was moved to C. If this didn't speed things up enough we were planning to implement 'chunking' through the VCF and/or move the parsing of info to C, however, it looks like geno was the bottleneck. I've tested a number of samples/fields combinations in files with = .5 million rows and the improvement over writeVcf() in release is ~ 90%. Valerie On 09/04/14 15:28, Valerie Obenchain wrote: Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the different strings are pasted together, and has to traverse, in linear time, increasing numbers of allocated SEXPs. So times scale approximately quadratically with the number of rows in the VCF An efficiency is to reduce the number of SEXPs in play by writing out in chunks -- as each chunk is written, the SEXPs become available for collection and are re-used. Here's my toy example time.R == splitIndices - function (nx, ncl) { i - seq_len(nx) if (ncl == 0L) list() else if (ncl == 1L || nx == 1L) list(i) else { fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl) breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L) structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) } } x = as.character(seq_len(1e7)); y = sample(x) if (!is.na http://is.na(Sys.getenv(SPLIT, NA))) { idx - splitIndices(length(x), 20) system.time(for (i in idx) paste(x[i], y[i], sep=:)) } else { system.time(paste(x, y, sep=:)) } running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the relevant time is user system elapsed 15.320 0.064 15.381 versus with $ R --no-save --quiet -f time.R it is user system elapsed 95.360 0.164 95.511 I think this is likely an overall strategy when dealing with character data -- processing in independent chunks of moderate (1M?) size (enabling as a consequence parallel evaluation in modest memory) that are sufficient to benefit from vectorization, but that do not entail allocation of large numbers of in-use SEXPs. Martin Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com mailto:becke...@gene.com mailto:becke...@gene.com mailto:becke...@gene.com wrote: Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty
Re: [Bioc-devel] writeVcf performance
Val, That is great. I'll check this out and test it on our end. ~G On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org wrote: The new writeVcf code is in 1.11.28. Using the illumina file you suggested, geno fields only, writing now takes about 17 minutes. hdr class: VCFHeader samples(1): NA12877 meta(6): fileformat ApplyRecalibration ... reference source fixed(1): FILTER info(22): AC AF ... culprit set geno(8): GT GQX ... PL VF param = ScanVcfParam(info=NA) vcf = readVcf(fl, , param=param) dim(vcf) [1] 516127621 system.time(writeVcf(vcf, out.vcf)) user system elapsed 971.0326.568 1004.593 In 1.11.28, parsing of geno data was moved to C. If this didn't speed things up enough we were planning to implement 'chunking' through the VCF and/or move the parsing of info to C, however, it looks like geno was the bottleneck. I've tested a number of samples/fields combinations in files with = .5 million rows and the improvement over writeVcf() in release is ~ 90%. Valerie On 09/04/14 15:28, Valerie Obenchain wrote: Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the different strings are pasted together, and has to traverse, in linear time, increasing numbers of allocated SEXPs. So times scale approximately quadratically with the number of rows in the VCF An efficiency is to reduce the number of SEXPs in play by writing out in chunks -- as each chunk is written, the SEXPs become available for collection and are re-used. Here's my toy example time.R == splitIndices - function (nx, ncl) { i - seq_len(nx) if (ncl == 0L) list() else if (ncl == 1L || nx == 1L) list(i) else { fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl) breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L) structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) } } x = as.character(seq_len(1e7)); y = sample(x) if (!is.na http://is.na(Sys.getenv(SPLIT, NA))) { idx - splitIndices(length(x), 20) system.time(for (i in idx) paste(x[i], y[i], sep=:)) } else { system.time(paste(x, y, sep=:)) } running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the relevant time is user system elapsed 15.320 0.064 15.381 versus with $ R --no-save --quiet -f time.R it is user system elapsed 95.360 0.164 95.511 I think this is likely an overall strategy when dealing with character data -- processing in independent chunks of moderate (1M?) size (enabling as a consequence parallel evaluation in modest memory) that are sufficient to benefit from vectorization, but that do not entail allocation of large numbers of in-use SEXPs. Martin Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com mailto:becke...@gene.com mailto:becke...@gene.com mailto:becke...@gene.com wrote: Martin and Val. I re-ran
Re: [Bioc-devel] writeVcf performance
This approach, writing in chunks, is the same Herve and I used for writing FASTA in the Biostrings package, although I see that Herve has now replaced the R implementation with a C implementation. I similarly found an absolutely huge speed up when writing genomes, by chunking. Best, Kasper On Tue, Sep 2, 2014 at 4:33 PM, Martin Morgan mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the different strings are pasted together, and has to traverse, in linear time, increasing numbers of allocated SEXPs. So times scale approximately quadratically with the number of rows in the VCF An efficiency is to reduce the number of SEXPs in play by writing out in chunks -- as each chunk is written, the SEXPs become available for collection and are re-used. Here's my toy example time.R == splitIndices - function (nx, ncl) { i - seq_len(nx) if (ncl == 0L) list() else if (ncl == 1L || nx == 1L) list(i) else { fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl) breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L) structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) } } x = as.character(seq_len(1e7)); y = sample(x) if (!is.na(Sys.getenv(SPLIT, NA))) { idx - splitIndices(length(x), 20) system.time(for (i in idx) paste(x[i], y[i], sep=:)) } else { system.time(paste(x, y, sep=:)) } running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the relevant time is user system elapsed 15.320 0.064 15.381 versus with $ R --no-save --quiet -f time.R it is user system elapsed 95.360 0.164 95.511 I think this is likely an overall strategy when dealing with character data -- processing in independent chunks of moderate (1M?) size (enabling as a consequence parallel evaluation in modest memory) that are sufficient to benefit from vectorization, but that do not entail allocation of large numbers of in-use SEXPs. Martin Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com mailto:becke...@gene.com wrote: Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller data. Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across the object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com mailto:micha...@gene.com wrote: Gabe is still testing/profiling, but we'll send something randomized along eventually. On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: I didn't see in the original thread a reproducible (simulated, I guess) example, to be explicit about what the problem is?? Martin On 08/26/2014 10:47 AM, Michael Lawrence wrote: My understanding is that the heap optimization provided marginal gains, and that we need to think harder about how to optimize the all of the string manipulation in writeVcf. We either need to reduce it or reduce its overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: Hi Gabe, Martin responded, and so did Michael, https://stat.ethz.ch/__pipermail/bioc-devel/2014-__ August/006082.html https://stat.ethz.ch/pipermail/bioc-devel/2014- August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM, Gabe Becker
Re: [Bioc-devel] writeVcf performance
Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the different strings are pasted together, and has to traverse, in linear time, increasing numbers of allocated SEXPs. So times scale approximately quadratically with the number of rows in the VCF An efficiency is to reduce the number of SEXPs in play by writing out in chunks -- as each chunk is written, the SEXPs become available for collection and are re-used. Here's my toy example time.R == splitIndices - function (nx, ncl) { i - seq_len(nx) if (ncl == 0L) list() else if (ncl == 1L || nx == 1L) list(i) else { fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl) breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L) structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) } } x = as.character(seq_len(1e7)); y = sample(x) if (!is.na(Sys.getenv(SPLIT, NA))) { idx - splitIndices(length(x), 20) system.time(for (i in idx) paste(x[i], y[i], sep=:)) } else { system.time(paste(x, y, sep=:)) } running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the relevant time is user system elapsed 15.320 0.064 15.381 versus with $ R --no-save --quiet -f time.R it is user system elapsed 95.360 0.164 95.511 I think this is likely an overall strategy when dealing with character data -- processing in independent chunks of moderate (1M?) size (enabling as a consequence parallel evaluation in modest memory) that are sufficient to benefit from vectorization, but that do not entail allocation of large numbers of in-use SEXPs. Martin Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com mailto:becke...@gene.com wrote: Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller data. Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across the object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com mailto:micha...@gene.com wrote: Gabe is still testing/profiling, but we'll send something randomized along eventually. On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: I didn't see in the original thread a reproducible (simulated, I guess) example, to be explicit about what the problem is?? Martin On 08/26/2014 10:47 AM, Michael Lawrence wrote: My understanding is that the heap optimization provided marginal gains, and that we need to think harder about how to optimize the all of the string manipulation in writeVcf. We either need to reduce it or reduce its overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: Hi Gabe, Martin responded, and so did
Re: [Bioc-devel] writeVcf performance
Thanks Gabe. I should have something for you on Monday. Val On 09/04/2014 01:56 PM, Gabe Becker wrote: Val and Martin, Apologies for the delay. We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote: Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the different strings are pasted together, and has to traverse, in linear time, increasing numbers of allocated SEXPs. So times scale approximately quadratically with the number of rows in the VCF An efficiency is to reduce the number of SEXPs in play by writing out in chunks -- as each chunk is written, the SEXPs become available for collection and are re-used. Here's my toy example time.R == splitIndices - function (nx, ncl) { i - seq_len(nx) if (ncl == 0L) list() else if (ncl == 1L || nx == 1L) list(i) else { fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl) breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L) structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) } } x = as.character(seq_len(1e7)); y = sample(x) if (!is.na http://is.na(Sys.getenv(SPLIT, NA))) { idx - splitIndices(length(x), 20) system.time(for (i in idx) paste(x[i], y[i], sep=:)) } else { system.time(paste(x, y, sep=:)) } running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the relevant time is user system elapsed 15.320 0.064 15.381 versus with $ R --no-save --quiet -f time.R it is user system elapsed 95.360 0.164 95.511 I think this is likely an overall strategy when dealing with character data -- processing in independent chunks of moderate (1M?) size (enabling as a consequence parallel evaluation in modest memory) that are sufficient to benefit from vectorization, but that do not entail allocation of large numbers of in-use SEXPs. Martin Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com mailto:becke...@gene.com mailto:becke...@gene.com mailto:becke...@gene.com wrote: Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller data. Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across the object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com mailto:micha...@gene.com mailto:micha...@gene.com mailto:micha...@gene.com wrote: Gabe is still
Re: [Bioc-devel] writeVcf performance
Yes, it's very clear that the scaling is non-linear, and Gabe has been experimenting with a chunk-wise + parallel algorithm. Unfortunately there is some frustrating overhead with the parallelism. But I'm glad Val is arriving at something quicker. Michael On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org wrote: On 08/27/2014 11:56 AM, Gabe Becker wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. I think Val is arriving at a (much) more efficient implementation, but... I wanted to share my guess that the poor _scaling_ is because the garbage collector runs multiple times as the different strings are pasted together, and has to traverse, in linear time, increasing numbers of allocated SEXPs. So times scale approximately quadratically with the number of rows in the VCF An efficiency is to reduce the number of SEXPs in play by writing out in chunks -- as each chunk is written, the SEXPs become available for collection and are re-used. Here's my toy example time.R == splitIndices - function (nx, ncl) { i - seq_len(nx) if (ncl == 0L) list() else if (ncl == 1L || nx == 1L) list(i) else { fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl) breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L) structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) } } x = as.character(seq_len(1e7)); y = sample(x) if (!is.na(Sys.getenv(SPLIT, NA))) { idx - splitIndices(length(x), 20) system.time(for (i in idx) paste(x[i], y[i], sep=:)) } else { system.time(paste(x, y, sep=:)) } running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the relevant time is user system elapsed 15.320 0.064 15.381 versus with $ R --no-save --quiet -f time.R it is user system elapsed 95.360 0.164 95.511 I think this is likely an overall strategy when dealing with character data -- processing in independent chunks of moderate (1M?) size (enabling as a consequence parallel evaluation in modest memory) that are sufficient to benefit from vectorization, but that do not entail allocation of large numbers of in-use SEXPs. Martin Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com mailto:becke...@gene.com wrote: Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller data. Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across the object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com mailto:micha...@gene.com wrote: Gabe is still testing/profiling, but we'll send something randomized along eventually. On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: I didn't see in the original thread a reproducible (simulated, I guess) example, to be explicit about what the problem is?? Martin On 08/26/2014 10:47 AM, Michael Lawrence wrote: My understanding is that the heap optimization provided marginal gains, and that we need to think harder about how to optimize the all of the string manipulation in writeVcf. We either need to reduce it or reduce its overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: Hi Gabe, Martin responded, and so did Michael, https://stat.ethz.ch/__pipermail/bioc-devel/2014-__ August/006082.html https://stat.ethz.ch/pipermail/bioc-devel/2014- August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM, Gabe Becker wrote:
Re: [Bioc-devel] writeVcf performance
Try to run it through the lineprof package for memory profiling; I have found this to be very helpful. Here is an old blog post I wrote about it http://www.hansenlab.org/rstats/2014/01/30/lineprof/ Kasper On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker becker.g...@gene.com wrote: The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com wrote: Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller data. Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across the object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com wrote: Gabe is still testing/profiling, but we'll send something randomized along eventually. On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org wrote: I didn't see in the original thread a reproducible (simulated, I guess) example, to be explicit about what the problem is?? Martin On 08/26/2014 10:47 AM, Michael Lawrence wrote: My understanding is that the heap optimization provided marginal gains, and that we need to think harder about how to optimize the all of the string manipulation in writeVcf. We either need to reduce it or reduce its overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org wrote: Hi Gabe, Martin responded, and so did Michael, https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM, Gabe Becker wrote: Val, Has there been any movement on this? This remains a substantial bottleneck for us when writing very large VCF files (e.g. variants+genotypes for whole genome NGS samples). I was able to see a ~25% speedup with 4 cores and an optimal speedup of ~2x with 10-12 cores for a VCF with 500k rows using a very naive parallelization strategy and no other changes. I suspect this could be improved on quite a bit, or possibly made irrelevant with judicious use of serial C code. Did you and Martin make any plans regarding optimizing writeVcf? Best ~G On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: Hi Michael, I'm interested in working on this. I'll discuss with Martin next week when we're both back in the office. Val On 08/05/14 07:46, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1 hour, which is still too long by my standards ( 0). Only takes 3 minutes to call the genotypes (and associated likelihoods etc) from the variant calls (using 80 cores and 450 GB RAM on one node), so the output is an issue. Profiling suggests that the running time scales non-linearly in the number of rows. Digging a little deeper, it seems to be something with R's string/memory allocation. Below, pasting 1 million strings takes 6 seconds, but 10 million strings takes over 2 minutes. It gets way worse with 50 million. I suspect it has something to do with R's string hash table. set.seed(1000) end - sample(1e8, 1e6) system.time(paste0(END, =, end)) user system elapsed 6.396 0.028 6.420 end - sample(1e8, 1e7) system.time(paste0(END, =, end)) user system elapsed 134.714 0.352 134.978 Indeed, even this takes a long time (in a fresh session): set.seed(1000) end -
Re: [Bioc-devel] writeVcf performance
Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller data. Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across the object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com wrote: Gabe is still testing/profiling, but we'll send something randomized along eventually. On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org wrote: I didn't see in the original thread a reproducible (simulated, I guess) example, to be explicit about what the problem is?? Martin On 08/26/2014 10:47 AM, Michael Lawrence wrote: My understanding is that the heap optimization provided marginal gains, and that we need to think harder about how to optimize the all of the string manipulation in writeVcf. We either need to reduce it or reduce its overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org wrote: Hi Gabe, Martin responded, and so did Michael, https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM, Gabe Becker wrote: Val, Has there been any movement on this? This remains a substantial bottleneck for us when writing very large VCF files (e.g. variants+genotypes for whole genome NGS samples). I was able to see a ~25% speedup with 4 cores and an optimal speedup of ~2x with 10-12 cores for a VCF with 500k rows using a very naive parallelization strategy and no other changes. I suspect this could be improved on quite a bit, or possibly made irrelevant with judicious use of serial C code. Did you and Martin make any plans regarding optimizing writeVcf? Best ~G On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: Hi Michael, I'm interested in working on this. I'll discuss with Martin next week when we're both back in the office. Val On 08/05/14 07:46, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1 hour, which is still too long by my standards ( 0). Only takes 3 minutes to call the genotypes (and associated likelihoods etc) from the variant calls (using 80 cores and 450 GB RAM on one node), so the output is an issue. Profiling suggests that the running time scales non-linearly in the number of rows. Digging a little deeper, it seems to be something with R's string/memory allocation. Below, pasting 1 million strings takes 6 seconds, but 10 million strings takes over 2 minutes. It gets way worse with 50 million. I suspect it has something to do with R's string hash table. set.seed(1000) end - sample(1e8, 1e6) system.time(paste0(END, =, end)) user system elapsed 6.396 0.028 6.420 end - sample(1e8, 1e7) system.time(paste0(END, =, end)) user system elapsed 134.714 0.352 134.978 Indeed, even this takes a long time (in a fresh session): set.seed(1000) end - sample(1e8, 1e6) end - sample(1e8, 1e7) system.time(as.character(end)) user system elapsed 57.224 0.156 57.366 But running it a second time is faster (about what one would expect?): system.time(levels - as.character(end)) user system elapsed 23.582 0.021 23.589 I did some simple profiling of R to find that the resizing of the string hash table is not a significant component of the time. So maybe something to do with the R heap/gc? No time right now to go deeper. But I know Martin likes this sort of thing ;) Michael [[alternative HTML
Re: [Bioc-devel] writeVcf performance
The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com wrote: Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller data. Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across the object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com wrote: Gabe is still testing/profiling, but we'll send something randomized along eventually. On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org wrote: I didn't see in the original thread a reproducible (simulated, I guess) example, to be explicit about what the problem is?? Martin On 08/26/2014 10:47 AM, Michael Lawrence wrote: My understanding is that the heap optimization provided marginal gains, and that we need to think harder about how to optimize the all of the string manipulation in writeVcf. We either need to reduce it or reduce its overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org wrote: Hi Gabe, Martin responded, and so did Michael, https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM, Gabe Becker wrote: Val, Has there been any movement on this? This remains a substantial bottleneck for us when writing very large VCF files (e.g. variants+genotypes for whole genome NGS samples). I was able to see a ~25% speedup with 4 cores and an optimal speedup of ~2x with 10-12 cores for a VCF with 500k rows using a very naive parallelization strategy and no other changes. I suspect this could be improved on quite a bit, or possibly made irrelevant with judicious use of serial C code. Did you and Martin make any plans regarding optimizing writeVcf? Best ~G On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: Hi Michael, I'm interested in working on this. I'll discuss with Martin next week when we're both back in the office. Val On 08/05/14 07:46, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1 hour, which is still too long by my standards ( 0). Only takes 3 minutes to call the genotypes (and associated likelihoods etc) from the variant calls (using 80 cores and 450 GB RAM on one node), so the output is an issue. Profiling suggests that the running time scales non-linearly in the number of rows. Digging a little deeper, it seems to be something with R's string/memory allocation. Below, pasting 1 million strings takes 6 seconds, but 10 million strings takes over 2 minutes. It gets way worse with 50 million. I suspect it has something to do with R's string hash table. set.seed(1000) end - sample(1e8, 1e6) system.time(paste0(END, =, end)) user system elapsed 6.396 0.028 6.420 end - sample(1e8, 1e7) system.time(paste0(END, =, end)) user system elapsed 134.714 0.352 134.978 Indeed, even this takes a long time (in a fresh session): set.seed(1000) end - sample(1e8, 1e6) end - sample(1e8, 1e7) system.time(as.character(end)) user system elapsed 57.224 0.156 57.366 But running it a second time is faster (about what one would expect?): system.time(levels - as.character(end)) user system elapsed 23.582 0.021 23.589 I did some simple profiling of R to find that the resizing of
Re: [Bioc-devel] writeVcf performance
Hi Gabe, Martin responded, and so did Michael, https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM, Gabe Becker wrote: Val, Has there been any movement on this? This remains a substantial bottleneck for us when writing very large VCF files (e.g. variants+genotypes for whole genome NGS samples). I was able to see a ~25% speedup with 4 cores and an optimal speedup of ~2x with 10-12 cores for a VCF with 500k rows using a very naive parallelization strategy and no other changes. I suspect this could be improved on quite a bit, or possibly made irrelevant with judicious use of serial C code. Did you and Martin make any plans regarding optimizing writeVcf? Best ~G On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org mailto:voben...@fhcrc.org wrote: Hi Michael, I'm interested in working on this. I'll discuss with Martin next week when we're both back in the office. Val On 08/05/14 07:46, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1 hour, which is still too long by my standards ( 0). Only takes 3 minutes to call the genotypes (and associated likelihoods etc) from the variant calls (using 80 cores and 450 GB RAM on one node), so the output is an issue. Profiling suggests that the running time scales non-linearly in the number of rows. Digging a little deeper, it seems to be something with R's string/memory allocation. Below, pasting 1 million strings takes 6 seconds, but 10 million strings takes over 2 minutes. It gets way worse with 50 million. I suspect it has something to do with R's string hash table. set.seed(1000) end - sample(1e8, 1e6) system.time(paste0(END, =, end)) user system elapsed 6.396 0.028 6.420 end - sample(1e8, 1e7) system.time(paste0(END, =, end)) user system elapsed 134.714 0.352 134.978 Indeed, even this takes a long time (in a fresh session): set.seed(1000) end - sample(1e8, 1e6) end - sample(1e8, 1e7) system.time(as.character(end)) user system elapsed 57.224 0.156 57.366 But running it a second time is faster (about what one would expect?): system.time(levels - as.character(end)) user system elapsed 23.582 0.021 23.589 I did some simple profiling of R to find that the resizing of the string hash table is not a significant component of the time. So maybe something to do with the R heap/gc? No time right now to go deeper. But I know Martin likes this sort of thing ;) Michael [[alternative HTML version deleted]] _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biologist Genentech Research ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] writeVcf performance
I thought it might come down to the heap initialization. We'll work with that. On Wed, Aug 13, 2014 at 4:42 PM, Martin Morgan mtmor...@fhcrc.org wrote: On 08/05/2014 07:46 AM, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1 hour, which is still too long by my standards ( 0). Only takes 3 minutes to call the genotypes (and associated likelihoods etc) from the variant calls (using 80 cores and 450 GB RAM on one node), so the output is an issue. Profiling suggests that the running time scales non-linearly in the number of rows. Digging a little deeper, it seems to be something with R's string/memory allocation. Below, pasting 1 million strings takes 6 seconds, but 10 million strings takes over 2 minutes. It gets way worse with 50 million. I suspect it has something to do with R's string hash table. set.seed(1000) end - sample(1e8, 1e6) system.time(paste0(END, =, end)) user system elapsed 6.396 0.028 6.420 end - sample(1e8, 1e7) system.time(paste0(END, =, end)) user system elapsed 134.714 0.352 134.978 Indeed, even this takes a long time (in a fresh session): set.seed(1000) end - sample(1e8, 1e6) end - sample(1e8, 1e7) system.time(as.character(end)) user system elapsed 57.224 0.156 57.366 my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M, which changes the example above from system.time(as.character(end)) user system elapsed 82.835 0.343 83.195 to system.time(as.character(end)) user system elapsed 9.245 0.169 9.424 but I think it's a one-time gain; I wonder what the writeVcf command is that you're running? Martin But running it a second time is faster (about what one would expect?): system.time(levels - as.character(end)) user system elapsed 23.582 0.021 23.589 I did some simple profiling of R to find that the resizing of the string hash table is not a significant component of the time. So maybe something to do with the R heap/gc? No time right now to go deeper. But I know Martin likes this sort of thing ;) Michael [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] writeVcf performance
On 08/05/2014 07:46 AM, Michael Lawrence wrote: Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1 hour, which is still too long by my standards ( 0). Only takes 3 minutes to call the genotypes (and associated likelihoods etc) from the variant calls (using 80 cores and 450 GB RAM on one node), so the output is an issue. Profiling suggests that the running time scales non-linearly in the number of rows. Digging a little deeper, it seems to be something with R's string/memory allocation. Below, pasting 1 million strings takes 6 seconds, but 10 million strings takes over 2 minutes. It gets way worse with 50 million. I suspect it has something to do with R's string hash table. set.seed(1000) end - sample(1e8, 1e6) system.time(paste0(END, =, end)) user system elapsed 6.396 0.028 6.420 end - sample(1e8, 1e7) system.time(paste0(END, =, end)) user system elapsed 134.714 0.352 134.978 Indeed, even this takes a long time (in a fresh session): set.seed(1000) end - sample(1e8, 1e6) end - sample(1e8, 1e7) system.time(as.character(end)) user system elapsed 57.224 0.156 57.366 my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M, which changes the example above from system.time(as.character(end)) user system elapsed 82.835 0.343 83.195 to system.time(as.character(end)) user system elapsed 9.245 0.169 9.424 but I think it's a one-time gain; I wonder what the writeVcf command is that you're running? Martin But running it a second time is faster (about what one would expect?): system.time(levels - as.character(end)) user system elapsed 23.582 0.021 23.589 I did some simple profiling of R to find that the resizing of the string hash table is not a significant component of the time. So maybe something to do with the R heap/gc? No time right now to go deeper. But I know Martin likes this sort of thing ;) Michael [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel