Re: [Bioc-devel] writeVcf performance

2014-09-30 Thread Gabe Becker
Valerie,

Apologies for this taking much longer than it should have. The changes in
Bioc-devel have wreaked havoc on the code we use to to generate and process
the data we need to write out, but the fault is mine for not getting on top
of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the illumina
platinum and writing it back out to see if it is something about our
specific vcf object (could ExpandedVCF vs VCF be an issue?).

 vcfgeno
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: END
  Fields with no header: END
geno(vcf):
  SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
  Number TypeDescription

   AD 2  Integer Allelic depths (number of reads in each observed
al...
   DP 1  Integer Total read depth

   FT 1  String  Variant filters

   GT 1  String  Genotype

   GQ 1  Integer Genotype quality

   PL 3  Integer Normalized, Phred-scaled likelihoods for genotypes

   MIN_DP 1  Integer Minimum DP observed within the GVCF block

 sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
 [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
 [3] AnnotationDbi_1.27.16 Biobase_2.25.0
 [5] gmapR_1.7.8   VTGenotyping_0.0.1
 [7] BiocParallel_0.99.22  futile.logger_1.3.7
 [9] VariantTools_1.7.5*VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34 Biostrings_2.33.14
[13] XVector_0.5.8 rtracklayer_1.25.16
[15] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23
[17] IRanges_1.99.28   S4Vectors_0.2.4
[19] BiocGenerics_0.11.5   switchr_0.2.1

loaded via a namespace (and not attached):
 [1] annotate_1.43.5base64enc_0.1-2
 [3] BatchJobs_1.4  BBmisc_1.7
 [5] biomaRt_2.21.1 bitops_1.0-6
 [7] brew_1.0-6 BSgenome_1.33.9
 [9] CGPtools_2.2.0 checkmate_1.4
[11] codetools_0.2-9DBI_0.3.1
[13] DESeq_1.17.0   digest_0.6.4
[15] fail_1.2   foreach_1.4.2
[17] futile.options_1.0.0   genefilter_1.47.6
[19] geneplotter_1.43.0 GenomicAlignments_1.1.29
[21] genoset_1.19.32gneDB_0.4.18
[23] grid_3.1.1 iterators_1.0.7
[25] lambda.r_1.1.6 lattice_0.20-29
[27] Matrix_1.1-4   RColorBrewer_1.0-5
[29] RCurl_1.95-4.3 rjson_0.2.14
[31] RSQLite_0.11.4 sendmailR_1.2-1
[33] splines_3.1.1  stringr_0.6.2
[35] survival_2.37-7tools_3.1.1
[37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
[39] xtable_1.7-4   zlibbioc_1.11.1

On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain voben...@fhcrc.org
wrote:

 Hi Gabe,

 Have you had a chance to test writeVcf? The changes made over the past
 week have shaved off more time. It now takes ~ 9 minutes to write the
 NA12877 example.

  dim(vcf)

 [1] 516127621

 gc()

  used   (Mb) gc trigger(Mb)   max used(Mb)
 Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
 Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8

 print(system.time(writeVcf(vcf, tempfile(

user  system elapsed
 555.282   6.700 565.700

 gc()

  used   (Mb) gc trigger(Mb)   max used(Mb)
 Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
 Vcells 1176960717 8979.5 2183277445 16657.1 2171401955 16566.5



 In the most recent version (1.11.35) I've added chunking for files with 
 1e5 records. Right now the choice of # records per chunk is simple, based
 on total records only. We are still experimenting with this. You can
 override default chunking with 'nchunk'. Examples on the man page.

 Valerie


 On 09/08/14 08:43, Gabe Becker wrote:

 Val,

 That is great. I'll check this out and test it on our end.

 ~G

 On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org
 mailto:voben...@fhcrc.org wrote:

 The new writeVcf code is in 1.11.28.

 Using the illumina file you suggested, geno fields only, writing now
 takes about 17 minutes.

   hdr
 class: VCFHeader
 samples(1): NA12877
 meta(6): fileformat ApplyRecalibration ... reference source
 

Re: [Bioc-devel] writeVcf performance

2014-09-30 Thread Valerie Obenchain

Hi Gabe,

It would help to have a common baseline. Please show the output for 
writing the Illumina file you sent originally:


library(VariantAnnotation)
fl - NA12877_S1.genome.vcf.gz
vcf - readVcf(fl, , param=ScanVcfParam(info=NA))
dim(vcf)

gc()
print(system.time(writeVcf(vcf, tempfile(
gc()

Here's my output. For this file, writing takes between 8-10 minutes 
depending on machine load. (Machine has 378 GIG of RAM.)



dim(vcf)

[1] 516127621



gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157824501 8428.8  298615851 15947.9  261241334 13951.8
Vcells 1109937571 8468.2 1778479081 13568.8 1693642432 12921.5

print(system.time(writeVcf(vcf, tempfile(

   user  system elapsed
601.265   9.261 614.230

gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157829989 8429.1  298615851 15947.9  261241334 13951.8
Vcells 1109941106 8468.2 2270282109 17320.9 2243573149 17117.2


I'm not seeing substantial differences in writeVcf with expanded vs 
collapsed VCF. However, as an aside, note that calling writeVcf on 
VRanges (or equivalently ExpandedVCF) the new file will have multiple 
rows per position if there were multiple ALT alleles. I'm not sure this 
is technically valid as per the VCF specs but am assuming Michael knew 
about this.


Five positions in original VCF:
fl - system.file(extdata, ex2.vcf, package=VariantAnnotation)
vcf - readVcf(fl, )
 dim(vcf)
[1] 5 3

vr - as(vcf, VRanges)
 length(vr)
[1] 21

Before writing out, VRanges is coerced to ExpandedVCF; 7 rows are 
written out.

 dim(as(vr, VCF))
[1] 7 3

Can you provide a portion of 'vcfgeno' for testing?

Valerie


sessionInfo()

R version 3.1.0 Patched (2014-06-24 r66016)
Platform: x86_64-unknown-linux-gnu (64-bit)


...


other attached packages:
[1] VariantAnnotation_1.11.35 Rsamtools_1.17.34
[3] Biostrings_2.33.14XVector_0.5.8
[5] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23
[7] IRanges_1.99.28   S4Vectors_0.2.4
[9] BiocGenerics_0.11.5




On 09/30/2014 08:36 AM, Gabe Becker wrote:

Valerie,

Apologies for this taking much longer than it should have. The changes
in Bioc-devel have wreaked havoc on the code we use to to generate and
process the data we need to write out, but the fault is mine for not
getting on top of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the
illumina platinum and writing it back out to see if it is something
about our specific vcf object (could ExpandedVCF vs VCF be an issue?).

  vcfgeno
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
   GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
   DataFrame with 1 column: END
   Fields with no header: END
geno(vcf):
   SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
   Number TypeDescription
AD 2  Integer Allelic depths (number of reads in each
observed al...
DP 1  Integer Total read depth
FT 1  String  Variant filters
GT 1  String  Genotype
GQ 1  Integer Genotype quality
PL 3  Integer Normalized, Phred-scaled likelihoods for
genotypes
MIN_DP 1  Integer Minimum DP observed within the GVCF block
  sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
  [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
  [3] AnnotationDbi_1.27.16 Biobase_2.25.0
  [5] gmapR_1.7.8   VTGenotyping_0.0.1
  [7] BiocParallel_0.99.22  futile.logger_1.3.7
  [9] VariantTools_1.7.5 *VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34 Biostrings_2.33.14
[13] XVector_0.5.8 rtracklayer_1.25.16
[15] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23
[17] IRanges_1.99.28   S4Vectors_0.2.4
[19] BiocGenerics_0.11.5   switchr_0.2.1

loaded via a namespace (and not attached):
  [1] annotate_1.43.5base64enc_0.1-2
  [3] BatchJobs_1.4  BBmisc_1.7
  [5] biomaRt_2.21.1 bitops_1.0-6
  [7] brew_1.0-6 BSgenome_1.33.9
  [9] CGPtools_2.2.0 checkmate_1.4
[11] codetools_0.2-9DBI_0.3.1
[13] DESeq_1.17.0   digest_0.6.4
[15] fail_1.2   foreach_1.4.2
[17] futile.options_1.0.0   genefilter_1.47.6
[19] geneplotter_1.43.0 GenomicAlignments_1.1.29
[21] 

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the 
process of moving SimpleList and DataFrame from IRanges to S4Vectors; 
finished up today I think. Anyhow, if you get VariantAnnotation from svn 
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe 
rtracklayer).


I'm working on the 'chunking' approach next. It looks like we can still 
gain from adding that.


Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org
mailto:voben...@fhcrc.org wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

  hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

  param = ScanVcfParam(info=NA)
  vcf = readVcf(fl, , param=param)
  dim(vcf)
[1] 516127621

  system.time(writeVcf(vcf, out.vcf))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with =
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz


ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com
mailto:lawrence.michael@gene.__com
mailto:lawrence.mich...@gene.com wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org
mailto:mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

 An efficiency is to reduce the number of SEXPs in
play by
 writing out in chunks -- as each chunk is written,
the SEXPs
 become available for collection and are re-used.
Here's my toy
 example

 time.R
 ==
 splitIndices - function (nx, ncl)
 {
  i - seq_len(nx)
  if (ncl == 0L)
  

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Hervé Pagès

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

  library(VariantAnnotation)
  library(RUnit)
  source(path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R)

  dest - tempfile()
  vr - make_TARGET_VRanges_vcf()
  writeVcf(vr, dest)
  vcf - readVcf(dest, genome = hg19)
  perm - c(1, 7, 8, 4, 2, 10)
  vcf.vr - as(vcf, VRanges)[perm]
  genome(vr) - hg19
  checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) : 
FALSE


Hard for me to tell whether this is related to DataFrame moving from 
IRanges to S4Vectors or to a regression in writeVcf(). Do you think

you can have a look? Thanks for the help and sorry for the trouble.

H.


Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org
mailto:voben...@fhcrc.org wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

  hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

  param = ScanVcfParam(info=NA)
  vcf = readVcf(fl, , param=param)
  dim(vcf)
[1] 516127621

  system.time(writeVcf(vcf, out.vcf))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with =
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz



ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our
~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com
mailto:lawrence.michael@gene.__com
mailto:lawrence.mich...@gene.com wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org
mailto:mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain

Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was 
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.


Val

On 09/09/2014 02:47 PM, Hervé Pagès wrote:

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)
   source(path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R)

   dest - tempfile()
   vr - make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf - readVcf(dest, genome = hg19)
   perm - c(1, 7, 8, 4, 2, 10)
   vcf.vr - as(vcf, VRanges)[perm]
   genome(vr) - hg19
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.


Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org
mailto:voben...@fhcrc.org wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing
now
takes about 17 minutes.

  hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

  param = ScanVcfParam(info=NA)
  vcf = readVcf(fl, , param=param)
  dim(vcf)
[1] 516127621

  system.time(writeVcf(vcf, out.vcf))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files
with =
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files
make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz




ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz

took about ~4.2 hrs to write out, and is about 1.5x the
size
of the
files we are actually dealing with (~50M ranges vs our
~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com
mailto:lawrence.michael@gene.__com
mailto:lawrence.mich...@gene.com wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org
mailto:mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Hervé Pagès

Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before
reporting this. Thanks and sorry for the noise.

H.

On 09/09/2014 03:09 PM, Valerie Obenchain wrote:

Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.

Val

On 09/09/2014 02:47 PM, Hervé Pagès wrote:

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)

source(path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R)

   dest - tempfile()
   vr - make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf - readVcf(dest, genome = hg19)
   perm - c(1, 7, 8, 4, 2, 10)
   vcf.vr - as(vcf, VRanges)[perm]
   genome(vr) - hg19
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.


Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org
mailto:voben...@fhcrc.org wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing
now
takes about 17 minutes.

  hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

  param = ScanVcfParam(info=NA)
  vcf = readVcf(fl, , param=param)
  dim(vcf)
[1] 516127621

  system.time(writeVcf(vcf, out.vcf))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files
with =
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files
make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz





ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz


took about ~4.2 hrs to write out, and is about 1.5x the
size
of the
files we are actually dealing with (~50M ranges vs our
~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
lawrence.mich...@gene.com
mailto:lawrence.mich...@gene.com
mailto:lawrence.michael@gene.__com
mailto:lawrence.mich...@gene.com wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at
something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org
mailto:mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
   

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now 
takes about 17 minutes.


 hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 param = ScanVcfParam(info=NA)
 vcf = readVcf(fl, , param=param)
 dim(vcf)
[1] 516127621

 system.time(writeVcf(vcf, out.vcf))
user   system  elapsed
 971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed 
things up enough we were planning to implement 'chunking' through the 
VCF and/or move the parsing of info to C, however, it looks like geno 
was the bottleneck.


I've tested a number of samples/fields combinations in files with = .5 
million rows and the improvement over writeVcf() in release is ~ 90%.


Valerie



On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote:

Yes, it's very clear that the scaling is non-linear, and Gabe has
been experimenting with a chunk-wise + parallel algorithm.
Unfortunately there is some frustrating overhead with the
parallelism. But I'm glad Val is arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org
mailto:mtmor...@fhcrc.org wrote:

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno
fields, as I said,
but our typical usecase involves only ~4-6 fields, and is
faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient
implementation, but...

I wanted to share my guess that the poor _scaling_ is because
the garbage collector runs multiple times as the different
strings are pasted together, and has to traverse, in linear
time, increasing numbers of allocated SEXPs. So times scale
approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by
writing out in chunks -- as each chunk is written, the SEXPs
become available for collection and are re-used. Here's my toy
example

time.R
==
splitIndices - function (nx, ncl)
{
 i - seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names
= NULL)
 }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na http://is.na(Sys.getenv(SPLIT, NA))) {
 idx - splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=:))
} else {
 system.time(paste(x, y, sep=:))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
time.R the relevant time is

user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with
character data -- processing in independent chunks of moderate
(1M?) size (enabling as a consequence parallel evaluation in
modest memory) that are sufficient to benefit from
vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin


Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
becke...@gene.com mailto:becke...@gene.com
mailto:becke...@gene.com mailto:becke...@gene.com wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
24 geno fields) with
 profiling enabled. The results of summaryRprof for that
run are attached,
 though for a variety of reasons they are pretty

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Gabe Becker
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain voben...@fhcrc.org
wrote:

 The new writeVcf code is in 1.11.28.

 Using the illumina file you suggested, geno fields only, writing now takes
 about 17 minutes.

  hdr
 class: VCFHeader
 samples(1): NA12877
 meta(6): fileformat ApplyRecalibration ... reference source
 fixed(1): FILTER
 info(22): AC AF ... culprit set
 geno(8): GT GQX ... PL VF

  param = ScanVcfParam(info=NA)
  vcf = readVcf(fl, , param=param)
  dim(vcf)
 [1] 516127621

  system.time(writeVcf(vcf, out.vcf))
 user   system  elapsed
  971.0326.568 1004.593

 In 1.11.28, parsing of geno data was moved to C. If this didn't speed
 things up enough we were planning to implement 'chunking' through the VCF
 and/or move the parsing of info to C, however, it looks like geno was the
 bottleneck.

 I've tested a number of samples/fields combinations in files with = .5
 million rows and the improvement over writeVcf() in release is ~ 90%.

 Valerie




 On 09/04/14 15:28, Valerie Obenchain wrote:

 Thanks Gabe. I should have something for you on Monday.

 Val


 On 09/04/2014 01:56 PM, Gabe Becker wrote:

 Val and Martin,

 Apologies for the delay.

 We realized that the Illumina platinum genome vcf files make a good test
 case, assuming you strip out all the info (info=NA when reading it into
 R) stuff.

 ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
 took about ~4.2 hrs to write out, and is about 1.5x the size of the
 files we are actually dealing with (~50M ranges vs our ~30M).

 Looking forward a new vastly improved writeVcf :).

 ~G


 On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
 lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote:

 Yes, it's very clear that the scaling is non-linear, and Gabe has
 been experimenting with a chunk-wise + parallel algorithm.
 Unfortunately there is some frustrating overhead with the
 parallelism. But I'm glad Val is arriving at something quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6 fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_ is because
 the garbage collector runs multiple times as the different
 strings are pasted together, and has to traverse, in linear
 time, increasing numbers of allocated SEXPs. So times scale
 approximately quadratically with the number of rows in the VCF

 An efficiency is to reduce the number of SEXPs in play by
 writing out in chunks -- as each chunk is written, the SEXPs
 become available for collection and are re-used. Here's my toy
 example

 time.R
 ==
 splitIndices - function (nx, ncl)
 {
  i - seq_len(nx)
  if (ncl == 0L)
  list()
  else if (ncl == 1L || nx == 1L)
  list(i)
  else {
  fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl)
  breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
  structure(split(i, cut(i, breaks, labels=FALSE)), names
 = NULL)
  }
 }

 x = as.character(seq_len(1e7)); y = sample(x)
 if (!is.na http://is.na(Sys.getenv(SPLIT, NA))) {
  idx - splitIndices(length(x), 20)
  system.time(for (i in idx) paste(x[i], y[i], sep=:))
 } else {
  system.time(paste(x, y, sep=:))
 }


 running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
 time.R the relevant time is

 user  system elapsed
   15.320   0.064  15.381

 versus with $ R --no-save --quiet -f time.R it is

 user  system elapsed
   95.360   0.164  95.511

 I think this is likely an overall strategy when dealing with
 character data -- processing in independent chunks of moderate
 (1M?) size (enabling as a consequence parallel evaluation in
 modest memory) that are sufficient to benefit from
 vectorization, but that do not entail allocation of large
 numbers of in-use SEXPs.

 Martin


 Sorry for the confusion.
 ~G


 On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
 becke...@gene.com mailto:becke...@gene.com
 mailto:becke...@gene.com mailto:becke...@gene.com
 wrote:

  Martin and Val.

  I re-ran 

Re: [Bioc-devel] writeVcf performance

2014-09-05 Thread Kasper Daniel Hansen
This approach, writing in chunks, is the same Herve and I used for writing
FASTA in the Biostrings package, although I see that Herve has now replaced
the R implementation with a C implementation.  I similarly found an
absolutely huge speed up when writing genomes, by chunking.

Best,
Kasper


On Tue, Sep 2, 2014 at 4:33 PM, Martin Morgan mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email is for 24 geno fields, as I
 said,
 but our typical usecase involves only ~4-6 fields, and is faster but
 still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient implementation, but...

 I wanted to share my guess that the poor _scaling_ is because the garbage
 collector runs multiple times as the different strings are pasted together,
 and has to traverse, in linear time, increasing numbers of allocated SEXPs.
 So times scale approximately quadratically with the number of rows in the
 VCF

 An efficiency is to reduce the number of SEXPs in play by writing out in
 chunks -- as each chunk is written, the SEXPs become available for
 collection and are re-used. Here's my toy example

 time.R
 ==
 splitIndices - function (nx, ncl)
 {
 i - seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
 }
 }

 x = as.character(seq_len(1e7)); y = sample(x)
 if (!is.na(Sys.getenv(SPLIT, NA))) {
 idx - splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=:))
 } else {
 system.time(paste(x, y, sep=:))
 }


 running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
 relevant time is

user  system elapsed
  15.320   0.064  15.381

 versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

 I think this is likely an overall strategy when dealing with character
 data -- processing in independent chunks of moderate (1M?) size (enabling
 as a consequence parallel evaluation in modest memory) that are sufficient
 to benefit from vectorization, but that do not entail allocation of large
 numbers of in-use SEXPs.

 Martin


 Sorry for the confusion.
 ~G


 On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com
 mailto:becke...@gene.com wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
 fields) with
 profiling enabled. The results of summaryRprof for that run are
 attached,
 though for a variety of reasons they are pretty misleading.

 It took over an hour to write (3700+seconds), so it's definitely a
 bottleneck when the data get very large, even if it isn't for smaller
 data.

 Michael and I both think the culprit is all the pasting and cbinding
 that is
 going on, and more to the point, that memory for an internal
 representation
 to be written out is allocated at all.  Streaming across the object,
 looping
 by rows and writing directly to file (e.g. from C) should be
 blisteringly
 fast in comparison.

 ~G


 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com
 mailto:micha...@gene.com wrote:

 Gabe is still testing/profiling, but we'll send something
 randomized
 along eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan 
 mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org wrote:

 I didn't see in the original thread a reproducible
 (simulated, I
 guess) example, to be explicit about what the problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap optimization provided
 marginal
 gains, and
 that we need to think harder about how to optimize the
 all of
 the string
 manipulation in writeVcf. We either need to reduce it or
 reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing
 more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
 voben...@fhcrc.org mailto:voben...@fhcrc.org
 wrote:

 Hi Gabe,

 Martin responded, and so did Michael,

 https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
 August/006082.html

 https://stat.ethz.ch/pipermail/bioc-devel/2014-
 August/006082.html

 It sounded like Michael was ok with working
 with/around heap
 initialization.

 Michael, is that right or should we still consider
 this on
 the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker 

Re: [Bioc-devel] writeVcf performance

2014-09-04 Thread Gabe Becker
Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into R)
stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took
about ~4.2 hrs to write out, and is about 1.5x the size of the files we are
actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence lawrence.mich...@gene.com
wrote:

 Yes, it's very clear that the scaling is non-linear, and Gabe has been
 experimenting with a chunk-wise + parallel algorithm. Unfortunately there
 is some frustrating overhead with the parallelism. But I'm glad Val is
 arriving at something quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email is for 24 geno fields, as
 I said,
 but our typical usecase involves only ~4-6 fields, and is faster but
 still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient implementation, but...

 I wanted to share my guess that the poor _scaling_ is because the garbage
 collector runs multiple times as the different strings are pasted together,
 and has to traverse, in linear time, increasing numbers of allocated SEXPs.
 So times scale approximately quadratically with the number of rows in the
 VCF

 An efficiency is to reduce the number of SEXPs in play by writing out in
 chunks -- as each chunk is written, the SEXPs become available for
 collection and are re-used. Here's my toy example

 time.R
 ==
 splitIndices - function (nx, ncl)
 {
 i - seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
 }
 }

 x = as.character(seq_len(1e7)); y = sample(x)
 if (!is.na(Sys.getenv(SPLIT, NA))) {
 idx - splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=:))
 } else {
 system.time(paste(x, y, sep=:))
 }


 running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
 relevant time is

user  system elapsed
  15.320   0.064  15.381

 versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

 I think this is likely an overall strategy when dealing with character
 data -- processing in independent chunks of moderate (1M?) size (enabling
 as a consequence parallel evaluation in modest memory) that are sufficient
 to benefit from vectorization, but that do not entail allocation of large
 numbers of in-use SEXPs.

 Martin


 Sorry for the confusion.
 ~G


 On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com
 mailto:becke...@gene.com wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
 fields) with
 profiling enabled. The results of summaryRprof for that run are
 attached,
 though for a variety of reasons they are pretty misleading.

 It took over an hour to write (3700+seconds), so it's definitely a
 bottleneck when the data get very large, even if it isn't for
 smaller data.

 Michael and I both think the culprit is all the pasting and cbinding
 that is
 going on, and more to the point, that memory for an internal
 representation
 to be written out is allocated at all.  Streaming across the object,
 looping
 by rows and writing directly to file (e.g. from C) should be
 blisteringly
 fast in comparison.

 ~G


 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence 
 micha...@gene.com
 mailto:micha...@gene.com wrote:

 Gabe is still testing/profiling, but we'll send something
 randomized
 along eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan 
 mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org wrote:

 I didn't see in the original thread a reproducible
 (simulated, I
 guess) example, to be explicit about what the problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap optimization provided
 marginal
 gains, and
 that we need to think harder about how to optimize the
 all of
 the string
 manipulation in writeVcf. We either need to reduce it or
 reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing
 more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
 voben...@fhcrc.org mailto:voben...@fhcrc.org

 wrote:

 Hi Gabe,

 Martin responded, and so did 

Re: [Bioc-devel] writeVcf performance

2014-09-04 Thread Valerie Obenchain

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
lawrence.mich...@gene.com mailto:lawrence.mich...@gene.com wrote:

Yes, it's very clear that the scaling is non-linear, and Gabe has
been experimenting with a chunk-wise + parallel algorithm.
Unfortunately there is some frustrating overhead with the
parallelism. But I'm glad Val is arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org
mailto:mtmor...@fhcrc.org wrote:

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno
fields, as I said,
but our typical usecase involves only ~4-6 fields, and is
faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient
implementation, but...

I wanted to share my guess that the poor _scaling_ is because
the garbage collector runs multiple times as the different
strings are pasted together, and has to traverse, in linear
time, increasing numbers of allocated SEXPs. So times scale
approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by
writing out in chunks -- as each chunk is written, the SEXPs
become available for collection and are re-used. Here's my toy
example

time.R
==
splitIndices - function (nx, ncl)
{
 i - seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names
= NULL)
 }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na http://is.na(Sys.getenv(SPLIT, NA))) {
 idx - splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=:))
} else {
 system.time(paste(x, y, sep=:))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
time.R the relevant time is

user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with
character data -- processing in independent chunks of moderate
(1M?) size (enabling as a consequence parallel evaluation in
modest memory) that are sufficient to benefit from
vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin


Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
becke...@gene.com mailto:becke...@gene.com
mailto:becke...@gene.com mailto:becke...@gene.com wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
24 geno fields) with
 profiling enabled. The results of summaryRprof for that
run are attached,
 though for a variety of reasons they are pretty misleading.

 It took over an hour to write (3700+seconds), so it's
definitely a
 bottleneck when the data get very large, even if it
isn't for smaller data.

 Michael and I both think the culprit is all the pasting
and cbinding that is
 going on, and more to the point, that memory for an
internal representation
 to be written out is allocated at all.  Streaming
across the object, looping
 by rows and writing directly to file (e.g. from C)
should be blisteringly
 fast in comparison.

 ~G


 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
micha...@gene.com mailto:micha...@gene.com
 mailto:micha...@gene.com mailto:micha...@gene.com
wrote:

 Gabe is still 

Re: [Bioc-devel] writeVcf performance

2014-09-02 Thread Michael Lawrence
Yes, it's very clear that the scaling is non-linear, and Gabe has been
experimenting with a chunk-wise + parallel algorithm. Unfortunately there
is some frustrating overhead with the parallelism. But I'm glad Val is
arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mtmor...@fhcrc.org wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email is for 24 geno fields, as I
 said,
 but our typical usecase involves only ~4-6 fields, and is faster but
 still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient implementation, but...

 I wanted to share my guess that the poor _scaling_ is because the garbage
 collector runs multiple times as the different strings are pasted together,
 and has to traverse, in linear time, increasing numbers of allocated SEXPs.
 So times scale approximately quadratically with the number of rows in the
 VCF

 An efficiency is to reduce the number of SEXPs in play by writing out in
 chunks -- as each chunk is written, the SEXPs become available for
 collection and are re-used. Here's my toy example

 time.R
 ==
 splitIndices - function (nx, ncl)
 {
 i - seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz - min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks - seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
 }
 }

 x = as.character(seq_len(1e7)); y = sample(x)
 if (!is.na(Sys.getenv(SPLIT, NA))) {
 idx - splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=:))
 } else {
 system.time(paste(x, y, sep=:))
 }


 running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
 relevant time is

user  system elapsed
  15.320   0.064  15.381

 versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

 I think this is likely an overall strategy when dealing with character
 data -- processing in independent chunks of moderate (1M?) size (enabling
 as a consequence parallel evaluation in modest memory) that are sufficient
 to benefit from vectorization, but that do not entail allocation of large
 numbers of in-use SEXPs.

 Martin


 Sorry for the confusion.
 ~G


 On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com
 mailto:becke...@gene.com wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
 fields) with
 profiling enabled. The results of summaryRprof for that run are
 attached,
 though for a variety of reasons they are pretty misleading.

 It took over an hour to write (3700+seconds), so it's definitely a
 bottleneck when the data get very large, even if it isn't for smaller
 data.

 Michael and I both think the culprit is all the pasting and cbinding
 that is
 going on, and more to the point, that memory for an internal
 representation
 to be written out is allocated at all.  Streaming across the object,
 looping
 by rows and writing directly to file (e.g. from C) should be
 blisteringly
 fast in comparison.

 ~G


 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com
 mailto:micha...@gene.com wrote:

 Gabe is still testing/profiling, but we'll send something
 randomized
 along eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan 
 mtmor...@fhcrc.org
 mailto:mtmor...@fhcrc.org wrote:

 I didn't see in the original thread a reproducible
 (simulated, I
 guess) example, to be explicit about what the problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap optimization provided
 marginal
 gains, and
 that we need to think harder about how to optimize the
 all of
 the string
 manipulation in writeVcf. We either need to reduce it or
 reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing
 more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
 voben...@fhcrc.org mailto:voben...@fhcrc.org

 wrote:

 Hi Gabe,

 Martin responded, and so did Michael,

 https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
 August/006082.html

 https://stat.ethz.ch/pipermail/bioc-devel/2014-
 August/006082.html

 It sounded like Michael was ok with working
 with/around heap
 initialization.

 Michael, is that right or should we still consider
 this on
 the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker wrote:

 

Re: [Bioc-devel] writeVcf performance

2014-08-29 Thread Kasper Daniel Hansen
Try to run it through the lineprof package for memory profiling; I have
found this to be very helpful.

Here is an old blog post I wrote about it
  http://www.hansenlab.org/rstats/2014/01/30/lineprof/

Kasper


On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker becker.g...@gene.com wrote:

 The profiling I attached in my previous email is for 24 geno fields, as I
 said, but our typical usecase involves only ~4-6 fields, and is faster but
 still on the order of dozens of minutes.

 Sorry for the confusion.
 ~G


 On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com wrote:

  Martin and Val.
 
  I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
  with profiling enabled. The results of summaryRprof for that run are
  attached, though for a variety of reasons they are pretty misleading.
 
  It took over an hour to write (3700+seconds), so it's definitely a
  bottleneck when the data get very large, even if it isn't for smaller
 data.
 
  Michael and I both think the culprit is all the pasting and cbinding that
  is going on, and more to the point, that memory for an internal
  representation to be written out is allocated at all.  Streaming across
 the
  object, looping by rows and writing directly to file (e.g. from C) should
  be blisteringly fast in comparison.
 
  ~G
 
 
  On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com
  wrote:
 
  Gabe is still testing/profiling, but we'll send something randomized
  along eventually.
 
 
  On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org
  wrote:
 
  I didn't see in the original thread a reproducible (simulated, I guess)
  example, to be explicit about what the problem is??
 
  Martin
 
 
  On 08/26/2014 10:47 AM, Michael Lawrence wrote:
 
  My understanding is that the heap optimization provided marginal
 gains,
  and
  that we need to think harder about how to optimize the all of the
 string
  manipulation in writeVcf. We either need to reduce it or reduce its
  overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
 
 
  On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain 
 voben...@fhcrc.org
  wrote:
 
   Hi Gabe,
 
  Martin responded, and so did Michael,
 
  https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
 
  It sounded like Michael was ok with working with/around heap
  initialization.
 
  Michael, is that right or should we still consider this on the table?
 
 
  Val
 
 
  On 08/26/2014 09:34 AM, Gabe Becker wrote:
 
   Val,
 
  Has there been any movement on this? This remains a substantial
  bottleneck for us when writing very large VCF files (e.g.
  variants+genotypes for whole genome NGS samples).
 
  I was able to see a ~25% speedup with 4 cores and  an optimal
  speedup
  of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
  parallelization strategy and no other changes. I suspect this could
 be
  improved on quite a bit, or possibly made irrelevant with judicious
  use
  of serial C code.
 
  Did you and Martin make any plans regarding optimizing writeVcf?
 
  Best
  ~G
 
 
  On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain 
 voben...@fhcrc.org
  mailto:voben...@fhcrc.org wrote:
 
   Hi Michael,
 
   I'm interested in working on this. I'll discuss with Martin
 next
   week when we're both back in the office.
 
   Val
 
 
 
 
 
   On 08/05/14 07:46, Michael Lawrence wrote:
 
   Hi guys (Val, Martin, Herve):
 
   Anyone have an itch for optimization? The writeVcf function
  is
   currently a
   bottleneck in our WGS genotyping pipeline. For a typical 50
   million row
   gVCF, it was taking 2.25 hours prior to yesterday's
  improvements
   (pasteCollapseRows) that brought it down to about 1 hour,
  which
   is still
   too long by my standards ( 0). Only takes 3 minutes to
 call
  the
   genotypes
   (and associated likelihoods etc) from the variant calls
  (using
   80 cores and
   450 GB RAM on one node), so the output is an issue.
 Profiling
   suggests that
   the running time scales non-linearly in the number of rows.
 
   Digging a little deeper, it seems to be something with R's
   string/memory
   allocation. Below, pasting 1 million strings takes 6
  seconds, but
  10
   million strings takes over 2 minutes. It gets way worse
 with
  50
   million. I
   suspect it has something to do with R's string hash table.
 
   set.seed(1000)
   end - sample(1e8, 1e6)
   system.time(paste0(END, =, end))
user  system elapsed
   6.396   0.028   6.420
 
   end - sample(1e8, 1e7)
   system.time(paste0(END, =, end))
user  system elapsed
   134.714   0.352 134.978
 
   Indeed, even this takes a long time (in a fresh session):
 
   set.seed(1000)
   end - 

Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
profiling enabled. The results of summaryRprof for that run are attached,
though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller data.

Michael and I both think the culprit is all the pasting and cbinding that
is going on, and more to the point, that memory for an internal
representation to be written out is allocated at all.  Streaming across the
object, looping by rows and writing directly to file (e.g. from C) should
be blisteringly fast in comparison.

~G


On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com
wrote:

 Gabe is still testing/profiling, but we'll send something randomized along
 eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org
 wrote:

 I didn't see in the original thread a reproducible (simulated, I guess)
 example, to be explicit about what the problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap optimization provided marginal gains,
 and
 that we need to think harder about how to optimize the all of the string
 manipulation in writeVcf. We either need to reduce it or reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org
 wrote:

  Hi Gabe,

 Martin responded, and so did Michael,

 https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

 It sounded like Michael was ok with working with/around heap
 initialization.

 Michael, is that right or should we still consider this on the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker wrote:

  Val,

 Has there been any movement on this? This remains a substantial
 bottleneck for us when writing very large VCF files (e.g.
 variants+genotypes for whole genome NGS samples).

 I was able to see a ~25% speedup with 4 cores and  an optimal speedup
 of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
 parallelization strategy and no other changes. I suspect this could be
 improved on quite a bit, or possibly made irrelevant with judicious use
 of serial C code.

 Did you and Martin make any plans regarding optimizing writeVcf?

 Best
 ~G


 On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org
 mailto:voben...@fhcrc.org wrote:

  Hi Michael,

  I'm interested in working on this. I'll discuss with Martin next
  week when we're both back in the office.

  Val





  On 08/05/14 07:46, Michael Lawrence wrote:

  Hi guys (Val, Martin, Herve):

  Anyone have an itch for optimization? The writeVcf function is
  currently a
  bottleneck in our WGS genotyping pipeline. For a typical 50
  million row
  gVCF, it was taking 2.25 hours prior to yesterday's
 improvements
  (pasteCollapseRows) that brought it down to about 1 hour,
 which
  is still
  too long by my standards ( 0). Only takes 3 minutes to call
 the
  genotypes
  (and associated likelihoods etc) from the variant calls (using
  80 cores and
  450 GB RAM on one node), so the output is an issue. Profiling
  suggests that
  the running time scales non-linearly in the number of rows.

  Digging a little deeper, it seems to be something with R's
  string/memory
  allocation. Below, pasting 1 million strings takes 6 seconds,
 but
 10
  million strings takes over 2 minutes. It gets way worse with
 50
  million. I
  suspect it has something to do with R's string hash table.

  set.seed(1000)
  end - sample(1e8, 1e6)
  system.time(paste0(END, =, end))
   user  system elapsed
  6.396   0.028   6.420

  end - sample(1e8, 1e7)
  system.time(paste0(END, =, end))
   user  system elapsed
  134.714   0.352 134.978

  Indeed, even this takes a long time (in a fresh session):

  set.seed(1000)
  end - sample(1e8, 1e6)
  end - sample(1e8, 1e7)
  system.time(as.character(end))
   user  system elapsed
 57.224   0.156  57.366

  But running it a second time is faster (about what one would
  expect?):

  system.time(levels - as.character(end))
   user  system elapsed
 23.582   0.021  23.589

  I did some simple profiling of R to find that the resizing of
  the string
  hash table is not a significant component of the time. So
 maybe
  something
  to do with the R heap/gc? No time right now to go deeper. But
 I
  know Martin
  likes this sort of thing ;)

  Michael

   [[alternative HTML 

Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.

Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker becke...@gene.com wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
 with profiling enabled. The results of summaryRprof for that run are
 attached, though for a variety of reasons they are pretty misleading.

 It took over an hour to write (3700+seconds), so it's definitely a
 bottleneck when the data get very large, even if it isn't for smaller data.

 Michael and I both think the culprit is all the pasting and cbinding that
 is going on, and more to the point, that memory for an internal
 representation to be written out is allocated at all.  Streaming across the
 object, looping by rows and writing directly to file (e.g. from C) should
 be blisteringly fast in comparison.

 ~G


 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence micha...@gene.com
 wrote:

 Gabe is still testing/profiling, but we'll send something randomized
 along eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mtmor...@fhcrc.org
 wrote:

 I didn't see in the original thread a reproducible (simulated, I guess)
 example, to be explicit about what the problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap optimization provided marginal gains,
 and
 that we need to think harder about how to optimize the all of the string
 manipulation in writeVcf. We either need to reduce it or reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain voben...@fhcrc.org
 wrote:

  Hi Gabe,

 Martin responded, and so did Michael,

 https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

 It sounded like Michael was ok with working with/around heap
 initialization.

 Michael, is that right or should we still consider this on the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker wrote:

  Val,

 Has there been any movement on this? This remains a substantial
 bottleneck for us when writing very large VCF files (e.g.
 variants+genotypes for whole genome NGS samples).

 I was able to see a ~25% speedup with 4 cores and  an optimal
 speedup
 of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
 parallelization strategy and no other changes. I suspect this could be
 improved on quite a bit, or possibly made irrelevant with judicious
 use
 of serial C code.

 Did you and Martin make any plans regarding optimizing writeVcf?

 Best
 ~G


 On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org
 mailto:voben...@fhcrc.org wrote:

  Hi Michael,

  I'm interested in working on this. I'll discuss with Martin next
  week when we're both back in the office.

  Val





  On 08/05/14 07:46, Michael Lawrence wrote:

  Hi guys (Val, Martin, Herve):

  Anyone have an itch for optimization? The writeVcf function
 is
  currently a
  bottleneck in our WGS genotyping pipeline. For a typical 50
  million row
  gVCF, it was taking 2.25 hours prior to yesterday's
 improvements
  (pasteCollapseRows) that brought it down to about 1 hour,
 which
  is still
  too long by my standards ( 0). Only takes 3 minutes to call
 the
  genotypes
  (and associated likelihoods etc) from the variant calls
 (using
  80 cores and
  450 GB RAM on one node), so the output is an issue. Profiling
  suggests that
  the running time scales non-linearly in the number of rows.

  Digging a little deeper, it seems to be something with R's
  string/memory
  allocation. Below, pasting 1 million strings takes 6
 seconds, but
 10
  million strings takes over 2 minutes. It gets way worse with
 50
  million. I
  suspect it has something to do with R's string hash table.

  set.seed(1000)
  end - sample(1e8, 1e6)
  system.time(paste0(END, =, end))
   user  system elapsed
  6.396   0.028   6.420

  end - sample(1e8, 1e7)
  system.time(paste0(END, =, end))
   user  system elapsed
  134.714   0.352 134.978

  Indeed, even this takes a long time (in a fresh session):

  set.seed(1000)
  end - sample(1e8, 1e6)
  end - sample(1e8, 1e7)
  system.time(as.character(end))
   user  system elapsed
 57.224   0.156  57.366

  But running it a second time is faster (about what one would
  expect?):

  system.time(levels - as.character(end))
   user  system elapsed
 23.582   0.021  23.589

  I did some simple profiling of R to find that the resizing of
   

Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Valerie Obenchain

Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap 
initialization.


Michael, is that right or should we still consider this on the table?


Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an optimal speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G


On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain voben...@fhcrc.org
mailto:voben...@fhcrc.org wrote:

Hi Michael,

I'm interested in working on this. I'll discuss with Martin next
week when we're both back in the office.

Val





On 08/05/14 07:46, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is
currently a
bottleneck in our WGS genotyping pipeline. For a typical 50
million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which
is still
too long by my standards ( 0). Only takes 3 minutes to call the
genotypes
(and associated likelihoods etc) from the variant calls (using
80 cores and
450 GB RAM on one node), so the output is an issue. Profiling
suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's
string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50
million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end - sample(1e8, 1e6)
system.time(paste0(END, =, end))
 user  system elapsed
6.396   0.028   6.420

end - sample(1e8, 1e7)
system.time(paste0(END, =, end))
 user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end - sample(1e8, 1e6)
end - sample(1e8, 1e7)
system.time(as.character(end))
 user  system elapsed
   57.224   0.156  57.366

But running it a second time is faster (about what one would
expect?):

system.time(levels - as.character(end))
 user  system elapsed
   23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of
the string
hash table is not a significant component of the time. So maybe
something
to do with the R heap/gc? No time right now to go deeper. But I
know Martin
likes this sort of thing ;)

Michael

 [[alternative HTML version deleted]]

_
Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_
Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biologist
Genentech Research


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-14 Thread Michael Lawrence
I thought it might come down to the heap initialization. We'll work with
that.


On Wed, Aug 13, 2014 at 4:42 PM, Martin Morgan mtmor...@fhcrc.org wrote:

 On 08/05/2014 07:46 AM, Michael Lawrence wrote:

 Hi guys (Val, Martin, Herve):

 Anyone have an itch for optimization? The writeVcf function is currently a
 bottleneck in our WGS genotyping pipeline. For a typical 50 million row
 gVCF, it was taking 2.25 hours prior to yesterday's improvements
 (pasteCollapseRows) that brought it down to about 1 hour, which is still
 too long by my standards ( 0). Only takes 3 minutes to call the genotypes
 (and associated likelihoods etc) from the variant calls (using 80 cores
 and
 450 GB RAM on one node), so the output is an issue. Profiling suggests
 that
 the running time scales non-linearly in the number of rows.

 Digging a little deeper, it seems to be something with R's string/memory
 allocation. Below, pasting 1 million strings takes 6 seconds, but 10
 million strings takes over 2 minutes. It gets way worse with 50 million. I
 suspect it has something to do with R's string hash table.

 set.seed(1000)
 end - sample(1e8, 1e6)
 system.time(paste0(END, =, end))
 user  system elapsed
6.396   0.028   6.420

 end - sample(1e8, 1e7)
 system.time(paste0(END, =, end))
 user  system elapsed
 134.714   0.352 134.978

 Indeed, even this takes a long time (in a fresh session):

 set.seed(1000)
 end - sample(1e8, 1e6)
 end - sample(1e8, 1e7)
 system.time(as.character(end))
 user  system elapsed
   57.224   0.156  57.366


 my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M,
 which changes the example above from

  system.time(as.character(end))
user  system elapsed
  82.835   0.343  83.195

 to

  system.time(as.character(end))
user  system elapsed
   9.245   0.169   9.424

 but I think it's a one-time gain; I wonder what the writeVcf command is
 that you're running?

 Martin


 But running it a second time is faster (about what one would expect?):

 system.time(levels - as.character(end))
 user  system elapsed
   23.582   0.021  23.589

 I did some simple profiling of R to find that the resizing of the string
 hash table is not a significant component of the time. So maybe something
 to do with the R heap/gc? No time right now to go deeper. But I know
 Martin
 likes this sort of thing ;)

 Michael

 [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



 --
 Computational Biology / Fred Hutchinson Cancer Research Center
 1100 Fairview Ave. N.
 PO Box 19024 Seattle, WA 98109

 Location: Arnold Building M1 B861
 Phone: (206) 667-2793


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-13 Thread Martin Morgan

On 08/05/2014 07:46 AM, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards ( 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end - sample(1e8, 1e6)
system.time(paste0(END, =, end))
user  system elapsed
   6.396   0.028   6.420

end - sample(1e8, 1e7)
system.time(paste0(END, =, end))
user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end - sample(1e8, 1e6)
end - sample(1e8, 1e7)
system.time(as.character(end))
user  system elapsed
  57.224   0.156  57.366


my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M, which 
changes the example above from


 system.time(as.character(end))
   user  system elapsed
 82.835   0.343  83.195

to

 system.time(as.character(end))
   user  system elapsed
  9.245   0.169   9.424

but I think it's a one-time gain; I wonder what the writeVcf command is that 
you're running?


Martin



But running it a second time is faster (about what one would expect?):

system.time(levels - as.character(end))
user  system elapsed
  23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel