Re: [Bioc-devel] writeVcf performance

2014-09-30 Thread Valerie Obenchain

Hi Gabe,

It would help to have a common baseline. Please show the output for 
writing the Illumina file you sent originally:


library(VariantAnnotation)
fl <- "NA12877_S1.genome.vcf.gz"
vcf <- readVcf(fl, "", param=ScanVcfParam(info=NA))
dim(vcf)

gc()
print(system.time(writeVcf(vcf, tempfile(
gc()

Here's my output. For this file, writing takes between 8-10 minutes 
depending on machine load. (Machine has 378 GIG of RAM.)



dim(vcf)

[1] 516127621



gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157824501 8428.8  298615851 15947.9  261241334 13951.8
Vcells 1109937571 8468.2 1778479081 13568.8 1693642432 12921.5

print(system.time(writeVcf(vcf, tempfile(

   user  system elapsed
601.265   9.261 614.230

gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157829989 8429.1  298615851 15947.9  261241334 13951.8
Vcells 1109941106 8468.2 2270282109 17320.9 2243573149 17117.2


I'm not seeing substantial differences in writeVcf with expanded vs 
collapsed VCF. However, as an aside, note that calling writeVcf on 
VRanges (or equivalently ExpandedVCF) the new file will have multiple 
rows per position if there were multiple ALT alleles. I'm not sure this 
is technically valid as per the VCF specs but am assuming Michael knew 
about this.


Five positions in original VCF:
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "")
> dim(vcf)
[1] 5 3

vr <- as(vcf, "VRanges")
> length(vr)
[1] 21

Before writing out, VRanges is coerced to ExpandedVCF; 7 rows are 
written out.

> dim(as(vr, "VCF"))
[1] 7 3

Can you provide a portion of 'vcfgeno' for testing?

Valerie


sessionInfo()

R version 3.1.0 Patched (2014-06-24 r66016)
Platform: x86_64-unknown-linux-gnu (64-bit)


...


other attached packages:
[1] VariantAnnotation_1.11.35 Rsamtools_1.17.34
[3] Biostrings_2.33.14XVector_0.5.8
[5] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23
[7] IRanges_1.99.28   S4Vectors_0.2.4
[9] BiocGenerics_0.11.5




On 09/30/2014 08:36 AM, Gabe Becker wrote:

Valerie,

Apologies for this taking much longer than it should have. The changes
in Bioc-devel have wreaked havoc on the code we use to to generate and
process the data we need to write out, but the fault is mine for not
getting on top of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the
illumina platinum and writing it back out to see if it is something
about our specific vcf object (could ExpandedVCF vs VCF be an issue?).

 > vcfgeno
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
   GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
   DataFrame with 1 column: END
   Fields with no header: END
geno(vcf):
   SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
   Number TypeDescription
AD 2  Integer Allelic depths (number of reads in each
observed al...
DP 1  Integer Total read depth
FT 1  String  Variant filters
GT 1  String  Genotype
GQ 1  Integer Genotype quality
PL 3  Integer Normalized, Phred-scaled likelihoods for
genotypes
MIN_DP 1  Integer Minimum DP observed within the GVCF block
 > sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
  [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
  [3] AnnotationDbi_1.27.16 Biobase_2.25.0
  [5] gmapR_1.7.8   VTGenotyping_0.0.1
  [7] BiocParallel_0.99.22  futile.logger_1.3.7
  [9] VariantTools_1.7.5 *VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34 Biostrings_2.33.14
[13] XVector_0.5.8 rtracklayer_1.25.16
[15] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23
[17] IRanges_1.99.28   S4Vectors_0.2.4
[19] BiocGenerics_0.11.5   switchr_0.2.1

loaded via a namespace (and not attached):
  [1] annotate_1.43.5base64enc_0.1-2
  [3] BatchJobs_1.4  BBmisc_1.7
  [5] biomaRt_2.21.1 bitops_1.0-6
  [7] brew_1.0-6 BSgenome_1.33.9
  [9] CGPtools_2.2.0 checkmate_1.4
[11] codetools_0.2-9DBI_0.3.1
[13] DESeq_1.17.0   digest_0.6.4
[15] fail_1.2   foreach_1.4.2
[17] futile.options_1.0.0   genefilter_1.47.6
[19] geneplotter_1.43.0 GenomicAlignments

Re: [Bioc-devel] writeVcf performance

2014-09-30 Thread Gabe Becker
Valerie,

Apologies for this taking much longer than it should have. The changes in
Bioc-devel have wreaked havoc on the code we use to to generate and process
the data we need to write out, but the fault is mine for not getting on top
of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the illumina
platinum and writing it back out to see if it is something about our
specific vcf object (could ExpandedVCF vs VCF be an issue?).

> vcfgeno
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: END
  Fields with no header: END
geno(vcf):
  SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
  Number TypeDescription

   AD 2  Integer Allelic depths (number of reads in each observed
al...
   DP 1  Integer Total read depth

   FT 1  String  Variant filters

   GT 1  String  Genotype

   GQ 1  Integer Genotype quality

   PL 3  Integer Normalized, Phred-scaled likelihoods for genotypes

   MIN_DP 1  Integer Minimum DP observed within the GVCF block

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
 [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
 [3] AnnotationDbi_1.27.16 Biobase_2.25.0
 [5] gmapR_1.7.8   VTGenotyping_0.0.1
 [7] BiocParallel_0.99.22  futile.logger_1.3.7
 [9] VariantTools_1.7.5*VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34 Biostrings_2.33.14
[13] XVector_0.5.8 rtracklayer_1.25.16
[15] GenomicRanges_1.17.42 GenomeInfoDb_1.1.23
[17] IRanges_1.99.28   S4Vectors_0.2.4
[19] BiocGenerics_0.11.5   switchr_0.2.1

loaded via a namespace (and not attached):
 [1] annotate_1.43.5base64enc_0.1-2
 [3] BatchJobs_1.4  BBmisc_1.7
 [5] biomaRt_2.21.1 bitops_1.0-6
 [7] brew_1.0-6 BSgenome_1.33.9
 [9] CGPtools_2.2.0 checkmate_1.4
[11] codetools_0.2-9DBI_0.3.1
[13] DESeq_1.17.0   digest_0.6.4
[15] fail_1.2   foreach_1.4.2
[17] futile.options_1.0.0   genefilter_1.47.6
[19] geneplotter_1.43.0 GenomicAlignments_1.1.29
[21] genoset_1.19.32gneDB_0.4.18
[23] grid_3.1.1 iterators_1.0.7
[25] lambda.r_1.1.6 lattice_0.20-29
[27] Matrix_1.1-4   RColorBrewer_1.0-5
[29] RCurl_1.95-4.3 rjson_0.2.14
[31] RSQLite_0.11.4 sendmailR_1.2-1
[33] splines_3.1.1  stringr_0.6.2
[35] survival_2.37-7tools_3.1.1
[37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
[39] xtable_1.7-4   zlibbioc_1.11.1

On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain 
wrote:

> Hi Gabe,
>
> Have you had a chance to test writeVcf? The changes made over the past
> week have shaved off more time. It now takes ~ 9 minutes to write the
> NA12877 example.
>
>  dim(vcf)
>>>
>> [1] 516127621
>>
>>> gc()
>>>
>>  used   (Mb) gc trigger(Mb)   max used(Mb)
>> Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
>> Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8
>>
>>> print(system.time(writeVcf(vcf, tempfile(
>>>
>>user  system elapsed
>> 555.282   6.700 565.700
>>
>>> gc()
>>>
>>  used   (Mb) gc trigger(Mb)   max used(Mb)
>> Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
>> Vcells 1176960717 8979.5 2183277445 16657.1 2171401955 16566.5
>>
>
>
> In the most recent version (1.11.35) I've added chunking for files with >
> 1e5 records. Right now the choice of # records per chunk is simple, based
> on total records only. We are still experimenting with this. You can
> override default chunking with 'nchunk'. Examples on the man page.
>
> Valerie
>
>
> On 09/08/14 08:43, Gabe Becker wrote:
>
>> Val,
>>
>> That is great. I'll check this out and test it on our end.
>>
>> ~G
>>
>> On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain > > wrote:
>>
>> The new writeVcf code is in 1.11.28.
>>
>> Using the illumina file you suggested, geno fields only, writing now
>> takes about 17 minutes.
>>
>>  > hdr
>> class: VCFHeader
>> samples(1): NA12877
>> me

Re: [Bioc-devel] writeVcf performance

2014-09-17 Thread Valerie Obenchain

Hi Gabe,

Have you had a chance to test writeVcf? The changes made over the past 
week have shaved off more time. It now takes ~ 9 minutes to write the 
NA12877 example.



dim(vcf)

[1] 516127621

gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8

print(system.time(writeVcf(vcf, tempfile(

   user  system elapsed
555.282   6.700 565.700

gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
Vcells 1176960717 8979.5 2183277445 16657.1 2171401955 16566.5



In the most recent version (1.11.35) I've added chunking for files with 
> 1e5 records. Right now the choice of # records per chunk is simple, 
based on total records only. We are still experimenting with this. You 
can override default chunking with 'nchunk'. Examples on the man page.


Valerie


On 09/08/14 08:43, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz


took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel algorithm.
 Unfortunately there is some frustrating overhead with the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 >> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

 An efficiency is to reduce the number of SEXPs in
play by
 writing out in chunks -- as each chunk is wri

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Hervé Pagès

Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before
reporting this. Thanks and sorry for the noise.

H.

On 09/09/2014 03:09 PM, Valerie Obenchain wrote:

Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.

Val

On 09/09/2014 02:47 PM, Hervé Pagès wrote:

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)

source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.


Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing
now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files
with >=
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files
make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz








took about ~4.2 hrs to write out, and is about 1.5x the
size
of the
files we are actually dealing with (~50M ranges vs our
~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at
something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 >> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
  

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain

Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was 
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.


Val

On 09/09/2014 02:47 PM, Hervé Pagès wrote:

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)
   source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.


Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing
now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files
with >=
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files
make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz






took about ~4.2 hrs to write out, and is about 1.5x the
size
of the
files we are actually dealing with (~50M ranges vs our
~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 >> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but.

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Hervé Pagès

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

  library(VariantAnnotation)
  library(RUnit)
  source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

  dest <- tempfile()
  vr <- make_TARGET_VRanges_vcf()
  writeVcf(vr, dest)
  vcf <- readVcf(dest, genome = "hg19")
  perm <- c(1, 7, 8, 4, 2, 10)
  vcf.vr <- as(vcf, "VRanges")[perm]
  genome(vr) <- "hg19"
  checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) : 
FALSE


Hard for me to tell whether this is related to DataFrame moving from 
IRanges to S4Vectors or to a regression in writeVcf(). Do you think

you can have a look? Thanks for the help and sorry for the trouble.

H.


Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz




took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our
~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 >> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings a

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the 
process of moving SimpleList and DataFrame from IRanges to S4Vectors; 
finished up today I think. Anyhow, if you get VariantAnnotation from svn 
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe 
rtracklayer).


I'm working on the 'chunking' approach next. It looks like we can still 
gain from adding that.


Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz



took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 >> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

 An efficiency is to reduce the number of SEXPs in
play by
 writing out in chunks -- as each chunk is written,
the SEXPs
 become available for collection and are re-used.
Here's my toy
 example

 time.R
 ==
 splitIndices <- function (nx, ncl)
 {
  i <- seq_len(nx)
  if (ncl == 0L)
  list()
  els

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain
fyi Martin found a bug with the treatment of list data (ie, Number = 
'.') in the header. Working on a fix ...


Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz


took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel algorithm.
 Unfortunately there is some frustrating overhead with the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 >> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

 An efficiency is to reduce the number of SEXPs in
play by
 writing out in chunks -- as each chunk is written,
the SEXPs
 become available for collection and are re-used.
Here's my toy
 example

 time.R
 ==
 splitIndices <- function (nx, ncl)
 {
  i <- seq_len(nx)
  if (ncl == 0L)
  list()
  else if (ncl == 1L || nx == 1L)
  list(i)
  else {
  fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
  breaks <- seq(1 - fuzz, nx + fuzz, length
= ncl + 1L)
  structure(split(i, cut(i, breaks,
labels=FALSE)), names
 = NULL)
  }
  

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Gabe Becker
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain 
wrote:

> The new writeVcf code is in 1.11.28.
>
> Using the illumina file you suggested, geno fields only, writing now takes
> about 17 minutes.
>
> > hdr
> class: VCFHeader
> samples(1): NA12877
> meta(6): fileformat ApplyRecalibration ... reference source
> fixed(1): FILTER
> info(22): AC AF ... culprit set
> geno(8): GT GQX ... PL VF
>
> > param = ScanVcfParam(info=NA)
> > vcf = readVcf(fl, "", param=param)
> > dim(vcf)
> [1] 516127621
>
> > system.time(writeVcf(vcf, "out.vcf"))
> user   system  elapsed
>  971.0326.568 1004.593
>
> In 1.11.28, parsing of geno data was moved to C. If this didn't speed
> things up enough we were planning to implement 'chunking' through the VCF
> and/or move the parsing of info to C, however, it looks like geno was the
> bottleneck.
>
> I've tested a number of samples/fields combinations in files with >= .5
> million rows and the improvement over writeVcf() in release is ~ 90%.
>
> Valerie
>
>
>
>
> On 09/04/14 15:28, Valerie Obenchain wrote:
>
>> Thanks Gabe. I should have something for you on Monday.
>>
>> Val
>>
>>
>> On 09/04/2014 01:56 PM, Gabe Becker wrote:
>>
>>> Val and Martin,
>>>
>>> Apologies for the delay.
>>>
>>> We realized that the Illumina platinum genome vcf files make a good test
>>> case, assuming you strip out all the info (info=NA when reading it into
>>> R) stuff.
>>>
>>> ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
>>> took about ~4.2 hrs to write out, and is about 1.5x the size of the
>>> files we are actually dealing with (~50M ranges vs our ~30M).
>>>
>>> Looking forward a new vastly improved writeVcf :).
>>>
>>> ~G
>>>
>>>
>>> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
>>> mailto:lawrence.mich...@gene.com>> wrote:
>>>
>>> Yes, it's very clear that the scaling is non-linear, and Gabe has
>>> been experimenting with a chunk-wise + parallel algorithm.
>>> Unfortunately there is some frustrating overhead with the
>>> parallelism. But I'm glad Val is arriving at something quicker.
>>>
>>> Michael
>>>
>>>
>>> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan >> > wrote:
>>>
>>> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>>
>>> The profiling I attached in my previous email is for 24 geno
>>> fields, as I said,
>>> but our typical usecase involves only ~4-6 fields, and is
>>> faster but still on
>>> the order of dozens of minutes.
>>>
>>>
>>> I think Val is arriving at a (much) more efficient
>>> implementation, but...
>>>
>>> I wanted to share my guess that the poor _scaling_ is because
>>> the garbage collector runs multiple times as the different
>>> strings are pasted together, and has to traverse, in linear
>>> time, increasing numbers of allocated SEXPs. So times scale
>>> approximately quadratically with the number of rows in the VCF
>>>
>>> An efficiency is to reduce the number of SEXPs in play by
>>> writing out in chunks -- as each chunk is written, the SEXPs
>>> become available for collection and are re-used. Here's my toy
>>> example
>>>
>>> time.R
>>> ==
>>> splitIndices <- function (nx, ncl)
>>> {
>>>  i <- seq_len(nx)
>>>  if (ncl == 0L)
>>>  list()
>>>  else if (ncl == 1L || nx == 1L)
>>>  list(i)
>>>  else {
>>>  fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>>>  breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>>>  structure(split(i, cut(i, breaks, labels=FALSE)), names
>>> = NULL)
>>>  }
>>> }
>>>
>>> x = as.character(seq_len(1e7)); y = sample(x)
>>> if (!is.na (Sys.getenv("SPLIT", NA))) {
>>>  idx <- splitIndices(length(x), 20)
>>>  system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>>> } else {
>>>  system.time(paste(x, y, sep=":"))
>>> }
>>>
>>>
>>> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
>>> time.R the relevant time is
>>>
>>> user  system elapsed
>>>   15.320   0.064  15.381
>>>
>>> versus with $ R --no-save --quiet -f time.R it is
>>>
>>> user  system elapsed
>>>   95.360   0.164  95.511
>>>
>>> I think this is likely an overall strategy when dealing with
>>> character data -- processing in independent chunks of moderate
>>> (1M?) size (enabling as a consequence parallel evaluation in
>>> modest memory) that are sufficient to benefit from
>>> vectorization, but that do not entail allocation of large
>>> numbers of in-use SEXPs.
>>>
>>> Martin
>

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now 
takes about 17 minutes.


> hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

> param = ScanVcfParam(info=NA)
> vcf = readVcf(fl, "", param=param)
> dim(vcf)
[1] 516127621

> system.time(writeVcf(vcf, "out.vcf"))
user   system  elapsed
 971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed 
things up enough we were planning to implement 'chunking' through the 
VCF and/or move the parsing of info to C, however, it looks like geno 
was the bottleneck.


I've tested a number of samples/fields combinations in files with >= .5 
million rows and the improvement over writeVcf() in release is ~ 90%.


Valerie



On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

Yes, it's very clear that the scaling is non-linear, and Gabe has
been experimenting with a chunk-wise + parallel algorithm.
Unfortunately there is some frustrating overhead with the
parallelism. But I'm glad Val is arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mailto:mtmor...@fhcrc.org>> wrote:

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno
fields, as I said,
but our typical usecase involves only ~4-6 fields, and is
faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient
implementation, but...

I wanted to share my guess that the poor _scaling_ is because
the garbage collector runs multiple times as the different
strings are pasted together, and has to traverse, in linear
time, increasing numbers of allocated SEXPs. So times scale
approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by
writing out in chunks -- as each chunk is written, the SEXPs
become available for collection and are re-used. Here's my toy
example

time.R
==
splitIndices <- function (nx, ncl)
{
 i <- seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names
= NULL)
 }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na (Sys.getenv("SPLIT", NA))) {
 idx <- splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
 system.time(paste(x, y, sep=":"))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
time.R the relevant time is

user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with
character data -- processing in independent chunks of moderate
(1M?) size (enabling as a consequence parallel evaluation in
modest memory) that are sufficient to benefit from
vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin


Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
mailto:becke...@gene.com>
>> wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
24 geno fields) with
 profiling enabled. The results of summaryRprof for that
run are attached,
 though for a variety of reasons they are pretty
misleading.

 It took ove

Re: [Bioc-devel] writeVcf performance

2014-09-05 Thread Kasper Daniel Hansen
This approach, writing in chunks, is the same Herve and I used for writing
FASTA in the Biostrings package, although I see that Herve has now replaced
the R implementation with a C implementation.  I similarly found an
absolutely huge speed up when writing genomes, by chunking.

Best,
Kasper


On Tue, Sep 2, 2014 at 4:33 PM, Martin Morgan  wrote:

> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
>> The profiling I attached in my previous email is for 24 geno fields, as I
>> said,
>> but our typical usecase involves only ~4-6 fields, and is faster but
>> still on
>> the order of dozens of minutes.
>>
>
> I think Val is arriving at a (much) more efficient implementation, but...
>
> I wanted to share my guess that the poor _scaling_ is because the garbage
> collector runs multiple times as the different strings are pasted together,
> and has to traverse, in linear time, increasing numbers of allocated SEXPs.
> So times scale approximately quadratically with the number of rows in the
> VCF
>
> An efficiency is to reduce the number of SEXPs in play by writing out in
> chunks -- as each chunk is written, the SEXPs become available for
> collection and are re-used. Here's my toy example
>
> time.R
> ==
> splitIndices <- function (nx, ncl)
> {
> i <- seq_len(nx)
> if (ncl == 0L)
> list()
> else if (ncl == 1L || nx == 1L)
> list(i)
> else {
> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
> structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
> }
> }
>
> x = as.character(seq_len(1e7)); y = sample(x)
> if (!is.na(Sys.getenv("SPLIT", NA))) {
> idx <- splitIndices(length(x), 20)
> system.time(for (i in idx) paste(x[i], y[i], sep=":"))
> } else {
> system.time(paste(x, y, sep=":"))
> }
>
>
> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
> relevant time is
>
>user  system elapsed
>  15.320   0.064  15.381
>
> versus with $ R --no-save --quiet -f time.R it is
>
>user  system elapsed
>  95.360   0.164  95.511
>
> I think this is likely an overall strategy when dealing with character
> data -- processing in independent chunks of moderate (1M?) size (enabling
> as a consequence parallel evaluation in modest memory) that are sufficient
> to benefit from vectorization, but that do not entail allocation of large
> numbers of in-use SEXPs.
>
> Martin
>
>
>> Sorry for the confusion.
>> ~G
>>
>>
>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker > > wrote:
>>
>> Martin and Val.
>>
>> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
>> fields) with
>> profiling enabled. The results of summaryRprof for that run are
>> attached,
>> though for a variety of reasons they are pretty misleading.
>>
>> It took over an hour to write (3700+seconds), so it's definitely a
>> bottleneck when the data get very large, even if it isn't for smaller
>> data.
>>
>> Michael and I both think the culprit is all the pasting and cbinding
>> that is
>> going on, and more to the point, that memory for an internal
>> representation
>> to be written out is allocated at all.  Streaming across the object,
>> looping
>> by rows and writing directly to file (e.g. from C) should be
>> blisteringly
>> fast in comparison.
>>
>> ~G
>>
>>
>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence > > wrote:
>>
>> Gabe is still testing/profiling, but we'll send something
>> randomized
>> along eventually.
>>
>>
>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
>> mtmor...@fhcrc.org
>> > wrote:
>>
>> I didn't see in the original thread a reproducible
>> (simulated, I
>> guess) example, to be explicit about what the problem is??
>>
>> Martin
>>
>>
>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>
>> My understanding is that the heap optimization provided
>> marginal
>> gains, and
>> that we need to think harder about how to optimize the
>> all of
>> the string
>> manipulation in writeVcf. We either need to reduce it or
>> reduce its
>> overhead (i.e., the CHARSXP allocation). Gabe is doing
>> more tests.
>>
>>
>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
>> mailto:voben...@fhcrc.org>>
>> wrote:
>>
>> Hi Gabe,
>>
>> Martin responded, and so did Michael,
>>
>> https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
>> August/006082.html
>>
>> > August/006082.html>
>>
>> It sounded like Michael was ok with working
>> with/around heap
>> initialization.
>>
>> 

Re: [Bioc-devel] writeVcf performance

2014-09-04 Thread Valerie Obenchain

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

Yes, it's very clear that the scaling is non-linear, and Gabe has
been experimenting with a chunk-wise + parallel algorithm.
Unfortunately there is some frustrating overhead with the
parallelism. But I'm glad Val is arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mailto:mtmor...@fhcrc.org>> wrote:

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno
fields, as I said,
but our typical usecase involves only ~4-6 fields, and is
faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient
implementation, but...

I wanted to share my guess that the poor _scaling_ is because
the garbage collector runs multiple times as the different
strings are pasted together, and has to traverse, in linear
time, increasing numbers of allocated SEXPs. So times scale
approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by
writing out in chunks -- as each chunk is written, the SEXPs
become available for collection and are re-used. Here's my toy
example

time.R
==
splitIndices <- function (nx, ncl)
{
 i <- seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names
= NULL)
 }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na (Sys.getenv("SPLIT", NA))) {
 idx <- splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
 system.time(paste(x, y, sep=":"))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
time.R the relevant time is

user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with
character data -- processing in independent chunks of moderate
(1M?) size (enabling as a consequence parallel evaluation in
modest memory) that are sufficient to benefit from
vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin


Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
mailto:becke...@gene.com>
>> wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
24 geno fields) with
 profiling enabled. The results of summaryRprof for that
run are attached,
 though for a variety of reasons they are pretty misleading.

 It took over an hour to write (3700+seconds), so it's
definitely a
 bottleneck when the data get very large, even if it
isn't for smaller data.

 Michael and I both think the culprit is all the pasting
and cbinding that is
 going on, and more to the point, that memory for an
internal representation
 to be written out is allocated at all.  Streaming
across the object, looping
 by rows and writing directly to file (e.g. from C)
should be blisteringly
 fast in comparison.

 ~G


 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
mailto:micha...@gene.com>
 >>
wrote:

 Gabe is still testing/profiling, but we'll send
something rand

Re: [Bioc-devel] writeVcf performance

2014-09-04 Thread Gabe Becker
Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into R)
stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took
about ~4.2 hrs to write out, and is about 1.5x the size of the files we are
actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence 
wrote:

> Yes, it's very clear that the scaling is non-linear, and Gabe has been
> experimenting with a chunk-wise + parallel algorithm. Unfortunately there
> is some frustrating overhead with the parallelism. But I'm glad Val is
> arriving at something quicker.
>
> Michael
>
>
> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan  wrote:
>
>> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>
>>> The profiling I attached in my previous email is for 24 geno fields, as
>>> I said,
>>> but our typical usecase involves only ~4-6 fields, and is faster but
>>> still on
>>> the order of dozens of minutes.
>>>
>>
>> I think Val is arriving at a (much) more efficient implementation, but...
>>
>> I wanted to share my guess that the poor _scaling_ is because the garbage
>> collector runs multiple times as the different strings are pasted together,
>> and has to traverse, in linear time, increasing numbers of allocated SEXPs.
>> So times scale approximately quadratically with the number of rows in the
>> VCF
>>
>> An efficiency is to reduce the number of SEXPs in play by writing out in
>> chunks -- as each chunk is written, the SEXPs become available for
>> collection and are re-used. Here's my toy example
>>
>> time.R
>> ==
>> splitIndices <- function (nx, ncl)
>> {
>> i <- seq_len(nx)
>> if (ncl == 0L)
>> list()
>> else if (ncl == 1L || nx == 1L)
>> list(i)
>> else {
>> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>> structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
>> }
>> }
>>
>> x = as.character(seq_len(1e7)); y = sample(x)
>> if (!is.na(Sys.getenv("SPLIT", NA))) {
>> idx <- splitIndices(length(x), 20)
>> system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>> } else {
>> system.time(paste(x, y, sep=":"))
>> }
>>
>>
>> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
>> relevant time is
>>
>>user  system elapsed
>>  15.320   0.064  15.381
>>
>> versus with $ R --no-save --quiet -f time.R it is
>>
>>user  system elapsed
>>  95.360   0.164  95.511
>>
>> I think this is likely an overall strategy when dealing with character
>> data -- processing in independent chunks of moderate (1M?) size (enabling
>> as a consequence parallel evaluation in modest memory) that are sufficient
>> to benefit from vectorization, but that do not entail allocation of large
>> numbers of in-use SEXPs.
>>
>> Martin
>>
>>
>>> Sorry for the confusion.
>>> ~G
>>>
>>>
>>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker >> > wrote:
>>>
>>> Martin and Val.
>>>
>>> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
>>> fields) with
>>> profiling enabled. The results of summaryRprof for that run are
>>> attached,
>>> though for a variety of reasons they are pretty misleading.
>>>
>>> It took over an hour to write (3700+seconds), so it's definitely a
>>> bottleneck when the data get very large, even if it isn't for
>>> smaller data.
>>>
>>> Michael and I both think the culprit is all the pasting and cbinding
>>> that is
>>> going on, and more to the point, that memory for an internal
>>> representation
>>> to be written out is allocated at all.  Streaming across the object,
>>> looping
>>> by rows and writing directly to file (e.g. from C) should be
>>> blisteringly
>>> fast in comparison.
>>>
>>> ~G
>>>
>>>
>>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <
>>> micha...@gene.com
>>> > wrote:
>>>
>>> Gabe is still testing/profiling, but we'll send something
>>> randomized
>>> along eventually.
>>>
>>>
>>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
>>> mtmor...@fhcrc.org
>>> > wrote:
>>>
>>> I didn't see in the original thread a reproducible
>>> (simulated, I
>>> guess) example, to be explicit about what the problem is??
>>>
>>> Martin
>>>
>>>
>>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
>>> My understanding is that the heap optimization provided
>>> marginal
>>> gains, and
>>> that we need to think harder about how to optimize the
>>> all of
>>> the string
>>> manipulation in writeVcf. We either need to reduce it or
>>> reduce its
>>> overhead (i

Re: [Bioc-devel] writeVcf performance

2014-09-02 Thread Michael Lawrence
Yes, it's very clear that the scaling is non-linear, and Gabe has been
experimenting with a chunk-wise + parallel algorithm. Unfortunately there
is some frustrating overhead with the parallelism. But I'm glad Val is
arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan  wrote:

> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>
>> The profiling I attached in my previous email is for 24 geno fields, as I
>> said,
>> but our typical usecase involves only ~4-6 fields, and is faster but
>> still on
>> the order of dozens of minutes.
>>
>
> I think Val is arriving at a (much) more efficient implementation, but...
>
> I wanted to share my guess that the poor _scaling_ is because the garbage
> collector runs multiple times as the different strings are pasted together,
> and has to traverse, in linear time, increasing numbers of allocated SEXPs.
> So times scale approximately quadratically with the number of rows in the
> VCF
>
> An efficiency is to reduce the number of SEXPs in play by writing out in
> chunks -- as each chunk is written, the SEXPs become available for
> collection and are re-used. Here's my toy example
>
> time.R
> ==
> splitIndices <- function (nx, ncl)
> {
> i <- seq_len(nx)
> if (ncl == 0L)
> list()
> else if (ncl == 1L || nx == 1L)
> list(i)
> else {
> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
> structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
> }
> }
>
> x = as.character(seq_len(1e7)); y = sample(x)
> if (!is.na(Sys.getenv("SPLIT", NA))) {
> idx <- splitIndices(length(x), 20)
> system.time(for (i in idx) paste(x[i], y[i], sep=":"))
> } else {
> system.time(paste(x, y, sep=":"))
> }
>
>
> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
> relevant time is
>
>user  system elapsed
>  15.320   0.064  15.381
>
> versus with $ R --no-save --quiet -f time.R it is
>
>user  system elapsed
>  95.360   0.164  95.511
>
> I think this is likely an overall strategy when dealing with character
> data -- processing in independent chunks of moderate (1M?) size (enabling
> as a consequence parallel evaluation in modest memory) that are sufficient
> to benefit from vectorization, but that do not entail allocation of large
> numbers of in-use SEXPs.
>
> Martin
>
>
>> Sorry for the confusion.
>> ~G
>>
>>
>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker > > wrote:
>>
>> Martin and Val.
>>
>> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
>> fields) with
>> profiling enabled. The results of summaryRprof for that run are
>> attached,
>> though for a variety of reasons they are pretty misleading.
>>
>> It took over an hour to write (3700+seconds), so it's definitely a
>> bottleneck when the data get very large, even if it isn't for smaller
>> data.
>>
>> Michael and I both think the culprit is all the pasting and cbinding
>> that is
>> going on, and more to the point, that memory for an internal
>> representation
>> to be written out is allocated at all.  Streaming across the object,
>> looping
>> by rows and writing directly to file (e.g. from C) should be
>> blisteringly
>> fast in comparison.
>>
>> ~G
>>
>>
>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence > > wrote:
>>
>> Gabe is still testing/profiling, but we'll send something
>> randomized
>> along eventually.
>>
>>
>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
>> mtmor...@fhcrc.org
>> > wrote:
>>
>> I didn't see in the original thread a reproducible
>> (simulated, I
>> guess) example, to be explicit about what the problem is??
>>
>> Martin
>>
>>
>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>
>> My understanding is that the heap optimization provided
>> marginal
>> gains, and
>> that we need to think harder about how to optimize the
>> all of
>> the string
>> manipulation in writeVcf. We either need to reduce it or
>> reduce its
>> overhead (i.e., the CHARSXP allocation). Gabe is doing
>> more tests.
>>
>>
>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
>> mailto:voben...@fhcrc.org>>
>>
>> wrote:
>>
>> Hi Gabe,
>>
>> Martin responded, and so did Michael,
>>
>> https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
>> August/006082.html
>>
>> > August/006082.html>
>>
>> It sounded like Michael was ok with working
>> with/around heap
>> initialization.
>>
>> Michael, is that right

Re: [Bioc-devel] writeVcf performance

2014-09-02 Thread Martin Morgan

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno fields, as I said,
but our typical usecase involves only ~4-6 fields, and is faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbage 
collector runs multiple times as the different strings are pasted together, and 
has to traverse, in linear time, increasing numbers of allocated SEXPs. So times 
scale approximately quadratically with the number of rows in the VCF


An efficiency is to reduce the number of SEXPs in play by writing out in chunks 
-- as each chunk is written, the SEXPs become available for collection and are 
re-used. Here's my toy example


time.R
==
splitIndices <- function (nx, ncl)
{
i <- seq_len(nx)
if (ncl == 0L)
list()
else if (ncl == 1L || nx == 1L)
list(i)
else {
fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
}
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
idx <- splitIndices(length(x), 20)
system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
system.time(paste(x, y, sep=":"))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the 
relevant time is


   user  system elapsed
 15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

   user  system elapsed
 95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character data -- 
processing in independent chunks of moderate (1M?) size (enabling as a 
consequence parallel evaluation in modest memory) that are sufficient to benefit 
from vectorization, but that do not entail allocation of large numbers of in-use 
SEXPs.


Martin



Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker mailto:becke...@gene.com>> wrote:

Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
profiling enabled. The results of summaryRprof for that run are attached,
though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller data.

Michael and I both think the culprit is all the pasting and cbinding that is
going on, and more to the point, that memory for an internal representation
to be written out is allocated at all.  Streaming across the object, looping
by rows and writing directly to file (e.g. from C) should be blisteringly
fast in comparison.

~G


On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence mailto:micha...@gene.com>> wrote:

Gabe is still testing/profiling, but we'll send something randomized
along eventually.


On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan mailto:mtmor...@fhcrc.org>> wrote:

I didn't see in the original thread a reproducible (simulated, I
guess) example, to be explicit about what the problem is??

Martin


On 08/26/2014 10:47 AM, Michael Lawrence wrote:

My understanding is that the heap optimization provided marginal
gains, and
that we need to think harder about how to optimize the all of
the string
manipulation in writeVcf. We either need to reduce it or reduce 
its
overhead (i.e., the CHARSXP allocation). Gabe is doing more 
tests.


On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
mailto:voben...@fhcrc.org>>
wrote:

Hi Gabe,

Martin responded, and so did Michael,


https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html



It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on
the table?


Val


On 08/26/2014 09:34 AM, Gabe Becker wrote:

Val,

Has there been any movement on this? This remains a
substantial
bottleneck for us when writing very large VCF files 
(e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an
"optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using
  

Re: [Bioc-devel] writeVcf performance

2014-08-29 Thread Kasper Daniel Hansen
Try to run it through the lineprof package for memory profiling; I have
found this to be very helpful.

Here is an old blog post I wrote about it
  http://www.hansenlab.org/rstats/2014/01/30/lineprof/

Kasper


On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker  wrote:

> The profiling I attached in my previous email is for 24 geno fields, as I
> said, but our typical usecase involves only ~4-6 fields, and is faster but
> still on the order of dozens of minutes.
>
> Sorry for the confusion.
> ~G
>
>
> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker  wrote:
>
> > Martin and Val.
> >
> > I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
> > with profiling enabled. The results of summaryRprof for that run are
> > attached, though for a variety of reasons they are pretty misleading.
> >
> > It took over an hour to write (3700+seconds), so it's definitely a
> > bottleneck when the data get very large, even if it isn't for smaller
> data.
> >
> > Michael and I both think the culprit is all the pasting and cbinding that
> > is going on, and more to the point, that memory for an internal
> > representation to be written out is allocated at all.  Streaming across
> the
> > object, looping by rows and writing directly to file (e.g. from C) should
> > be blisteringly fast in comparison.
> >
> > ~G
> >
> >
> > On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence 
> > wrote:
> >
> >> Gabe is still testing/profiling, but we'll send something randomized
> >> along eventually.
> >>
> >>
> >> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan 
> >> wrote:
> >>
> >>> I didn't see in the original thread a reproducible (simulated, I guess)
> >>> example, to be explicit about what the problem is??
> >>>
> >>> Martin
> >>>
> >>>
> >>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
> >>>
>  My understanding is that the heap optimization provided marginal
> gains,
>  and
>  that we need to think harder about how to optimize the all of the
> string
>  manipulation in writeVcf. We either need to reduce it or reduce its
>  overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
> 
> 
>  On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <
> voben...@fhcrc.org>
>  wrote:
> 
>   Hi Gabe,
> >
> > Martin responded, and so did Michael,
> >
> > https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
> >
> > It sounded like Michael was ok with working with/around heap
> > initialization.
> >
> > Michael, is that right or should we still consider this on the table?
> >
> >
> > Val
> >
> >
> > On 08/26/2014 09:34 AM, Gabe Becker wrote:
> >
> >  Val,
> >>
> >> Has there been any movement on this? This remains a substantial
> >> bottleneck for us when writing very large VCF files (e.g.
> >> variants+genotypes for whole genome NGS samples).
> >>
> >> I was able to see a ~25% speedup with 4 cores and  an "optimal"
> >> speedup
> >> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
> >> parallelization strategy and no other changes. I suspect this could
> be
> >> improved on quite a bit, or possibly made irrelevant with judicious
> >> use
> >> of serial C code.
> >>
> >> Did you and Martin make any plans regarding optimizing writeVcf?
> >>
> >> Best
> >> ~G
> >>
> >>
> >> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <
> voben...@fhcrc.org
> >> > wrote:
> >>
> >>  Hi Michael,
> >>
> >>  I'm interested in working on this. I'll discuss with Martin
> next
> >>  week when we're both back in the office.
> >>
> >>  Val
> >>
> >>
> >>
> >>
> >>
> >>  On 08/05/14 07:46, Michael Lawrence wrote:
> >>
> >>  Hi guys (Val, Martin, Herve):
> >>
> >>  Anyone have an itch for optimization? The writeVcf function
> >> is
> >>  currently a
> >>  bottleneck in our WGS genotyping pipeline. For a typical 50
> >>  million row
> >>  gVCF, it was taking 2.25 hours prior to yesterday's
> >> improvements
> >>  (pasteCollapseRows) that brought it down to about 1 hour,
> >> which
> >>  is still
> >>  too long by my standards (> 0). Only takes 3 minutes to
> call
> >> the
> >>  genotypes
> >>  (and associated likelihoods etc) from the variant calls
> >> (using
> >>  80 cores and
> >>  450 GB RAM on one node), so the output is an issue.
> Profiling
> >>  suggests that
> >>  the running time scales non-linearly in the number of rows.
> >>
> >>  Digging a little deeper, it seems to be something with R's
> >>  string/memory
> >>  allocation. Below, pasting 1 million strings takes 6
> >> seco

Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.

Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker  wrote:

> Martin and Val.
>
> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
> with profiling enabled. The results of summaryRprof for that run are
> attached, though for a variety of reasons they are pretty misleading.
>
> It took over an hour to write (3700+seconds), so it's definitely a
> bottleneck when the data get very large, even if it isn't for smaller data.
>
> Michael and I both think the culprit is all the pasting and cbinding that
> is going on, and more to the point, that memory for an internal
> representation to be written out is allocated at all.  Streaming across the
> object, looping by rows and writing directly to file (e.g. from C) should
> be blisteringly fast in comparison.
>
> ~G
>
>
> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence 
> wrote:
>
>> Gabe is still testing/profiling, but we'll send something randomized
>> along eventually.
>>
>>
>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan 
>> wrote:
>>
>>> I didn't see in the original thread a reproducible (simulated, I guess)
>>> example, to be explicit about what the problem is??
>>>
>>> Martin
>>>
>>>
>>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>>
 My understanding is that the heap optimization provided marginal gains,
 and
 that we need to think harder about how to optimize the all of the string
 manipulation in writeVcf. We either need to reduce it or reduce its
 overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain 
 wrote:

  Hi Gabe,
>
> Martin responded, and so did Michael,
>
> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>
> It sounded like Michael was ok with working with/around heap
> initialization.
>
> Michael, is that right or should we still consider this on the table?
>
>
> Val
>
>
> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>
>  Val,
>>
>> Has there been any movement on this? This remains a substantial
>> bottleneck for us when writing very large VCF files (e.g.
>> variants+genotypes for whole genome NGS samples).
>>
>> I was able to see a ~25% speedup with 4 cores and  an "optimal"
>> speedup
>> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
>> parallelization strategy and no other changes. I suspect this could be
>> improved on quite a bit, or possibly made irrelevant with judicious
>> use
>> of serial C code.
>>
>> Did you and Martin make any plans regarding optimizing writeVcf?
>>
>> Best
>> ~G
>>
>>
>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain > > wrote:
>>
>>  Hi Michael,
>>
>>  I'm interested in working on this. I'll discuss with Martin next
>>  week when we're both back in the office.
>>
>>  Val
>>
>>
>>
>>
>>
>>  On 08/05/14 07:46, Michael Lawrence wrote:
>>
>>  Hi guys (Val, Martin, Herve):
>>
>>  Anyone have an itch for optimization? The writeVcf function
>> is
>>  currently a
>>  bottleneck in our WGS genotyping pipeline. For a typical 50
>>  million row
>>  gVCF, it was taking 2.25 hours prior to yesterday's
>> improvements
>>  (pasteCollapseRows) that brought it down to about 1 hour,
>> which
>>  is still
>>  too long by my standards (> 0). Only takes 3 minutes to call
>> the
>>  genotypes
>>  (and associated likelihoods etc) from the variant calls
>> (using
>>  80 cores and
>>  450 GB RAM on one node), so the output is an issue. Profiling
>>  suggests that
>>  the running time scales non-linearly in the number of rows.
>>
>>  Digging a little deeper, it seems to be something with R's
>>  string/memory
>>  allocation. Below, pasting 1 million strings takes 6
>> seconds, but
>> 10
>>  million strings takes over 2 minutes. It gets way worse with
>> 50
>>  million. I
>>  suspect it has something to do with R's string hash table.
>>
>>  set.seed(1000)
>>  end <- sample(1e8, 1e6)
>>  system.time(paste0("END", "=", end))
>>   user  system elapsed
>>  6.396   0.028   6.420
>>
>>  end <- sample(1e8, 1e7)
>>  system.time(paste0("END", "=", end))
>>   user  system elapsed
>>  134.714   0.3

Re: [Bioc-devel] writeVcf performance

2014-08-27 Thread Gabe Becker
Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
profiling enabled. The results of summaryRprof for that run are attached,
though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller data.

Michael and I both think the culprit is all the pasting and cbinding that
is going on, and more to the point, that memory for an internal
representation to be written out is allocated at all.  Streaming across the
object, looping by rows and writing directly to file (e.g. from C) should
be blisteringly fast in comparison.

~G


On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence 
wrote:

> Gabe is still testing/profiling, but we'll send something randomized along
> eventually.
>
>
> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan 
> wrote:
>
>> I didn't see in the original thread a reproducible (simulated, I guess)
>> example, to be explicit about what the problem is??
>>
>> Martin
>>
>>
>> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>>
>>> My understanding is that the heap optimization provided marginal gains,
>>> and
>>> that we need to think harder about how to optimize the all of the string
>>> manipulation in writeVcf. We either need to reduce it or reduce its
>>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>>>
>>>
>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain 
>>> wrote:
>>>
>>>  Hi Gabe,

 Martin responded, and so did Michael,

 https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

 It sounded like Michael was ok with working with/around heap
 initialization.

 Michael, is that right or should we still consider this on the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker wrote:

  Val,
>
> Has there been any movement on this? This remains a substantial
> bottleneck for us when writing very large VCF files (e.g.
> variants+genotypes for whole genome NGS samples).
>
> I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
> parallelization strategy and no other changes. I suspect this could be
> improved on quite a bit, or possibly made irrelevant with judicious use
> of serial C code.
>
> Did you and Martin make any plans regarding optimizing writeVcf?
>
> Best
> ~G
>
>
> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain  > wrote:
>
>  Hi Michael,
>
>  I'm interested in working on this. I'll discuss with Martin next
>  week when we're both back in the office.
>
>  Val
>
>
>
>
>
>  On 08/05/14 07:46, Michael Lawrence wrote:
>
>  Hi guys (Val, Martin, Herve):
>
>  Anyone have an itch for optimization? The writeVcf function is
>  currently a
>  bottleneck in our WGS genotyping pipeline. For a typical 50
>  million row
>  gVCF, it was taking 2.25 hours prior to yesterday's
> improvements
>  (pasteCollapseRows) that brought it down to about 1 hour,
> which
>  is still
>  too long by my standards (> 0). Only takes 3 minutes to call
> the
>  genotypes
>  (and associated likelihoods etc) from the variant calls (using
>  80 cores and
>  450 GB RAM on one node), so the output is an issue. Profiling
>  suggests that
>  the running time scales non-linearly in the number of rows.
>
>  Digging a little deeper, it seems to be something with R's
>  string/memory
>  allocation. Below, pasting 1 million strings takes 6 seconds,
> but
> 10
>  million strings takes over 2 minutes. It gets way worse with
> 50
>  million. I
>  suspect it has something to do with R's string hash table.
>
>  set.seed(1000)
>  end <- sample(1e8, 1e6)
>  system.time(paste0("END", "=", end))
>   user  system elapsed
>  6.396   0.028   6.420
>
>  end <- sample(1e8, 1e7)
>  system.time(paste0("END", "=", end))
>   user  system elapsed
>  134.714   0.352 134.978
>
>  Indeed, even this takes a long time (in a fresh session):
>
>  set.seed(1000)
>  end <- sample(1e8, 1e6)
>  end <- sample(1e8, 1e7)
>  system.time(as.character(end))
>   user  system elapsed
> 57.224   0.156  57.366
>
>  But running it a second time is faster (about what one would
>  expect?):
>
> 

Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Michael Lawrence
Gabe is still testing/profiling, but we'll send something randomized along
eventually.


On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan  wrote:

> I didn't see in the original thread a reproducible (simulated, I guess)
> example, to be explicit about what the problem is??
>
> Martin
>
>
> On 08/26/2014 10:47 AM, Michael Lawrence wrote:
>
>> My understanding is that the heap optimization provided marginal gains,
>> and
>> that we need to think harder about how to optimize the all of the string
>> manipulation in writeVcf. We either need to reduce it or reduce its
>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.
>>
>>
>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain 
>> wrote:
>>
>>  Hi Gabe,
>>>
>>> Martin responded, and so did Michael,
>>>
>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>>>
>>> It sounded like Michael was ok with working with/around heap
>>> initialization.
>>>
>>> Michael, is that right or should we still consider this on the table?
>>>
>>>
>>> Val
>>>
>>>
>>> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>>>
>>>  Val,

 Has there been any movement on this? This remains a substantial
 bottleneck for us when writing very large VCF files (e.g.
 variants+genotypes for whole genome NGS samples).

 I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
 of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
 parallelization strategy and no other changes. I suspect this could be
 improved on quite a bit, or possibly made irrelevant with judicious use
 of serial C code.

 Did you and Martin make any plans regarding optimizing writeVcf?

 Best
 ~G


 On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain >>> > wrote:

  Hi Michael,

  I'm interested in working on this. I'll discuss with Martin next
  week when we're both back in the office.

  Val





  On 08/05/14 07:46, Michael Lawrence wrote:

  Hi guys (Val, Martin, Herve):

  Anyone have an itch for optimization? The writeVcf function is
  currently a
  bottleneck in our WGS genotyping pipeline. For a typical 50
  million row
  gVCF, it was taking 2.25 hours prior to yesterday's
 improvements
  (pasteCollapseRows) that brought it down to about 1 hour, which
  is still
  too long by my standards (> 0). Only takes 3 minutes to call
 the
  genotypes
  (and associated likelihoods etc) from the variant calls (using
  80 cores and
  450 GB RAM on one node), so the output is an issue. Profiling
  suggests that
  the running time scales non-linearly in the number of rows.

  Digging a little deeper, it seems to be something with R's
  string/memory
  allocation. Below, pasting 1 million strings takes 6 seconds,
 but
 10
  million strings takes over 2 minutes. It gets way worse with 50
  million. I
  suspect it has something to do with R's string hash table.

  set.seed(1000)
  end <- sample(1e8, 1e6)
  system.time(paste0("END", "=", end))
   user  system elapsed
  6.396   0.028   6.420

  end <- sample(1e8, 1e7)
  system.time(paste0("END", "=", end))
   user  system elapsed
  134.714   0.352 134.978

  Indeed, even this takes a long time (in a fresh session):

  set.seed(1000)
  end <- sample(1e8, 1e6)
  end <- sample(1e8, 1e7)
  system.time(as.character(end))
   user  system elapsed
 57.224   0.156  57.366

  But running it a second time is faster (about what one would
  expect?):

  system.time(levels <- as.character(end))
   user  system elapsed
 23.582   0.021  23.589

  I did some simple profiling of R to find that the resizing of
  the string
  hash table is not a significant component of the time. So maybe
  something
  to do with the R heap/gc? No time right now to go deeper. But I
  know Martin
  likes this sort of thing ;)

  Michael

   [[alternative HTML version deleted]]

  _
  Bioc-devel@r-project.org 
  mailing list
  https://stat.ethz.ch/mailman/__listinfo/bioc-devel
  


  ___

Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Martin Morgan
I didn't see in the original thread a reproducible (simulated, I guess) example, 
to be explicit about what the problem is??


Martin

On 08/26/2014 10:47 AM, Michael Lawrence wrote:

My understanding is that the heap optimization provided marginal gains, and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain 
wrote:


Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on the table?


Val


On 08/26/2014 09:34 AM, Gabe Becker wrote:


Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G


On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

 Hi Michael,

 I'm interested in working on this. I'll discuss with Martin next
 week when we're both back in the office.

 Val





 On 08/05/14 07:46, Michael Lawrence wrote:

 Hi guys (Val, Martin, Herve):

 Anyone have an itch for optimization? The writeVcf function is
 currently a
 bottleneck in our WGS genotyping pipeline. For a typical 50
 million row
 gVCF, it was taking 2.25 hours prior to yesterday's improvements
 (pasteCollapseRows) that brought it down to about 1 hour, which
 is still
 too long by my standards (> 0). Only takes 3 minutes to call the
 genotypes
 (and associated likelihoods etc) from the variant calls (using
 80 cores and
 450 GB RAM on one node), so the output is an issue. Profiling
 suggests that
 the running time scales non-linearly in the number of rows.

 Digging a little deeper, it seems to be something with R's
 string/memory
 allocation. Below, pasting 1 million strings takes 6 seconds, but
10
 million strings takes over 2 minutes. It gets way worse with 50
 million. I
 suspect it has something to do with R's string hash table.

 set.seed(1000)
 end <- sample(1e8, 1e6)
 system.time(paste0("END", "=", end))
  user  system elapsed
 6.396   0.028   6.420

 end <- sample(1e8, 1e7)
 system.time(paste0("END", "=", end))
  user  system elapsed
 134.714   0.352 134.978

 Indeed, even this takes a long time (in a fresh session):

 set.seed(1000)
 end <- sample(1e8, 1e6)
 end <- sample(1e8, 1e7)
 system.time(as.character(end))
  user  system elapsed
57.224   0.156  57.366

 But running it a second time is faster (about what one would
 expect?):

 system.time(levels <- as.character(end))
  user  system elapsed
23.582   0.021  23.589

 I did some simple profiling of R to find that the resizing of
 the string
 hash table is not a significant component of the time. So maybe
 something
 to do with the R heap/gc? No time right now to go deeper. But I
 know Martin
 likes this sort of thing ;)

 Michael

  [[alternative HTML version deleted]]

 _
 Bioc-devel@r-project.org 
 mailing list
 https://stat.ethz.ch/mailman/__listinfo/bioc-devel
 


 _
 Bioc-devel@r-project.org  mailing
list
 https://stat.ethz.ch/mailman/__listinfo/bioc-devel

 




--
Computational Biologist
Genentech Research







[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailma

Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Michael Lawrence
My understanding is that the heap optimization provided marginal gains, and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain 
wrote:

> Hi Gabe,
>
> Martin responded, and so did Michael,
>
> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html
>
> It sounded like Michael was ok with working with/around heap
> initialization.
>
> Michael, is that right or should we still consider this on the table?
>
>
> Val
>
>
> On 08/26/2014 09:34 AM, Gabe Becker wrote:
>
>> Val,
>>
>> Has there been any movement on this? This remains a substantial
>> bottleneck for us when writing very large VCF files (e.g.
>> variants+genotypes for whole genome NGS samples).
>>
>> I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
>> of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
>> parallelization strategy and no other changes. I suspect this could be
>> improved on quite a bit, or possibly made irrelevant with judicious use
>> of serial C code.
>>
>> Did you and Martin make any plans regarding optimizing writeVcf?
>>
>> Best
>> ~G
>>
>>
>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain > > wrote:
>>
>> Hi Michael,
>>
>> I'm interested in working on this. I'll discuss with Martin next
>> week when we're both back in the office.
>>
>> Val
>>
>>
>>
>>
>>
>> On 08/05/14 07:46, Michael Lawrence wrote:
>>
>> Hi guys (Val, Martin, Herve):
>>
>> Anyone have an itch for optimization? The writeVcf function is
>> currently a
>> bottleneck in our WGS genotyping pipeline. For a typical 50
>> million row
>> gVCF, it was taking 2.25 hours prior to yesterday's improvements
>> (pasteCollapseRows) that brought it down to about 1 hour, which
>> is still
>> too long by my standards (> 0). Only takes 3 minutes to call the
>> genotypes
>> (and associated likelihoods etc) from the variant calls (using
>> 80 cores and
>> 450 GB RAM on one node), so the output is an issue. Profiling
>> suggests that
>> the running time scales non-linearly in the number of rows.
>>
>> Digging a little deeper, it seems to be something with R's
>> string/memory
>> allocation. Below, pasting 1 million strings takes 6 seconds, but
>> 10
>> million strings takes over 2 minutes. It gets way worse with 50
>> million. I
>> suspect it has something to do with R's string hash table.
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> system.time(paste0("END", "=", end))
>>  user  system elapsed
>> 6.396   0.028   6.420
>>
>> end <- sample(1e8, 1e7)
>> system.time(paste0("END", "=", end))
>>  user  system elapsed
>> 134.714   0.352 134.978
>>
>> Indeed, even this takes a long time (in a fresh session):
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> end <- sample(1e8, 1e7)
>> system.time(as.character(end))
>>  user  system elapsed
>>57.224   0.156  57.366
>>
>> But running it a second time is faster (about what one would
>> expect?):
>>
>> system.time(levels <- as.character(end))
>>  user  system elapsed
>>23.582   0.021  23.589
>>
>> I did some simple profiling of R to find that the resizing of
>> the string
>> hash table is not a significant component of the time. So maybe
>> something
>> to do with the R heap/gc? No time right now to go deeper. But I
>> know Martin
>> likes this sort of thing ;)
>>
>> Michael
>>
>>  [[alternative HTML version deleted]]
>>
>> _
>> Bioc-devel@r-project.org 
>> mailing list
>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>> 
>>
>>
>> _
>> Bioc-devel@r-project.org  mailing
>> list
>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>
>> 
>>
>>
>>
>>
>> --
>> Computational Biologist
>> Genentech Research
>>
>
>
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Valerie Obenchain

Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap 
initialization.


Michael, is that right or should we still consider this on the table?


Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G


On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

Hi Michael,

I'm interested in working on this. I'll discuss with Martin next
week when we're both back in the office.

Val





On 08/05/14 07:46, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is
currently a
bottleneck in our WGS genotyping pipeline. For a typical 50
million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which
is still
too long by my standards (> 0). Only takes 3 minutes to call the
genotypes
(and associated likelihoods etc) from the variant calls (using
80 cores and
450 GB RAM on one node), so the output is an issue. Profiling
suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's
string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50
million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
 user  system elapsed
6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
 user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
 user  system elapsed
   57.224   0.156  57.366

But running it a second time is faster (about what one would
expect?):

system.time(levels <- as.character(end))
 user  system elapsed
   23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of
the string
hash table is not a significant component of the time. So maybe
something
to do with the R heap/gc? No time right now to go deeper. But I
know Martin
likes this sort of thing ;)

Michael

 [[alternative HTML version deleted]]

_
Bioc-devel@r-project.org 
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel



_
Bioc-devel@r-project.org  mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel





--
Computational Biologist
Genentech Research


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Gabe Becker
Val,

Has there been any movement on this? This remains a substantial bottleneck
for us when writing very large VCF files (e.g. variants+genotypes for whole
genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup of
~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use of
serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G


On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain 
wrote:

> Hi Michael,
>
> I'm interested in working on this. I'll discuss with Martin next week when
> we're both back in the office.
>
> Val
>
>
>
>
>
> On 08/05/14 07:46, Michael Lawrence wrote:
>
>> Hi guys (Val, Martin, Herve):
>>
>> Anyone have an itch for optimization? The writeVcf function is currently a
>> bottleneck in our WGS genotyping pipeline. For a typical 50 million row
>> gVCF, it was taking 2.25 hours prior to yesterday's improvements
>> (pasteCollapseRows) that brought it down to about 1 hour, which is still
>> too long by my standards (> 0). Only takes 3 minutes to call the genotypes
>> (and associated likelihoods etc) from the variant calls (using 80 cores
>> and
>> 450 GB RAM on one node), so the output is an issue. Profiling suggests
>> that
>> the running time scales non-linearly in the number of rows.
>>
>> Digging a little deeper, it seems to be something with R's string/memory
>> allocation. Below, pasting 1 million strings takes 6 seconds, but 10
>> million strings takes over 2 minutes. It gets way worse with 50 million. I
>> suspect it has something to do with R's string hash table.
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> system.time(paste0("END", "=", end))
>> user  system elapsed
>>6.396   0.028   6.420
>>
>> end <- sample(1e8, 1e7)
>> system.time(paste0("END", "=", end))
>> user  system elapsed
>> 134.714   0.352 134.978
>>
>> Indeed, even this takes a long time (in a fresh session):
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> end <- sample(1e8, 1e7)
>> system.time(as.character(end))
>> user  system elapsed
>>   57.224   0.156  57.366
>>
>> But running it a second time is faster (about what one would expect?):
>>
>> system.time(levels <- as.character(end))
>> user  system elapsed
>>   23.582   0.021  23.589
>>
>> I did some simple profiling of R to find that the resizing of the string
>> hash table is not a significant component of the time. So maybe something
>> to do with the R heap/gc? No time right now to go deeper. But I know
>> Martin
>> likes this sort of thing ;)
>>
>> Michael
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



-- 
Computational Biologist
Genentech Research

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-14 Thread Michael Lawrence
I thought it might come down to the heap initialization. We'll work with
that.


On Wed, Aug 13, 2014 at 4:42 PM, Martin Morgan  wrote:

> On 08/05/2014 07:46 AM, Michael Lawrence wrote:
>
>> Hi guys (Val, Martin, Herve):
>>
>> Anyone have an itch for optimization? The writeVcf function is currently a
>> bottleneck in our WGS genotyping pipeline. For a typical 50 million row
>> gVCF, it was taking 2.25 hours prior to yesterday's improvements
>> (pasteCollapseRows) that brought it down to about 1 hour, which is still
>> too long by my standards (> 0). Only takes 3 minutes to call the genotypes
>> (and associated likelihoods etc) from the variant calls (using 80 cores
>> and
>> 450 GB RAM on one node), so the output is an issue. Profiling suggests
>> that
>> the running time scales non-linearly in the number of rows.
>>
>> Digging a little deeper, it seems to be something with R's string/memory
>> allocation. Below, pasting 1 million strings takes 6 seconds, but 10
>> million strings takes over 2 minutes. It gets way worse with 50 million. I
>> suspect it has something to do with R's string hash table.
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> system.time(paste0("END", "=", end))
>> user  system elapsed
>>6.396   0.028   6.420
>>
>> end <- sample(1e8, 1e7)
>> system.time(paste0("END", "=", end))
>> user  system elapsed
>> 134.714   0.352 134.978
>>
>> Indeed, even this takes a long time (in a fresh session):
>>
>> set.seed(1000)
>> end <- sample(1e8, 1e6)
>> end <- sample(1e8, 1e7)
>> system.time(as.character(end))
>> user  system elapsed
>>   57.224   0.156  57.366
>>
>
> my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M,
> which changes the example above from
>
> > system.time(as.character(end))
>user  system elapsed
>  82.835   0.343  83.195
>
> to
>
> > system.time(as.character(end))
>user  system elapsed
>   9.245   0.169   9.424
>
> but I think it's a one-time gain; I wonder what the writeVcf command is
> that you're running?
>
> Martin
>
>
>> But running it a second time is faster (about what one would expect?):
>>
>> system.time(levels <- as.character(end))
>> user  system elapsed
>>   23.582   0.021  23.589
>>
>> I did some simple profiling of R to find that the resizing of the string
>> hash table is not a significant component of the time. So maybe something
>> to do with the R heap/gc? No time right now to go deeper. But I know
>> Martin
>> likes this sort of thing ;)
>>
>> Michael
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-13 Thread Martin Morgan

On 08/05/2014 07:46 AM, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards (> 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
user  system elapsed
   6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
user  system elapsed
  57.224   0.156  57.366


my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M, which 
changes the example above from


> system.time(as.character(end))
   user  system elapsed
 82.835   0.343  83.195

to

> system.time(as.character(end))
   user  system elapsed
  9.245   0.169   9.424

but I think it's a one-time gain; I wonder what the writeVcf command is that 
you're running?


Martin



But running it a second time is faster (about what one would expect?):

system.time(levels <- as.character(end))
user  system elapsed
  23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-08-05 Thread Valerie Obenchain

Hi Michael,

I'm interested in working on this. I'll discuss with Martin next week 
when we're both back in the office.


Val




On 08/05/14 07:46, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards (> 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
user  system elapsed
   6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
user  system elapsed
  57.224   0.156  57.366

But running it a second time is faster (about what one would expect?):

system.time(levels <- as.character(end))
user  system elapsed
  23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] writeVcf performance

2014-08-05 Thread Michael Lawrence
Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards (> 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
   user  system elapsed
  6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
   user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
   user  system elapsed
 57.224   0.156  57.366

But running it a second time is faster (about what one would expect?):

system.time(levels <- as.character(end))
   user  system elapsed
 23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel