Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

Martin Morgan Wed, 28 Sep 2016 06:06:48 -0700

On 09/27/2016 06:00 PM, Dario Strbenac wrote:

Good day,


file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation")
anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = 
"VariantAnnotation")
aSet <- readVcf(file, "hg19")
system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet)))
   user  system elapsed
209.120  16.628 226.083

Reading in the Exome chromosome 22 VCF and intersecting it with the other file 
in the data directory takes almost 4 minutes.

However, reading in the whole file is much faster.

system.time(anotherSet <- readVcf(anotherFile, "hg19"))

   user  system elapsed
  0.376   0.016   0.392

and doing the intersection manually takes a fraction of a second

system.time(fastCommonMutations <- intersect(rowRanges(aSet), 
rowRanges(anotherSet)))

   user  system elapsed
  0.128   0.000   0.129

This comparison ignores the finer details such as the identities of the 
alleles, but does it have to be so slow ? My real use case is intersecting 
dozens of VCF files of cancer samples with the ExAC consortium's VCF file that 
is 4 GB in size when compressed. I can't imagine how long that would take.

Can the code of readVcf be optimised ?


iterate through the file using yieldSize to manage memory, e.g.,

rngs <- rowRanges(aSet)
genome(rngs) <- "b37"

GenomicFiles::reduceByYield(
    VcfFile(anotherFile, yieldSize=10000),
    readVcf,
    function(YIELD) subsetByOverlaps(YIELD, rngs),
    c)


--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



This email message may contain legally privileged and/or...{{dropped:2}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

Reply via email to