Dario's computer is faster than mine > system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet)))
user system elapsed 426.271 57.296 483.766 The disk infrastructure is a determinant of throughput. Most VCF queries are decomposable and can be parallelized. After chunking the indices of aSet to 20 chunks > system.time(comm <- mclapply(ch, function(x) readVcf(anotherFile, "hg19", rowRanges(aSet)[x]), mc.cores=20)) user system elapsed 628.307 322.830 51.303 As far as I can tell, the answers agree. Is this a risky approach? Also the payload of readVcf can be reduced with a ScanVcfParam, that might improve throughput. On Tue, Sep 27, 2016 at 6:23 PM, Michael Lawrence <lawrence.mich...@gene.com > wrote: > I think the basic problem is that each range requires a separate query > through tabix. BAM and tabix are designed to be fast for single > queries, like what a genome browser might generate, but not for > querying thousands of regions at once. At least that's the way it > seems to me. The index is only at the block level, because the data > are compressed. In principle, smarter caching could speed this up, > but that belongs at the samtools level. > > To make many queries, it pays to load the data into memory, or a more > efficient on-disk representation (HDF5, GDS, ...), first. > > Michael > > On Tue, Sep 27, 2016 at 3:00 PM, Dario Strbenac > <dstr7...@uni.sydney.edu.au> wrote: > > Good day, > > > > file <- system.file("extdata", "chr22.vcf.gz", package = > "VariantAnnotation") > > anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", > package = "VariantAnnotation") > > aSet <- readVcf(file, "hg19") > > system.time(commonMutations <- readVcf(anotherFile, "hg19", > rowRanges(aSet))) > > user system elapsed > > 209.120 16.628 226.083 > > > > Reading in the Exome chromosome 22 VCF and intersecting it with the > other file in the data directory takes almost 4 minutes. > > > > However, reading in the whole file is much faster. > > > >> system.time(anotherSet <- readVcf(anotherFile, "hg19")) > > user system elapsed > > 0.376 0.016 0.392 > > > > and doing the intersection manually takes a fraction of a second > > > >> system.time(fastCommonMutations <- intersect(rowRanges(aSet), > rowRanges(anotherSet))) > > user system elapsed > > 0.128 0.000 0.129 > > > > This comparison ignores the finer details such as the identities of the > alleles, but does it have to be so slow ? My real use case is intersecting > dozens of VCF files of cancer samples with the ExAC consortium's VCF file > that is 4 GB in size when compressed. I can't imagine how long that would > take. > > > > Can the code of readVcf be optimised ? > > > > -------------------------------------- > > Dario Strbenac > > University of Sydney > > Camperdown NSW 2050 > > Australia > > _______________________________________________ > > Bioc-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel