Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow
On 09/27/2016 06:00 PM, Dario Strbenac wrote: Good day, file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation") anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation") aSet <- readVcf(file, "hg19") system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet))) user system elapsed 209.120 16.628 226.083 Reading in the Exome chromosome 22 VCF and intersecting it with the other file in the data directory takes almost 4 minutes. However, reading in the whole file is much faster. system.time(anotherSet <- readVcf(anotherFile, "hg19")) user system elapsed 0.376 0.016 0.392 and doing the intersection manually takes a fraction of a second system.time(fastCommonMutations <- intersect(rowRanges(aSet), rowRanges(anotherSet))) user system elapsed 0.128 0.000 0.129 This comparison ignores the finer details such as the identities of the alleles, but does it have to be so slow ? My real use case is intersecting dozens of VCF files of cancer samples with the ExAC consortium's VCF file that is 4 GB in size when compressed. I can't imagine how long that would take. Can the code of readVcf be optimised ? iterate through the file using yieldSize to manage memory, e.g., rngs <- rowRanges(aSet) genome(rngs) <- "b37" GenomicFiles::reduceByYield( VcfFile(anotherFile, yieldSize=1), readVcf, function(YIELD) subsetByOverlaps(YIELD, rngs), c) -- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel This email message may contain legally privileged and/or...{{dropped:2}} ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow
Dario's computer is faster than mine > system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet))) user system elapsed 426.271 57.296 483.766 The disk infrastructure is a determinant of throughput. Most VCF queries are decomposable and can be parallelized. After chunking the indices of aSet to 20 chunks > system.time(comm <- mclapply(ch, function(x) readVcf(anotherFile, "hg19", rowRanges(aSet)[x]), mc.cores=20)) user system elapsed 628.307 322.830 51.303 As far as I can tell, the answers agree. Is this a risky approach? Also the payload of readVcf can be reduced with a ScanVcfParam, that might improve throughput. On Tue, Sep 27, 2016 at 6:23 PM, Michael Lawrencewrote: > I think the basic problem is that each range requires a separate query > through tabix. BAM and tabix are designed to be fast for single > queries, like what a genome browser might generate, but not for > querying thousands of regions at once. At least that's the way it > seems to me. The index is only at the block level, because the data > are compressed. In principle, smarter caching could speed this up, > but that belongs at the samtools level. > > To make many queries, it pays to load the data into memory, or a more > efficient on-disk representation (HDF5, GDS, ...), first. > > Michael > > On Tue, Sep 27, 2016 at 3:00 PM, Dario Strbenac > wrote: > > Good day, > > > > file <- system.file("extdata", "chr22.vcf.gz", package = > "VariantAnnotation") > > anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", > package = "VariantAnnotation") > > aSet <- readVcf(file, "hg19") > > system.time(commonMutations <- readVcf(anotherFile, "hg19", > rowRanges(aSet))) > >user system elapsed > > 209.120 16.628 226.083 > > > > Reading in the Exome chromosome 22 VCF and intersecting it with the > other file in the data directory takes almost 4 minutes. > > > > However, reading in the whole file is much faster. > > > >> system.time(anotherSet <- readVcf(anotherFile, "hg19")) > >user system elapsed > > 0.376 0.016 0.392 > > > > and doing the intersection manually takes a fraction of a second > > > >> system.time(fastCommonMutations <- intersect(rowRanges(aSet), > rowRanges(anotherSet))) > >user system elapsed > > 0.128 0.000 0.129 > > > > This comparison ignores the finer details such as the identities of the > alleles, but does it have to be so slow ? My real use case is intersecting > dozens of VCF files of cancer samples with the ExAC consortium's VCF file > that is 4 GB in size when compressed. I can't imagine how long that would > take. > > > > Can the code of readVcf be optimised ? > > > > -- > > Dario Strbenac > > University of Sydney > > Camperdown NSW 2050 > > Australia > > ___ > > Bioc-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow
I think the basic problem is that each range requires a separate query through tabix. BAM and tabix are designed to be fast for single queries, like what a genome browser might generate, but not for querying thousands of regions at once. At least that's the way it seems to me. The index is only at the block level, because the data are compressed. In principle, smarter caching could speed this up, but that belongs at the samtools level. To make many queries, it pays to load the data into memory, or a more efficient on-disk representation (HDF5, GDS, ...), first. Michael On Tue, Sep 27, 2016 at 3:00 PM, Dario Strbenacwrote: > Good day, > > file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation") > anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = > "VariantAnnotation") > aSet <- readVcf(file, "hg19") > system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet))) >user system elapsed > 209.120 16.628 226.083 > > Reading in the Exome chromosome 22 VCF and intersecting it with the other > file in the data directory takes almost 4 minutes. > > However, reading in the whole file is much faster. > >> system.time(anotherSet <- readVcf(anotherFile, "hg19")) >user system elapsed > 0.376 0.016 0.392 > > and doing the intersection manually takes a fraction of a second > >> system.time(fastCommonMutations <- intersect(rowRanges(aSet), >> rowRanges(anotherSet))) >user system elapsed > 0.128 0.000 0.129 > > This comparison ignores the finer details such as the identities of the > alleles, but does it have to be so slow ? My real use case is intersecting > dozens of VCF files of cancer samples with the ExAC consortium's VCF file > that is 4 GB in size when compressed. I can't imagine how long that would > take. > > Can the code of readVcf be optimised ? > > -- > Dario Strbenac > University of Sydney > Camperdown NSW 2050 > Australia > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] VCF Intersection Using readVcf Remarkably Slow
Good day, file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation") anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation") aSet <- readVcf(file, "hg19") system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet))) user system elapsed 209.120 16.628 226.083 Reading in the Exome chromosome 22 VCF and intersecting it with the other file in the data directory takes almost 4 minutes. However, reading in the whole file is much faster. > system.time(anotherSet <- readVcf(anotherFile, "hg19")) user system elapsed 0.376 0.016 0.392 and doing the intersection manually takes a fraction of a second > system.time(fastCommonMutations <- intersect(rowRanges(aSet), > rowRanges(anotherSet))) user system elapsed 0.128 0.000 0.129 This comparison ignores the finer details such as the identities of the alleles, but does it have to be so slow ? My real use case is intersecting dozens of VCF files of cancer samples with the ExAC consortium's VCF file that is 4 GB in size when compressed. I can't imagine how long that would take. Can the code of readVcf be optimised ? -- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel