Thanks for the report and reproducible example. Now fixed in 1.14.8 and
1.15.22.
Valerie
On 08/07/2015 07:25 AM, Julian Gehring wrote:
Hi,
When I use the 'filterVcf' function to process a VCF file in chunks, the
output file contains as many copies of the header as there are
chunks. Consider the example, adapted from the 'filterVcf' man page:
library(VariantAnnotation)
fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz")
filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))
## Control: Processing without chunking
t0 <- TabixFile(fl) ## yieldSize -> NA
out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt)
tx0 = readLines(out0)
header_lines0 = grep("^##", tx0)
## 30 header lines, in line 1,..,30
## Case: Processing in 3 chunks
t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here
out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt)
tx1 = readLines(out1)
header_lines1 = grep("^##", tx1)
## 90 header lines, header 3 times duplicated
## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702
It seems that for each chunk, a complete VCF file including the header
gets written. See the relevant part of VariantFiltering:::.filter:
while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) {
vcfChunk <- subsetByFilter(vcfChunk, filters)
writeVcf(vcfChunk, filtered)
}
For the processing of VCFs with large headers in many chunks
(e.g. 1000genomes callsets), this can result in the paradox situation
that the filtered file ends up being significantly larger than the
original.
Best wishes
Julian
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] VariantAnnotation_1.14.6 Rsamtools_1.20.4 Biostrings_2.36.1
[4] XVector_0.8.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
[7] IRanges_2.2.5 S4Vectors_0.6.3 BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.30.1 zlibbioc_1.14.0 GenomicAlignments_1.4.1
[4] BiocParallel_1.2.9 BSgenome_1.36.3 tools_3.2.1
[7] Biobase_2.28.0 DBI_0.3.1 lambda.r_1.1.7
[10] futile.logger_1.4.1 rtracklayer_1.28.6 futile.options_1.0.0
[13] bitops_1.0-6 RCurl_1.95-4.7 biomaRt_2.24.0
[16] RSQLite_1.0.0 GenomicFeatures_1.20.1 XML_3.98-1.3
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109
Email: voben...@fredhutch.org
Phone: (206) 667-3158
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel