Thanks for the report and reproducible example. Now fixed in 1.14.8 and 1.15.22.

Valerie



On 08/07/2015 07:25 AM, Julian Gehring wrote:
Hi,

When I use the 'filterVcf' function to process a VCF file in chunks, the
output file contains as many copies of the header as there are
chunks. Consider the example, adapted from the 'filterVcf' man page:

   library(VariantAnnotation)

   fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz")
   filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))

   ## Control: Processing without chunking
   t0 <- TabixFile(fl) ## yieldSize -> NA
   out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt)

   tx0 = readLines(out0)
   header_lines0 = grep("^##", tx0)
   ## 30 header lines, in line 1,..,30

   ## Case: Processing in 3 chunks
   t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here
   out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt)

   tx1 = readLines(out1)
   header_lines1 = grep("^##", tx1)
   ## 90 header lines, header 3 times duplicated
   ## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702

It seems that for each chunk, a complete VCF file including the header
gets written.  See the relevant part of VariantFiltering:::.filter:

    while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) {
      vcfChunk <- subsetByFilter(vcfChunk, filters)
      writeVcf(vcfChunk, filtered)
    }

For the processing of VCFs with large headers in many chunks
(e.g. 1000genomes callsets), this can result in the paradox situation
that the filtered file ends up being significantly larger than the
original.

Best wishes
Julian



R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] VariantAnnotation_1.14.6 Rsamtools_1.20.4         Biostrings_2.36.1
[4] XVector_0.8.0            GenomicRanges_1.20.5     GenomeInfoDb_1.4.1
[7] IRanges_2.2.5            S4Vectors_0.6.3          BiocGenerics_0.14.0

loaded via a namespace (and not attached):
  [1] AnnotationDbi_1.30.1    zlibbioc_1.14.0         GenomicAlignments_1.4.1
  [4] BiocParallel_1.2.9      BSgenome_1.36.3         tools_3.2.1
  [7] Biobase_2.28.0          DBI_0.3.1               lambda.r_1.1.7
[10] futile.logger_1.4.1     rtracklayer_1.28.6      futile.options_1.0.0
[13] bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0
[16] RSQLite_1.0.0           GenomicFeatures_1.20.1  XML_3.98-1.3

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to