Yes, I think it would make sense for the Xtra package to follow the established convention in VariantAnnotation/VCF.
On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo <robert.cast...@upf.edu> wrote: > dear devel people, specially Val and Michael, > > Hervé has recently added an annotation package that includes non-SNVs > variants from dbSNP, it is called: > > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38) > > if you execute the corresponding example you'll see the kind of > information stored in the package: > > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38) > > > if you pay attention to the following case: > > my_snps1[1:2] > GRanges object with 2 ranges and 3 metadata columns: > seqnames ranges strand | RefSNP_id alleles > <Rle> <IRanges> <Rle> | <character> <character> > [1] 22 [10513380, 10513380] - | rs386831164 -/T > [2] 22 [10519678, 10519677] + | rs71286731 -/TTT > ref_allele > <DNAStringSet> > [1] T > [2] - > ------- > seqinfo: 25 sequences (1 circular) from GRCh38 genome > > you'll see the first variant (rs386831164) is a deletion of one nucleotide > and the second (rs71286731) is an insertion of three nucleotides (TTT). > > it struck me that the ranges representing the insertion had an start > position one nucleotide larger than then and i contacted Hervé thinking > that this was a mistake. however, i've learned from him that these are > so-called "zero-width" ranges and they actually allow to distinguish > insertions from every other type of variant without the need to know > anything about the reference or the alternate allele. > > currently, the VCF specification 4.2 (http://samtools.github.io/ > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF > column to help distinguishing insertions by including the flanking > nucleotide of the inserted sequence. As a result, > VariantAnnotation::readVcf() produces ranges that mimic this standard > having identical start and end positions leading to 1-width ranges: > > fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering") > vcf <- readVcf(fl, genome="hg19") > rowRanges(vcf[isInsertion(vcf), ])[1:2] > GRanges object with 2 ranges and 5 metadata columns: > seqnames ranges strand | paramRangeID > <Rle> <IRanges> <Rle> | <factor> > rs11474033 20 [45093330, 45093330] * | <NA> > 20:47592746_G/GGC 20 [47592746, 47592746] * | <NA> > REF ALT QUAL FILTER > <DNAStringSet> <DNAStringSetList> <numeric> <character> > rs11474033 C CTTCT 2901.12 . > 20:47592746_G/GGC G GGC 608.88 . > ------- > seqinfo: 84 sequences from hg19 genome > > > table(width(rowRanges(vcf[isInsertion(vcf), ]))) > > 1 > 78 > > i would like to ask whether it would make sense to harmonize the way in > which dbSNP insertions and variants are imported into Bioconductor by > making VariantAnnotation::readVcf() to produce zero-width ranges for > insertion variants. this not a feature request, i only would like to know > what whether there is a particular reason not to use the available > zero-width ranges that seem to be implemented for this purpose. > > > cheers, > > robert. > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel