On 03/16/2015 09:22 AM, Michael Lawrence wrote:
Yes, I think it would make sense for the Xtra package to follow the
established convention in VariantAnnotation/VCF.

I don't know. I agree it would be nice to use a more consistent
representation of insertions across the software but I'm not convinced
we should necessarily follow the VCF way, which is to include the base
before the event in the ref and alt alleles as well as in the reported
range.

Note that there doesn't seem to be any consensus in the broader
Bioinformatics community either. For example dbSNP and HGVS report the
range that corresponds to the 2 flanking nucleotides but they don't
include these nucleotides in the ref or alt alleles. VCF does the same
as GFF3 which says "start equals end and the implied site is to the
right of the indicated base" except that VCF wants to treat events that
occur at position 1 in a special way. In that case VCF says the base
*after* the event should be included (seems like the VCF authors want
to avoid both: empty ranges and also ranges that start at POS 0).
BTW it doesn't seem that VariantAnnotation::isInsertion() is aware of
that special treatment.

UCSC uses a zero-width range, and so does the XtraSNPlocs.* packages.
The advantage of this representation is its simplicity. There is no
special cases. It also reflects the notion that an insertion is
a replacement of an empty string with a non-empty string. No need
to include flanking nucleotides in the representation (which is
artificial and distorts the "real" alt allele).

H.


On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo <robert.cast...@upf.edu>
wrote:

dear devel people, specially Val and Michael,

Hervé has recently added an annotation package that includes non-SNVs
variants from dbSNP, it is called:

library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)

if you execute the corresponding example you'll see the kind of
information stored in the package:

example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)


if you pay attention to the following case:

my_snps1[1:2]
GRanges object with 2 ranges and 3 metadata columns:
       seqnames               ranges strand |   RefSNP_id     alleles
          <Rle>            <IRanges>  <Rle> | <character> <character>
   [1]       22 [10513380, 10513380]      - | rs386831164         -/T
   [2]       22 [10519678, 10519677]      + |  rs71286731       -/TTT
           ref_allele
       <DNAStringSet>
   [1]              T
   [2]              -
   -------
   seqinfo: 25 sequences (1 circular) from GRCh38 genome

you'll see the first variant (rs386831164) is a deletion of one nucleotide
and the second (rs71286731) is an insertion of three nucleotides (TTT).

it struck me that the ranges representing the insertion had an start
position one nucleotide larger than then and i contacted Hervé thinking
that this was a mistake. however, i've learned from him that these are
so-called "zero-width" ranges and they actually allow to distinguish
insertions from every other type of variant without the need to know
anything about the reference or the alternate allele.

currently, the VCF specification 4.2 (http://samtools.github.io/
hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF
column to help distinguishing insertions by including the flanking
nucleotide of the inserted sequence. As a result,
VariantAnnotation::readVcf() produces ranges that mimic this standard
having identical start and end positions leading to 1-width ranges:

fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")
vcf <- readVcf(fl, genome="hg19")
rowRanges(vcf[isInsertion(vcf), ])[1:2]
GRanges object with 2 ranges and 5 metadata columns:
                     seqnames               ranges strand | paramRangeID
                        <Rle>            <IRanges>  <Rle> |     <factor>
          rs11474033       20 [45093330, 45093330]      * |         <NA>
   20:47592746_G/GGC       20 [47592746, 47592746]      * |         <NA>
                                REF                ALT      QUAL      FILTER
                     <DNAStringSet> <DNAStringSetList> <numeric> <character>
          rs11474033              C              CTTCT   2901.12           .
   20:47592746_G/GGC              G                GGC    608.88           .
   -------
   seqinfo: 84 sequences from hg19 genome


table(width(rowRanges(vcf[isInsertion(vcf), ])))

  1
78

i would like to ask whether it would make sense to harmonize the way in
which dbSNP insertions and variants are imported into Bioconductor by
making VariantAnnotation::readVcf() to produce zero-width ranges for
insertion variants. this not a feature request, i only would like to know
what whether there is a particular reason not to use the available
zero-width ranges that seem to be implemented for this purpose.


cheers,

robert.

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to