Hi,

In our group's efforts to accurately parse UCSC human SNP records, several
small puzzles have emerged.  First, a few general questions about SNP
records of any class:

1) As a value in the 'refUCSC' field, does "-" always simply denote a gap
relative to a subject (or other some other reference) genome?

2) Is the value in the 'refUCSC' field, or its reverse complement (if and
only if "-" is the value of 'strand' but not the value of 'refUCSC'?),
always at least an implied value of the 'observed' field, in addition to any
other value(s) listed there?  That is, for entries where the values of
'observed' do not include either the value of 'refUCSC' or its reverse
complement, are we to presume that the missing value would be included in a
more verbose population of the 'observed' field?

3) Are all values for 'observed' given as part of the strand specified in
the 'strand' field, while values in 'refUCSC' are given as part of the plus
strand, regardless of the value listed in the 'strand' field?

4) What mapping, if any, holds between the allele state listed in 'refUCSC'
and the ancestral (versus derived) allele state for that SNP?

Next a few questions about SNP records of class "single":

1) For a class "single" SNP entry, it appears that the value of 'chromStart'
equals that of 'chromEnd' if and only if the value of 'refUCSC' is "-"
(e.g., rs3542401, rs1755135).  Yet the allele states (always multiple)
listed in the observed field never include "-", but instead always appear as
single bases (e.g., "A/T").  How does the "-" value in 'refUCSC' relate to
the multiple allele state values in 'observed'?

2) Given that the values of 'chromStart' and 'chromEnd' values are equal
only where the value of 'refUCSC' is "-", are we right to infer that such
cases represent single-base insertions/deletions, while all other class
"single" cases represent single-base substitutions?  If this interpretation
is right, why is SNP rs17551353 (strand = -; refUCSC = -; observed = C/G)
classified as class "single", while SNP rs28383030 (strand = "-"; refUCSC =
"A"; observed = "-/T") is classified as class "in-del"?

3) In some class "single" entries (e.g., rs5869813, rs61556558), the value
of 'refUCSC' is a multibase string, but each allele state in 'observed' is a
single base.  How are such entries (specifically, the multibase value of
'refUCSC') to be interpreted, especially for parsing which (class =...)
"single" base is the site of variation?  In what cases, if any (or all), are
we to infer that the site of variation for a class "single" SNP is
'chromStart'+1?

Next, a few questions about SNP records of class "in-del":

1) Is the identity of the ancestral allele invoked in further classifying a
class "in-del" SNP as either class "insertion" or class "deletion"?  If not,
what is the basis/purpose of this subclassification?

2) Just as for class "single" entries, "-" may appear as the value of the
'refUCSC' field, but not as a value of 'observed' for that entry.  Are such
cases always also of class "insertion"?

3) When the values of 'chromStart' and 'chromEnd' are equal, the value of
'observed' appears to always be "lengthTooLong"; by contrast, when the
'chromStart' and 'chromEnd' values are not equal, the value of observed may
or may not be "lengthTooLong".  Is every entry with "lengthTooLong" in the
observed field to be interpreted as an allelism in which the two possible
allele states are a too-long-to-be-reliably-sequenced motif versus a gap?
Is the specific nucleotide sequence of that motif stored somewhere in the
database?  If not, how, if at all, can we find its value?

4) How, if at all, does the value of the 'strand' field affect the
interpretation of the "lengthTooLong" value listed in the observed field,
and/or of the "-" value listed in the 'refUCSC' field?

5) In some cases (e.g., rs10605661), the 'observed' field contains a "-"
value and/or a multinucleotide value, but the 'refUCSC' field contains only
a single-base value.  Why is the 'refUCSC' value not one of the values
listed in 'observed'?

6) Is the variable segment in a class "in-del" SNP always the segment that
starts at position 'chromStart' + 1 and continues through position
'chromEnd' (even when the value of strand is "-"), or are there other rules
for inferring exactly which positions vary?

For class "insertion":

Are the following inferences right (and, if not, please advise re. correct
interpretation)?:

1) Every class "insertion" SNP has exactly two allele state values in
'observed'.

2) "-" appears as the value of 'refUCSC', and as a value of 'observed', if
and only if the value of 'chromStart' equals the value of 'chromEnd'.

3) If the value of 'chromStart' does not equal the value of 'chromEnd', and
the length of some non-'refUCSC' allele listed in 'observed' equals the
quantity 'chromEnd' - 'chromStart', then the subject and reference genomes
align with no local gap, and the subject genome has that non-'refUCSC'
allele substituted for the reference positions 'chromStart+1' to 'chromEnd'.

4) If the value of 'chromStart' does not equal the value of 'chromEnd, and
the length of some non-'refUCSC' allele listed in 'observed' exceeds the
quantity 'chromEnd - chromStart', then the reference genome contains a local
gap relative to the subject genome, and the subject genome has that
non-refUCSC allele substituted for gap-inclusive reference positions
'chromStart+1' to 'chromEnd'.

For class "deletion":

Are the following inferences right (and, if not, please advise re. correct
interpretation)?:

1) No class "deletion" SNP has equal 'chromStart' and 'chromEnd' values.

2) No class "deletion" SNP has "-" as a 'refUCSC' value.

3) Every class "deletion" SNP with "-" as a value in 'observed' has exactly
one other 'observed' value; other cases (in which "-" is not listed as a
value in 'observed') may have more than two possible allele states listed in
'observed'.

4) If the length of some non-'refUCSC' allele listed in 'observed' equals
the quantity 'chromEnd' - 'chromStart', then the subject and reference
genomes align with no local gap, and the subject genome has that
non-'refUCSC' allele substituted for reference positions 'chromStart+1' to
'chromEnd'.

5) If the length of some non-'refUCSC' allele listed in 'observed' is less
than the quantity 'chromEnd - chromStart', then the subject genome contains
a local gap relative to the reference genome, and the subject genome has
that non-'refUCSC' allele substituted for reference positions 'chromStart+1'
to 'chromEnd'.

Thanks very much for any clarification you might provide!  Sincerely,

Nathan Pearson
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to