Hi, In our group's efforts to accurately parse UCSC human SNP records, several small puzzles have emerged. First, a few general questions about SNP records of any class:
1) As a value in the 'refUCSC' field, does "-" always simply denote a gap relative to a subject (or other some other reference) genome? 2) Is the value in the 'refUCSC' field, or its reverse complement (if and only if "-" is the value of 'strand' but not the value of 'refUCSC'?), always at least an implied value of the 'observed' field, in addition to any other value(s) listed there? That is, for entries where the values of 'observed' do not include either the value of 'refUCSC' or its reverse complement, are we to presume that the missing value would be included in a more verbose population of the 'observed' field? 3) Are all values for 'observed' given as part of the strand specified in the 'strand' field, while values in 'refUCSC' are given as part of the plus strand, regardless of the value listed in the 'strand' field? 4) What mapping, if any, holds between the allele state listed in 'refUCSC' and the ancestral (versus derived) allele state for that SNP? Next a few questions about SNP records of class "single": 1) For a class "single" SNP entry, it appears that the value of 'chromStart' equals that of 'chromEnd' if and only if the value of 'refUCSC' is "-" (e.g., rs3542401, rs1755135). Yet the allele states (always multiple) listed in the observed field never include "-", but instead always appear as single bases (e.g., "A/T"). How does the "-" value in 'refUCSC' relate to the multiple allele state values in 'observed'? 2) Given that the values of 'chromStart' and 'chromEnd' values are equal only where the value of 'refUCSC' is "-", are we right to infer that such cases represent single-base insertions/deletions, while all other class "single" cases represent single-base substitutions? If this interpretation is right, why is SNP rs17551353 (strand = -; refUCSC = -; observed = C/G) classified as class "single", while SNP rs28383030 (strand = "-"; refUCSC = "A"; observed = "-/T") is classified as class "in-del"? 3) In some class "single" entries (e.g., rs5869813, rs61556558), the value of 'refUCSC' is a multibase string, but each allele state in 'observed' is a single base. How are such entries (specifically, the multibase value of 'refUCSC') to be interpreted, especially for parsing which (class =...) "single" base is the site of variation? In what cases, if any (or all), are we to infer that the site of variation for a class "single" SNP is 'chromStart'+1? Next, a few questions about SNP records of class "in-del": 1) Is the identity of the ancestral allele invoked in further classifying a class "in-del" SNP as either class "insertion" or class "deletion"? If not, what is the basis/purpose of this subclassification? 2) Just as for class "single" entries, "-" may appear as the value of the 'refUCSC' field, but not as a value of 'observed' for that entry. Are such cases always also of class "insertion"? 3) When the values of 'chromStart' and 'chromEnd' are equal, the value of 'observed' appears to always be "lengthTooLong"; by contrast, when the 'chromStart' and 'chromEnd' values are not equal, the value of observed may or may not be "lengthTooLong". Is every entry with "lengthTooLong" in the observed field to be interpreted as an allelism in which the two possible allele states are a too-long-to-be-reliably-sequenced motif versus a gap? Is the specific nucleotide sequence of that motif stored somewhere in the database? If not, how, if at all, can we find its value? 4) How, if at all, does the value of the 'strand' field affect the interpretation of the "lengthTooLong" value listed in the observed field, and/or of the "-" value listed in the 'refUCSC' field? 5) In some cases (e.g., rs10605661), the 'observed' field contains a "-" value and/or a multinucleotide value, but the 'refUCSC' field contains only a single-base value. Why is the 'refUCSC' value not one of the values listed in 'observed'? 6) Is the variable segment in a class "in-del" SNP always the segment that starts at position 'chromStart' + 1 and continues through position 'chromEnd' (even when the value of strand is "-"), or are there other rules for inferring exactly which positions vary? For class "insertion": Are the following inferences right (and, if not, please advise re. correct interpretation)?: 1) Every class "insertion" SNP has exactly two allele state values in 'observed'. 2) "-" appears as the value of 'refUCSC', and as a value of 'observed', if and only if the value of 'chromStart' equals the value of 'chromEnd'. 3) If the value of 'chromStart' does not equal the value of 'chromEnd', and the length of some non-'refUCSC' allele listed in 'observed' equals the quantity 'chromEnd' - 'chromStart', then the subject and reference genomes align with no local gap, and the subject genome has that non-'refUCSC' allele substituted for the reference positions 'chromStart+1' to 'chromEnd'. 4) If the value of 'chromStart' does not equal the value of 'chromEnd, and the length of some non-'refUCSC' allele listed in 'observed' exceeds the quantity 'chromEnd - chromStart', then the reference genome contains a local gap relative to the subject genome, and the subject genome has that non-refUCSC allele substituted for gap-inclusive reference positions 'chromStart+1' to 'chromEnd'. For class "deletion": Are the following inferences right (and, if not, please advise re. correct interpretation)?: 1) No class "deletion" SNP has equal 'chromStart' and 'chromEnd' values. 2) No class "deletion" SNP has "-" as a 'refUCSC' value. 3) Every class "deletion" SNP with "-" as a value in 'observed' has exactly one other 'observed' value; other cases (in which "-" is not listed as a value in 'observed') may have more than two possible allele states listed in 'observed'. 4) If the length of some non-'refUCSC' allele listed in 'observed' equals the quantity 'chromEnd' - 'chromStart', then the subject and reference genomes align with no local gap, and the subject genome has that non-'refUCSC' allele substituted for reference positions 'chromStart+1' to 'chromEnd'. 5) If the length of some non-'refUCSC' allele listed in 'observed' is less than the quantity 'chromEnd - chromStart', then the subject genome contains a local gap relative to the reference genome, and the subject genome has that non-'refUCSC' allele substituted for reference positions 'chromStart+1' to 'chromEnd'. Thanks very much for any clarification you might provide! Sincerely, Nathan Pearson
_______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
