Hello David,

I'll try explain per example. I'll use hg19 (& hg18 and note if they differ). 
Please note that the dbSNP release for hg19 is still considered provisional. 
And if there are problems in the data, dbSNP would need to adjust the actual 
source data there - we only place the data on the genome and add in some 
descriptions.

Case A - rs3223599

You are correct, the CA repeat is part of the reference sequence. The genomic 
flanking sequence for the SNP is not quite exactly what would be expected. It 
looks as if both the C and A should be noted as the SNP position, not just the 
C, which would leave the leading A out of the right flanking sequence. The 
alignment of the flaking sequence to the reference genome sequence shows this. 
My guess is that the repeats caused some problems with placing the SNP. From a 
practical perspective, the SNP covers all of the bases of the CA repeat (all 
48) observed in the reference sequence and the variability is in that region 
(the observed frequency of the CA repeat, of which 24 is one of them). Since 
the an observed is never zero or a single "CA", the the SNP placement and the 
flaking sequence are confused. Hg18 is the same, different coordinates, but a 
similar SNP description. 

Case B - rs3220726

Almost the same case, just the reverse, where the ms is in the left (not right) 
flanking sequence. Only one base is noted as the reference allele (C), when the 
variation is a repeating "CA". This one also has the problem that the SNP 
position is inside of a CAT as compared to the genomic reference sequence, not 
the start of the actual CA repeating block. The BLAT alignment of the flanking 
sequence was done the same way as in case A. When the flanking sequence is 
organized this way, the ms coordinates could end up at the start or end (or in 
this case, slightly before) the actual start of the ms. The repetitive sequence 
is difficult to align to. It would be better if at least one observed was in 
the "allele" field.

Question 1: For either of these, there is not much to do except to adjust the 
coordinates yourself and perhaps submit the data to dbSNP. There isn't another 
data source. It may take some hand editing to perfect the coordinate positions, 
until dbSNP has a chance to edit it. 

Case C - rs3222966

Better, since one of the actual observed is noted as the allele, but it does 
have a problem with a missing base. If you actually look at the genome 
sequence, there isn't another A to make the full 48 bases. There is however a 
leading A before the SNP starts. So, is this an AC repeat? Is the actual 
observed really only 23 copies? Perhaps the G after the ms is a bad base call?

Question 2: Agreed, the lengthTooLong is not very helpful. Although all three 
of these are obviously describing the same type of feature, they all have 
little problems with how they are modeled in the data. Again, you will need to 
make repairs to the data itself for your own use and consider submitting the 
evidence to dbSNP, for them to use for a correction.

Case D - rs3219614              

Question 3: The /A/T are supposed to represent alternate alleles, but I think 
this is probably an error. The ms starts after this base (the allele coordinate 
is one base too small). Many of the same issues you noted in earlier cases 
apply, with a this new wrinkle. I only count 21 observed CA repeats in the 
genome sequence (and in the flanking sequence) but actually only 16, if the CAT 
towards the end is the true end of the feature (not likely, poor base calling? 
Probably.) You would need to see all the evidence for each of the observed in 
place with flanking sequence to know for certain.

Question 4: A good question. The reference is the genome the SNP is place upon, 
which as you note, only has 21 copies. hg18 is the same. Another problem.

Mostly what I can do is confirm that what you are seeing is the same as what I 
can see here. All of these are the same class of variation and should be 
formatted in the same way to facilitate analysis. This is the data from dbSNP - 
so any changes/corrections would need to flow from them.

I hope I helped a bit, 
Jennifer

------------------------------------------------ 
Jennifer Jackson 
UCSC Genome Bioinformatics Group 

----- "David Gordon" <[email protected]> wrote:

> From: "David Gordon" <[email protected]>
> To: [email protected]
> Cc: "David Gordon" <[email protected]>
> Sent: Thursday, December 10, 2009 5:51:39 PM GMT -08:00 US/Canada Pacific
> Subject: [Genome] questions on microsatellites in snp130.txt download
>
> Dear UCSC,
> 
> I've looked through the archives so I think my question hasn't yet
> been answered.
> 
> I'm looking at microsatellites in the snp130.txt file.  I am trying
> to
> make sense of the coordinates.  In many case the coordinates of a
> microsatellite refer to a single base (chromEnd = chromStart + 1).
> Such is the cases A and B below.  But where is the microsatellite?
> According to the alignments (by clicking on the rs... name), in case
> A
> the indicated microsatellite (the black bar in the browser with
> snp130
> set to "full") is at the *end* of the CA repeat (the actual
> microsatellite).  In case B, the indicated microsatellite is at the
> *beginning* of the CA repeat.  Both of these are top strand snps.
> 
> Case A.
> 
> 627     chr1    5576651 5576652 rs3223599       0       +       C
> C       (CA)19/20/21/22/23/24   genomic microsatellite  by-frequency
> 0.752086        0.089764        unknown exact   1
> 
> The genome browser shows the entire microsatellite
> repeat (all 24 copies of CA, so 48 bases) as the reference
> sequence. The position 5576652 marks the *end* of the CA repeat.  The
> browser just shows the microsatellite as a single base.
> 
> Case B:
> 
>   658     chr1    9585594 9585595 rs3220726       0       +       C
>   C       lengthTooLong   genomic microsatellite  by-frequency
>   0.8126  0.129764        unknown exact   1
> 
> The genome browser shows base at 1-position 9585595 in this case is
> at
> the *left* (beginning) of the CA repeat.  This repeat is not
> particularly long: 58 bases.  I don't see any way that I can get this
> information from the line above.
> 
> Question 1)
> 
> So how would anyone know, by looking in snp130.txt, where the actual
> microsatellite is?  Is there some other table that I could download
> that would give this information?
> 
> In case C, the coordinates given are the actual coordinates of the
> microsatellite.
> 
> Case C:
> 
> 753     chr1    22129926        22129973        rs3222966       0
> +       CACACACACACACACACACACACACACACACACACACACACACACAC
> CACACACACACACACACACACACACACACACACACACACACACACAC
> (CA)17/18/19/20/21/22/23/24     genomic microsatellite  by-frequency
> 0.7524  0.158867        unknown range   1
> 
> In this case, the microsatellite shows the full coordinates of the
> 47-base microsatellite which includes all (but 1/2) of the 24-copy CA
> repeat.
> 
> Question 2)
> 
> If the observed is listed as lengthTooLong, is there any way to
> determine what the bases of the microsatellite are?  (Without
> that, they aren't much use.)
> 
> 
> Case D:
> 
> 852     chr1    35119589        35119590        rs3219614       0
> +       T       T       (CA)20/21/22/23/A/T     genomic
> microsatellite
> by-frequency    0.284918        0.283047        unknown exact   1
> 
> Question 3)
> 
> In case D, what does the /A/T mean at the end of (CA)20/21/22/23/A/T
> ?
> 
> Question 4)
> 
> In case D, the CA repeat starts at position 35119591 (chr1) and ends
> at
> 35119632, giving 42 bases or 21 copies of the repeat.  So why does
> the
> allele indicate that there are 23 copies?
> 
> Thank you very much!
> 
> David Gordon
> 
> 
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to