Hi Tom,

Thank you for your input on SNPs and the Genome Browser tracks.  I have 
passed on your recommendation to supplement the existing 46-way 
comparative genomics track with the human variation frequencies to our 
engineers.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 12/8/11 9:03 AM, thomas pringle wrote:
> I read through the 324 abstracts Robert K sent for the 2011 ASHG
> meeting. As usual, a vast number of papers found coding changes in this
> or that gene the explanation for a disease condition, sequencing entire
> family genomes only to throw all the gigabases away at the end (except
> for that 1bp they wanted).
>
> One option used for evaluating a non-synonymous SNP: if the gene has an
> available 3D structure from xray or nmr, the variant can be stubbed in,
> the structural effect evaluated, and the substitution characterized as
> bad or neutral (good is rare and much harder to prove).
>
> The problem here is the 'if'. As far as I know, nobody explicitly tracks
> how many of the 20,000 human genes have an experimentally determined
> structure nor graphs how fast we are progressing per year to 100% due to
> proteomics initiatives, for purposes of SNP evaluation. I would estimate
> an 80-100 year time frame on this the way things are going.
>
> This is because human proteins are large (ave 450 aa relative to a
> typical structural determination (ave 260 aa over the 1,000 most recent
> PDB additions) and 20% or so are not soluble and may never be amenable
> to structural study.
>
> Below I took 50 genes at random and blastped them individually against
> human and non-human PDB:
>
> -- 18% had a determined structure for the human query; however these
> were seldom full length resulting in 68% coverage on average
>
> -- these are somewhat over-weighted on disease genes (adjusted for
> incidence) but many structures have been determined for small enzymes of
> low clinical interest.
>
> -- another 30% had a determined structure for a human paralog; however
> these had both low coverage and low id resulting in 18% coverage of a
> given SNP requiring identity.
>
> -- very few situations were improved using structures from homologous
> proteins in other species.
>
> -- overall, chance of a given SNP having any kind of structural coverage
> of the original amino acid was 17%, so if there was a 25% chance of the
> enveloping patch providing reliable structural evaluation of the SNP,
> this option works out roughly 4-5% of the time (after adjusting for much
> shorter protein length in the on-target structures). Ab initio
> calculations of structure are not at a point where they can affect SNP
> evaluation statistics.
>
> In summary, 3D structure is a nice tool in the toolbox to evaluate
> nsSNPs but it is rarely applicable, whereas massive comparative genomics
> + human variation frequencies along the protein is universally
> available. The latter data, though available already, will be a done
> deal in 2-3 years. Finally, it is better suited to computerized
> evaluation without subjective human intervention and so to personal
> genomic medicine.
>
> For the genome browser, while it is fine to provide rs9898090-type links
> to chimera and LS-SNP on the details page, it is more useful to the
> visitor if we supplement the existing 46-way comparative genomicswith
> the human variation frequencies (which can be done on the same display
> over the human line with logo sizes). So that is a top priority, to
> extract the naturally occurring amino acid variant frequencies from the
> many new studies above. The 1000k genome pj may be the only realistic
> source for that. It is complicated by ethnic group differences so needs
> stratification.
>
> With these data, we could precompute the dysfunctional effect of all 19
> possible aa substitutions at every site in the 9,000,000 aa proteome,
> and for other key species as well.
>
>
>
> gene coverage id cov*id length
> TYMP 100% 100% 482 482
> TBC1D2 63% 100% 326 517
> PPARA 59% 100% 276 468
> MAPK11 100% 99% 360 364
> MAPK12 100% 99% 363 367
> ARSA 96% 99% 484 509
> SCO2 63% 99% 166 266
> BRD1 12% 99% 126 1058
> CHKB 98% 98% 379 395
> MIOX 87% 93% 231 285
> PIM3 96% 70% 219 326
> SBF1 40% 66% 500 1893
> PLXNB2 63% 50% 579 1838
> FBLN1 53% 45% 168 703
> MAPK8IP 7% 42% 23 797
> ACR 59% 41% 102 421
> HDAC10 63% 38% 160 669
> CELSR1 28% 38% 321 3014
> RABL2B 70% 36% 58 229
> SHANK3 12% 34% 71 1747
> CPT1B 75% 32% 185 772
> KLHDC7B 47% 30% 84 594
> MOV10L1 36% 30% 131 1211
> TUBGCP6 12% 28% 61 1819
> ADM2 0 0 0 148
> LMF2 0 0 0 707
> NCAPH2 0 0 0 606
> ODF3B 0 0 0 253
> C22orf41 0 0 0 88
> PPP6R2 0 0 0 959
> FAM116B 0 0 0 585
> SELO 0 0 0 669
> TRABD 0 0 0 376
> PANX2 0 0 0 677
> MLC1 0 0 0 377
> IL17REL 0 0 0 336
> CRELD2 0 0 0 402
> ALG12 0 0 0 488
> ZBED4 0 0 0 1171
> FAM19A5 0 0 0 132
> CERK 0 0 0 537
> GRAMD4 0 0 0 578
> TRMU 0 0 0 421
> GTSE1 0 0 0 739
> TTC38 0 0 0 469
> PKDREJ 0 0 0 2253
> RP4-695 0 0 0 219
> WNT7B 0 0 0 349
> ATXN10 0 0 0 475
> RIBC2 0 0 0 377
> ave 29% 31% 117 702
> chance of a SNP having any kind of coverage: 17%
>
>
>
> _______________________________________________
> Genecats maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genecats
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to