Hi Tom, Thank you for your input on SNPs and the Genome Browser tracks. I have passed on your recommendation to supplement the existing 46-way comparative genomics track with the human variation frequencies to our engineers.
-- Brooke Rhead UCSC Genome Bioinformatics Group On 12/8/11 9:03 AM, thomas pringle wrote: > I read through the 324 abstracts Robert K sent for the 2011 ASHG > meeting. As usual, a vast number of papers found coding changes in this > or that gene the explanation for a disease condition, sequencing entire > family genomes only to throw all the gigabases away at the end (except > for that 1bp they wanted). > > One option used for evaluating a non-synonymous SNP: if the gene has an > available 3D structure from xray or nmr, the variant can be stubbed in, > the structural effect evaluated, and the substitution characterized as > bad or neutral (good is rare and much harder to prove). > > The problem here is the 'if'. As far as I know, nobody explicitly tracks > how many of the 20,000 human genes have an experimentally determined > structure nor graphs how fast we are progressing per year to 100% due to > proteomics initiatives, for purposes of SNP evaluation. I would estimate > an 80-100 year time frame on this the way things are going. > > This is because human proteins are large (ave 450 aa relative to a > typical structural determination (ave 260 aa over the 1,000 most recent > PDB additions) and 20% or so are not soluble and may never be amenable > to structural study. > > Below I took 50 genes at random and blastped them individually against > human and non-human PDB: > > -- 18% had a determined structure for the human query; however these > were seldom full length resulting in 68% coverage on average > > -- these are somewhat over-weighted on disease genes (adjusted for > incidence) but many structures have been determined for small enzymes of > low clinical interest. > > -- another 30% had a determined structure for a human paralog; however > these had both low coverage and low id resulting in 18% coverage of a > given SNP requiring identity. > > -- very few situations were improved using structures from homologous > proteins in other species. > > -- overall, chance of a given SNP having any kind of structural coverage > of the original amino acid was 17%, so if there was a 25% chance of the > enveloping patch providing reliable structural evaluation of the SNP, > this option works out roughly 4-5% of the time (after adjusting for much > shorter protein length in the on-target structures). Ab initio > calculations of structure are not at a point where they can affect SNP > evaluation statistics. > > In summary, 3D structure is a nice tool in the toolbox to evaluate > nsSNPs but it is rarely applicable, whereas massive comparative genomics > + human variation frequencies along the protein is universally > available. The latter data, though available already, will be a done > deal in 2-3 years. Finally, it is better suited to computerized > evaluation without subjective human intervention and so to personal > genomic medicine. > > For the genome browser, while it is fine to provide rs9898090-type links > to chimera and LS-SNP on the details page, it is more useful to the > visitor if we supplement the existing 46-way comparative genomicswith > the human variation frequencies (which can be done on the same display > over the human line with logo sizes). So that is a top priority, to > extract the naturally occurring amino acid variant frequencies from the > many new studies above. The 1000k genome pj may be the only realistic > source for that. It is complicated by ethnic group differences so needs > stratification. > > With these data, we could precompute the dysfunctional effect of all 19 > possible aa substitutions at every site in the 9,000,000 aa proteome, > and for other key species as well. > > > > gene coverage id cov*id length > TYMP 100% 100% 482 482 > TBC1D2 63% 100% 326 517 > PPARA 59% 100% 276 468 > MAPK11 100% 99% 360 364 > MAPK12 100% 99% 363 367 > ARSA 96% 99% 484 509 > SCO2 63% 99% 166 266 > BRD1 12% 99% 126 1058 > CHKB 98% 98% 379 395 > MIOX 87% 93% 231 285 > PIM3 96% 70% 219 326 > SBF1 40% 66% 500 1893 > PLXNB2 63% 50% 579 1838 > FBLN1 53% 45% 168 703 > MAPK8IP 7% 42% 23 797 > ACR 59% 41% 102 421 > HDAC10 63% 38% 160 669 > CELSR1 28% 38% 321 3014 > RABL2B 70% 36% 58 229 > SHANK3 12% 34% 71 1747 > CPT1B 75% 32% 185 772 > KLHDC7B 47% 30% 84 594 > MOV10L1 36% 30% 131 1211 > TUBGCP6 12% 28% 61 1819 > ADM2 0 0 0 148 > LMF2 0 0 0 707 > NCAPH2 0 0 0 606 > ODF3B 0 0 0 253 > C22orf41 0 0 0 88 > PPP6R2 0 0 0 959 > FAM116B 0 0 0 585 > SELO 0 0 0 669 > TRABD 0 0 0 376 > PANX2 0 0 0 677 > MLC1 0 0 0 377 > IL17REL 0 0 0 336 > CRELD2 0 0 0 402 > ALG12 0 0 0 488 > ZBED4 0 0 0 1171 > FAM19A5 0 0 0 132 > CERK 0 0 0 537 > GRAMD4 0 0 0 578 > TRMU 0 0 0 421 > GTSE1 0 0 0 739 > TTC38 0 0 0 469 > PKDREJ 0 0 0 2253 > RP4-695 0 0 0 219 > WNT7B 0 0 0 349 > ATXN10 0 0 0 475 > RIBC2 0 0 0 377 > ave 29% 31% 117 702 > chance of a SNP having any kind of coverage: 17% > > > > _______________________________________________ > Genecats maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genecats _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
