Thanks so much Tom. Great input!! -D
On 1/24/12 2:48 PM, Vanessa Kirkup Swing wrote: > Hi Tom, > > Thank you for your input on the gene details pages. I have passed on > your recommendations to our > engineers. > > Vanessa Kirkup Swing > UCSC Genome Bioinformatics Group > > > > ---------- Forwarded message ---------- > From: thomas pringle<[email protected]> > Date: Sun, Jan 22, 2012 at 9:03 AM > Subject: [Genome] 9 bugs on gene description page > To: [email protected] > Cc: David Haussler<[email protected]>, Donna Karolchik > <[email protected]> > > > I may have sent in some version of 7-9 previously, included here since > on the same gene details page. > > 1. The 'Orthologous Genes' table has lumped C.elegans and S. cerevisae > into a single column. The code may just need another<td> in each row. > However, yesterday zebrafish and D.melanogaster were also lumped into > a single table cell so more may be going on. (see attached graphic). > > 2. The 'Orthologous Genes' table is really questionable when it comes > to fly, worm and yeast. Orthology has a long-standing fixed > definition. That definition is not best-reciprocal Blastp -- a sloppy > proxy that we use because of computational convenience. Orthology is > already difficult to establish computationally between human and mouse > (see below) and seldom reliable outside of mammals without extensive > manual curation. > > With fly, worm and yeast, there is almost never syntenic support. As > with human, these species have experienced large numbers of gene gains > and losses; yeast had a old whole genome duplication. This gives rise > to sets of paralogs with highly variable rates of evolution and > dramatic changes in function, making them very difficult to compare. > > The table should say 'best reciprocal Blastp' (BRBp for brevity) in > the table cells if that's all it is, not 'ortholog' . If there is > nothing to put in the cell, '---' is perhaps better than text clutter. > > 3. Also in the 'Orthologous Genes' table, it says "Orthologies between > human, mouse, and rat are computed by taking the best BLASTP hit..." > Here the target protein database has to be specified. For example, I'm > not seeing a rat ortholog entry for ERI3 despite 98% identity between > rat, mouse and human. It has been represented since 2008 at GenBank as > "AAI67080 unknown protein [Rattus norvegicus]". > > Rat annotation at NCBI appears to be a total joke. Even slamdunks like > ERI3 have not been given a RefSeq in the 7 years since the genome was > released. Obviously NCBI has no intention of annotating rat. However > it is correctly annotated in the browser by RGD Genes 2. ERI3 is not > listed at rat GeneSorter, suggesting RefSeq was used there too but not > RGD Genes 2. Since there is no entry in the 'Orthologous Genes' table, > we seemed to have used only refSeq and not consulted RGD Genes 2 or > GenBank. That should be stated because otherwise visitors will assume > we looked around for rat gene annotation. > > So here we are, providing visitors with obsolete, misleading product > that will never be fixed by updates. I recommend either doing rat > right or dropping it from this table since its orthologs including > ERI3 are already done correctly at Protein Fasta. > > What I am really writing about here is the slow accrual of legacy > crud. We have to be very careful about hosting orphaned products that > people have lost interest in. Automatic updating will accomplish > nothing here since NCBI is not updating rat RefSeq. There is no > trajectory converging on correctness. > > 4. Also in the 'Orthologous Genes' table, not ok: "Note that the > absence of an ortholog in the table below may reflect incomplete > annotations in the other species rather than a true absence of the > orthologous gene." The previous sentence just said we used reciprocal > best-Blastp. That suffices. If one of the species lacks a gene model, > reciprocal best-Blastp obviously could not have been conducted. > > Worm, fly and yeast are small genomes that were exhaustively annotated > many years ago. It is impossible that any protein with signficant > homology to human has been overlooked. That's because they repeatedly > do Blastx of their entire genome to all proteins in GenBank. > Meanwhile, human has been over-annotated: not just the genuine coding > genes but thousands of pseudogenes and junk transcripts. Human may > have lost a few hundred genes but these will still show up as good > worm/fly/yeast matches in the other vertebrates. > > This sentence should be replaced with "The absence of ortholog or best > reciprocal Blastp entry in the table below may reflect a genuine lack > of candidates, multiple paralogous matches of indistinguishable low > quality, sub-threshhold percent identity (<25% or whatever was used), > less than full-length matches (<80% or whatever was used) with > chimeric domain proteins, unannotated pseudogene debris, or gene loss > in the clade representative used." (An example of the latter would be > URAH, pseudogene debris in human but with good orthologs between > gorilla and mouse). These tables are only useful to visitors if they > know how they were made. > > 5. Zebrafish has to be treated differently from the others in the > 'Orthologous Genes' table. First, it has no GeneSorter. One solution > is to drop it entirely: the browser annotation is awful; the genome > assembly has dragged on for a decade, never attaining high quality. > Protein divergence to human is generally high; lineage-specific gene > family expansion is rampant; syntenic retention is rare. It is > redundant in this table because we already have a whole genome > alignment best-guess at Protein Fasta. And everything is a model > organism today. > > The biggest problem is that whole genome duplication makes a > meaningful choice of ortholog systemically impractical. What visitors > want here, in view of the whole genome duplication, is whether both > copies were retained. If only one copy was retained, orthologous > correspondence to human is clear. If both copies were retained, the > correspondence becomes very murky (co-orthology). We need to double > up on the zebrafish column. Best Blastp does not work in this > situation. What we are doing now -- picking one, not mentioning the > second -- is a disservice to the visitor. > > 6. The worst bug of all is in the 'Orthologous Genes' table. We did > not really filter out non-syntentic hits in mouse and rat. "Filtering > out of non-syntenic hits" should be changed to what we actually did > here operationally. Visitors cannot use our material without an > explanation of methods. > > I provide below a counter-example, PRDM9, that proves syntenic > filtering was not done (see attached graphic). The attached mouse and > human browser screenshots show human and mouse 'PRDM9' are not > remotely syntenic. The flanking genes do not correspond. This is > actually the mouse ortholog of human PRDM7. > > Synteny ('same thread') refers to conserved adjacency of potentially > rearrangeable genes orother features on a chromosome. The minimum > unit is two genes. There is a great risk of cross-matching paralogs > and larger segmental duplications. > > Synteny does not refer to parts of a single gene such as exons, > introns, and promoter regions because these are not units that can be > routinely shuffled by chromosomal rearrangements with retention of > function. > > Synteny does not refer to best whole-genome-alignment of two species, > restricted to a single gene and its internal nucleotides. That is > called best-Blastn. > > I'm guessing the procedure used took the comparative genomics track in > the human browser, ie the one that shows the mouse chrs, then > intersected with the mouse gene table. That won't work, faulty > algorithm. There have to be multiple genes in the contiguous patch of > syntenic chr, each normally best-blastp. > > 7. When a visitor has landed by whatever route on a gene description > page, a natural thing to do next is visit GeneSorter (banner menu). > Here GeneSorter should default to the gene on the description page. It > does not, it defaults only to the GeneSorter gateway page. The visitor > then has to go back, scrape off the gene name, forward to the gateway > page, move the mouse to the text box, paste the gene name, hit return. > This is inconsistent with our overall 'smart' interface that carries > database fields along with page clicks and inserts them appropriately. > > In the 'Orthologous Genes' table on this same page, clicking on > GeneSorter in the mouse ortholog column already does the right thing. > It's just human that is broken. > > 8. Protein Fasta desparately needs to be renamed. Its current name > does not describe what it is. Visitors do not have time to explore > cryptic links. This is causing one of the most important pages on the > entire browser to be greatly under-utilized. Please change to Aligned > Orthologs or another appropriately descriptive name (check w Brian > Rainey). > > 9. When a visitor has landed on a gene description page, another > natural thing to do next is visit Protein Fasta . Here the browser > should display the gene name as well as the uc index number. For > example, "Human Gene RHOT1 (uc002hgw.3) Description and Page Index" > should go over to "Protein Alignments for Human Gene RHOT1 > (uc002hgw.3)" on the Protein Fasta page. Right now, it goes over to > just "Protein Alignments for knownGene uc002hgw.3". > > Here the visitor harvests the fasta sequences, then has to go back a > window, scrape off the gene name, search and replace the uc name with > the gene name in the harvested sequences which should have empasized > the gene name to begin with. (The uc name is just a redundant inhouse > indexing system; gene names mean something to biomedical researchers.) > The current set-up is again inconsistent with our overall 'smart' > interface. > > > > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
