Thanks so much Tom. Great input!!  -D

On 1/24/12 2:48 PM, Vanessa Kirkup Swing wrote:
> Hi Tom,
>
> Thank you for your input on the gene details pages.  I have passed on
> your recommendations to our
> engineers.
>
> Vanessa Kirkup Swing
> UCSC Genome Bioinformatics Group
>
>
>
> ---------- Forwarded message ----------
> From: thomas pringle<[email protected]>
> Date: Sun, Jan 22, 2012 at 9:03 AM
> Subject: [Genome] 9 bugs on gene description page
> To: [email protected]
> Cc: David Haussler<[email protected]>, Donna Karolchik
> <[email protected]>
>
>
> I may have sent in some version of 7-9 previously, included here since
> on the same gene details page.
>
> 1. The 'Orthologous Genes' table has lumped C.elegans and S. cerevisae
> into a single column. The code may just need another<td>  in each row.
> However, yesterday zebrafish and D.melanogaster were also lumped into
> a single table cell so more may be going on. (see attached graphic).
>
> 2. The 'Orthologous Genes' table is really questionable when it comes
> to fly, worm and yeast. Orthology has a long-standing fixed
> definition. That definition is not best-reciprocal Blastp -- a sloppy
> proxy that we use because of computational convenience. Orthology is
> already difficult to establish computationally between human and mouse
> (see below) and seldom reliable outside of mammals without extensive
> manual curation.
>
> With fly, worm and yeast, there is almost never syntenic support. As
> with human, these species have experienced large numbers of gene gains
> and losses; yeast had a old whole genome duplication. This gives rise
> to sets of paralogs with highly variable rates of evolution and
> dramatic changes in function, making them very difficult to compare.
>
> The table should say 'best reciprocal Blastp' (BRBp for brevity) in
> the table cells if that's all it is, not 'ortholog' . If there is
> nothing to put in the cell, '---' is perhaps better than text clutter.
>
> 3. Also in the 'Orthologous Genes' table, it says "Orthologies between
> human, mouse, and rat are computed by taking the best BLASTP hit..."
> Here the target protein database has to be specified. For example, I'm
> not seeing a rat ortholog entry for ERI3 despite 98% identity between
> rat, mouse and human. It has been represented since 2008 at GenBank as
> "AAI67080  unknown protein [Rattus norvegicus]".
>
> Rat annotation at NCBI appears to be a total joke. Even slamdunks like
> ERI3 have not been given a RefSeq in the 7 years since the genome was
> released. Obviously NCBI has no intention of annotating rat. However
> it is correctly annotated in the browser by RGD Genes 2. ERI3 is not
> listed at rat GeneSorter, suggesting RefSeq was used there too but not
> RGD Genes 2. Since there is no entry in the 'Orthologous Genes' table,
> we seemed to have used only refSeq and not consulted RGD Genes 2 or
> GenBank. That should be stated because otherwise visitors will assume
> we looked around for rat gene annotation.
>
> So here we are, providing visitors with obsolete, misleading product
> that will never be fixed by updates. I recommend either doing rat
> right or dropping it from this table since its orthologs including
> ERI3 are already done correctly at Protein Fasta.
>
> What I am really writing about here is the slow accrual of legacy
> crud. We have to be very careful about hosting orphaned products that
> people have lost interest in. Automatic updating will accomplish
> nothing here since NCBI is not updating rat RefSeq. There is no
> trajectory converging on correctness.
>
> 4. Also in the 'Orthologous Genes' table, not ok: "Note that the
> absence of an ortholog in the table below may reflect incomplete
> annotations in the other species rather than a true absence of the
> orthologous gene." The previous sentence just said we used reciprocal
> best-Blastp. That suffices. If one of the species lacks a gene model,
> reciprocal best-Blastp obviously could not have been conducted.
>
> Worm, fly and yeast are small genomes that were exhaustively annotated
> many years ago. It is impossible that any protein with signficant
> homology to human has been overlooked. That's because they repeatedly
> do Blastx of their entire genome to all proteins in GenBank.
> Meanwhile, human has been over-annotated: not just the genuine coding
> genes but thousands of pseudogenes and junk transcripts. Human may
> have lost a few hundred genes but these will still show up as good
> worm/fly/yeast matches in the other vertebrates.
>
> This sentence should be replaced with "The absence of ortholog or best
> reciprocal Blastp entry in the table below may reflect a genuine lack
> of candidates, multiple paralogous matches of indistinguishable low
> quality, sub-threshhold percent identity (<25% or whatever was used),
> less than full-length matches (<80% or whatever was used) with
> chimeric domain proteins, unannotated pseudogene debris, or gene loss
> in the clade representative used." (An example of the latter would be
> URAH, pseudogene debris in human but with good orthologs between
> gorilla and mouse). These tables are only useful to visitors if they
> know how they were made.
>
> 5. Zebrafish has to be treated differently from the others in the
> 'Orthologous Genes' table. First, it has no GeneSorter. One solution
> is to drop it entirely: the browser annotation is awful; the genome
> assembly has dragged on for a decade, never attaining high quality.
> Protein divergence to human is generally high; lineage-specific gene
> family expansion is rampant; syntenic retention is rare. It is
> redundant in this table because we already have a whole genome
> alignment best-guess at Protein Fasta. And everything is a model
> organism today.
>
> The biggest problem is that whole genome duplication makes a
> meaningful choice of ortholog systemically impractical. What visitors
> want here, in view of the whole genome duplication, is whether both
> copies were retained. If only one copy was retained, orthologous
> correspondence to human is clear. If both copies were retained, the
> correspondence becomes very murky (co-orthology).  We need to double
> up on the zebrafish column. Best Blastp does not work in this
> situation. What we are doing now -- picking one, not mentioning the
> second -- is a disservice to the visitor.
>
> 6. The worst bug of all is in the 'Orthologous Genes' table. We did
> not really filter out non-syntentic hits in mouse and rat. "Filtering
> out of non-syntenic hits" should be changed to what we actually did
> here operationally. Visitors cannot use our material without an
> explanation of methods.
>
> I provide below a counter-example, PRDM9, that proves syntenic
> filtering was not done (see attached graphic). The attached mouse and
> human browser screenshots show human and mouse 'PRDM9' are not
> remotely syntenic. The flanking genes do not correspond. This is
> actually the mouse ortholog of human PRDM7.
>
> Synteny ('same thread') refers to conserved adjacency of potentially
> rearrangeable genes orother  features on a chromosome. The minimum
> unit is two genes. There is a great risk of cross-matching paralogs
> and larger segmental duplications.
>
> Synteny does not refer to parts of a single gene such as exons,
> introns, and promoter regions because these are not units that can be
> routinely shuffled by chromosomal rearrangements with retention of
> function.
>
> Synteny does not refer to best whole-genome-alignment of two species,
> restricted to a single gene and its internal nucleotides. That is
> called best-Blastn.
>
> I'm guessing the procedure used took the comparative genomics track in
> the human browser, ie the one that shows the mouse chrs, then
> intersected with the mouse gene table. That won't work, faulty
> algorithm. There have to be multiple genes in the contiguous patch of
> syntenic chr, each normally best-blastp.
>
> 7. When a visitor has landed by whatever route on a gene description
> page, a natural thing to do next is visit GeneSorter (banner menu).
> Here GeneSorter should default to the gene on the description page. It
> does not, it defaults only to the GeneSorter gateway page. The visitor
> then has to go back, scrape off the gene name, forward to the gateway
> page, move the mouse to the text box, paste the gene name, hit return.
> This is inconsistent with our overall 'smart' interface that carries
> database fields along with page clicks and inserts them appropriately.
>
> In the 'Orthologous Genes' table on this same page, clicking on
> GeneSorter in the mouse ortholog column already does the right thing.
> It's just human that is broken.
>
> 8. Protein Fasta desparately needs to be renamed. Its current name
> does not describe what it is. Visitors do not have time to explore
> cryptic links. This is causing one of the most important pages on the
> entire browser to be greatly under-utilized. Please change to Aligned
> Orthologs or another appropriately descriptive name (check w Brian
> Rainey).
>
> 9. When a visitor has landed on a gene description page, another
> natural thing to do next is visit Protein Fasta . Here the browser
> should display the gene name as well as the uc index number. For
> example, "Human Gene RHOT1 (uc002hgw.3) Description and Page Index"
> should go over to "Protein Alignments for Human Gene RHOT1
> (uc002hgw.3)" on the Protein Fasta page. Right now, it goes over to
> just "Protein Alignments for knownGene uc002hgw.3".
>
> Here the visitor harvests the fasta sequences, then has to go back a
> window, scrape off the gene name, search and replace the uc name with
> the gene name in the harvested sequences which should have empasized
> the gene name to begin with. (The uc name is just a redundant inhouse
> indexing system; gene names mean something to biomedical researchers.)
> The current set-up is again inconsistent with our overall 'smart'
> interface.
>
>
>
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to