I may have sent in some version of 7-9 previously, included here since on the
same gene details page.
1. The 'Orthologous Genes' table has lumped C.elegans and S. cerevisae into a
single column. The code may just need another <td> in each row. However,
yesterday zebrafish and D.melanogaster were also lumped into a single table
cell so more may be going on. (see attached graphic).
2. The 'Orthologous Genes' table is really questionable when it comes to fly,
worm and yeast. Orthology has a long-standing fixed definition. That definition
is not best-reciprocal Blastp -- a sloppy proxy that we use because of
computational convenience. Orthology is already difficult to establish
computationally between human and mouse (see below) and seldom reliable outside
of mammals without extensive manual curation.
With fly, worm and yeast, there is almost never syntenic support. As with
human, these species have experienced large numbers of gene gains and losses;
yeast had a old whole genome duplication. This gives rise to sets of paralogs
with highly variable rates of evolution and dramatic changes in function,
making them very difficult to compare.
The table should say 'best reciprocal Blastp' (BRBp for brevity) in the table
cells if that's all it is, not 'ortholog' . If there is nothing to put in the
cell, '---' is perhaps better than text clutter.
3. Also in the 'Orthologous Genes' table, it says "Orthologies between human,
mouse, and rat are computed by taking the best BLASTP hit..." Here the target
protein database has to be specified. For example, I'm not seeing a rat
ortholog entry for ERI3 despite 98% identity between rat, mouse and human. It
has been represented since 2008 at GenBank as "AAI67080 unknown protein
[Rattus norvegicus]".
Rat annotation at NCBI appears to be a total joke. Even slamdunks like ERI3
have not been given a RefSeq in the 7 years since the genome was released.
Obviously NCBI has no intention of annotating rat. However it is correctly
annotated in the browser by RGD Genes 2. ERI3 is not listed at rat GeneSorter,
suggesting RefSeq was used there too but not RGD Genes 2. Since there is no
entry in the 'Orthologous Genes' table, we seemed to have used only refSeq and
not consulted RGD Genes 2 or GenBank. That should be stated because otherwise
visitors will assume we looked around for rat gene annotation.
So here we are, providing visitors with obsolete, misleading product that will
never be fixed by updates. I recommend either doing rat right or dropping it
from this table since its orthologs including ERI3 are already done correctly
at Protein Fasta.
What I am really writing about here is the slow accrual of legacy crud. We have
to be very careful about hosting orphaned products that people have lost
interest in. Automatic updating will accomplish nothing here since NCBI is not
updating rat RefSeq. There is no trajectory converging on correctness.
4. Also in the 'Orthologous Genes' table, not ok: "Note that the absence of an
ortholog in the table below may reflect incomplete annotations in the other
species rather than a true absence of the orthologous gene." The previous
sentence just said we used reciprocal best-Blastp. That suffices. If one of the
species lacks a gene model, reciprocal best-Blastp obviously could not have
been conducted.
Worm, fly and yeast are small genomes that were exhaustively annotated many
years ago. It is impossible that any protein with signficant homology to human
has been overlooked. That's because they repeatedly do Blastx of their entire
genome to all proteins in GenBank. Meanwhile, human has been over-annotated:
not just the genuine coding genes but thousands of pseudogenes and junk
transcripts. Human may have lost a few hundred genes but these will still show
up as good worm/fly/yeast matches in the other vertebrates.
This sentence should be replaced with "The absence of ortholog or best
reciprocal Blastp entry in the table below may reflect a genuine lack of
candidates, multiple paralogous matches of indistinguishable low quality,
sub-threshhold percent identity (<25% or whatever was used), less than
full-length matches (<80% or whatever was used) with chimeric domain proteins,
unannotated pseudogene debris, or gene loss in the clade representative used."
(An example of the latter would be URAH, pseudogene debris in human but with
good orthologs between gorilla and mouse). These tables are only useful to
visitors if they know how they were made.
5. Zebrafish has to be treated differently from the others in the 'Orthologous
Genes' table. First, it has no GeneSorter. One solution is to drop it entirely:
the browser annotation is awful; the genome assembly has dragged on for a
decade, never attaining high quality. Protein divergence to human is generally
high; lineage-specific gene family expansion is rampant; syntenic retention is
rare. It is redundant in this table because we already have a whole genome
alignment best-guess at Protein Fasta. And everything is a model organism
today.
The biggest problem is that whole genome duplication makes a meaningful choice
of ortholog systemically impractical. What visitors want here, in view of the
whole genome duplication, is whether both copies were retained. If only one
copy was retained, orthologous correspondence to human is clear. If both copies
were retained, the correspondence becomes very murky (co-orthology). We need
to double up on the zebrafish column. Best Blastp does not work in this
situation. What we are doing now -- picking one, not mentioning the second --
is a disservice to the visitor.
6. The worst bug of all is in the 'Orthologous Genes' table. We did not really
filter out non-syntentic hits in mouse and rat. "Filtering out of non-syntenic
hits" should be changed to what we actually did here operationally. Visitors
cannot use our material without an explanation of methods.
I provide below a counter-example, PRDM9, that proves syntenic filtering was
not done (see attached graphic). The attached mouse and human browser
screenshots show human and mouse 'PRDM9' are not remotely syntenic. The
flanking genes do not correspond. This is actually the mouse ortholog of human
PRDM7.
Synteny ('same thread') refers to conserved adjacency of potentially
rearrangeable genes orother features on a chromosome. The minimum unit is two
genes. There is a great risk of cross-matching paralogs and larger segmental
duplications.
Synteny does not refer to parts of a single gene such as exons, introns, and
promoter regions because these are not units that can be routinely shuffled by
chromosomal rearrangements with retention of function.
Synteny does not refer to best whole-genome-alignment of two species,
restricted to a single gene and its internal nucleotides. That is called
best-Blastn.
I'm guessing the procedure used took the comparative genomics track in the
human browser, ie the one that shows the mouse chrs, then intersected with the
mouse gene table. That won't work, faulty algorithm. There have to be multiple
genes in the contiguous patch of syntenic chr, each normally best-blastp.
7. When a visitor has landed by whatever route on a gene description page, a
natural thing to do next is visit GeneSorter (banner menu). Here GeneSorter
should default to the gene on the description page. It does not, it defaults
only to the GeneSorter gateway page. The visitor then has to go back, scrape
off the gene name, forward to the gateway page, move the mouse to the text box,
paste the gene name, hit return. This is inconsistent with our overall 'smart'
interface that carries database fields along with page clicks and inserts them
appropriately.
In the 'Orthologous Genes' table on this same page, clicking on GeneSorter in
the mouse ortholog column already does the right thing. It's just human that is
broken.
8. Protein Fasta desparately needs to be renamed. Its current name does not
describe what it is. Visitors do not have time to explore cryptic links. This
is causing one of the most important pages on the entire browser to be greatly
under-utilized. Please change to Aligned Orthologs or another appropriately
descriptive name (check w Brian Rainey).
9. When a visitor has landed on a gene description page, another natural thing
to do next is visit Protein Fasta . Here the browser should display the gene
name as well as the uc index number. For example, "Human Gene RHOT1
(uc002hgw.3) Description and Page Index" should go over to "Protein Alignments
for Human Gene RHOT1 (uc002hgw.3)" on the Protein Fasta page. Right now, it
goes over to just "Protein Alignments for knownGene uc002hgw.3".
Here the visitor harvests the fasta sequences, then has to go back a window,
scrape off the gene name, search and replace the uc name with the gene name in
the harvested sequences which should have empasized the gene name to begin
with. (The uc name is just a redundant inhouse indexing system; gene names mean
something to biomedical researchers.) The current set-up is again inconsistent
with our overall 'smart' interface.
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome