I may have sent in some version of 7-9 previously, included here since on the 
same gene details page. 

1. The 'Orthologous Genes' table has lumped C.elegans and S. cerevisae into a 
single column. The code may just need another <td> in each row. However, 
yesterday zebrafish and D.melanogaster were also lumped into a single table 
cell so more may be going on. (see attached graphic).

2. The 'Orthologous Genes' table is really questionable when it comes to fly, 
worm and yeast. Orthology has a long-standing fixed definition. That definition 
is not best-reciprocal Blastp -- a sloppy proxy that we use because of 
computational convenience. Orthology is already difficult to establish 
computationally between human and mouse (see below) and seldom reliable outside 
of mammals without extensive manual curation. 

With fly, worm and yeast, there is almost never syntenic support. As with 
human, these species have experienced large numbers of gene gains and losses; 
yeast had a old whole genome duplication. This gives rise to sets of paralogs 
with highly variable rates of evolution and dramatic changes in function, 
making them very difficult to compare. 

The table should say 'best reciprocal Blastp' (BRBp for brevity) in the table 
cells if that's all it is, not 'ortholog' . If there is nothing to put in the 
cell, '---' is perhaps better than text clutter.

3. Also in the 'Orthologous Genes' table, it says "Orthologies between human, 
mouse, and rat are computed by taking the best BLASTP hit..." Here the target 
protein database has to be specified. For example, I'm not seeing a rat 
ortholog entry for ERI3 despite 98% identity between rat, mouse and human. It 
has been represented since 2008 at GenBank as "AAI67080  unknown protein 
[Rattus norvegicus]".

Rat annotation at NCBI appears to be a total joke. Even slamdunks like ERI3 
have not been given a RefSeq in the 7 years since the genome was released. 
Obviously NCBI has no intention of annotating rat. However it is correctly 
annotated in the browser by RGD Genes 2. ERI3 is not listed at rat GeneSorter, 
suggesting RefSeq was used there too but not RGD Genes 2. Since there is no 
entry in the 'Orthologous Genes' table, we seemed to have used only refSeq and 
not consulted RGD Genes 2 or GenBank. That should be stated because otherwise 
visitors will assume we looked around for rat gene annotation.  

So here we are, providing visitors with obsolete, misleading product that will 
never be fixed by updates. I recommend either doing rat right or dropping it 
from this table since its orthologs including ERI3 are already done correctly 
at Protein Fasta.  

What I am really writing about here is the slow accrual of legacy crud. We have 
to be very careful about hosting orphaned products that people have lost 
interest in. Automatic updating will accomplish nothing here since NCBI is not 
updating rat RefSeq. There is no trajectory converging on correctness.  

4. Also in the 'Orthologous Genes' table, not ok: "Note that the absence of an 
ortholog in the table below may reflect incomplete annotations in the other 
species rather than a true absence of the orthologous gene." The previous 
sentence just said we used reciprocal best-Blastp. That suffices. If one of the 
species lacks a gene model, reciprocal best-Blastp obviously could not have 
been conducted. 

Worm, fly and yeast are small genomes that were exhaustively annotated many 
years ago. It is impossible that any protein with signficant homology to human 
has been overlooked. That's because they repeatedly do Blastx of their entire 
genome to all proteins in GenBank. Meanwhile, human has been over-annotated: 
not just the genuine coding genes but thousands of pseudogenes and junk 
transcripts. Human may have lost a few hundred genes but these will still show 
up as good worm/fly/yeast matches in the other vertebrates.

This sentence should be replaced with "The absence of ortholog or best 
reciprocal Blastp entry in the table below may reflect a genuine lack of 
candidates, multiple paralogous matches of indistinguishable low quality, 
sub-threshhold percent identity (<25% or whatever was used), less than 
full-length matches (<80% or whatever was used) with chimeric domain proteins, 
unannotated pseudogene debris, or gene loss in the clade representative used." 
(An example of the latter would be URAH, pseudogene debris in human but with 
good orthologs between gorilla and mouse). These tables are only useful to 
visitors if they know how they were made.

5. Zebrafish has to be treated differently from the others in the 'Orthologous 
Genes' table. First, it has no GeneSorter. One solution is to drop it entirely: 
the browser annotation is awful; the genome assembly has dragged on for a 
decade, never attaining high quality. Protein divergence to human is generally 
high; lineage-specific gene family expansion is rampant; syntenic retention is 
rare. It is redundant in this table because we already have a whole genome 
alignment best-guess at Protein Fasta. And everything is a model organism 
today. 

The biggest problem is that whole genome duplication makes a meaningful choice 
of ortholog systemically impractical. What visitors want here, in view of the 
whole genome duplication, is whether both copies were retained. If only one 
copy was retained, orthologous correspondence to human is clear. If both copies 
were retained, the correspondence becomes very murky (co-orthology).  We need 
to double up on the zebrafish column. Best Blastp does not work in this 
situation. What we are doing now -- picking one, not mentioning the second -- 
is a disservice to the visitor.

6. The worst bug of all is in the 'Orthologous Genes' table. We did not really 
filter out non-syntentic hits in mouse and rat. "Filtering out of non-syntenic 
hits" should be changed to what we actually did here operationally. Visitors 
cannot use our material without an explanation of methods. 

I provide below a counter-example, PRDM9, that proves syntenic filtering was 
not done (see attached graphic). The attached mouse and human browser 
screenshots show human and mouse 'PRDM9' are not remotely syntenic. The 
flanking genes do not correspond. This is actually the mouse ortholog of human 
PRDM7.

Synteny ('same thread') refers to conserved adjacency of potentially 
rearrangeable genes orother  features on a chromosome. The minimum unit is two 
genes. There is a great risk of cross-matching paralogs and larger segmental 
duplications.

Synteny does not refer to parts of a single gene such as exons, introns, and 
promoter regions because these are not units that can be routinely shuffled by 
chromosomal rearrangements with retention of function. 

Synteny does not refer to best whole-genome-alignment of two species, 
restricted to a single gene and its internal nucleotides. That is called 
best-Blastn.

I'm guessing the procedure used took the comparative genomics track in the 
human browser, ie the one that shows the mouse chrs, then intersected with the 
mouse gene table. That won't work, faulty algorithm. There have to be multiple 
genes in the contiguous patch of syntenic chr, each normally best-blastp.

7. When a visitor has landed by whatever route on a gene description page, a 
natural thing to do next is visit GeneSorter (banner menu). Here GeneSorter 
should default to the gene on the description page. It does not, it defaults 
only to the GeneSorter gateway page. The visitor then has to go back, scrape 
off the gene name, forward to the gateway page, move the mouse to the text box, 
paste the gene name, hit return. This is inconsistent with our overall 'smart' 
interface that carries database fields along with page clicks and inserts them 
appropriately. 

In the 'Orthologous Genes' table on this same page, clicking on GeneSorter in 
the mouse ortholog column already does the right thing. It's just human that is 
broken.

8. Protein Fasta desparately needs to be renamed. Its current name does not 
describe what it is. Visitors do not have time to explore cryptic links. This 
is causing one of the most important pages on the entire browser to be greatly 
under-utilized. Please change to Aligned Orthologs or another appropriately 
descriptive name (check w Brian Rainey).

9. When a visitor has landed on a gene description page, another natural thing 
to do next is visit Protein Fasta . Here the browser should display the gene 
name as well as the uc index number. For example, "Human Gene RHOT1 
(uc002hgw.3) Description and Page Index" should go over to "Protein Alignments 
for Human Gene RHOT1 (uc002hgw.3)" on the Protein Fasta page. Right now, it 
goes over to just "Protein Alignments for knownGene uc002hgw.3". 

Here the visitor harvests the fasta sequences, then has to go back a window, 
scrape off the gene name, search and replace the uc name with the gene name in 
the harvested sequences which should have empasized the gene name to begin 
with. (The uc name is just a redundant inhouse indexing system; gene names mean 
something to biomedical researchers.) The current set-up is again inconsistent 
with our overall 'smart' interface.



_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to