Hello Mike,

A brief explanation of the table contents and what is displayed in the 
Browser:

knownGene = the alignment of individual transcripts
knownIsoforms = groups these transcripts to define a cluster (gene bound
knownCanonical = the single transcript from any cluster 
(knownIsoforms.clusterID) chosen to represent the group

These fields can be linked:
  knownGene.name
  knownIsoforms.transcript
  knownCanonical.transcript
  kgXref.kgID
  kgAlias.kgID

These last two tables, kgXref and kgAlias, link the UCSC Known Gene's 
transcripts to other common gene names, symbols, and acronyms. The Known 
Gene track's description goes into detail about data sources. The 
"describe table schema" link in the Table browser from the primary table 
"knownGene" also has this information (scroll to bottom of page). The 
field kgXref.geneSymbol is what is displayed in the Browser.

It is expected that an external identifier would map to more than one 
transcript. Ideally, these would represent variants and all be assigned 
to the same cluster (gene bound). But, occasionally, an external 
identifier may map to more than one cluster. There are no constraints or 
attempts to curate external data/names/symbols. For an example, one 
unusual case I examined recently was where a certain gene acronym 
(symbol) was given to two obviously distinct genes (different genomic 
locations, different functions, different transcripts/proteins, etc). 
Both were "correct", but this can obviously create confusion (both for 
programming and data analysis). Therefore, we do not use transcript/gene 
names from any external source as unique identifiers and instead 
create/use our own and just link in the external data.

Hopefully this information will help you to navigate the tables and 
create your file. We do have a GTF format output from the table browser 
(same as GFF v2, see our FAQ about file formats for more info). There is 
no GFF3 format (yet) for complicated reasons. Creating your own, for 
your own use, is an excellent idea since you will then know exactly what 
the data represents.

Jennifer Jackson
UCSC Genome Bioinformatics Group

Michael Muratet wrote:
> Greetings
>
> I am trying to create a GFF3-formatted file from knownGene,  
> knownIsoforms and knownCanonical. (Most importantly, has anyone  
> already done this?) I'm using the mySQL server directly, it's the  
> easiest for me and should not be a burden to the server (but let me  
> know). I see the join between knownGene and knownIsoforms on the name  
> and transcript fields, but I'm looking for a gene name that is common  
> to all the rows in knownIsoforms and I'm not finding it. For the cases  
> I've examined, knownCanonical contains the same information as  
> knownGenes. If I look in the browser at a member from clusterId=2, say  
> uc001aac.2 I can see that it has the synonym FLJ0038 as do several  
> others, but then all the other names are different. (These are  
> pseudogenes and may therefore be a bad example.) Is there a field in a  
> table somewhere that has the necessary one-to-many relationship  
> between 'gene name' and clusterId? Am I misinterpreting knownIsoforms?
>
> Thanks
>
> Mike
>
> Michael Muratet, Ph.D.
> Senior Scientist
> HudsonAlpha Institute for Biotechnology
> [email protected]
> (256) 327-0473 (p)
> (256) 327-0966 (f)
>
> Room 4005
> 601 Genome Way
> Huntsville, Alabama 35806
>
>
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>   
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to