Hello Mike, A brief explanation of the table contents and what is displayed in the Browser:
knownGene = the alignment of individual transcripts knownIsoforms = groups these transcripts to define a cluster (gene bound knownCanonical = the single transcript from any cluster (knownIsoforms.clusterID) chosen to represent the group These fields can be linked: knownGene.name knownIsoforms.transcript knownCanonical.transcript kgXref.kgID kgAlias.kgID These last two tables, kgXref and kgAlias, link the UCSC Known Gene's transcripts to other common gene names, symbols, and acronyms. The Known Gene track's description goes into detail about data sources. The "describe table schema" link in the Table browser from the primary table "knownGene" also has this information (scroll to bottom of page). The field kgXref.geneSymbol is what is displayed in the Browser. It is expected that an external identifier would map to more than one transcript. Ideally, these would represent variants and all be assigned to the same cluster (gene bound). But, occasionally, an external identifier may map to more than one cluster. There are no constraints or attempts to curate external data/names/symbols. For an example, one unusual case I examined recently was where a certain gene acronym (symbol) was given to two obviously distinct genes (different genomic locations, different functions, different transcripts/proteins, etc). Both were "correct", but this can obviously create confusion (both for programming and data analysis). Therefore, we do not use transcript/gene names from any external source as unique identifiers and instead create/use our own and just link in the external data. Hopefully this information will help you to navigate the tables and create your file. We do have a GTF format output from the table browser (same as GFF v2, see our FAQ about file formats for more info). There is no GFF3 format (yet) for complicated reasons. Creating your own, for your own use, is an excellent idea since you will then know exactly what the data represents. Jennifer Jackson UCSC Genome Bioinformatics Group Michael Muratet wrote: > Greetings > > I am trying to create a GFF3-formatted file from knownGene, > knownIsoforms and knownCanonical. (Most importantly, has anyone > already done this?) I'm using the mySQL server directly, it's the > easiest for me and should not be a burden to the server (but let me > know). I see the join between knownGene and knownIsoforms on the name > and transcript fields, but I'm looking for a gene name that is common > to all the rows in knownIsoforms and I'm not finding it. For the cases > I've examined, knownCanonical contains the same information as > knownGenes. If I look in the browser at a member from clusterId=2, say > uc001aac.2 I can see that it has the synonym FLJ0038 as do several > others, but then all the other names are different. (These are > pseudogenes and may therefore be a bad example.) Is there a field in a > table somewhere that has the necessary one-to-many relationship > between 'gene name' and clusterId? Am I misinterpreting knownIsoforms? > > Thanks > > Mike > > Michael Muratet, Ph.D. > Senior Scientist > HudsonAlpha Institute for Biotechnology > [email protected] > (256) 327-0473 (p) > (256) 327-0966 (f) > > Room 4005 > 601 Genome Way > Huntsville, Alabama 35806 > > > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
