On Jun 8, 2009, at 7:43 PM, Jennifer Jackson wrote: > Hello Mike, > > A brief explanation of the table contents and what is displayed in > the Browser: > > knownGene = the alignment of individual transcripts > knownIsoforms = groups these transcripts to define a cluster (gene > bound > knownCanonical = the single transcript from any cluster > (knownIsoforms.clusterID) chosen to represent the group
Jennifer I see that 14K of 67K rows in knownGene have cdsStart=cdsEnd but txStart is never equal to txEnd. Is this an artifact? Thanks Mike > > > These fields can be linked: > knownGene.name > knownIsoforms.transcript > knownCanonical.transcript > kgXref.kgID > kgAlias.kgID > > These last two tables, kgXref and kgAlias, link the UCSC Known > Gene's transcripts to other common gene names, symbols, and > acronyms. The Known Gene track's description goes into detail about > data sources. The "describe table schema" link in the Table browser > from the primary table "knownGene" also has this information (scroll > to bottom of page). The field kgXref.geneSymbol is what is displayed > in the Browser. > > It is expected that an external identifier would map to more than > one transcript. Ideally, these would represent variants and all be > assigned to the same cluster (gene bound). But, occasionally, an > external identifier may map to more than one cluster. There are no > constraints or attempts to curate external data/names/symbols. For > an example, one unusual case I examined recently was where a certain > gene acronym (symbol) was given to two obviously distinct genes > (different genomic locations, different functions, different > transcripts/proteins, etc). Both were "correct", but this can > obviously create confusion (both for programming and data analysis). > Therefore, we do not use transcript/gene names from any external > source as unique identifiers and instead create/use our own and just > link in the external data. > > Hopefully this information will help you to navigate the tables and > create your file. We do have a GTF format output from the table > browser (same as GFF v2, see our FAQ about file formats for more > info). There is no GFF3 format (yet) for complicated reasons. > Creating your own, for your own use, is an excellent idea since you > will then know exactly what the data represents. > > Jennifer Jackson > UCSC Genome Bioinformatics Group > > Michael Muratet wrote: >> Greetings >> >> I am trying to create a GFF3-formatted file from knownGene, >> knownIsoforms and knownCanonical. (Most importantly, has anyone >> already done this?) I'm using the mySQL server directly, it's the >> easiest for me and should not be a burden to the server (but let >> me know). I see the join between knownGene and knownIsoforms on >> the name and transcript fields, but I'm looking for a gene name >> that is common to all the rows in knownIsoforms and I'm not >> finding it. For the cases I've examined, knownCanonical contains >> the same information as knownGenes. If I look in the browser at a >> member from clusterId=2, say uc001aac.2 I can see that it has the >> synonym FLJ0038 as do several others, but then all the other names >> are different. (These are pseudogenes and may therefore be a bad >> example.) Is there a field in a table somewhere that has the >> necessary one-to-many relationship between 'gene name' and >> clusterId? Am I misinterpreting knownIsoforms? >> >> Thanks >> >> Mike >> >> Michael Muratet, Ph.D. >> Senior Scientist >> HudsonAlpha Institute for Biotechnology >> [email protected] >> (256) 327-0473 (p) >> (256) 327-0966 (f) >> >> Room 4005 >> 601 Genome Way >> Huntsville, Alabama 35806 >> >> >> >> >> >> _______________________________________________ >> Genome maillist - [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome >> Michael Muratet, Ph.D. Senior Scientist HudsonAlpha Institute for Biotechnology [email protected] (256) 327-0473 (p) (256) 327-0966 (f) Room 4005 601 Genome Way Huntsville, Alabama 35806 _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
