Hello Mike, The txStart and txEnd are the global alignment coordinates for the entire mRna sequence - the "footprint" - including all exons and introns aligned. The cdsStart and cdsEnd are also global, but for only the coding region of the gene. For any gene that is non-coding, the values are equal. Individual exon positions are defined in the exonStarts/Ends fields.
Another way to interpret these "equal cds" values is to think of them as being "NULL". We don't actually use the mySQL NULL value here due to some coding reasons, but for your use, perhaps this would be a good translation value during the data conversion to GFF3 format. Thanks, Jennifer Jackson UCSC Genome Bioinformatics Group Michael Muratet wrote: > > On Jun 8, 2009, at 7:43 PM, Jennifer Jackson wrote: > >> Hello Mike, >> >> A brief explanation of the table contents and what is displayed in >> the Browser: >> >> knownGene = the alignment of individual transcripts >> knownIsoforms = groups these transcripts to define a cluster (gene bound >> knownCanonical = the single transcript from any cluster >> (knownIsoforms.clusterID) chosen to represent the group > > Jennifer > > I see that 14K of 67K rows in knownGene have cdsStart=cdsEnd but > txStart is never equal to txEnd. Is this an artifact? > > Thanks > > Mike > >> >> >> These fields can be linked: >> knownGene.name >> knownIsoforms.transcript >> knownCanonical.transcript >> kgXref.kgID >> kgAlias.kgID >> >> These last two tables, kgXref and kgAlias, link the UCSC Known Gene's >> transcripts to other common gene names, symbols, and acronyms. The >> Known Gene track's description goes into detail about data sources. >> The "describe table schema" link in the Table browser from the >> primary table "knownGene" also has this information (scroll to bottom >> of page). The field kgXref.geneSymbol is what is displayed in the >> Browser. >> >> It is expected that an external identifier would map to more than one >> transcript. Ideally, these would represent variants and all be >> assigned to the same cluster (gene bound). But, occasionally, an >> external identifier may map to more than one cluster. There are no >> constraints or attempts to curate external data/names/symbols. For an >> example, one unusual case I examined recently was where a certain >> gene acronym (symbol) was given to two obviously distinct genes >> (different genomic locations, different functions, different >> transcripts/proteins, etc). Both were "correct", but this can >> obviously create confusion (both for programming and data analysis). >> Therefore, we do not use transcript/gene names from any external >> source as unique identifiers and instead create/use our own and just >> link in the external data. >> >> Hopefully this information will help you to navigate the tables and >> create your file. We do have a GTF format output from the table >> browser (same as GFF v2, see our FAQ about file formats for more >> info). There is no GFF3 format (yet) for complicated reasons. >> Creating your own, for your own use, is an excellent idea since you >> will then know exactly what the data represents. >> >> Jennifer Jackson >> UCSC Genome Bioinformatics Group >> >> Michael Muratet wrote: >>> Greetings >>> >>> I am trying to create a GFF3-formatted file from knownGene, >>> knownIsoforms and knownCanonical. (Most importantly, has anyone >>> already done this?) I'm using the mySQL server directly, it's the >>> easiest for me and should not be a burden to the server (but let me >>> know). I see the join between knownGene and knownIsoforms on the >>> name and transcript fields, but I'm looking for a gene name that is >>> common to all the rows in knownIsoforms and I'm not finding it. For >>> the cases I've examined, knownCanonical contains the same >>> information as knownGenes. If I look in the browser at a member >>> from clusterId=2, say uc001aac.2 I can see that it has the synonym >>> FLJ0038 as do several others, but then all the other names are >>> different. (These are pseudogenes and may therefore be a bad >>> example.) Is there a field in a table somewhere that has the >>> necessary one-to-many relationship between 'gene name' and >>> clusterId? Am I misinterpreting knownIsoforms? >>> >>> Thanks >>> >>> Mike >>> >>> Michael Muratet, Ph.D. >>> Senior Scientist >>> HudsonAlpha Institute for Biotechnology >>> [email protected] >>> (256) 327-0473 (p) >>> (256) 327-0966 (f) >>> >>> Room 4005 >>> 601 Genome Way >>> Huntsville, Alabama 35806 >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Genome maillist - [email protected] >>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>> > > Michael Muratet, Ph.D. > Senior Scientist > HudsonAlpha Institute for Biotechnology > [email protected] > (256) 327-0473 (p) > (256) 327-0966 (f) > > Room 4005 > 601 Genome Way > Huntsville, Alabama 35806 > > > > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
