On Jun 8, 2009, at 7:43 PM, Jennifer Jackson wrote:

> Hello Mike,
>
> A brief explanation of the table contents and what is displayed in  
> the Browser:
>
> knownGene = the alignment of individual transcripts
> knownIsoforms = groups these transcripts to define a cluster (gene  
> bound
> knownCanonical = the single transcript from any cluster  
> (knownIsoforms.clusterID) chosen to represent the group

Jennifer

I see that 14K of 67K rows in knownGene have cdsStart=cdsEnd but  
txStart is never equal to txEnd. Is this an artifact?

Thanks

Mike

>
>
> These fields can be linked:
> knownGene.name
> knownIsoforms.transcript
> knownCanonical.transcript
> kgXref.kgID
> kgAlias.kgID
>
> These last two tables, kgXref and kgAlias, link the UCSC Known  
> Gene's transcripts to other common gene names, symbols, and  
> acronyms. The Known Gene track's description goes into detail about  
> data sources. The "describe table schema" link in the Table browser  
> from the primary table "knownGene" also has this information (scroll  
> to bottom of page). The field kgXref.geneSymbol is what is displayed  
> in the Browser.
>
> It is expected that an external identifier would map to more than  
> one transcript. Ideally, these would represent variants and all be  
> assigned to the same cluster (gene bound). But, occasionally, an  
> external identifier may map to more than one cluster. There are no  
> constraints or attempts to curate external data/names/symbols. For  
> an example, one unusual case I examined recently was where a certain  
> gene acronym (symbol) was given to two obviously distinct genes  
> (different genomic locations, different functions, different  
> transcripts/proteins, etc). Both were "correct", but this can  
> obviously create confusion (both for programming and data analysis).  
> Therefore, we do not use transcript/gene names from any external  
> source as unique identifiers and instead create/use our own and just  
> link in the external data.
>
> Hopefully this information will help you to navigate the tables and  
> create your file. We do have a GTF format output from the table  
> browser (same as GFF v2, see our FAQ about file formats for more  
> info). There is no GFF3 format (yet) for complicated reasons.  
> Creating your own, for your own use, is an excellent idea since you  
> will then know exactly what the data represents.
>
> Jennifer Jackson
> UCSC Genome Bioinformatics Group
>
> Michael Muratet wrote:
>> Greetings
>>
>> I am trying to create a GFF3-formatted file from knownGene,   
>> knownIsoforms and knownCanonical. (Most importantly, has anyone   
>> already done this?) I'm using the mySQL server directly, it's the   
>> easiest for me and should not be a burden to the server (but let  
>> me  know). I see the join between knownGene and knownIsoforms on  
>> the name  and transcript fields, but I'm looking for a gene name  
>> that is common  to all the rows in knownIsoforms and I'm not  
>> finding it. For the cases  I've examined, knownCanonical contains  
>> the same information as  knownGenes. If I look in the browser at a  
>> member from clusterId=2, say  uc001aac.2 I can see that it has the  
>> synonym FLJ0038 as do several  others, but then all the other names  
>> are different. (These are  pseudogenes and may therefore be a bad  
>> example.) Is there a field in a  table somewhere that has the  
>> necessary one-to-many relationship  between 'gene name' and  
>> clusterId? Am I misinterpreting knownIsoforms?
>>
>> Thanks
>>
>> Mike
>>
>> Michael Muratet, Ph.D.
>> Senior Scientist
>> HudsonAlpha Institute for Biotechnology
>> [email protected]
>> (256) 327-0473 (p)
>> (256) 327-0966 (f)
>>
>> Room 4005
>> 601 Genome Way
>> Huntsville, Alabama 35806
>>
>>
>>
>>
>>
>> _______________________________________________
>> Genome maillist  -  [email protected]
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>

Michael Muratet, Ph.D.
Senior Scientist
HudsonAlpha Institute for Biotechnology
[email protected]
(256) 327-0473 (p)
(256) 327-0966 (f)

Room 4005
601 Genome Way
Huntsville, Alabama 35806





_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to