Hello Jian, It seems as if there is some confusion about how the RefSeq Genes track is created and what the data represent. Or possibly even what alignment-based track data in general represents.
Tracks based on nucleotide or protein alignments use sequences as a query and a target and a tool (like BLAT) to calculate the "best fit" on the genome. Sometimes the query represents a fragment of a transcript that belongs to a gene (EST tracks, some mRNA tracks). Other times the query is curated (assembled from fragments) to represent a complete transcript that belongs to a gene (most Gene and Gene Prediction tracks, some mRna tracks). Another version of this is where a transcript is created, the translated protein extracted, and then the protein is used for the alignment (other Gene and Gene Prediction tracks). And there are even query sequences that do not represent transcripts/genes at all, but other genome sequence features. For most Gene and Gene Prediction tracks, any individual gene will have one or more transcripts associated, called variants or isoforms. Some tracks have a single transcript variant singled out from the gene group so that it can be used alone to represent the gene for analysis. For others, all transcript variants from a gene are simply grouped together and the data user has to determine the representative (when needed). And some tracks have transcripts or proteins that are not grouped into genes at all and the data user has to decide how to cluster the transcripts/proteins into genes and how to call a representative. There are even tracks that only contain one protein per gene bound (CCDS is an example). The RefSeq Gene's track is a track where the transcript variants are assigned to a specific gene, but there is no canonical, or representative, transcript assigned in the UCSC database (although the genbank datasheet may have annotation or references to annotation about this). For this track, the field refGene.name is the transcript name and refGene.name2 is the gene name. The table refLink has additional data handles (protein accession, etc.). If you want only one transcript per gene, you will need to decide how to pick the canonical transcript, and filter the data accordingly. The UCSC Gene's track is a track where the transcript variants are also assigned to a specific gene, but a canonical transcript is selected (usually the longest, or most 5' reaching, transcript; but not always). For this track, the field knownGene.name is the transcript name. Tables such as kgAlias and kgXref have several common gene symbol handles. The table knownIsoforms links together various transcripts of a gene into a cluster. The table knownCanonical describes the canonical splice variant of a gene - this is the data that would allow you to capture gene bound coordinates in the simplest way using the Table browser. The RefSeq Gene dataset is included in the UCSC Gene dataset. You could decide to use the UCSC Gene dataset instead. Or, you could link from the RefSeq Gene track into the UCSC Gene track (via the table knownToRefSeq) to use the UCSC Gene clustering information (instead of the Genbank clustering in the refGene.name2 field). Or, you could compare the two and read the methods and make a decision about which to use once you have evaluated the differences. To explore, link together, and download data for these tracks, the Table browser is an excellent tool. Here is the main help to get started: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#TableBrowser Hopefully this is helpful, Jennifer --------------------------------- Jennifer Jackson UCSC Genome Informatics Group http://genome.ucsc.edu/ On 3/19/10 1:35 PM, Jian WJ Wang wrote: > Dear UCSC genome browser experts: > > I would like to download full refseq gene sets (hg18) with genomic > coordinates for each gene. I understand this can be done using table > browser. But I tried many files/tracks and none of them export gene > coordinates. Instead, all the coordinates are based on transcripts, cds, > exons etc. Any help is greatly appreciated. > > Best regards, > > -Jian > > ______________________ > Jian Wang, PhD > Informatics > 317 655 3496 > [email protected] > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
