Dear Vikram, I've looked into this, and I see what's going on. The current version of UCSC Genes was built using a download of RefSeq sequences dated Aug 24, 2009. The sequence NM_001010847.1 was added to RefSeq as a reviewed sequence on June 30, 2010. When NM_001010847.1 was released, it replaced three predicted RefSeq entries: XM_059074.6, XM_943661.3, and XM_001713942.2. Because UCSC Genes does not link to predicted RefSeq sequences, there were no links recorded between uc001avb.2 and these sequences. There were links between uc001avb.2 and the non-RefSeq sequences that it was built from, such as LRC38_HUMAN.
We are working on updating UCSC Genes, and there should be a new version within a few months. In the meantime, UCSC Genes will not contain links to any RefSeq sequences that were added since Aug 24, 2009. I hope this information helps. If you have any more questions, feel free to reply to this mail thread. Cheers, Melissa On Thu, Sep 9, 2010 at 3:42 PM, Vikram Agarwal <[email protected]> wrote: > Hello, > > I would like to report a concern I have about UCSC genes listings: > > The UCSC gene ID: uc001avb.2 does not have any RefSeq ID > cross-referenced in the table knownToRefseq. On the genome browser this > UCSC gene was clearly derived from RefSeq ID: NM_001010847.1 > <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=Nucleotide&term=NM_001010847&doptcmdl=GenBank&tool=genome.ucsc.edu>. > > The genes should have been generated according to this procedure described: > "For RefSeq transcripts the RefSeq protein prediction is used directly > instead of this procedure." > > Is there any way to fix this or to identify all UCSC gene IDs in which > this occurs? > > Best, > Vikram > > On 09/07/2010 08:00 PM, Mary Goldman wrote: >> Hi Vikram, >> >> To get the list of protein coding, canonical genes in GFF format, you >> will need to do a two-part extraction from the Table Browser >> (http://genome.ucsc.edu/cgi-bin/hgTables). The first part involves >> getting a list of canonical genes (ie. no splice variants), while the >> second part involves filtering out non-coding genes by looking for genes >> where the cdsStart does not equal the cdsEnd (our notation for a >> non-coding gene >> https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019588.html). >> >> Getting a list of canonical genes: >> 1. Go to the Table Browser and select your genome and assembly of interest. >> 2. UCSC Genes should automatically be selected as the track. Select >> "knownCanonical" from the table pull down menu. >> 3. Select "selected fields from primary and related tables" as the >> output format and enter a file name for the output file. Click "get output". >> 4. Select "transcript" and then click "get output". >> Please see this previous mailing list question for clarification about >> the construction of the knownCanonical table: >> https://lists.soe.ucsc.edu/pipermail/genome/2005-July/008123.html >> >> Filtering out non-coding genes: >> 1. Go back to the Table Browser and select "knownGene" from the table >> pull down menu. >> 2. To upload our list of canonical genes from before, click "upload >> list" next to identifiers. Select your file and click "submit". >> 3. Make a filter by clicking "create" next to filter. For cdsStart, >> select "!=" from the pull down menu and type "hg19.knownGene.cdsEnd" >> into the text box. Click "submit". >> 4. Select "GTF - gene transfer format" as the output format (GTF is very >> similar to GFF; see this page for more information: >> http://genome.ucsc.edu/FAQ/FAQformat#format4) and click "get output". >> >> I hope this information is helpful. Please feel free to contact the >> mail list again if you require further assistance. >> >> Best, >> Mary >> ------------------ >> Mary Goldman >> UCSC Bioinformatics Group >> >> On 9/3/10 11:59 AM, Vikram Agarwal wrote: >>> Hello, >>> >>> I would like to extract the coordinates for all protein-coding gene >>> models listed in UCSC genes in gff format. In the genome browser, it >>> has an option to restrict the viewing of splice variants to show only >>> one gene model per gene. I would like to extract only one model per >>> gene according to the criterion that this option takes. Is there an >>> easy way to accomplish this while also removing non-coding genes? Also, >>> is there information somewhere about the criterion the genome browser >>> takes to view only one gene model? >>> >>> Help is greatly appreciated! >>> >>> Thank you, >>> Vikram >>> _______________________________________________ >>> Genome maillist - [email protected] >>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>> > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
