Hi Vikram, To get the list of protein coding, canonical genes in GFF format, you will need to do a two-part extraction from the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables). The first part involves getting a list of canonical genes (ie. no splice variants), while the second part involves filtering out non-coding genes by looking for genes where the cdsStart does not equal the cdsEnd (our notation for a non-coding gene https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019588.html).
Getting a list of canonical genes: 1. Go to the Table Browser and select your genome and assembly of interest. 2. UCSC Genes should automatically be selected as the track. Select "knownCanonical" from the table pull down menu. 3. Select "selected fields from primary and related tables" as the output format and enter a file name for the output file. Click "get output". 4. Select "transcript" and then click "get output". Please see this previous mailing list question for clarification about the construction of the knownCanonical table: https://lists.soe.ucsc.edu/pipermail/genome/2005-July/008123.html Filtering out non-coding genes: 1. Go back to the Table Browser and select "knownGene" from the table pull down menu. 2. To upload our list of canonical genes from before, click "upload list" next to identifiers. Select your file and click "submit". 3. Make a filter by clicking "create" next to filter. For cdsStart, select "!=" from the pull down menu and type "hg19.knownGene.cdsEnd" into the text box. Click "submit". 4. Select "GTF - gene transfer format" as the output format (GTF is very similar to GFF; see this page for more information: http://genome.ucsc.edu/FAQ/FAQformat#format4) and click "get output". I hope this information is helpful. Please feel free to contact the mail list again if you require further assistance. Best, Mary ------------------ Mary Goldman UCSC Bioinformatics Group On 9/3/10 11:59 AM, Vikram Agarwal wrote: > Hello, > > I would like to extract the coordinates for all protein-coding gene > models listed in UCSC genes in gff format. In the genome browser, it > has an option to restrict the viewing of splice variants to show only > one gene model per gene. I would like to extract only one model per > gene according to the criterion that this option takes. Is there an > easy way to accomplish this while also removing non-coding genes? Also, > is there information somewhere about the criterion the genome browser > takes to view only one gene model? > > Help is greatly appreciated! > > Thank you, > Vikram > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
