Hello Rathi, Unfortunately, the RefSeq dataset does not contain a "canonical" transcript designation. There is clustering by gene however (name2 in the primary table).
There are a few choices, as you bring up: 1) Select a canonical transcript yourself (longest, most exon, furthest 5' reaching, or similar). 2) Use UCSC Genes to cluster and that canonical for your analysis. All of RefSeq is included in UCSC Genes, but the canonical may or may not be a RefSeq. So even this method would require some independent analysis. For the merging question, perhaps use the RefSeq gene name and only compare RefSeqs assigned to the same gene when collapsing the redundancy in Galaxy. You could alternatively use the UCSC gene "cluster" name and the linked gene symbols (and discard the RefSeq assignments). If you try to merge the two clustering methods, it is almost certain that some regions will have more than one gene assigned. This is obvious - but just to be sure it is considered - make sure strand is taken into account for any interval merges. This would not be an issue if you are already clustering by gene (all isoforms would *hopefully* be on the same strand already), but strand would definitely be important to consider, if you are clustering by reference genome position/footprint, before doing the redundancy analysis. Some outliers should probably be expected for footprint-based analysis: genes within other genes, partially interleaved regions, that sort of thing. This is likely where your issue with the redundancy clustering producing more than one gene name came from. You will have to decide how to sort these out if you choose not to set up the analysis per-gene in the beginning. Even the UCSC Genes track has interleaved transcript/genes present (on purpose). Hopefully this gives you some more information about the RefSeq track and some clustering/analysis ideas, thanks jen --------------------------------- Jennifer Jackson UCSC Genome Informatics Group http://genome.ucsc.edu/ On 4/2/10 6:03 PM, Rathi Thiagarajan wrote: > Hi there, > > Could you please advice me the best way to obtain (mm9) RefSeq canonical > transcripts genomic intervals? I see that there is a table for UCSC genes > "knownCanonical", but I was wondering if there was something similar just > for RefSeq? I could just filter for UCSC genes with linked RefSeq ID's but > was wondering if there was a better way? > > Also is it possible to get a non-redundant set of RefSeq exons while still > retaining the Gene Name information? I have tried to merge the exon > genomic intervals within Galaxy, but it doesn't return the gene names. > Bascially, my goal is to get a RefSeq-based locus information either > through non-redundant exons or non-redundant whole gene co-ordinates. > > Thanking you in advance. > > Cheers, > Rathi > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
