Hello Irina, I am able to extract just the same three transcripts when I limit the region to chr15:22787248-22813789. uc001ywz.1 uc001yxa.1 uc001yxc.2
However, I noticed in the other list, where the additional exons are being reported, the some of the chromosome coordinates fall outside of this region and have transcriptIDs different from the first three. These additional transcripts have alignments that overlap (using global genomic footprint coordinates) with the first three, but they are not completely contained in the region chr15:22787248-22813789. The using the first and last in the longer list you sent can help explain: chr15 22758353 22758449 uc001ywp.1_exon_4_0_chr15_22758354_f 0 + chr15 22781099 22784472 uc001yxd.2_exon_0_0_chr15_22781100_f 0 + Both of these transcripts, uc001ywp.1 and uc001yxd.2 are not in the original list are have exons outside of the original region position. uc001ywp.1 has exons that span this region: chr15:22,619,887-22,774,822 uc001yxd.2 has exons that span this region: chr15:22,781,100-22,784,472 I would expect this to be the case in several genomic regions, where genes are overlapping (even on the same strand, such as this case, for at least the examples you sent). Rules for merging transcripts and clustering transcripts into gene bounds include these rules (among others, see the UCSC Gene's track description for more details): - If a transcript produces a distinct protein, then it is retained as a distinct transcript. - If two transcripts do not share any common exons, then they are assigned to different clusters (gene bounds). My guess is when you were analyzing the total output, you extracted all exons that fell into the region chr15:22787248-22813789. By doing this, additional transcripts were pulled in. To do more complex overlap analysis (that includes rules about the type of overlap required), the Table browser function Intersect may be useful. There are also tools in Galaxy for doing "Interval" based queries. And if you are interested, the kent source tree has unix line command utilities to do comparisons like this. For the utilities in the kent tree, the best advice is to examine the utility descriptions (and usage, once the compiled version is available) and experiment. bedIntersect may be interesting to you. You may have problems with a BED file that is based on exons, so some special handling/processing developed by you to account for common transcriptIDs may be necessary. Or start with the genePred format, then filter by region (as the Table browser does), then split into exons at the end. http://genomewiki.cse.ucsc.edu/index.php/Kent_source_utilities http://genomewiki.cse.ucsc.edu/index.php/The_source_tree http://genome.ucsc.edu/FAQ/FAQdownloads#download27 README files in the source explain how to set up the environment and compile. Or, if you system is a match, pre-compiled utilities are available for ftp from here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ Follow the regular ftp instructions, but instead of a cd into goldenPath/ you will cd into admin/. http://genome.ucsc.edu/FAQ/FAQdownloads#download1 We hope this helps, Jennifer ------------------------------------------------ Jennifer Jackson UCSC Genome Bioinformatics Group ----- "Irina Khrebtukova" <[email protected]> wrote: > From: "Irina Khrebtukova" <[email protected]> > To: [email protected] > Sent: Monday, January 18, 2010 12:31:36 PM GMT -08:00 US/Canada Pacific > Subject: [Genome] extra knownGene exons when using Table tool > > Hi, > > > > I'd like to report a strange bug: when making custom track/bed file > of > the knownGene exons using Table tool for the whole genome I get a > number > of exons that are not in actual knownGene track. > > For example more than 100 extra exons are listed for region > chr15:22787248-22813789 (hg18). That region has only 3 ucsc genes > overlapping: > > > > chr15 22751162 22795318 uc001ywz.1 0 > + 22774020 22776299 0 6 > 140,96,139,126,138,109, 0,7191,22855,23270,25024,44047, > > chr15 22751162 22795318 uc001yxa.1 0 > + 22751162 22751162 0 7 > 140,96,151,107,138,663,109, > 0,7191,13009,24613,25024,26852,44047, > > chr15 22778014 22817591 uc001yxc.2 0 > + 22778014 22778014 0 9 > 663,121,109,130,356,164,122,42,762,0,16591,17195,18470,19583,20950,37252 > ,38053,38815, > > > > and correspondingly 22 exons. This could be seen when making exon > track > for the region only. However when creating such track for the whole > genome (hg18) the same region on chr15 has 139 exons, listing 119 > extra > exons, Below are just a few of those (let me know if you want me to > post > whole list): > > > > chr15 22758353 22758449 > uc001ywp.1_exon_4_0_chr15_22758354_f 0 + > > chr15 22764171 22764322 > uc001ywy.1_exon_2_0_chr15_22764172_f 0 + > > chr15 22765034 22765069 > uc001ywu.1_exon_3_0_chr15_22765035_f 0 + > > chr15 22770527 22770696 > uc001ywx.1_exon_3_0_chr15_22770528_f 0 + > > chr15 22770550 22770696 > uc001ywp.1_exon_6_0_chr15_22770551_f 0 + > > chr15 22770881 22771197 > uc001ywv.1_exon_5_0_chr15_22770882_f 0 + > > chr15 22771597 22771749 > uc001ywy.1_exon_4_0_chr15_22771598_f 0 + > > chr15 22772544 22772656 > uc001ywp.1_exon_8_0_chr15_22772545_f 0 + > > chr15 22772544 22772656 > uc001ywy.1_exon_5_0_chr15_22772545_f 0 + > > chr15 22773116 22773269 > uc001ywp.1_exon_9_0_chr15_22773117_f 0 + > > chr15 22774017 22774156 > uc001ywy.1_exon_7_0_chr15_22774018_f 0 + > > chr15 22774432 22774558 > uc001ywp.1_exon_11_0_chr15_22774433_f 0 + > > chr15 22774646 22774822 > uc001ywp.1_exon_12_0_chr15_22774647_f 0 + > > chr15 22778233 22780030 > uc001yxb.2_exon_0_0_chr15_22778234_f 0 + > > chr15 22781099 22784472 > uc001yxd.2_exon_0_0_chr15_22781100_f 0 + > > > > Could you please check this? I haven't noticed similar problem for > any > other gene prediction tracks, only for known genes. > > > > Thanks! > > > > Irina Khrebtukova > > Illumina, Hayward CA > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
