Hello Irina,

I am able to extract just the same three transcripts when I limit the region to 
chr15:22787248-22813789.
uc001ywz.1
uc001yxa.1
uc001yxc.2

However, I noticed in the other list, where the additional exons are being 
reported, the some of the chromosome coordinates fall outside of this region 
and have transcriptIDs different from the first three. These additional 
transcripts have alignments that overlap (using global genomic footprint 
coordinates) with the first three, but they are not completely contained in the 
region chr15:22787248-22813789.

The using the first and last in the longer list you sent can help explain:
chr15 22758353 22758449 uc001ywp.1_exon_4_0_chr15_22758354_f 0 +
chr15 22781099 22784472 uc001yxd.2_exon_0_0_chr15_22781100_f 0 +

Both of these transcripts, uc001ywp.1 and uc001yxd.2 are not in the original 
list are have exons outside of the original region position.

uc001ywp.1 has exons that span this region: chr15:22,619,887-22,774,822
uc001yxd.2 has exons that span this region: chr15:22,781,100-22,784,472

I would expect this to be the case in several genomic regions, where genes are 
overlapping (even on the same strand, such as this case, for at least the 
examples you sent). 

Rules for merging transcripts and clustering transcripts into gene bounds 
include these rules (among others, see the UCSC Gene's track description for 
more details):
- If a transcript produces a distinct protein, then it is retained as a 
distinct transcript. 
- If two transcripts do not share any common exons, then they are assigned to 
different clusters (gene bounds).

My guess is when you were analyzing the total output, you extracted all exons 
that fell into the region chr15:22787248-22813789. By doing this, additional 
transcripts were pulled in. To do more complex overlap analysis (that includes 
rules about the type of overlap required), the Table browser function Intersect 
may be useful. There are also tools in Galaxy for doing "Interval" based 
queries. And if you are interested, the kent source tree has unix line command 
utilities to do comparisons like this. 

For the utilities in the kent tree, the best advice is to examine the utility 
descriptions (and usage, once the compiled version is available) and 
experiment. bedIntersect may be interesting to you. You may have problems with 
a BED file that is based on exons, so some special handling/processing 
developed by you to account for common transcriptIDs may be necessary. Or start 
with the genePred format, then filter by region (as the Table browser does), 
then split into exons at the end.

http://genomewiki.cse.ucsc.edu/index.php/Kent_source_utilities
http://genomewiki.cse.ucsc.edu/index.php/The_source_tree
http://genome.ucsc.edu/FAQ/FAQdownloads#download27

README files in the source explain how to set up the environment and compile. 
Or, if you system is a match, pre-compiled utilities are available for ftp from 
here:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

Follow the regular ftp instructions, but instead of a cd into goldenPath/ you 
will cd into admin/.
http://genome.ucsc.edu/FAQ/FAQdownloads#download1

We hope this helps,
Jennifer







------------------------------------------------ 
Jennifer Jackson 
UCSC Genome Bioinformatics Group 

----- "Irina Khrebtukova" <[email protected]> wrote:

> From: "Irina Khrebtukova" <[email protected]>
> To: [email protected]
> Sent: Monday, January 18, 2010 12:31:36 PM GMT -08:00 US/Canada Pacific
> Subject: [Genome] extra knownGene exons when using Table tool
>
> Hi,
> 
>  
> 
> I'd like to report a strange bug: when making custom track/bed file
> of
> the knownGene exons using Table tool for the whole genome I get a
> number
> of exons that are not in actual knownGene track.
> 
> For example more than 100 extra exons are listed for region
> chr15:22787248-22813789 (hg18). That region has only 3 ucsc genes
> overlapping:
> 
>  
> 
> chr15    22751162          22795318          uc001ywz.1       0
> +          22774020          22776299          0          6
> 140,96,139,126,138,109,            0,7191,22855,23270,25024,44047,
> 
> chr15    22751162          22795318          uc001yxa.1       0
> +          22751162          22751162          0          7
> 140,96,151,107,138,663,109,     
> 0,7191,13009,24613,25024,26852,44047,
> 
> chr15    22778014          22817591          uc001yxc.2       0
> +          22778014          22778014          0          9
> 663,121,109,130,356,164,122,42,762,0,16591,17195,18470,19583,20950,37252
> ,38053,38815,
> 
>  
> 
> and  correspondingly 22 exons. This could be seen when making exon
> track
> for the region only. However when creating such track for the whole
> genome (hg18)  the same region on chr15 has 139 exons, listing 119
> extra
> exons, Below are just a few of those (let me know if you want me to
> post
> whole list):
> 
>  
> 
> chr15    22758353          22758449
> uc001ywp.1_exon_4_0_chr15_22758354_f          0          +
> 
> chr15    22764171          22764322
> uc001ywy.1_exon_2_0_chr15_22764172_f          0          +
> 
> chr15    22765034          22765069
> uc001ywu.1_exon_3_0_chr15_22765035_f          0          +
> 
> chr15    22770527          22770696
> uc001ywx.1_exon_3_0_chr15_22770528_f          0          +
> 
> chr15    22770550          22770696
> uc001ywp.1_exon_6_0_chr15_22770551_f          0          +
> 
> chr15    22770881          22771197
> uc001ywv.1_exon_5_0_chr15_22770882_f          0          +
> 
> chr15    22771597          22771749
> uc001ywy.1_exon_4_0_chr15_22771598_f          0          +
> 
> chr15    22772544          22772656
> uc001ywp.1_exon_8_0_chr15_22772545_f          0          +
> 
> chr15    22772544          22772656
> uc001ywy.1_exon_5_0_chr15_22772545_f          0          +
> 
> chr15    22773116          22773269
> uc001ywp.1_exon_9_0_chr15_22773117_f          0          +
> 
> chr15    22774017          22774156
> uc001ywy.1_exon_7_0_chr15_22774018_f          0          +
> 
> chr15    22774432          22774558
> uc001ywp.1_exon_11_0_chr15_22774433_f        0          +
> 
> chr15    22774646          22774822
> uc001ywp.1_exon_12_0_chr15_22774647_f        0          +
> 
> chr15    22778233          22780030
> uc001yxb.2_exon_0_0_chr15_22778234_f           0          +
> 
> chr15    22781099          22784472
> uc001yxd.2_exon_0_0_chr15_22781100_f           0          +
> 
>  
> 
> Could you please check this? I haven't noticed similar problem for
> any
> other gene prediction tracks, only for known genes.
> 
>  
> 
> Thanks!
> 
>  
> 
> Irina Khrebtukova
> 
> Illumina, Hayward CA
> 
>  
> 
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to