Hi, Brent!

The psl record found in kgTargetAli has already been changed for 
genepred use.

If you just blat uc002imy.2 to hg19 you can see everything that's happening.

1. 16-bases of the poly-A tail are removed from the picture.
2. A small exon of 5 bases is merged into its neighbor exon.
   (This wipes out 1 tiny q-gap).
3. 2 tiny q-gaps are ignored since they cannot be represented in genepred.

The original blat psl shows them though.

 >uc002imy.2 (COPZ2) length=923
ggcggcgagcggaatgcagcggcccgaggcctggccacgtccgcacccgggggagggggc
cgcggcggcccaggccgggggcccggcgccgcctgctcgagccggggagccctcggggct
gcggttgcaggaaccttccctctacaccatcaaggctgttttcatcctagataatgacgg
gcgccggctgctggccaagtattatgatgacacattcccctccatgaaggagcagatggt
tttcgagaaaaatgtcttcaacaagaccagccggactgagagtgagattgcattttttgg
gggtatgaccatcgtctacaagaacagcattgacctcttcctatacgtggtgggctcatc
ctacgagaatgagctgatgctcatgtctgttctcacctgcctgtttgagtctctgaacca
catgttaaggaagaacgtggagaagcgctggttgctggagaacatggacggagccttctt
ggtgctggacgagattgtggatggcggtgtgattctggagagtgacccccagcaagtgat
ccagaaggtgaattttagggcagatgatggcggcttgactgaacagagtgtggcccaggt
tcttcagtctgccaaggaacaaattaaatggtcgttattgaaatgaaggctgtggattca
aggctccctgccccccagatcatttccccaatcctggcaaaagcccaaagatcccagggt
caggagagacccctctgtatccccaggtccctcccagaactgactcctaaggtctccagc
cagggcttctgagatgcaaaggtttggcctcaggagagtcaccttttctcacggccctgg
ccttaactcatatcttaggcattcctggccccagggccctaataaacctgcttttgtctt
ctgccaaaaaaaaaaaaaaaaaa

The quick answer is that there were tiny 1 and 2 bp gaps on
the query side that caused the alignment to be broken on the
target side.

People are used to seeing gaps of size zero on the query side
all the time. They are not used to seeing it on the target side.
This is just the flip-side of having an insert on the opposite side.

I suppose that a human looking at these would merge them together.
If the second to the last intron is size 0, the last intron is only
size 2 which is not biologically realistic.

So really, there are 9 exons here, not 11 (genepred) or 12 (blat psl).

-Galt

5/24/2011 8:18 AM, Brent Pedersen:
> On Sun, May 22, 2011 at 8:02 AM, Brent Pedersen<[email protected]>  wrote:
>> hi, I have grabbed some data from mysql like this:
>>
>> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D $ORG -P
>> 3306   -e "select
>> chrom,txStart,txEnd,cdsStart,cdsEnd,K.name,X.geneSymbol,proteinID,strand,exonStarts,exonEnds
>> from knownGene as K,kgXref as X where  X.kgId=K.name
>>
>> I have a couple questions about the data. First, a row like this:
>>
>> chrom   txStart txEnd   cdsStart        cdsEnd  name    geneSymbol      
>> proteinID       strand  exonStarts      exonEnds
>> chr17    46103534       46115152        46103793        46115139        
>> uc002imy.2      COPZ2   Q9P299  -       
>> 46103534,46105837,46106490,46109521,46110051,46110576,46111228,46114216,46115032,46115092,46115124,
>>      
>> 46103841,46105876,46106542,46109599,46110107,46110668,46111310,46114291,46115092,46115122,46115152,
>>
>> note that the 2nd-to-last exonStart is the same as the 3rd-from-last
>> exonEnd: 46115092. Does this mean a 0 length intron? And what does
>> that mean within a transcript?
>
> Can anyone comment on this? Is there some way I can clarify the question?
> I find cases like this 185 times in hg19
> Thanks,
> -Brent
>
>
>>
>> Second question: for this same row; is it correct to infer that the
>> first exon in (0-based) bed format would be:
>>   start=46103534, end=46103841
>> and the first intron would be:
>>   start=46103841 end=46105837
>>
>> but then the problem is that start == end in for the 0-length intron.
>>
>> I have seen this: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 so the
>> internal format matches the BED format, correct?
>>
>> thanks,
>> -Brent
>>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome

_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to