Hi, Brent! The psl record found in kgTargetAli has already been changed for genepred use.
If you just blat uc002imy.2 to hg19 you can see everything that's happening. 1. 16-bases of the poly-A tail are removed from the picture. 2. A small exon of 5 bases is merged into its neighbor exon. (This wipes out 1 tiny q-gap). 3. 2 tiny q-gaps are ignored since they cannot be represented in genepred. The original blat psl shows them though. >uc002imy.2 (COPZ2) length=923 ggcggcgagcggaatgcagcggcccgaggcctggccacgtccgcacccgggggagggggc cgcggcggcccaggccgggggcccggcgccgcctgctcgagccggggagccctcggggct gcggttgcaggaaccttccctctacaccatcaaggctgttttcatcctagataatgacgg gcgccggctgctggccaagtattatgatgacacattcccctccatgaaggagcagatggt tttcgagaaaaatgtcttcaacaagaccagccggactgagagtgagattgcattttttgg gggtatgaccatcgtctacaagaacagcattgacctcttcctatacgtggtgggctcatc ctacgagaatgagctgatgctcatgtctgttctcacctgcctgtttgagtctctgaacca catgttaaggaagaacgtggagaagcgctggttgctggagaacatggacggagccttctt ggtgctggacgagattgtggatggcggtgtgattctggagagtgacccccagcaagtgat ccagaaggtgaattttagggcagatgatggcggcttgactgaacagagtgtggcccaggt tcttcagtctgccaaggaacaaattaaatggtcgttattgaaatgaaggctgtggattca aggctccctgccccccagatcatttccccaatcctggcaaaagcccaaagatcccagggt caggagagacccctctgtatccccaggtccctcccagaactgactcctaaggtctccagc cagggcttctgagatgcaaaggtttggcctcaggagagtcaccttttctcacggccctgg ccttaactcatatcttaggcattcctggccccagggccctaataaacctgcttttgtctt ctgccaaaaaaaaaaaaaaaaaa The quick answer is that there were tiny 1 and 2 bp gaps on the query side that caused the alignment to be broken on the target side. People are used to seeing gaps of size zero on the query side all the time. They are not used to seeing it on the target side. This is just the flip-side of having an insert on the opposite side. I suppose that a human looking at these would merge them together. If the second to the last intron is size 0, the last intron is only size 2 which is not biologically realistic. So really, there are 9 exons here, not 11 (genepred) or 12 (blat psl). -Galt 5/24/2011 8:18 AM, Brent Pedersen: > On Sun, May 22, 2011 at 8:02 AM, Brent Pedersen<[email protected]> wrote: >> hi, I have grabbed some data from mysql like this: >> >> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D $ORG -P >> 3306 -e "select >> chrom,txStart,txEnd,cdsStart,cdsEnd,K.name,X.geneSymbol,proteinID,strand,exonStarts,exonEnds >> from knownGene as K,kgXref as X where X.kgId=K.name >> >> I have a couple questions about the data. First, a row like this: >> >> chrom txStart txEnd cdsStart cdsEnd name geneSymbol >> proteinID strand exonStarts exonEnds >> chr17 46103534 46115152 46103793 46115139 >> uc002imy.2 COPZ2 Q9P299 - >> 46103534,46105837,46106490,46109521,46110051,46110576,46111228,46114216,46115032,46115092,46115124, >> >> 46103841,46105876,46106542,46109599,46110107,46110668,46111310,46114291,46115092,46115122,46115152, >> >> note that the 2nd-to-last exonStart is the same as the 3rd-from-last >> exonEnd: 46115092. Does this mean a 0 length intron? And what does >> that mean within a transcript? > > Can anyone comment on this? Is there some way I can clarify the question? > I find cases like this 185 times in hg19 > Thanks, > -Brent > > >> >> Second question: for this same row; is it correct to infer that the >> first exon in (0-based) bed format would be: >> start=46103534, end=46103841 >> and the first intron would be: >> start=46103841 end=46105837 >> >> but then the problem is that start == end in for the 0-length intron. >> >> I have seen this: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 so the >> internal format matches the BED format, correct? >> >> thanks, >> -Brent >> > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
