On Tue, May 24, 2011 at 9:30 AM, Galt Barber <[email protected]> wrote: > > Hi, Brent! > > The psl record found in kgTargetAli has already been changed for genepred > use. > > If you just blat uc002imy.2 to hg19 you can see everything that's happening. > > 1. 16-bases of the poly-A tail are removed from the picture. > 2. A small exon of 5 bases is merged into its neighbor exon. > (This wipes out 1 tiny q-gap). > 3. 2 tiny q-gaps are ignored since they cannot be represented in genepred. > > The original blat psl shows them though. > >>uc002imy.2 (COPZ2) length=923 > ggcggcgagcggaatgcagcggcccgaggcctggccacgtccgcacccgggggagggggc > cgcggcggcccaggccgggggcccggcgccgcctgctcgagccggggagccctcggggct > gcggttgcaggaaccttccctctacaccatcaaggctgttttcatcctagataatgacgg > gcgccggctgctggccaagtattatgatgacacattcccctccatgaaggagcagatggt > tttcgagaaaaatgtcttcaacaagaccagccggactgagagtgagattgcattttttgg > gggtatgaccatcgtctacaagaacagcattgacctcttcctatacgtggtgggctcatc > ctacgagaatgagctgatgctcatgtctgttctcacctgcctgtttgagtctctgaacca > catgttaaggaagaacgtggagaagcgctggttgctggagaacatggacggagccttctt > ggtgctggacgagattgtggatggcggtgtgattctggagagtgacccccagcaagtgat > ccagaaggtgaattttagggcagatgatggcggcttgactgaacagagtgtggcccaggt > tcttcagtctgccaaggaacaaattaaatggtcgttattgaaatgaaggctgtggattca > aggctccctgccccccagatcatttccccaatcctggcaaaagcccaaagatcccagggt > caggagagacccctctgtatccccaggtccctcccagaactgactcctaaggtctccagc > cagggcttctgagatgcaaaggtttggcctcaggagagtcaccttttctcacggccctgg > ccttaactcatatcttaggcattcctggccccagggccctaataaacctgcttttgtctt > ctgccaaaaaaaaaaaaaaaaaa > > The quick answer is that there were tiny 1 and 2 bp gaps on > the query side that caused the alignment to be broken on the > target side. > > People are used to seeing gaps of size zero on the query side > all the time. They are not used to seeing it on the target side. > This is just the flip-side of having an insert on the opposite side. > > I suppose that a human looking at these would merge them together. > If the second to the last intron is size 0, the last intron is only > size 2 which is not biologically realistic. > > So really, there are 9 exons here, not 11 (genepred) or 12 (blat psl). > > -Galt
Thanks Galt, this makes sense. I will tinker with BLAT and try to understand fully. -B > > 5/24/2011 8:18 AM, Brent Pedersen: >> >> On Sun, May 22, 2011 at 8:02 AM, Brent Pedersen<[email protected]> >> wrote: >>> >>> hi, I have grabbed some data from mysql like this: >>> >>> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D $ORG -P >>> 3306 -e "select >>> >>> chrom,txStart,txEnd,cdsStart,cdsEnd,K.name,X.geneSymbol,proteinID,strand,exonStarts,exonEnds >>> from knownGene as K,kgXref as X where X.kgId=K.name >>> >>> I have a couple questions about the data. First, a row like this: >>> >>> chrom txStart txEnd cdsStart cdsEnd name geneSymbol >>> proteinID strand exonStarts exonEnds >>> chr17 46103534 46115152 46103793 46115139 >>> uc002imy.2 COPZ2 Q9P299 - >>> 46103534,46105837,46106490,46109521,46110051,46110576,46111228,46114216,46115032,46115092,46115124, >>> >>> 46103841,46105876,46106542,46109599,46110107,46110668,46111310,46114291,46115092,46115122,46115152, >>> >>> note that the 2nd-to-last exonStart is the same as the 3rd-from-last >>> exonEnd: 46115092. Does this mean a 0 length intron? And what does >>> that mean within a transcript? >> >> Can anyone comment on this? Is there some way I can clarify the question? >> I find cases like this 185 times in hg19 >> Thanks, >> -Brent >> >> >>> >>> Second question: for this same row; is it correct to infer that the >>> first exon in (0-based) bed format would be: >>> start=46103534, end=46103841 >>> and the first intron would be: >>> start=46103841 end=46105837 >>> >>> but then the problem is that start == end in for the 0-length intron. >>> >>> I have seen this: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 so the >>> internal format matches the BED format, correct? >>> >>> thanks, >>> -Brent >>> >> >> _______________________________________________ >> Genome maillist - [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
