Hey Jaaved, Thanks for the very detailed problem report. The behavior you are seeing is due to how MAF coordinates are defined. If an alignment is on the negative (-) strand, then the coordinate reported is the number of bases from the *end* of the sequence. This means that to convert the coordinates in the MAF file to the coordinates that blat gives you, or to extract sequence from a FASTA or 2bit file, you need to subtract the end coordinate of the range, from the total size of the scaffold.
So, if your MAF coordinate is scaffold_13337 1466195 1466344, then to calculate the start coordinate of the region on the positive strand, you need to take the size of scaffold_13337, which is 23293914, and subtract the end coordinate of your range from it: 23293914 - 1466344 = 21827570 which is the 0-based coordinate where you can find the sequence from the MAF block in the FASTA file. Remember that the browser reports coordinates in a 1-based system, but the MAF and PSL files are in 0-based coordinates. We realize that these coordinate systems can be confusing, but using these systems internally makes it easier to avoid different code for alignments on the different strands. I hope this answers your questions. If you have any follow-up questions, please respond to this list. Brian On Thu, Jan 27, 2011 at 2:15 PM, Jaaved Mohammed <[email protected]> wrote: > Hello, > > I am seeing numerous discrepancy between the sequences and coordinates in > the insect 15-way multiple-alignment files > (http://hgdownload.cse.ucsc.edu/goldenPath/dm3/multiz15way/) primarily for > D. ananassae and D. virilus. Hopefully I can explain this as concisely as > possible, however, let me know how I can clarify: > > For example: > 1. I extract the MAF block of a popular melanogaster miRNA (dme-mir-289, > chr3L:13613907-13614035)using the maf_parse program from Adam Siepel's lab: > maf_parse -o MAF --start 13613907 --end 13614035 chr3L.maf > See the attached dme-mir-289.maf output file. Notice that the droAna3 > sequence in this file is > "CAGCTCGGGTTTTAGGTTGAGTTTACAGTAAAATAAATATTTAAGTGGAGCCTGCGACTctgctactgccactgc > cactgccactgccactgccGCTCGGGGAGTCACTTGAGCGTTTGTTGGCACGTAAAAGACATCATAATTAGCATT" > and the coordinate is scaffold_13337:1466195-1466344 -. > > 2. When I extract the sequence of this coordinate from the droAna3.fa file > (http://hgdownload.cse.ucsc.edu/goldenPath/droAna3/bigZips/) and reverse > complement it, I get a completely different sequence than that reported (see > the file droAna3_289_maf.fa). > > 3. When I blat the sequence reported in the MAF file against droAna3.fa, I > get a best scoring coordinate that is different from the MAF coordinate. > That is, scaffold_13337:21827571-21827720 -. No blat reported coordinate > agrees with the MAF coordinate (see droAna3_blatOutput.txt for all blat > reported coordinates). When I extract the sequence of the blat reported > coordinate and reverse complement it (see droAna3_289_blat.fa), this > sequence agrees with the MAF reported sequence. > > I can furnish additional examples upon request. I really appreciate any time > and effort spent helping me explain this bizarre observation. > > Thanks, > Jaaved > > -- > Jaaved Mohammed, > Ph.D. Student of Computational Biology > Tri-Institutional Training Program in Computational Biology and Medicine > (Cornell University - Ithaca, Weill Cornell Medical College, and Memorial > Sloan-Kettering Cancer Center) > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
