Hi Brian, Thanks a bunch. This is a great explanation which I'm sure will help many who parse MAF files. I should also point out to others that the total size of the scaffold (23293914 in this case) is actually included in the MAF file so you don't need to look-up this info elsewhere to perform the computation.
Thanks again, Jaaved -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Brian Raney Sent: Friday, January 28, 2011 5:40 PM To: Jaaved Mohammed Cc: [email protected] Subject: Re: [Genome] Insect MAF file coordinate-sequence disagreement Hey Jaaved, Thanks for the very detailed problem report. The behavior you are seeing is due to how MAF coordinates are defined. If an alignment is on the negative (-) strand, then the coordinate reported is the number of bases from the *end* of the sequence. This means that to convert the coordinates in the MAF file to the coordinates that blat gives you, or to extract sequence from a FASTA or 2bit file, you need to subtract the end coordinate of the range, from the total size of the scaffold. So, if your MAF coordinate is scaffold_13337 1466195 1466344, then to calculate the start coordinate of the region on the positive strand, you need to take the size of scaffold_13337, which is 23293914, and subtract the end coordinate of your range from it: 23293914 - 1466344 = 21827570 which is the 0-based coordinate where you can find the sequence from the MAF block in the FASTA file. Remember that the browser reports coordinates in a 1-based system, but the MAF and PSL files are in 0-based coordinates. We realize that these coordinate systems can be confusing, but using these systems internally makes it easier to avoid different code for alignments on the different strands. I hope this answers your questions. If you have any follow-up questions, please respond to this list. Brian On Thu, Jan 27, 2011 at 2:15 PM, Jaaved Mohammed <[email protected]> wrote: > Hello, > > I am seeing numerous discrepancy between the sequences and coordinates in > the insect 15-way multiple-alignment files > (http://hgdownload.cse.ucsc.edu/goldenPath/dm3/multiz15way/) primarily for > D. ananassae and D. virilus. Hopefully I can explain this as concisely as > possible, however, let me know how I can clarify: > > For example: > 1. I extract the MAF block of a popular melanogaster miRNA (dme-mir-289, > chr3L:13613907-13614035)using the maf_parse program from Adam Siepel's lab: > maf_parse -o MAF --start 13613907 --end 13614035 chr3L.maf > See the attached dme-mir-289.maf output file. Notice that the droAna3 > sequence in this file is > "CAGCTCGGGTTTTAGGTTGAGTTTACAGTAAAATAAATATTTAAGTGGAGCCTGCGACTctgctactgccactgc > cactgccactgccactgccGCTCGGGGAGTCACTTGAGCGTTTGTTGGCACGTAAAAGACATCATAATTAGCATT" > and the coordinate is scaffold_13337:1466195-1466344 -. > > 2. When I extract the sequence of this coordinate from the droAna3.fa file > (http://hgdownload.cse.ucsc.edu/goldenPath/droAna3/bigZips/) and reverse > complement it, I get a completely different sequence than that reported (see > the file droAna3_289_maf.fa). > > 3. When I blat the sequence reported in the MAF file against droAna3.fa, I > get a best scoring coordinate that is different from the MAF coordinate. > That is, scaffold_13337:21827571-21827720 -. No blat reported coordinate > agrees with the MAF coordinate (see droAna3_blatOutput.txt for all blat > reported coordinates). When I extract the sequence of the blat reported > coordinate and reverse complement it (see droAna3_289_blat.fa), this > sequence agrees with the MAF reported sequence. > > I can furnish additional examples upon request. I really appreciate any time > and effort spent helping me explain this bizarre observation. > > Thanks, > Jaaved > > -- > Jaaved Mohammed, > Ph.D. Student of Computational Biology > Tri-Institutional Training Program in Computational Biology and Medicine > (Cornell University - Ithaca, Weill Cornell Medical College, and Memorial > Sloan-Kettering Cancer Center) > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
