Hi Kyle, You could perform your own alignments of flanking sequences with blat, but you would still need to do some text processing in order to find mismatches. There might well be some cases where blat's results differ from dbSNP's, because dbSNP does independent alignments of the left and right flank with BLAST, and then uses their own process to join up left and right flank alignments to get SNP coordinates.
If you run blat with no command line arguments, you will see a basic usage description. The available output formats are these: -out=type Controls output file format. Type is one of: psl - Default. Tab separated format, no sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments Blast format gives the familiar rows of bases (and -'s) separated by rows of |'s (and spaces). Several other output formats also include the aligned sequence (e.g. pslx, axt, maf). You could parse the aligned sequences (making note of the start coordinates and the strand) to determine the location of the SNP base and any mismatches or insertions/deletions. Angie ----- "Kyle Tretina" <[email protected]> wrote: > From: "Kyle Tretina" <[email protected]> > To: "Angie Hinrichs" <[email protected]>, "UCSC" <[email protected]> > Sent: Saturday, March 20, 2010 1:11:50 PM GMT -08:00 US/Canada Pacific > Subject: Re: [Genome] Stand Alone Blat Question > > Hi Angie, > It seems like writing a script that sends queries to your website would take a lot longer, given the time restraints posed by the site (1/15sec) than using stand alone Blat. Is there any way that I can modify stand alone Blat to get the same output as the website (including the SNP base)? > > > > Kyle Tretina > > > On Tue, Mar 16, 2010 at 11:15 PM, Angie Hinrichs < [email protected] > > wrote: > Hi Kyle, > > You're right, there is nothing explicit in the HTML returned from that > command that gives mismatch coordinates. You will need to parse the alignment > section and go through base-by-base to find the nearest mismatch to the SNP > base. In HTML, the section starts like this: > > <PRE><B>Alignment between genome (hg19 chr16:79237287-79237887, + strand; 601 > bp) and dbSNP sequence (rs870; 601 bp)</ B > > ID (including gaps) 99.8%, coverage (of both) 100.0% > > Then there are one or more triplets of lines like this (reference, > match/mismatch indicators, flank): > > > 79237487 > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > 79237586 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > > 00000201 > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > 00000300 > > Then the SNP base itself: > > <B>79237587 G 79237587 > > 00000301 R 00000301 > > And then more triplets (the first one beginning with </B>): > > </B>79237588 > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > 79237687 > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > > 00000302 > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > 00000401 > > As you noticed, when there is a mismatch, there is a space between the |'s. > So one approach would be to collect the |||'s from each flank, and step > backwards from the end of the left flank & forwards from the start of the > right flank, looking for the closest spaces. I personally would prefer to > collect the actual sequences, and step through reference bases vs. flanking > bases -- I think bugs would become more obvious that way. > > It ain't pretty, but if I had to do it (without leveraging my experience with > our C code base), that's how I would do it. > > > Hope that helps, > Angie > > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
