Hi Kyle, 

You're right, there is nothing explicit in the HTML returned from that command 
that gives mismatch coordinates. You will need to parse the alignment section 
and go through base-by-base to find the nearest mismatch to the SNP base. In 
HTML, the section starts like this: 

<PRE><B>Alignment between genome (hg19 chr16:79237287-79237887, + strand; 601 
bp) and dbSNP sequence (rs870; 601 bp)</ B > 
ID (including gaps) 99.8%, coverage (of both) 100.0% 

Then there are one or more triplets of lines like this (reference, 
match/mismatch indicators, flank): 

79237487 
AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
 79237586 
         
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 
00000201 
AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
 00000300 

Then the SNP base itself:

<B>79237587 G 79237587
          
00000301 R 00000301

And then more triplets (the first one beginning with </B>):

</B>79237588 
TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
 79237687
         
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
00000302 
TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
 00000401

As you noticed, when there is a mismatch, there is a space between the |'s.  So 
one approach would be to collect the |||'s from each flank, and step backwards 
from the end of the left flank & forwards from the start of the right flank, 
looking for the closest spaces.  I personally would prefer to collect the 
actual sequences, and step through reference bases vs. flanking bases -- I 
think bugs would become more obvious that way.  

It ain't pretty, but if I had to do it (without leveraging my experience with 
our C code base), that's how I would do it.  

Hope that helps,
Angie

----- "Kyle Tretina" <[email protected]> wrote: 
> From: "Kyle Tretina" <[email protected]> 
> To: "Angie Hinrichs" <[email protected]>, "UCSC" <[email protected]> 
> Sent: Monday, March 15, 2010 11:56:54 AM GMT -08:00 US/Canada Pacific 
> Subject: Re: [Genome] Stand Alone Blat Question 
> 
> 
Hello Angie, 

To answer a couple of your questions: Yes, my goal is to establish a mapping 
pipeline so that I can reproduce the details page re-alignment for a list of 
existing SNPs. However, if that sequence has a mismatch I only want the 
sequence around the reference SNP up to the base right before the mismatch. 
Also, in my last email I did tweak some | for illustration and those lines did 
come from r870 in hg19. 

So now my question is, if I follow your second suggestion and write a script 
that sends queries to your website, and then pick the part that I need from the 
returned HTML, is it possible to only get the sequence around the reference 
base that matches (as in my example in the last email and explanation above)? 
The reason that I am asking this is that in the URL generated by the script 
that you suggested in your last email (as seen below) 

http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=$chrom&l=$start&r=$end&o=$start&t=$end&g=snp130&i=$rsID
 
> e.g. 
> http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=chr16&l=79237586&r=79237587&o=79237586&t=79237587&g=snp130&i=rs870
>  
there does not seem to be any variable denoting mismatches or referring to them 
at all. I am assuming that this information can be gained from either the table 
browser, or by some alteration of BlatBot.pl. 



Kyle Tretina 
Junior 
Wheaton College 

> 
> On Mon, Mar 8, 2010 at 2:17 PM, Angie Hinrichs < [email protected] > wrote: 
> 

Hi Kyle, 
> 
> 
> > 1) First of all, in the last response I received I was told that the SNPs 
> > come from dbSNP - both the novel content and reference genome position - 
> > this is not parsed from BLAT output. However, this poses a problem for the 
> > sequences that I am looking for in particular. I only want the sequence 
> > near 
> > to the reference genome position that matches. 
> 
> To clarify your goal: are you establishing a mapping pipeline so that 
> you can map new SNPs given new flanking sequences, or are you looking 
> to reproduce the details page re-alignment for a list of existing 
> SNPs? The better we understand your desired output, the more detailed 
> suggestions we can give. 
> 
> 
> 
> > If the sequence below were taken as an example and I were to do this 
> > manually on the UCSC website, I would count the number of bases upstream 
> > from the reference genome position to the first mismatch upstream (in this 
> > case 71 bases upstream). Then I would could the number of bases downstream 
> > from the reference genome position to the first mismatch downstream (in 
> > this 
> > case 41 bases downstream). When I then went to click on "Get DNA," I would 
> > type in those numbers -- 71 bases upstream and 41 bases downstream. How do 
> > you think I should address this problem? 
> > 
> > 79237487 
> > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
> >  79237586 
> > ||||||||||||||||||||||||||| 
> > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 
> > 00000201 
> > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
> >  00000300 
> > 
> > *79237587 G 79237587 
> > 
> > 00000301 R 00000301 
> > 
> > *79237588 
> > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
> >  79237687 
> > ||||||||||||||||||||||||||||||||||||||||||||| 
> > |||||||||||||||||||||||||||||||||||||||||||||||||||||| 
> > 00000302 
> > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
> >  00000401 
> 
> The automated way to "Get DNA" is to download the genomic sequence 
> ( http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit ) and 
> use our program twoBitToFa, either with the -seqList=file option or 
> -seq=chrom -start=N -end=N options. 
> 
> BTW that looks like 100 bases perfectly matching upstream and 100 
> bases perfectly matching downstream -- did you tweak some '|' to 
> spaces for illustration? It looks like those lines came from the 
> details page for rs870 in hg19, is that correct? 
> 
> If you want to duplicate the above alignment, and you have a local 
> copy of our source tree (see step 6 part d: 
> http://genome.ucsc.edu/admin/mirror.html#step6 ), then you can read our 
> C code that makes that alignment. The file is kent/src/hg/hgc/hgc.c 
> -- search for printSnpAlignment. At a high level, it retrieves the 
> dbSNP fasta record and genomic sequence, limits the length, and passes 
> the sequences to generateAlignment which calls axtAffine to perform 
> the re-alignment. A bit more detail about printSnpAlignment's code: 
> 
> 1. Retrieve the flanking sequences for the SNP from our local copy of 
> the rs_ch*.fas.gz data (snp130.fa). 
> 
> 2. Parse the fasta record into 3 parts: left flank, ambiguous base 
> for observed alleles, and right flank. If a flank is larger than 
> 1000 bases, trim it to the 1000 bases closest to the ambiguous base. 
> 
> 3. Subtract the length of the left flank from the SNP's reported 
> starting position on the genome (chromStart field of snp130 table), 
> and add the length of the right flank to the end position (chromEnd). 
> Make sure we don't get coords < 0 or > chromosome length. 
> 
> 4. Get the genomic sequence at those coordinates. If the SNP was 
> aligned to the - strand, reverse-complement the genomic sequence. 
> 
> 5. [You may not be interested in this part: Print the genomic sequence 
> for the left flank, the genomic sequence at the coords from dbSNP, and 
> the genomic sequence for the right flank. Print the dbSNP fasta 
> record's left flank, ambiguous allele and right flank.] 
> 
> 6. Next, the sequences are passed to generateAlignment. Search 
> backwards in the source to see the definition of generateAlignment. 
> Note that it calls axtScoreSchemeSimpleDna to get a scoring matrix and 
> gap penalties, and then passes the sequences and scoring scheme to 
> axtAffine. See kent/src/lib/axt.c and kent/src/lib/axtAffine.c . The 
> realignment is printed out block by block using printOffsetAndBoldAxt. 
> 
> It's a project just to read the C code -- *if* you want to duplicate 
> the alignment, you may want to consider writing a script that sends 
> queries to our web site, and then picks the part that you need from 
> the returned HTML. Your script should generate URLs like this: 
> 
> http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=$chrom&l=$start&r=$end&o=$start&t=$end&g=snp130&i=$rsID
>  
> e.g. 
> http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=chr16&l=79237586&r=79237587&o=79237586&t=79237587&g=snp130&i=rs870
>  
> 
> and fetch them following the guidelines in BlatBot.pl 
> ( http://genomewiki.ucsc.edu/index.php/Blat_Scripts ). Note the 
> $sleepTime in that script -- that keeps your usage within the limits 
> stated on our home page. Using cookies as the script does also 
> reduces impact on our site. 
> 
> To generate URLs like that, parse $chrom, $start, $end and $name from 
> a file created using the Table Browser -- paste/upload your rs# IDs, 
> then do output=selected fields + save to local file, and select chrom, 
> chromStart, chromEnd, and name fields for output. 
> 
> 
> 
> > 2) I believe that there was a miscommunication in one of my previous 
> > questions. I was making the observation that there were more results in the 
> > BLAT output than queries that had been used. I used the same parameters 
> > (stand alone BLAT) that are in the FAQ, and I was just wondering what 
> > determines which sequence is presented on the genome website. Should I use 
> > pslReps to solve this issue? Is that what the website does? 
> 
> Yes, pslReps is the right approach, but we have a newer program, 
> pslCDnaFilter, that we now recommend instead of pslReps. 
> 
> (The website is different; for untranslated nucleotide alignments, any 
> result with at least 20 matching bases is displayed.) 
> 
> Hope that helps, 
> Angie 
> 
> 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to