Re: [Genome] Stand Alone Blat Question

Angie Hinrichs Mon, 08 Mar 2010 11:18:25 -0800

Hi Kyle,

> 1) First of all, in the last response I received I was told that the SNPs
> come from dbSNP  - both the novel content and reference genome position -
> this is not parsed from BLAT output. However, this poses a problem for the
> sequences that I am looking for in particular. I only want the sequence near
> to the reference genome position that matches.


To clarify your goal: are you establishing a mapping pipeline so that
you can map new SNPs given new flanking sequences, or are you looking
to reproduce the details page re-alignment for a list of existing
SNPs?  The better we understand your desired output, the more detailed
suggestions we can give.  


> If the sequence below were taken as an example and I were to do this
> manually on the UCSC website, I would count the number of bases upstream
> from the reference genome position to the first mismatch upstream (in this
> case 71 bases upstream). Then I would could the number of bases downstream
> from the reference genome position to the first mismatch downstream (in this
> case 41 bases downstream). When I then went to click on "Get DNA," I would
> type in those numbers -- 71 bases upstream and 41 bases downstream. How do
> you think I should address this problem?
> 
>  79237487 
> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
>  79237586
>           ||||||||||||||||||||||||||| 
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>  00000201 
> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
>  00000300
> 
> *79237587 G 79237587
> 
>  00000301 R 00000301
> 
> *79237588 
> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
>  79237687
>           ||||||||||||||||||||||||||||||||||||||||||||| 
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||
>  00000302 
> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
>  00000401

The automated way to "Get DNA" is to download the genomic sequence
(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit) and
use our program twoBitToFa, either with the -seqList=file option or 
-seq=chrom -start=N -end=N options.  

BTW that looks like 100 bases perfectly matching upstream and 100
bases perfectly matching downstream -- did you tweak some '|' to
spaces for illustration?  It looks like those lines came from the
details page for rs870 in hg19, is that correct?

If you want to duplicate the above alignment, and you have a local
copy of our source tree (see step 6 part d:
http://genome.ucsc.edu/admin/mirror.html#step6), then you can read our
C code that makes that alignment.  The file is kent/src/hg/hgc/hgc.c
-- search for printSnpAlignment.  At a high level, it retrieves the
dbSNP fasta record and genomic sequence, limits the length, and passes
the sequences to generateAlignment which calls axtAffine to perform
the re-alignment.  A bit more detail about printSnpAlignment's code:

1. Retrieve the flanking sequences for the SNP from our local copy of
the rs_ch*.fas.gz data (snp130.fa).

2. Parse the fasta record into 3 parts: left flank, ambiguous base
for observed alleles, and right flank.  If a flank is larger than
1000 bases, trim it to the 1000 bases closest to the ambiguous base.  

3. Subtract the length of the left flank from the SNP's reported
starting position on the genome (chromStart field of snp130 table),
and add the length of the right flank to the end position (chromEnd).
Make sure we don't get coords < 0 or > chromosome length.  

4. Get the genomic sequence at those coordinates.  If the SNP was
aligned to the - strand, reverse-complement the genomic sequence.

5. [You may not be interested in this part: Print the genomic sequence
for the left flank, the genomic sequence at the coords from dbSNP, and
the genomic sequence for the right flank.  Print the dbSNP fasta
record's left flank, ambiguous allele and right flank.]

6. Next, the sequences are passed to generateAlignment.  Search
backwards in the source to see the definition of generateAlignment.
Note that it calls axtScoreSchemeSimpleDna to get a scoring matrix and
gap penalties, and then passes the sequences and scoring scheme to
axtAffine.  See kent/src/lib/axt.c and kent/src/lib/axtAffine.c .  The
realignment is printed out block by block using printOffsetAndBoldAxt.

It's a project just to read the C code -- *if* you want to duplicate
the alignment, you may want to consider writing a script that sends
queries to our web site, and then picks the part that you need from
the returned HTML.  Your script should generate URLs like this:

http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=$chrom&l=$start&r=$end&o=$start&t=$end&g=snp130&i=$rsID
e.g. 
http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=chr16&l=79237586&r=79237587&o=79237586&t=79237587&g=snp130&i=rs870

and fetch them following the guidelines in BlatBot.pl
(http://genomewiki.ucsc.edu/index.php/Blat_Scripts).  Note the
$sleepTime in that script -- that keeps your usage within the limits
stated on our home page.  Using cookies as the script does also
reduces impact on our site.  

To generate URLs like that, parse $chrom, $start, $end and $name from
a file created using the Table Browser -- paste/upload your rs# IDs,
then do output=selected fields + save to local file, and select chrom,
chromStart, chromEnd, and name fields for output.


> 2) I believe that there was a miscommunication in one of my previous
> questions. I was making the observation that there were more results in the
> BLAT output than queries that had been used. I used the same parameters
> (stand alone BLAT) that are in the FAQ, and I was just wondering what
> determines which sequence is presented on the genome website. Should I use
> pslReps to solve this issue? Is that what the website does?

Yes, pslReps is the right approach, but we have a newer program,
pslCDnaFilter, that we now recommend instead of pslReps.  

(The website is different; for untranslated nucleotide alignments, any
result with at least 20 matching bases is displayed.)

Hope that helps,
Angie



----- "Kyle Tretina" <[email protected]> wrote:

> From: "Kyle Tretina" <[email protected]>
> To: "UCSC" <[email protected]>
> Sent: Thursday, March 4, 2010 9:31:04 PM GMT -08:00 US/Canada Pacific
> Subject: Re: [Genome] Stand Alone Blat Question
>
> Thank you very much for your help thus far. I do have a few questions in
> response:
> 
> 1) First of all, in the last response I received I was told that the SNPs
> come from dbSNP  - both the novel content and reference genome position -
> this is not parsed from BLAT output. However, this poses a problem for the
> sequences that I am looking for in particular. I only want the sequence near
> to the reference genome position that matches.
> 
> If the sequence below were taken as an example and I were to do this
> manually on the UCSC website, I would count the number of bases upstream
> from the reference genome position to the first mismatch upstream (in this
> case 71 bases upstream). Then I would could the number of bases downstream
> from the reference genome position to the first mismatch downstream (in this
> case 41 bases downstream). When I then went to click on "Get DNA," I would
> type in those numbers -- 71 bases upstream and 41 bases downstream. How do
> you think I should address this problem?
> 
> 79237487 
> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
>  79237586
>          ||||||||||||||||||||||||||| 
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 00000201 
> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
>  00000300
> 
> *79237587 G 79237587
>  00000301 R 00000301
> 
> *79237588 
> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
>  79237687
>           ||||||||||||||||||||||||||||||||||||||||||||| 
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||
>  00000302 
> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
>  00000401
> 
> 
> 
> 2) I believe that there was a miscommunication in one of my previous
> questions. I was making the observation that there were more results in the
> BLAT output than queries that had been used. I used the same parameters
> (stand alone BLAT) that are in the FAQ, and I was just wondering what
> determines which sequence is presented on the genome website. Should I use
> pslReps to solve this issue? Is that what the website does?
> 
> 
> Thank you for all of your help,
> 
> Kyle Tretina
> Wheaton College
> 
> 
> On Thu, Mar 4, 2010 at 7:38 PM, Jennifer Jackson <[email protected]>
> wrote:
> 
> > Hi Kyle,
> >
> > The best place to find out how UCSC did the processing for a track
> is in
> > what we call the "makedoc". There is one per assembly containing the
> track
> > build processing. Makedocs are located in the kent source tree.
> >
> > kent/src/hg/makeDb/doc/hg18.txt
> > kent/src/hg/makeDb/doc/hg19.txt
> >
> > These documents can also be browsed online at:
> > http://genome-test.cse.ucsc.edu/~kent/src/unzipped/hg/makeDb/doc/
> > (try the link tomorrow, access to genome-test is limited right now)
> >
> > All information is in the makedoc or the track description page or
> the
> > online FAQ, but in summary:
> >
> > For the flanking sequence question:
> >
> > - 1 Specific for this track, the flanking sequence for both
> assemblies is
> > at gbdb/hg18/snp/snp130.fa and can be downloaded using ftp to the
> downloads
> > server. (the flanking sequence data was the same for hg18 & hg19, so
> we did
> > not duplicate it.)
> >
> > For the BLAT questions:
> >
> > - 2a The SNPs come from dbSNP  - both the novel content and
> reference
> > genome position - this is not parsed from BLAT output
> >
> > - 2b IUPAC characters are declared on the SNP track's description
> page.
> > Section = Re-alignment of the SNP's flanking sequences to the
> genomic
> > sequence
> > dbSNP flanking sequences and observed allele code for rsXXXXX:
> > (Uses IUPAC ambiguity codes)
> > Go into the track description for the link or directly see:
> > http://genome.ucsc.edu/goldenPath/help/iupac.html
> >
> > - 3 BLAT documentation comes with the software when download, but is
> also
> > online here:
> > http://genome.ucsc.edu/goldenPath/help/blatSpec.html
> >
> > This should help you get going again, but please let us know if you
> need
> > more help Kyle,
> >
> > Jennifer
> >
> > ---------------------------------
> > Jennifer Jackson
> > UCSC Genome Bioinformatics Group
> > http://genome.ucsc.edu/
> >
> >
> > On 3/3/10 3:33 PM, Kyle Tretina wrote:
> >
> >> To whom it may concern,
> >>
> >> I wish to batch automate the re-alignment of all SNP flanking
> sequences on
> >> chromosome 16 from
> >> ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/
> >>  and align that to the genomic download for chr16 at
> >> http://hgdownload.cse.ucsc.edu/downloads.html.<
> >> http://hgdownload.cse.ucsc.edu/downloads.html>I
> >>
> >> have a couple of questions:
> >>
> >> 1) Is there anywhere that I can get the SNP flanking sequences
> customized
> >> (i.e. only get the positively selected SNP sequences if I have a
> list of
> >> all
> >> of the rs numbers for those SNP's, and the same for the
> non-positively
> >> selected)?
> >>
> >> 2) I did a test BLAT run using only a portion of the queries in the
> SNP
> >> sequences and I then converted it to the human readable output
> using
> >> pslPretty. However, I was wondering
> >>    a) How I could identify the base to which the SNP referrs to in
> the
> >> pslPretty output, as it is on the website (see example below)?
> Below it
> >> seems to be identified by a "G" for the genomic sequence (?) and an
> (R)
> >> for
> >> the reference sequence (?)
> >>    b) I am assuming that the output that I received for this test
> run was
> >> in
> >> order from highest score to lowest for each query. Is there any way
> to
> >> modify the parameters so that only the result with the highest
> score is in
> >> the output file? Is this what happens on the ucsc website?
> >>
> >> 79237487
> >>
> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
> >> 79237586
> >>
> >> 
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> >> 00000201
> >>
> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
> >> 00000300
> >>
> >> *79237587 G 79237587
> >>
> >> 00000301 R 00000301
> >>
> >> *79237588
> >>
> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
> >> 79237687
> >>
> >> 
> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> >> 00000302
> >>
> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
> >> 00000401
> >>
> >> 3) Finally, I was wondering if there were any
> documents/descriptions
> >> online
> >> of common modifications of BLAT and pslPretty.
> >>
> >>
> >>
> >>
> >> I apologize for the length of this email. I am an undergraduate
> >> bioinformatics intern, and so I have to ask for your patient in
> helping
> >> me.
> >>
> >> Kyle Tretina
> >> Junior
> >> Wheaton College
> >> _______________________________________________
> >> Genome maillist  -  [email protected]
> >> https://lists.soe.ucsc.edu/mailman/listinfo/genome
> >>
> >
> >
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Stand Alone Blat Question

Reply via email to