Hi Kyle, > 1) First of all, in the last response I received I was told that the SNPs > come from dbSNP - both the novel content and reference genome position - > this is not parsed from BLAT output. However, this poses a problem for the > sequences that I am looking for in particular. I only want the sequence near > to the reference genome position that matches.
To clarify your goal: are you establishing a mapping pipeline so that you can map new SNPs given new flanking sequences, or are you looking to reproduce the details page re-alignment for a list of existing SNPs? The better we understand your desired output, the more detailed suggestions we can give. > If the sequence below were taken as an example and I were to do this > manually on the UCSC website, I would count the number of bases upstream > from the reference genome position to the first mismatch upstream (in this > case 71 bases upstream). Then I would could the number of bases downstream > from the reference genome position to the first mismatch downstream (in this > case 41 bases downstream). When I then went to click on "Get DNA," I would > type in those numbers -- 71 bases upstream and 41 bases downstream. How do > you think I should address this problem? > > 79237487 > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > 79237586 > ||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > 00000201 > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > 00000300 > > *79237587 G 79237587 > > 00000301 R 00000301 > > *79237588 > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > 79237687 > ||||||||||||||||||||||||||||||||||||||||||||| > |||||||||||||||||||||||||||||||||||||||||||||||||||||| > 00000302 > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > 00000401 The automated way to "Get DNA" is to download the genomic sequence (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit) and use our program twoBitToFa, either with the -seqList=file option or -seq=chrom -start=N -end=N options. BTW that looks like 100 bases perfectly matching upstream and 100 bases perfectly matching downstream -- did you tweak some '|' to spaces for illustration? It looks like those lines came from the details page for rs870 in hg19, is that correct? If you want to duplicate the above alignment, and you have a local copy of our source tree (see step 6 part d: http://genome.ucsc.edu/admin/mirror.html#step6), then you can read our C code that makes that alignment. The file is kent/src/hg/hgc/hgc.c -- search for printSnpAlignment. At a high level, it retrieves the dbSNP fasta record and genomic sequence, limits the length, and passes the sequences to generateAlignment which calls axtAffine to perform the re-alignment. A bit more detail about printSnpAlignment's code: 1. Retrieve the flanking sequences for the SNP from our local copy of the rs_ch*.fas.gz data (snp130.fa). 2. Parse the fasta record into 3 parts: left flank, ambiguous base for observed alleles, and right flank. If a flank is larger than 1000 bases, trim it to the 1000 bases closest to the ambiguous base. 3. Subtract the length of the left flank from the SNP's reported starting position on the genome (chromStart field of snp130 table), and add the length of the right flank to the end position (chromEnd). Make sure we don't get coords < 0 or > chromosome length. 4. Get the genomic sequence at those coordinates. If the SNP was aligned to the - strand, reverse-complement the genomic sequence. 5. [You may not be interested in this part: Print the genomic sequence for the left flank, the genomic sequence at the coords from dbSNP, and the genomic sequence for the right flank. Print the dbSNP fasta record's left flank, ambiguous allele and right flank.] 6. Next, the sequences are passed to generateAlignment. Search backwards in the source to see the definition of generateAlignment. Note that it calls axtScoreSchemeSimpleDna to get a scoring matrix and gap penalties, and then passes the sequences and scoring scheme to axtAffine. See kent/src/lib/axt.c and kent/src/lib/axtAffine.c . The realignment is printed out block by block using printOffsetAndBoldAxt. It's a project just to read the C code -- *if* you want to duplicate the alignment, you may want to consider writing a script that sends queries to our web site, and then picks the part that you need from the returned HTML. Your script should generate URLs like this: http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=$chrom&l=$start&r=$end&o=$start&t=$end&g=snp130&i=$rsID e.g. http://genome.ucsc.edu/cgi-bin/hgc?db=hg19&c=chr16&l=79237586&r=79237587&o=79237586&t=79237587&g=snp130&i=rs870 and fetch them following the guidelines in BlatBot.pl (http://genomewiki.ucsc.edu/index.php/Blat_Scripts). Note the $sleepTime in that script -- that keeps your usage within the limits stated on our home page. Using cookies as the script does also reduces impact on our site. To generate URLs like that, parse $chrom, $start, $end and $name from a file created using the Table Browser -- paste/upload your rs# IDs, then do output=selected fields + save to local file, and select chrom, chromStart, chromEnd, and name fields for output. > 2) I believe that there was a miscommunication in one of my previous > questions. I was making the observation that there were more results in the > BLAT output than queries that had been used. I used the same parameters > (stand alone BLAT) that are in the FAQ, and I was just wondering what > determines which sequence is presented on the genome website. Should I use > pslReps to solve this issue? Is that what the website does? Yes, pslReps is the right approach, but we have a newer program, pslCDnaFilter, that we now recommend instead of pslReps. (The website is different; for untranslated nucleotide alignments, any result with at least 20 matching bases is displayed.) Hope that helps, Angie ----- "Kyle Tretina" <[email protected]> wrote: > From: "Kyle Tretina" <[email protected]> > To: "UCSC" <[email protected]> > Sent: Thursday, March 4, 2010 9:31:04 PM GMT -08:00 US/Canada Pacific > Subject: Re: [Genome] Stand Alone Blat Question > > Thank you very much for your help thus far. I do have a few questions in > response: > > 1) First of all, in the last response I received I was told that the SNPs > come from dbSNP - both the novel content and reference genome position - > this is not parsed from BLAT output. However, this poses a problem for the > sequences that I am looking for in particular. I only want the sequence near > to the reference genome position that matches. > > If the sequence below were taken as an example and I were to do this > manually on the UCSC website, I would count the number of bases upstream > from the reference genome position to the first mismatch upstream (in this > case 71 bases upstream). Then I would could the number of bases downstream > from the reference genome position to the first mismatch downstream (in this > case 41 bases downstream). When I then went to click on "Get DNA," I would > type in those numbers -- 71 bases upstream and 41 bases downstream. How do > you think I should address this problem? > > 79237487 > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > 79237586 > ||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > 00000201 > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > 00000300 > > *79237587 G 79237587 > 00000301 R 00000301 > > *79237588 > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > 79237687 > ||||||||||||||||||||||||||||||||||||||||||||| > |||||||||||||||||||||||||||||||||||||||||||||||||||||| > 00000302 > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > 00000401 > > > > 2) I believe that there was a miscommunication in one of my previous > questions. I was making the observation that there were more results in the > BLAT output than queries that had been used. I used the same parameters > (stand alone BLAT) that are in the FAQ, and I was just wondering what > determines which sequence is presented on the genome website. Should I use > pslReps to solve this issue? Is that what the website does? > > > Thank you for all of your help, > > Kyle Tretina > Wheaton College > > > On Thu, Mar 4, 2010 at 7:38 PM, Jennifer Jackson <[email protected]> > wrote: > > > Hi Kyle, > > > > The best place to find out how UCSC did the processing for a track > is in > > what we call the "makedoc". There is one per assembly containing the > track > > build processing. Makedocs are located in the kent source tree. > > > > kent/src/hg/makeDb/doc/hg18.txt > > kent/src/hg/makeDb/doc/hg19.txt > > > > These documents can also be browsed online at: > > http://genome-test.cse.ucsc.edu/~kent/src/unzipped/hg/makeDb/doc/ > > (try the link tomorrow, access to genome-test is limited right now) > > > > All information is in the makedoc or the track description page or > the > > online FAQ, but in summary: > > > > For the flanking sequence question: > > > > - 1 Specific for this track, the flanking sequence for both > assemblies is > > at gbdb/hg18/snp/snp130.fa and can be downloaded using ftp to the > downloads > > server. (the flanking sequence data was the same for hg18 & hg19, so > we did > > not duplicate it.) > > > > For the BLAT questions: > > > > - 2a The SNPs come from dbSNP - both the novel content and > reference > > genome position - this is not parsed from BLAT output > > > > - 2b IUPAC characters are declared on the SNP track's description > page. > > Section = Re-alignment of the SNP's flanking sequences to the > genomic > > sequence > > dbSNP flanking sequences and observed allele code for rsXXXXX: > > (Uses IUPAC ambiguity codes) > > Go into the track description for the link or directly see: > > http://genome.ucsc.edu/goldenPath/help/iupac.html > > > > - 3 BLAT documentation comes with the software when download, but is > also > > online here: > > http://genome.ucsc.edu/goldenPath/help/blatSpec.html > > > > This should help you get going again, but please let us know if you > need > > more help Kyle, > > > > Jennifer > > > > --------------------------------- > > Jennifer Jackson > > UCSC Genome Bioinformatics Group > > http://genome.ucsc.edu/ > > > > > > On 3/3/10 3:33 PM, Kyle Tretina wrote: > > > >> To whom it may concern, > >> > >> I wish to batch automate the re-alignment of all SNP flanking > sequences on > >> chromosome 16 from > >> ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/ > >> and align that to the genomic download for chr16 at > >> http://hgdownload.cse.ucsc.edu/downloads.html.< > >> http://hgdownload.cse.ucsc.edu/downloads.html>I > >> > >> have a couple of questions: > >> > >> 1) Is there anywhere that I can get the SNP flanking sequences > customized > >> (i.e. only get the positively selected SNP sequences if I have a > list of > >> all > >> of the rs numbers for those SNP's, and the same for the > non-positively > >> selected)? > >> > >> 2) I did a test BLAT run using only a portion of the queries in the > SNP > >> sequences and I then converted it to the human readable output > using > >> pslPretty. However, I was wondering > >> a) How I could identify the base to which the SNP referrs to in > the > >> pslPretty output, as it is on the website (see example below)? > Below it > >> seems to be identified by a "G" for the genomic sequence (?) and an > (R) > >> for > >> the reference sequence (?) > >> b) I am assuming that the output that I received for this test > run was > >> in > >> order from highest score to lowest for each query. Is there any way > to > >> modify the parameters so that only the result with the highest > score is in > >> the output file? Is this what happens on the ucsc website? > >> > >> 79237487 > >> > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > >> 79237586 > >> > >> > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > >> 00000201 > >> > AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC > >> 00000300 > >> > >> *79237587 G 79237587 > >> > >> 00000301 R 00000301 > >> > >> *79237588 > >> > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > >> 79237687 > >> > >> > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > >> 00000302 > >> > TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT > >> 00000401 > >> > >> 3) Finally, I was wondering if there were any > documents/descriptions > >> online > >> of common modifications of BLAT and pslPretty. > >> > >> > >> > >> > >> I apologize for the length of this email. I am an undergraduate > >> bioinformatics intern, and so I have to ask for your patient in > helping > >> me. > >> > >> Kyle Tretina > >> Junior > >> Wheaton College > >> _______________________________________________ > >> Genome maillist - [email protected] > >> https://lists.soe.ucsc.edu/mailman/listinfo/genome > >> > > > > > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
