Thank you very much for your help thus far. I do have a few questions in
response:
1) First of all, in the last response I received I was told that the SNPs
come from dbSNP - both the novel content and reference genome position -
this is not parsed from BLAT output. However, this poses a problem for the
sequences that I am looking for in particular. I only want the sequence near
to the reference genome position that matches.
If the sequence below were taken as an example and I were to do this
manually on the UCSC website, I would count the number of bases upstream
from the reference genome position to the first mismatch upstream (in this
case 71 bases upstream). Then I would could the number of bases downstream
from the reference genome position to the first mismatch downstream (in this
case 41 bases downstream). When I then went to click on "Get DNA," I would
type in those numbers -- 71 bases upstream and 41 bases downstream. How do
you think I should address this problem?
79237487
AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
79237586
|||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
00000201
AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
00000300
*79237587 G 79237587
00000301 R 00000301
*79237588
TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
79237687
|||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||
00000302
TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
00000401
2) I believe that there was a miscommunication in one of my previous
questions. I was making the observation that there were more results in the
BLAT output than queries that had been used. I used the same parameters
(stand alone BLAT) that are in the FAQ, and I was just wondering what
determines which sequence is presented on the genome website. Should I use
pslReps to solve this issue? Is that what the website does?
Thank you for all of your help,
Kyle Tretina
Wheaton College
On Thu, Mar 4, 2010 at 7:38 PM, Jennifer Jackson <[email protected]> wrote:
> Hi Kyle,
>
> The best place to find out how UCSC did the processing for a track is in
> what we call the "makedoc". There is one per assembly containing the track
> build processing. Makedocs are located in the kent source tree.
>
> kent/src/hg/makeDb/doc/hg18.txt
> kent/src/hg/makeDb/doc/hg19.txt
>
> These documents can also be browsed online at:
> http://genome-test.cse.ucsc.edu/~kent/src/unzipped/hg/makeDb/doc/
> (try the link tomorrow, access to genome-test is limited right now)
>
> All information is in the makedoc or the track description page or the
> online FAQ, but in summary:
>
> For the flanking sequence question:
>
> - 1 Specific for this track, the flanking sequence for both assemblies is
> at gbdb/hg18/snp/snp130.fa and can be downloaded using ftp to the downloads
> server. (the flanking sequence data was the same for hg18 & hg19, so we did
> not duplicate it.)
>
> For the BLAT questions:
>
> - 2a The SNPs come from dbSNP - both the novel content and reference
> genome position - this is not parsed from BLAT output
>
> - 2b IUPAC characters are declared on the SNP track's description page.
> Section = Re-alignment of the SNP's flanking sequences to the genomic
> sequence
> dbSNP flanking sequences and observed allele code for rsXXXXX:
> (Uses IUPAC ambiguity codes)
> Go into the track description for the link or directly see:
> http://genome.ucsc.edu/goldenPath/help/iupac.html
>
> - 3 BLAT documentation comes with the software when download, but is also
> online here:
> http://genome.ucsc.edu/goldenPath/help/blatSpec.html
>
> This should help you get going again, but please let us know if you need
> more help Kyle,
>
> Jennifer
>
> ---------------------------------
> Jennifer Jackson
> UCSC Genome Bioinformatics Group
> http://genome.ucsc.edu/
>
>
> On 3/3/10 3:33 PM, Kyle Tretina wrote:
>
>> To whom it may concern,
>>
>> I wish to batch automate the re-alignment of all SNP flanking sequences on
>> chromosome 16 from
>> ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/
>> and align that to the genomic download for chr16 at
>> http://hgdownload.cse.ucsc.edu/downloads.html.<
>> http://hgdownload.cse.ucsc.edu/downloads.html>I
>>
>> have a couple of questions:
>>
>> 1) Is there anywhere that I can get the SNP flanking sequences customized
>> (i.e. only get the positively selected SNP sequences if I have a list of
>> all
>> of the rs numbers for those SNP's, and the same for the non-positively
>> selected)?
>>
>> 2) I did a test BLAT run using only a portion of the queries in the SNP
>> sequences and I then converted it to the human readable output using
>> pslPretty. However, I was wondering
>> a) How I could identify the base to which the SNP referrs to in the
>> pslPretty output, as it is on the website (see example below)? Below it
>> seems to be identified by a "G" for the genomic sequence (?) and an (R)
>> for
>> the reference sequence (?)
>> b) I am assuming that the output that I received for this test run was
>> in
>> order from highest score to lowest for each query. Is there any way to
>> modify the parameters so that only the result with the highest score is in
>> the output file? Is this what happens on the ucsc website?
>>
>> 79237487
>> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
>> 79237586
>>
>>
>> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 00000201
>> AAACAAACAGCTTGTTTGTGGTTCGTCCTGAAATCCTCCCTGCTCACAAAACAGCCAGCTACTTGGTTTTCTAAAAGACGTAATTTTGCAGGCAGACTTC
>> 00000300
>>
>> *79237587 G 79237587
>>
>> 00000301 R 00000301
>>
>> *79237588
>> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
>> 79237687
>>
>>
>> ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> 00000302
>> TAGAGCCATTCTGTGCAGAAGAAGGGAAGGGAGAAGCTGTTTGTTTTACCTGTAGTATGAAGATATTCTTTGCGCTGTTAGAACTGAGCTCATTAATTCT
>> 00000401
>>
>> 3) Finally, I was wondering if there were any documents/descriptions
>> online
>> of common modifications of BLAT and pslPretty.
>>
>>
>>
>>
>> I apologize for the length of this email. I am an undergraduate
>> bioinformatics intern, and so I have to ask for your patient in helping
>> me.
>>
>> Kyle Tretina
>> Junior
>> Wheaton College
>> _______________________________________________
>> Genome maillist - [email protected]
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>
>
>
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome