Re: [Genome] Download coding sequence bulk

Lipika Ray Thu, 09 Sep 2010 09:04:46 -0700

Hello Galt,

Many many thanks for your kind reply - that helped me a lot to understand
where I was doing mistake - I was following every step, only got confused
with 0-based counting - what to write in substr function in perl - now it is
straight - thanks.
-64
Another thing I was not aware of those utilities - we have linux 64-bit and
32-bit machines - it seems all programs are not available for all the
executables - like you mentioned about twoBitToFa and faRc - I got
twoBitToFa for 64-bit machine not the other one. For 32-bit it seems nothing
is available except liftover. -- Am I missing something again or it is true
that all utilities are not available for all platforms?


Thanks again for your detailed help - now I am getting correct sequences.
Thanks,

Lipika

On Wed, Sep 8, 2010 at 5:03 PM, Galt Barber <[email protected]> wrote:

> Hi, Lipika!
>
> I was able to do something very similar to the process you
> describe and it worked.  Here are my results:
>
> I made a file called ranges which has each exon
>
> [hgwdev:~/lipika> cat ranges
> chr1:2479150-2479831
> chr1:2480082-2480114
> chr1:2481163-2481306
> chr1:2482264-2482355
> chr1:2483000-2483156
> chr1:2484510-2484636
> chr1:2485144-2485253
> chr1:2486245-2486613
>
>  twoBitToFa /gbdb/hg18/hg18.2bit -seqList=ranges out.fa
>
> This extracts the pieces as separate sequences.
>
> Then I merge the several exon pieces into a single new fasta record
> creating a new header and stripping out the original multiple headers.
>  echo ">NM_003820_from_gp" > test.fa
>  cat out.fa | grep -v '>' >> test.fa
>
> Then I reverse-complement the results since it was reported on the negative
> strand:
>  faRc test.fa testRC.fa
>
> >RC_NM_003820_from_gp
> ccttcataccggcccttcccctcggctttgcctggacagctcctgcctcc
> cgcagggcccacctgtgtcccccagcgccgctccacccagcaggcctgag
> cccctctctgctgccagacaccccctgctgcccactctcctgctgctcgg
> gttctgaggcacagcttgtcacaccgaggcggattctctttctctttctc
> tttctcttctggcccacagccgcagcaatggcgctgagttcctctgctgg
> agttcatcctgctagctgggttcccgagctgccggtctgagcctgaggca
> tggagcctcctggagactgggggcctcctccctggagatccacccccaaa
> accgacgtcttgaggctggtgctgtatctcaccttcctgggagccccctg
> ctacgccccagctctgccgtcctgcaaggaggacgagtacccagtgggct
> ccgagtgctgccccaagtgcagtccaggttatcgtgtgaaggaggcctgc
> ggggagctgacgggcacagtgtgtgaaccctgccctccaggcacctacat
> tgcccacctcaatggcctaagcaagtgtctgcagtgccaaatgtgtgacc
> cagccatgggcctgcgcgcgagccggaactgctccaggacagagaacgcc
> gtgtgtggctgcagcccaggccacttctgcatcgtccaggacggggacca
> ctgcgccgcgtgccgcgcttacgccacctccagcccgggccagagggtgc
> agaagggaggcaccgagagtcaggacaccctgtgtcagaactgccccccg
> gggaccttctctcccaatgggaccctggaggaatgtcagcaccagaccaa
> gtgcagctggctggtgacgaaggccggagctgggaccagcagctcccact
> gggtatggtggtttctctcagggagcctcgtcatcgtcattgtttgctcc
> acagttggcctaatcatatgtgtgaaaagaagaaagccaaggggtgatgt
> agtcaaggtgatcgtctccgtccagcggaaaagacaggaggcagaaggtg
> aggccacagtcattgaggccctgcaggcccctccggacgtcaccacggtg
> gccgtggaggagacaataccctcattcacggggaggagcccaaaccactg
> acccacagactctgcaccccgacgccagagatacctggagcgacggctgc
> tgaaagaggctgtccacctggcggaaccaccggagcccggaggcttgggg
> gctccgccctgggctggcttccgtctcctccagtggagggagaggtgggg
> cccctgctggggtagagctggggacgccacgtgccattcccatgggccag
> tgagggcctggggcctctgttctgctgtggcctgagctccccagagtcct
> gaggaggagcgccagttgcccctcgctcacagaccacacacccagccctc
> ctgggccagcccagagggcccttcagaccccagctgtctgcgcgtctgac
> tcttgtggcctcagcaggacaggccccgggcactgcctcacagccaaggc
> tggactgggttggctgcagtgtggtgtttagtggataccacatcggaagt
> gattttctaaattggatttgaattcggctcctgttttctatttgtcatga
> aacagtgtatttggggagatgctgtgggaggatgtaaatatcttgtttct
> cctcaa
>
> Here is the browser output for hg18 refSeq NM_003820
> cDNA NM_003820
>
> CCTTCATACC GGCCCTTCCC CTCGGCTTTG CCTGGACAGC TCCTGCCTCC  50
> CGCAGGGCCC ACCTGTGTCC CCCAGCGCCG CTCCACCCAG CAGGCCTGAG  100
> CCCCTCTCTG CTGCCAGACA CCCCCTGCTG CCCACTCTCC TGCTGCTCGG  150
> GTTCTGAGGC ACAGCTTGTC ACACCGAGGC GGATTCTCTT TCTCTTTCTC  200
> TTTCTCTTCT GGCCCACAGC CGCAGCAATG GCGCTGAGTT CCTCTGCTGG  250
> AGTTCATCCT GCTAGCTGGG TTCCCGAGCT GCCGGTCTGA GCCTGAGGCA  300
> TGGAGCCTCC TGGAGACTGG GGGCCTCCTC CCTGGAGATC CACCCCCAAA  350
> ACCGACGTCT TGAGGCTGGT GCTGTATCTC ACCTTCCTGG GAGCCCCCTG  400
> CTACGCCCCA GCTCTGCCGT CCTGCAAGGA GGACGAGTAC CCAGTGGGCT  450
> CCGAGTGCTG CCCCAAGTGC AGTCCAGGTT ATCGTGTGAA GGAGGCCTGC  500
> GGGGAGCTGA CGGGCACAGT GTGTGAACCC TGCCCTCCAG GCACCTACAT  550
> TGCCCACCTC AATGGCCTAA GCAAGTGTCT GCAGTGCCAA ATGTGTGACC  600
> CAGCCATGGG CCTGCGCGCG AGCCGGAACT GCTCCAGGAC AGAGAACGCC  650
> GTGTGTGGCT GCAGCCCAGG CCACTTCTGC ATCGTCCAGG ACGGGGACCA  700
> CTGCGCCGCG TGCCGCGCTT ACGCCACCTC CAGCCCGGGC CAGAGGGTGC  750
> AGAAGGGAGG CACCGAGAGT CAGGACACCC TGTGTCAGAA CTGCCCCCCG  800
> GGGACCTTCT CTCCCAATGG GACCCTGGAG GAATGTCAGC ACCAGACCAA  850
> GTGCAGCTGG CTGGTGACGA AGGCCGGAGC TGGGACCAGC AGCTCCCACT  900
> GGGTATGGTG GTTTCTCTCA GGGAGCCTCG TCATCGTCAT TGTTTGCTCC  950
> ACAGTTGGCC TAATCATATG TGTGAAAAGA AGAAAGCCAA GGGGTGATGT  1000
> AGTCAAGGTG ATCGTCTCCG TCCAGCGGAA AAGACAGGAG GCAGAAGGTG  1050
> AGGCCACAGT CATTGAGGCC CTGCAGGCCC CTCCGGACGT CACCACGGTG  1100
> GCCGTGGAGG AGACAATACC CTCATTCACG GGGAGGAGCC CAAACCACTG  1150
> ACCCACAGAC TCTGCACCCC GACGCCAGAG ATACCTGGAG CGACGGCTGC  1200
> TGAAAGAGGC TGTCCACCTG GCGGAACCAC CGGAGCCCGG AGGCTTGGGG  1250
> GCTCCGCCCT GGGCTGGCTT CCGTCTCCTC CAGTGGAGGG AGAGGTGGGG  1300
> CCCCTGCTGG GGTAGAGCTG GGGACGCCAC GTGCCATTCC CATGGGCCAG  1350
> TGAGGGCCTG GGGCCTCTGT TCTGCTGTGG CCTGAGCTCC CCAGAGTCCT  1400
> GAGGAGGAGC GCCAGTTGCC CCTCGCTCAC AGACCACACA CCCAGCCCTC  1450
> CTGGGCCAGC CCAGAGGGCC CTTCAGACCC CAGCTGTCTG CGCGTCTGAC  1500
> TCTTGTGGCC TCAGCAGGAC AGGCCCCGGG CACTGCCTCA CAGCCAAGGC  1550
> TGGACTGGGT TGGCTGCAGT GTGGTGTTTA GTGGATACCA CATCGGAAGT  1600
> GATTTTCTAA ATTGGATTTG AATTCGGCTC CTGTTTTCTA TTTGTCATGA  1650
> AACAGTGTAT TTGGGGAGAT GCTGTGGGAG GATGTAAATA TCTTGTTTCT  1700
> CCTCAAaaaa aaaaaaaaaa aaaaaaaaaa
>
> As you can see, they are a very good match.
>
> -Galt
>
>
> On 09/08/10 13:23, Jennifer Jackson wrote:
>
>> Hello Lipika,
>>
>> Perhaps some help understanding the coordinate system used by UCSC will
>> help. We use a 0-based start position. This can get tricky, especially when
>> converting to the (-) strand, since we also store all coordinates
>> smallest->largest along the chromosome.
>>
>> Help is located in this wiki:
>> http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
>>
>> All database tables/files will be formatted this way unless specifically
>> noted in the data format FAQ:
>> http://genome.ucsc.edu/FAQ/FAQformat.html
>>
>> There are utilities readily available that work with our coordinate
>> system. Some function stand-alone and others require a database. The public
>> mySQL database can be used when a database is required, if you do not run
>> your own mirror.
>>
>> A list of utilities is here:
>> http://hgwdev.cse.ucsc.edu/~larrym/utilities.html
>>
>> Many can be downloaded pre-compiled from here (for certain platforms):
>> http://hgdownload.cse.ucsc.edu/admin/exe/
>>
>> Otherwise, obtain the source and compile locally:
>> http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads
>>
>> Public mySQL access instructions:
>> http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29
>>
>> Please feel free to contact the mailing list support team again if you
>> would like more assistance.
>>
>> Warm regards,
>>
>> Jen
>> UCSC Genome Browser Support
>>
>> On 9/8/10 11:35 AM, Lipika Ray wrote:
>>
>>> Hello UCSC group,
>>>
>>> I like to get the coding sequence of gene from refseq mrna ids (like,
>>> NM_003820) from hg18 version - big list of such ids.
>>>
>>> So I am getting information of exonstarts , exonends, cdsStart, cdsend
>>> from
>>> refFlat table under hg18.
>>>
>>> So for NM_003820, the record looks like this:
>>>
>>> geneName: TNFRSF14
>>>       name: NM_003820
>>>      chrom: chr1
>>>     strand: -
>>>    txStart: 2479150
>>>      txEnd: 2486613
>>>   cdsStart: 2479705
>>>     cdsEnd: 2486314
>>>  exonCount: 8
>>> exonStarts:
>>> 2479150,2480082,2481163,2482264,2483000,2484510,2485144,2486245,
>>>   exonEnds:
>>> 2479831,2480114,2481306,2482355,2483156,2484636,2485253,2486613,
>>>
>>> To get the dna sequence corresponding to the coding regions, I am
>>> extracting
>>> sequences from chr1.fa.gz file under chromosomes in hg18 version and then
>>> extracting the dna sequence corresponding to the region:
>>>
>>> 2479705-2479831, 2480082-2480114, 2481163-2481306, 2482264-2482355,
>>> 2483000-2483156, 2484510-2484636, 2485144-2485253, 2486245-2486314
>>>
>>> The corresponding sequence is not matching if I cross check with the
>>> sequence from web. Can you please guide me whether I can extract sequence
>>> in
>>> this way, or you already have sequences corresponding to genes stored
>>> separately in your datanbase.
>>>
>>> Thanks for your help.
>>>
>>> Lipika
>>> _______________________________________________
>>> Genome maillist  -  [email protected]
>>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>>
>> _______________________________________________
>> Genome maillist  -  [email protected]
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>
>
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Download coding sequence bulk

Reply via email to