Re: [Genome] Download coding sequence bulk

Galt Barber Wed, 08 Sep 2010 14:04:41 -0700

Hi, Lipika!

I was able to do something very similar to the process you
describe and it worked.  Here are my results:


I made a file called ranges which has each exon

[hgwdev:~/lipika> cat ranges
chr1:2479150-2479831
chr1:2480082-2480114
chr1:2481163-2481306
chr1:2482264-2482355
chr1:2483000-2483156
chr1:2484510-2484636
chr1:2485144-2485253
chr1:2486245-2486613

  twoBitToFa /gbdb/hg18/hg18.2bit -seqList=ranges out.fa

This extracts the pieces as separate sequences.

Then I merge the several exon pieces into a single new fasta record
creating a new header and stripping out the original multiple headers.
  echo ">NM_003820_from_gp" > test.fa
  cat out.fa | grep -v '>' >> test.fa

Then I reverse-complement the results since it was reported on the 
negative strand:
  faRc test.fa testRC.fa

 >RC_NM_003820_from_gp
ccttcataccggcccttcccctcggctttgcctggacagctcctgcctcc
cgcagggcccacctgtgtcccccagcgccgctccacccagcaggcctgag
cccctctctgctgccagacaccccctgctgcccactctcctgctgctcgg
gttctgaggcacagcttgtcacaccgaggcggattctctttctctttctc
tttctcttctggcccacagccgcagcaatggcgctgagttcctctgctgg
agttcatcctgctagctgggttcccgagctgccggtctgagcctgaggca
tggagcctcctggagactgggggcctcctccctggagatccacccccaaa
accgacgtcttgaggctggtgctgtatctcaccttcctgggagccccctg
ctacgccccagctctgccgtcctgcaaggaggacgagtacccagtgggct
ccgagtgctgccccaagtgcagtccaggttatcgtgtgaaggaggcctgc
ggggagctgacgggcacagtgtgtgaaccctgccctccaggcacctacat
tgcccacctcaatggcctaagcaagtgtctgcagtgccaaatgtgtgacc
cagccatgggcctgcgcgcgagccggaactgctccaggacagagaacgcc
gtgtgtggctgcagcccaggccacttctgcatcgtccaggacggggacca
ctgcgccgcgtgccgcgcttacgccacctccagcccgggccagagggtgc
agaagggaggcaccgagagtcaggacaccctgtgtcagaactgccccccg
gggaccttctctcccaatgggaccctggaggaatgtcagcaccagaccaa
gtgcagctggctggtgacgaaggccggagctgggaccagcagctcccact
gggtatggtggtttctctcagggagcctcgtcatcgtcattgtttgctcc
acagttggcctaatcatatgtgtgaaaagaagaaagccaaggggtgatgt
agtcaaggtgatcgtctccgtccagcggaaaagacaggaggcagaaggtg
aggccacagtcattgaggccctgcaggcccctccggacgtcaccacggtg
gccgtggaggagacaataccctcattcacggggaggagcccaaaccactg
acccacagactctgcaccccgacgccagagatacctggagcgacggctgc
tgaaagaggctgtccacctggcggaaccaccggagcccggaggcttgggg
gctccgccctgggctggcttccgtctcctccagtggagggagaggtgggg
cccctgctggggtagagctggggacgccacgtgccattcccatgggccag
tgagggcctggggcctctgttctgctgtggcctgagctccccagagtcct
gaggaggagcgccagttgcccctcgctcacagaccacacacccagccctc
ctgggccagcccagagggcccttcagaccccagctgtctgcgcgtctgac
tcttgtggcctcagcaggacaggccccgggcactgcctcacagccaaggc
tggactgggttggctgcagtgtggtgtttagtggataccacatcggaagt
gattttctaaattggatttgaattcggctcctgttttctatttgtcatga
aacagtgtatttggggagatgctgtgggaggatgtaaatatcttgtttct
cctcaa

Here is the browser output for hg18 refSeq NM_003820
cDNA NM_003820

CCTTCATACC GGCCCTTCCC CTCGGCTTTG CCTGGACAGC TCCTGCCTCC  50
CGCAGGGCCC ACCTGTGTCC CCCAGCGCCG CTCCACCCAG CAGGCCTGAG  100
CCCCTCTCTG CTGCCAGACA CCCCCTGCTG CCCACTCTCC TGCTGCTCGG  150
GTTCTGAGGC ACAGCTTGTC ACACCGAGGC GGATTCTCTT TCTCTTTCTC  200
TTTCTCTTCT GGCCCACAGC CGCAGCAATG GCGCTGAGTT CCTCTGCTGG  250
AGTTCATCCT GCTAGCTGGG TTCCCGAGCT GCCGGTCTGA GCCTGAGGCA  300
TGGAGCCTCC TGGAGACTGG GGGCCTCCTC CCTGGAGATC CACCCCCAAA  350
ACCGACGTCT TGAGGCTGGT GCTGTATCTC ACCTTCCTGG GAGCCCCCTG  400
CTACGCCCCA GCTCTGCCGT CCTGCAAGGA GGACGAGTAC CCAGTGGGCT  450
CCGAGTGCTG CCCCAAGTGC AGTCCAGGTT ATCGTGTGAA GGAGGCCTGC  500
GGGGAGCTGA CGGGCACAGT GTGTGAACCC TGCCCTCCAG GCACCTACAT  550
TGCCCACCTC AATGGCCTAA GCAAGTGTCT GCAGTGCCAA ATGTGTGACC  600
CAGCCATGGG CCTGCGCGCG AGCCGGAACT GCTCCAGGAC AGAGAACGCC  650
GTGTGTGGCT GCAGCCCAGG CCACTTCTGC ATCGTCCAGG ACGGGGACCA  700
CTGCGCCGCG TGCCGCGCTT ACGCCACCTC CAGCCCGGGC CAGAGGGTGC  750
AGAAGGGAGG CACCGAGAGT CAGGACACCC TGTGTCAGAA CTGCCCCCCG  800
GGGACCTTCT CTCCCAATGG GACCCTGGAG GAATGTCAGC ACCAGACCAA  850
GTGCAGCTGG CTGGTGACGA AGGCCGGAGC TGGGACCAGC AGCTCCCACT  900
GGGTATGGTG GTTTCTCTCA GGGAGCCTCG TCATCGTCAT TGTTTGCTCC  950
ACAGTTGGCC TAATCATATG TGTGAAAAGA AGAAAGCCAA GGGGTGATGT  1000
AGTCAAGGTG ATCGTCTCCG TCCAGCGGAA AAGACAGGAG GCAGAAGGTG  1050
AGGCCACAGT CATTGAGGCC CTGCAGGCCC CTCCGGACGT CACCACGGTG  1100
GCCGTGGAGG AGACAATACC CTCATTCACG GGGAGGAGCC CAAACCACTG  1150
ACCCACAGAC TCTGCACCCC GACGCCAGAG ATACCTGGAG CGACGGCTGC  1200
TGAAAGAGGC TGTCCACCTG GCGGAACCAC CGGAGCCCGG AGGCTTGGGG  1250
GCTCCGCCCT GGGCTGGCTT CCGTCTCCTC CAGTGGAGGG AGAGGTGGGG  1300
CCCCTGCTGG GGTAGAGCTG GGGACGCCAC GTGCCATTCC CATGGGCCAG  1350
TGAGGGCCTG GGGCCTCTGT TCTGCTGTGG CCTGAGCTCC CCAGAGTCCT  1400
GAGGAGGAGC GCCAGTTGCC CCTCGCTCAC AGACCACACA CCCAGCCCTC  1450
CTGGGCCAGC CCAGAGGGCC CTTCAGACCC CAGCTGTCTG CGCGTCTGAC  1500
TCTTGTGGCC TCAGCAGGAC AGGCCCCGGG CACTGCCTCA CAGCCAAGGC  1550
TGGACTGGGT TGGCTGCAGT GTGGTGTTTA GTGGATACCA CATCGGAAGT  1600
GATTTTCTAA ATTGGATTTG AATTCGGCTC CTGTTTTCTA TTTGTCATGA  1650
AACAGTGTAT TTGGGGAGAT GCTGTGGGAG GATGTAAATA TCTTGTTTCT  1700
CCTCAAaaaa aaaaaaaaaa aaaaaaaaaa

As you can see, they are a very good match.

-Galt

On 09/08/10 13:23, Jennifer Jackson wrote:
> Hello Lipika,
> 
> Perhaps some help understanding the coordinate system used by UCSC will 
> help. We use a 0-based start position. This can get tricky, especially 
> when converting to the (-) strand, since we also store all coordinates 
> smallest->largest along the chromosome.
> 
> Help is located in this wiki:
> http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
> 
> All database tables/files will be formatted this way unless specifically 
> noted in the data format FAQ:
> http://genome.ucsc.edu/FAQ/FAQformat.html
> 
> There are utilities readily available that work with our coordinate 
> system. Some function stand-alone and others require a database. The 
> public mySQL database can be used when a database is required, if you do 
> not run your own mirror.
> 
> A list of utilities is here:
> http://hgwdev.cse.ucsc.edu/~larrym/utilities.html
> 
> Many can be downloaded pre-compiled from here (for certain platforms):
> http://hgdownload.cse.ucsc.edu/admin/exe/
> 
> Otherwise, obtain the source and compile locally:
> http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads
> 
> Public mySQL access instructions:
> http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29
> 
> Please feel free to contact the mailing list support team again if you 
> would like more assistance.
> 
> Warm regards,
> 
> Jen
> UCSC Genome Browser Support
> 
> On 9/8/10 11:35 AM, Lipika Ray wrote:
>> Hello UCSC group,
>>
>> I like to get the coding sequence of gene from refseq mrna ids (like,
>> NM_003820) from hg18 version - big list of such ids.
>>
>> So I am getting information of exonstarts , exonends, cdsStart, cdsend from
>> refFlat table under hg18.
>>
>> So for NM_003820, the record looks like this:
>>
>> geneName: TNFRSF14
>>        name: NM_003820
>>       chrom: chr1
>>      strand: -
>>     txStart: 2479150
>>       txEnd: 2486613
>>    cdsStart: 2479705
>>      cdsEnd: 2486314
>>   exonCount: 8
>> exonStarts: 2479150,2480082,2481163,2482264,2483000,2484510,2485144,2486245,
>>    exonEnds: 2479831,2480114,2481306,2482355,2483156,2484636,2485253,2486613,
>>
>> To get the dna sequence corresponding to the coding regions, I am extracting
>> sequences from chr1.fa.gz file under chromosomes in hg18 version and then
>> extracting the dna sequence corresponding to the region:
>>
>> 2479705-2479831, 2480082-2480114, 2481163-2481306, 2482264-2482355,
>> 2483000-2483156, 2484510-2484636, 2485144-2485253, 2486245-2486314
>>
>> The corresponding sequence is not matching if I cross check with the
>> sequence from web. Can you please guide me whether I can extract sequence in
>> this way, or you already have sequences corresponding to genes stored
>> separately in your datanbase.
>>
>> Thanks for your help.
>>
>> Lipika
>> _______________________________________________
>> Genome maillist  -  [email protected]
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Download coding sequence bulk

Reply via email to