Re: [Genome] Download coding sequence bulk

Galt Barber Thu, 09 Sep 2010 10:48:50 -0700

The number of pre-built utilities we have available
is much smaller than the number that we have in our
entire source code system.  We only build the ones
for which there is high demand.  32-bit versions
are basically ignored.  Most people are using 64-bit
systems because they are common now and because
they are often needed for large genomic databases'
on systems with much more than 4GB RAM.


Even for the case of Linux 64-bit, our pre-compiled
utilities will only work on a fraction of all systems
out there.  People can just get the source and compile
it if they want access to everything.

-Galt

On 09/08/10 20:23, Lipika Ray wrote:
> Hello Galt,
>  
> Many many thanks for your kind reply - that helped me a lot to 
> understand where I was doing mistake - I was following every step, only 
> got confused with 0-based counting - what to write in substr function in 
> perl - now it is straight - thanks.
> -64
> Another thing I was not aware of those utilities - we have linux 64-bit 
> and 32-bit machines - it seems all programs are not available for all 
> the executables - like you mentioned about twoBitToFa and faRc - I got 
> twoBitToFa for 64-bit machine not the other one. For 32-bit it seems 
> nothing is available except liftover. -- Am I missing something again or 
> it is true that all utilities are not available for all platforms?
>  
> Thanks again for your detailed help - now I am getting correct sequences.
> Thanks,
>  
> Lipika
> 
> On Wed, Sep 8, 2010 at 5:03 PM, Galt Barber <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     Hi, Lipika!
> 
>     I was able to do something very similar to the process you
>     describe and it worked.  Here are my results:
> 
>     I made a file called ranges which has each exon
> 
>     [hgwdev:~/lipika> cat ranges
>     chr1:2479150-2479831
>     chr1:2480082-2480114
>     chr1:2481163-2481306
>     chr1:2482264-2482355
>     chr1:2483000-2483156
>     chr1:2484510-2484636
>     chr1:2485144-2485253
>     chr1:2486245-2486613
> 
>      twoBitToFa /gbdb/hg18/hg18.2bit -seqList=ranges out.fa
> 
>     This extracts the pieces as separate sequences.
> 
>     Then I merge the several exon pieces into a single new fasta record
>     creating a new header and stripping out the original multiple headers.
>      echo ">NM_003820_from_gp" > test.fa
>      cat out.fa | grep -v '>' >> test.fa
> 
>     Then I reverse-complement the results since it was reported on the
>     negative strand:
>      faRc test.fa testRC.fa
> 
>      >RC_NM_003820_from_gp
>     ccttcataccggcccttcccctcggctttgcctggacagctcctgcctcc
>     cgcagggcccacctgtgtcccccagcgccgctccacccagcaggcctgag
>     cccctctctgctgccagacaccccctgctgcccactctcctgctgctcgg
>     gttctgaggcacagcttgtcacaccgaggcggattctctttctctttctc
>     tttctcttctggcccacagccgcagcaatggcgctgagttcctctgctgg
>     agttcatcctgctagctgggttcccgagctgccggtctgagcctgaggca
>     tggagcctcctggagactgggggcctcctccctggagatccacccccaaa
>     accgacgtcttgaggctggtgctgtatctcaccttcctgggagccccctg
>     ctacgccccagctctgccgtcctgcaaggaggacgagtacccagtgggct
>     ccgagtgctgccccaagtgcagtccaggttatcgtgtgaaggaggcctgc
>     ggggagctgacgggcacagtgtgtgaaccctgccctccaggcacctacat
>     tgcccacctcaatggcctaagcaagtgtctgcagtgccaaatgtgtgacc
>     cagccatgggcctgcgcgcgagccggaactgctccaggacagagaacgcc
>     gtgtgtggctgcagcccaggccacttctgcatcgtccaggacggggacca
>     ctgcgccgcgtgccgcgcttacgccacctccagcccgggccagagggtgc
>     agaagggaggcaccgagagtcaggacaccctgtgtcagaactgccccccg
>     gggaccttctctcccaatgggaccctggaggaatgtcagcaccagaccaa
>     gtgcagctggctggtgacgaaggccggagctgggaccagcagctcccact
>     gggtatggtggtttctctcagggagcctcgtcatcgtcattgtttgctcc
>     acagttggcctaatcatatgtgtgaaaagaagaaagccaaggggtgatgt
>     agtcaaggtgatcgtctccgtccagcggaaaagacaggaggcagaaggtg
>     aggccacagtcattgaggccctgcaggcccctccggacgtcaccacggtg
>     gccgtggaggagacaataccctcattcacggggaggagcccaaaccactg
>     acccacagactctgcaccccgacgccagagatacctggagcgacggctgc
>     tgaaagaggctgtccacctggcggaaccaccggagcccggaggcttgggg
>     gctccgccctgggctggcttccgtctcctccagtggagggagaggtgggg
>     cccctgctggggtagagctggggacgccacgtgccattcccatgggccag
>     tgagggcctggggcctctgttctgctgtggcctgagctccccagagtcct
>     gaggaggagcgccagttgcccctcgctcacagaccacacacccagccctc
>     ctgggccagcccagagggcccttcagaccccagctgtctgcgcgtctgac
>     tcttgtggcctcagcaggacaggccccgggcactgcctcacagccaaggc
>     tggactgggttggctgcagtgtggtgtttagtggataccacatcggaagt
>     gattttctaaattggatttgaattcggctcctgttttctatttgtcatga
>     aacagtgtatttggggagatgctgtgggaggatgtaaatatcttgtttct
>     cctcaa
> 
>     Here is the browser output for hg18 refSeq NM_003820
>     cDNA NM_003820
> 
>     CCTTCATACC GGCCCTTCCC CTCGGCTTTG CCTGGACAGC TCCTGCCTCC  50
>     CGCAGGGCCC ACCTGTGTCC CCCAGCGCCG CTCCACCCAG CAGGCCTGAG  100
>     CCCCTCTCTG CTGCCAGACA CCCCCTGCTG CCCACTCTCC TGCTGCTCGG  150
>     GTTCTGAGGC ACAGCTTGTC ACACCGAGGC GGATTCTCTT TCTCTTTCTC  200
>     TTTCTCTTCT GGCCCACAGC CGCAGCAATG GCGCTGAGTT CCTCTGCTGG  250
>     AGTTCATCCT GCTAGCTGGG TTCCCGAGCT GCCGGTCTGA GCCTGAGGCA  300
>     TGGAGCCTCC TGGAGACTGG GGGCCTCCTC CCTGGAGATC CACCCCCAAA  350
>     ACCGACGTCT TGAGGCTGGT GCTGTATCTC ACCTTCCTGG GAGCCCCCTG  400
>     CTACGCCCCA GCTCTGCCGT CCTGCAAGGA GGACGAGTAC CCAGTGGGCT  450
>     CCGAGTGCTG CCCCAAGTGC AGTCCAGGTT ATCGTGTGAA GGAGGCCTGC  500
>     GGGGAGCTGA CGGGCACAGT GTGTGAACCC TGCCCTCCAG GCACCTACAT  550
>     TGCCCACCTC AATGGCCTAA GCAAGTGTCT GCAGTGCCAA ATGTGTGACC  600
>     CAGCCATGGG CCTGCGCGCG AGCCGGAACT GCTCCAGGAC AGAGAACGCC  650
>     GTGTGTGGCT GCAGCCCAGG CCACTTCTGC ATCGTCCAGG ACGGGGACCA  700
>     CTGCGCCGCG TGCCGCGCTT ACGCCACCTC CAGCCCGGGC CAGAGGGTGC  750
>     AGAAGGGAGG CACCGAGAGT CAGGACACCC TGTGTCAGAA CTGCCCCCCG  800
>     GGGACCTTCT CTCCCAATGG GACCCTGGAG GAATGTCAGC ACCAGACCAA  850
>     GTGCAGCTGG CTGGTGACGA AGGCCGGAGC TGGGACCAGC AGCTCCCACT  900
>     GGGTATGGTG GTTTCTCTCA GGGAGCCTCG TCATCGTCAT TGTTTGCTCC  950
>     ACAGTTGGCC TAATCATATG TGTGAAAAGA AGAAAGCCAA GGGGTGATGT  1000
>     AGTCAAGGTG ATCGTCTCCG TCCAGCGGAA AAGACAGGAG GCAGAAGGTG  1050
>     AGGCCACAGT CATTGAGGCC CTGCAGGCCC CTCCGGACGT CACCACGGTG  1100
>     GCCGTGGAGG AGACAATACC CTCATTCACG GGGAGGAGCC CAAACCACTG  1150
>     ACCCACAGAC TCTGCACCCC GACGCCAGAG ATACCTGGAG CGACGGCTGC  1200
>     TGAAAGAGGC TGTCCACCTG GCGGAACCAC CGGAGCCCGG AGGCTTGGGG  1250
>     GCTCCGCCCT GGGCTGGCTT CCGTCTCCTC CAGTGGAGGG AGAGGTGGGG  1300
>     CCCCTGCTGG GGTAGAGCTG GGGACGCCAC GTGCCATTCC CATGGGCCAG  1350
>     TGAGGGCCTG GGGCCTCTGT TCTGCTGTGG CCTGAGCTCC CCAGAGTCCT  1400
>     GAGGAGGAGC GCCAGTTGCC CCTCGCTCAC AGACCACACA CCCAGCCCTC  1450
>     CTGGGCCAGC CCAGAGGGCC CTTCAGACCC CAGCTGTCTG CGCGTCTGAC  1500
>     TCTTGTGGCC TCAGCAGGAC AGGCCCCGGG CACTGCCTCA CAGCCAAGGC  1550
>     TGGACTGGGT TGGCTGCAGT GTGGTGTTTA GTGGATACCA CATCGGAAGT  1600
>     GATTTTCTAA ATTGGATTTG AATTCGGCTC CTGTTTTCTA TTTGTCATGA  1650
>     AACAGTGTAT TTGGGGAGAT GCTGTGGGAG GATGTAAATA TCTTGTTTCT  1700
>     CCTCAAaaaa aaaaaaaaaa aaaaaaaaaa
> 
>     As you can see, they are a very good match.
> 
>     -Galt
> 
> 
>     On 09/08/10 13:23, Jennifer Jackson wrote:
> 
>         Hello Lipika,
> 
>         Perhaps some help understanding the coordinate system used by
>         UCSC will help. We use a 0-based start position. This can get
>         tricky, especially when converting to the (-) strand, since we
>         also store all coordinates smallest->largest along the chromosome.
> 
>         Help is located in this wiki:
>         http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
> 
>         All database tables/files will be formatted this way unless
>         specifically noted in the data format FAQ:
>         http://genome.ucsc.edu/FAQ/FAQformat.html
> 
>         There are utilities readily available that work with our
>         coordinate system. Some function stand-alone and others require
>         a database. The public mySQL database can be used when a
>         database is required, if you do not run your own mirror.
> 
>         A list of utilities is here:
>         http://hgwdev.cse.ucsc.edu/~larrym/utilities.html
> 
>         Many can be downloaded pre-compiled from here (for certain
>         platforms):
>         http://hgdownload.cse.ucsc.edu/admin/exe/
> 
>         Otherwise, obtain the source and compile locally:
>         http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads
> 
>         Public mySQL access instructions:
>         http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29
> 
>         Please feel free to contact the mailing list support team again
>         if you would like more assistance.
> 
>         Warm regards,
> 
>         Jen
>         UCSC Genome Browser Support
> 
>         On 9/8/10 11:35 AM, Lipika Ray wrote:
> 
>             Hello UCSC group,
> 
>             I like to get the coding sequence of gene from refseq mrna
>             ids (like,
>             NM_003820) from hg18 version - big list of such ids.
> 
>             So I am getting information of exonstarts , exonends,
>             cdsStart, cdsend from
>             refFlat table under hg18.
> 
>             So for NM_003820, the record looks like this:
> 
>             geneName: TNFRSF14
>                   name: NM_003820
>                  chrom: chr1
>                 strand: -
>                txStart: 2479150
>                  txEnd: 2486613
>               cdsStart: 2479705
>                 cdsEnd: 2486314
>              exonCount: 8
>             exonStarts:
>             2479150,2480082,2481163,2482264,2483000,2484510,2485144,2486245,
>               exonEnds:
>             2479831,2480114,2481306,2482355,2483156,2484636,2485253,2486613,
> 
>             To get the dna sequence corresponding to the coding regions,
>             I am extracting
>             sequences from chr1.fa.gz file under chromosomes in hg18
>             version and then
>             extracting the dna sequence corresponding to the region:
> 
>             2479705-2479831, 2480082-2480114, 2481163-2481306,
>             2482264-2482355,
>             2483000-2483156, 2484510-2484636, 2485144-2485253,
>             2486245-2486314
> 
>             The corresponding sequence is not matching if I cross check
>             with the
>             sequence from web. Can you please guide me whether I can
>             extract sequence in
>             this way, or you already have sequences corresponding to
>             genes stored
>             separately in your datanbase.
> 
>             Thanks for your help.
> 
>             Lipika
>             _______________________________________________
>             Genome maillist  -  [email protected]
>             <mailto:[email protected]>
>             https://lists.soe.ucsc.edu/mailman/listinfo/genome
> 
>         _______________________________________________
>         Genome maillist  -  [email protected]
>         <mailto:[email protected]>
>         https://lists.soe.ucsc.edu/mailman/listinfo/genome
> 
> 
>  
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Download coding sequence bulk

Reply via email to