Hello Galt, Many many thanks for your kind reply - that helped me a lot to understand where I was doing mistake - I was following every step, only got confused with 0-based counting - what to write in substr function in perl - now it is straight - thanks. -64 Another thing I was not aware of those utilities - we have linux 64-bit and 32-bit machines - it seems all programs are not available for all the executables - like you mentioned about twoBitToFa and faRc - I got twoBitToFa for 64-bit machine not the other one. For 32-bit it seems nothing is available except liftover. -- Am I missing something again or it is true that all utilities are not available for all platforms?
Thanks again for your detailed help - now I am getting correct sequences. Thanks, Lipika On Wed, Sep 8, 2010 at 5:03 PM, Galt Barber <[email protected]> wrote: > Hi, Lipika! > > I was able to do something very similar to the process you > describe and it worked. Here are my results: > > I made a file called ranges which has each exon > > [hgwdev:~/lipika> cat ranges > chr1:2479150-2479831 > chr1:2480082-2480114 > chr1:2481163-2481306 > chr1:2482264-2482355 > chr1:2483000-2483156 > chr1:2484510-2484636 > chr1:2485144-2485253 > chr1:2486245-2486613 > > twoBitToFa /gbdb/hg18/hg18.2bit -seqList=ranges out.fa > > This extracts the pieces as separate sequences. > > Then I merge the several exon pieces into a single new fasta record > creating a new header and stripping out the original multiple headers. > echo ">NM_003820_from_gp" > test.fa > cat out.fa | grep -v '>' >> test.fa > > Then I reverse-complement the results since it was reported on the negative > strand: > faRc test.fa testRC.fa > > >RC_NM_003820_from_gp > ccttcataccggcccttcccctcggctttgcctggacagctcctgcctcc > cgcagggcccacctgtgtcccccagcgccgctccacccagcaggcctgag > cccctctctgctgccagacaccccctgctgcccactctcctgctgctcgg > gttctgaggcacagcttgtcacaccgaggcggattctctttctctttctc > tttctcttctggcccacagccgcagcaatggcgctgagttcctctgctgg > agttcatcctgctagctgggttcccgagctgccggtctgagcctgaggca > tggagcctcctggagactgggggcctcctccctggagatccacccccaaa > accgacgtcttgaggctggtgctgtatctcaccttcctgggagccccctg > ctacgccccagctctgccgtcctgcaaggaggacgagtacccagtgggct > ccgagtgctgccccaagtgcagtccaggttatcgtgtgaaggaggcctgc > ggggagctgacgggcacagtgtgtgaaccctgccctccaggcacctacat > tgcccacctcaatggcctaagcaagtgtctgcagtgccaaatgtgtgacc > cagccatgggcctgcgcgcgagccggaactgctccaggacagagaacgcc > gtgtgtggctgcagcccaggccacttctgcatcgtccaggacggggacca > ctgcgccgcgtgccgcgcttacgccacctccagcccgggccagagggtgc > agaagggaggcaccgagagtcaggacaccctgtgtcagaactgccccccg > gggaccttctctcccaatgggaccctggaggaatgtcagcaccagaccaa > gtgcagctggctggtgacgaaggccggagctgggaccagcagctcccact > gggtatggtggtttctctcagggagcctcgtcatcgtcattgtttgctcc > acagttggcctaatcatatgtgtgaaaagaagaaagccaaggggtgatgt > agtcaaggtgatcgtctccgtccagcggaaaagacaggaggcagaaggtg > aggccacagtcattgaggccctgcaggcccctccggacgtcaccacggtg > gccgtggaggagacaataccctcattcacggggaggagcccaaaccactg > acccacagactctgcaccccgacgccagagatacctggagcgacggctgc > tgaaagaggctgtccacctggcggaaccaccggagcccggaggcttgggg > gctccgccctgggctggcttccgtctcctccagtggagggagaggtgggg > cccctgctggggtagagctggggacgccacgtgccattcccatgggccag > tgagggcctggggcctctgttctgctgtggcctgagctccccagagtcct > gaggaggagcgccagttgcccctcgctcacagaccacacacccagccctc > ctgggccagcccagagggcccttcagaccccagctgtctgcgcgtctgac > tcttgtggcctcagcaggacaggccccgggcactgcctcacagccaaggc > tggactgggttggctgcagtgtggtgtttagtggataccacatcggaagt > gattttctaaattggatttgaattcggctcctgttttctatttgtcatga > aacagtgtatttggggagatgctgtgggaggatgtaaatatcttgtttct > cctcaa > > Here is the browser output for hg18 refSeq NM_003820 > cDNA NM_003820 > > CCTTCATACC GGCCCTTCCC CTCGGCTTTG CCTGGACAGC TCCTGCCTCC 50 > CGCAGGGCCC ACCTGTGTCC CCCAGCGCCG CTCCACCCAG CAGGCCTGAG 100 > CCCCTCTCTG CTGCCAGACA CCCCCTGCTG CCCACTCTCC TGCTGCTCGG 150 > GTTCTGAGGC ACAGCTTGTC ACACCGAGGC GGATTCTCTT TCTCTTTCTC 200 > TTTCTCTTCT GGCCCACAGC CGCAGCAATG GCGCTGAGTT CCTCTGCTGG 250 > AGTTCATCCT GCTAGCTGGG TTCCCGAGCT GCCGGTCTGA GCCTGAGGCA 300 > TGGAGCCTCC TGGAGACTGG GGGCCTCCTC CCTGGAGATC CACCCCCAAA 350 > ACCGACGTCT TGAGGCTGGT GCTGTATCTC ACCTTCCTGG GAGCCCCCTG 400 > CTACGCCCCA GCTCTGCCGT CCTGCAAGGA GGACGAGTAC CCAGTGGGCT 450 > CCGAGTGCTG CCCCAAGTGC AGTCCAGGTT ATCGTGTGAA GGAGGCCTGC 500 > GGGGAGCTGA CGGGCACAGT GTGTGAACCC TGCCCTCCAG GCACCTACAT 550 > TGCCCACCTC AATGGCCTAA GCAAGTGTCT GCAGTGCCAA ATGTGTGACC 600 > CAGCCATGGG CCTGCGCGCG AGCCGGAACT GCTCCAGGAC AGAGAACGCC 650 > GTGTGTGGCT GCAGCCCAGG CCACTTCTGC ATCGTCCAGG ACGGGGACCA 700 > CTGCGCCGCG TGCCGCGCTT ACGCCACCTC CAGCCCGGGC CAGAGGGTGC 750 > AGAAGGGAGG CACCGAGAGT CAGGACACCC TGTGTCAGAA CTGCCCCCCG 800 > GGGACCTTCT CTCCCAATGG GACCCTGGAG GAATGTCAGC ACCAGACCAA 850 > GTGCAGCTGG CTGGTGACGA AGGCCGGAGC TGGGACCAGC AGCTCCCACT 900 > GGGTATGGTG GTTTCTCTCA GGGAGCCTCG TCATCGTCAT TGTTTGCTCC 950 > ACAGTTGGCC TAATCATATG TGTGAAAAGA AGAAAGCCAA GGGGTGATGT 1000 > AGTCAAGGTG ATCGTCTCCG TCCAGCGGAA AAGACAGGAG GCAGAAGGTG 1050 > AGGCCACAGT CATTGAGGCC CTGCAGGCCC CTCCGGACGT CACCACGGTG 1100 > GCCGTGGAGG AGACAATACC CTCATTCACG GGGAGGAGCC CAAACCACTG 1150 > ACCCACAGAC TCTGCACCCC GACGCCAGAG ATACCTGGAG CGACGGCTGC 1200 > TGAAAGAGGC TGTCCACCTG GCGGAACCAC CGGAGCCCGG AGGCTTGGGG 1250 > GCTCCGCCCT GGGCTGGCTT CCGTCTCCTC CAGTGGAGGG AGAGGTGGGG 1300 > CCCCTGCTGG GGTAGAGCTG GGGACGCCAC GTGCCATTCC CATGGGCCAG 1350 > TGAGGGCCTG GGGCCTCTGT TCTGCTGTGG CCTGAGCTCC CCAGAGTCCT 1400 > GAGGAGGAGC GCCAGTTGCC CCTCGCTCAC AGACCACACA CCCAGCCCTC 1450 > CTGGGCCAGC CCAGAGGGCC CTTCAGACCC CAGCTGTCTG CGCGTCTGAC 1500 > TCTTGTGGCC TCAGCAGGAC AGGCCCCGGG CACTGCCTCA CAGCCAAGGC 1550 > TGGACTGGGT TGGCTGCAGT GTGGTGTTTA GTGGATACCA CATCGGAAGT 1600 > GATTTTCTAA ATTGGATTTG AATTCGGCTC CTGTTTTCTA TTTGTCATGA 1650 > AACAGTGTAT TTGGGGAGAT GCTGTGGGAG GATGTAAATA TCTTGTTTCT 1700 > CCTCAAaaaa aaaaaaaaaa aaaaaaaaaa > > As you can see, they are a very good match. > > -Galt > > > On 09/08/10 13:23, Jennifer Jackson wrote: > >> Hello Lipika, >> >> Perhaps some help understanding the coordinate system used by UCSC will >> help. We use a 0-based start position. This can get tricky, especially when >> converting to the (-) strand, since we also store all coordinates >> smallest->largest along the chromosome. >> >> Help is located in this wiki: >> http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms >> >> All database tables/files will be formatted this way unless specifically >> noted in the data format FAQ: >> http://genome.ucsc.edu/FAQ/FAQformat.html >> >> There are utilities readily available that work with our coordinate >> system. Some function stand-alone and others require a database. The public >> mySQL database can be used when a database is required, if you do not run >> your own mirror. >> >> A list of utilities is here: >> http://hgwdev.cse.ucsc.edu/~larrym/utilities.html >> >> Many can be downloaded pre-compiled from here (for certain platforms): >> http://hgdownload.cse.ucsc.edu/admin/exe/ >> >> Otherwise, obtain the source and compile locally: >> http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads >> >> Public mySQL access instructions: >> http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29 >> >> Please feel free to contact the mailing list support team again if you >> would like more assistance. >> >> Warm regards, >> >> Jen >> UCSC Genome Browser Support >> >> On 9/8/10 11:35 AM, Lipika Ray wrote: >> >>> Hello UCSC group, >>> >>> I like to get the coding sequence of gene from refseq mrna ids (like, >>> NM_003820) from hg18 version - big list of such ids. >>> >>> So I am getting information of exonstarts , exonends, cdsStart, cdsend >>> from >>> refFlat table under hg18. >>> >>> So for NM_003820, the record looks like this: >>> >>> geneName: TNFRSF14 >>> name: NM_003820 >>> chrom: chr1 >>> strand: - >>> txStart: 2479150 >>> txEnd: 2486613 >>> cdsStart: 2479705 >>> cdsEnd: 2486314 >>> exonCount: 8 >>> exonStarts: >>> 2479150,2480082,2481163,2482264,2483000,2484510,2485144,2486245, >>> exonEnds: >>> 2479831,2480114,2481306,2482355,2483156,2484636,2485253,2486613, >>> >>> To get the dna sequence corresponding to the coding regions, I am >>> extracting >>> sequences from chr1.fa.gz file under chromosomes in hg18 version and then >>> extracting the dna sequence corresponding to the region: >>> >>> 2479705-2479831, 2480082-2480114, 2481163-2481306, 2482264-2482355, >>> 2483000-2483156, 2484510-2484636, 2485144-2485253, 2486245-2486314 >>> >>> The corresponding sequence is not matching if I cross check with the >>> sequence from web. Can you please guide me whether I can extract sequence >>> in >>> this way, or you already have sequences corresponding to genes stored >>> separately in your datanbase. >>> >>> Thanks for your help. >>> >>> Lipika >>> _______________________________________________ >>> Genome maillist - [email protected] >>> https://lists.soe.ucsc.edu/mailman/listinfo/genome >>> >> _______________________________________________ >> Genome maillist - [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome >> > _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
