Hi, Lipika! I was able to do something very similar to the process you describe and it worked. Here are my results:
I made a file called ranges which has each exon [hgwdev:~/lipika> cat ranges chr1:2479150-2479831 chr1:2480082-2480114 chr1:2481163-2481306 chr1:2482264-2482355 chr1:2483000-2483156 chr1:2484510-2484636 chr1:2485144-2485253 chr1:2486245-2486613 twoBitToFa /gbdb/hg18/hg18.2bit -seqList=ranges out.fa This extracts the pieces as separate sequences. Then I merge the several exon pieces into a single new fasta record creating a new header and stripping out the original multiple headers. echo ">NM_003820_from_gp" > test.fa cat out.fa | grep -v '>' >> test.fa Then I reverse-complement the results since it was reported on the negative strand: faRc test.fa testRC.fa >RC_NM_003820_from_gp ccttcataccggcccttcccctcggctttgcctggacagctcctgcctcc cgcagggcccacctgtgtcccccagcgccgctccacccagcaggcctgag cccctctctgctgccagacaccccctgctgcccactctcctgctgctcgg gttctgaggcacagcttgtcacaccgaggcggattctctttctctttctc tttctcttctggcccacagccgcagcaatggcgctgagttcctctgctgg agttcatcctgctagctgggttcccgagctgccggtctgagcctgaggca tggagcctcctggagactgggggcctcctccctggagatccacccccaaa accgacgtcttgaggctggtgctgtatctcaccttcctgggagccccctg ctacgccccagctctgccgtcctgcaaggaggacgagtacccagtgggct ccgagtgctgccccaagtgcagtccaggttatcgtgtgaaggaggcctgc ggggagctgacgggcacagtgtgtgaaccctgccctccaggcacctacat tgcccacctcaatggcctaagcaagtgtctgcagtgccaaatgtgtgacc cagccatgggcctgcgcgcgagccggaactgctccaggacagagaacgcc gtgtgtggctgcagcccaggccacttctgcatcgtccaggacggggacca ctgcgccgcgtgccgcgcttacgccacctccagcccgggccagagggtgc agaagggaggcaccgagagtcaggacaccctgtgtcagaactgccccccg gggaccttctctcccaatgggaccctggaggaatgtcagcaccagaccaa gtgcagctggctggtgacgaaggccggagctgggaccagcagctcccact gggtatggtggtttctctcagggagcctcgtcatcgtcattgtttgctcc acagttggcctaatcatatgtgtgaaaagaagaaagccaaggggtgatgt agtcaaggtgatcgtctccgtccagcggaaaagacaggaggcagaaggtg aggccacagtcattgaggccctgcaggcccctccggacgtcaccacggtg gccgtggaggagacaataccctcattcacggggaggagcccaaaccactg acccacagactctgcaccccgacgccagagatacctggagcgacggctgc tgaaagaggctgtccacctggcggaaccaccggagcccggaggcttgggg gctccgccctgggctggcttccgtctcctccagtggagggagaggtgggg cccctgctggggtagagctggggacgccacgtgccattcccatgggccag tgagggcctggggcctctgttctgctgtggcctgagctccccagagtcct gaggaggagcgccagttgcccctcgctcacagaccacacacccagccctc ctgggccagcccagagggcccttcagaccccagctgtctgcgcgtctgac tcttgtggcctcagcaggacaggccccgggcactgcctcacagccaaggc tggactgggttggctgcagtgtggtgtttagtggataccacatcggaagt gattttctaaattggatttgaattcggctcctgttttctatttgtcatga aacagtgtatttggggagatgctgtgggaggatgtaaatatcttgtttct cctcaa Here is the browser output for hg18 refSeq NM_003820 cDNA NM_003820 CCTTCATACC GGCCCTTCCC CTCGGCTTTG CCTGGACAGC TCCTGCCTCC 50 CGCAGGGCCC ACCTGTGTCC CCCAGCGCCG CTCCACCCAG CAGGCCTGAG 100 CCCCTCTCTG CTGCCAGACA CCCCCTGCTG CCCACTCTCC TGCTGCTCGG 150 GTTCTGAGGC ACAGCTTGTC ACACCGAGGC GGATTCTCTT TCTCTTTCTC 200 TTTCTCTTCT GGCCCACAGC CGCAGCAATG GCGCTGAGTT CCTCTGCTGG 250 AGTTCATCCT GCTAGCTGGG TTCCCGAGCT GCCGGTCTGA GCCTGAGGCA 300 TGGAGCCTCC TGGAGACTGG GGGCCTCCTC CCTGGAGATC CACCCCCAAA 350 ACCGACGTCT TGAGGCTGGT GCTGTATCTC ACCTTCCTGG GAGCCCCCTG 400 CTACGCCCCA GCTCTGCCGT CCTGCAAGGA GGACGAGTAC CCAGTGGGCT 450 CCGAGTGCTG CCCCAAGTGC AGTCCAGGTT ATCGTGTGAA GGAGGCCTGC 500 GGGGAGCTGA CGGGCACAGT GTGTGAACCC TGCCCTCCAG GCACCTACAT 550 TGCCCACCTC AATGGCCTAA GCAAGTGTCT GCAGTGCCAA ATGTGTGACC 600 CAGCCATGGG CCTGCGCGCG AGCCGGAACT GCTCCAGGAC AGAGAACGCC 650 GTGTGTGGCT GCAGCCCAGG CCACTTCTGC ATCGTCCAGG ACGGGGACCA 700 CTGCGCCGCG TGCCGCGCTT ACGCCACCTC CAGCCCGGGC CAGAGGGTGC 750 AGAAGGGAGG CACCGAGAGT CAGGACACCC TGTGTCAGAA CTGCCCCCCG 800 GGGACCTTCT CTCCCAATGG GACCCTGGAG GAATGTCAGC ACCAGACCAA 850 GTGCAGCTGG CTGGTGACGA AGGCCGGAGC TGGGACCAGC AGCTCCCACT 900 GGGTATGGTG GTTTCTCTCA GGGAGCCTCG TCATCGTCAT TGTTTGCTCC 950 ACAGTTGGCC TAATCATATG TGTGAAAAGA AGAAAGCCAA GGGGTGATGT 1000 AGTCAAGGTG ATCGTCTCCG TCCAGCGGAA AAGACAGGAG GCAGAAGGTG 1050 AGGCCACAGT CATTGAGGCC CTGCAGGCCC CTCCGGACGT CACCACGGTG 1100 GCCGTGGAGG AGACAATACC CTCATTCACG GGGAGGAGCC CAAACCACTG 1150 ACCCACAGAC TCTGCACCCC GACGCCAGAG ATACCTGGAG CGACGGCTGC 1200 TGAAAGAGGC TGTCCACCTG GCGGAACCAC CGGAGCCCGG AGGCTTGGGG 1250 GCTCCGCCCT GGGCTGGCTT CCGTCTCCTC CAGTGGAGGG AGAGGTGGGG 1300 CCCCTGCTGG GGTAGAGCTG GGGACGCCAC GTGCCATTCC CATGGGCCAG 1350 TGAGGGCCTG GGGCCTCTGT TCTGCTGTGG CCTGAGCTCC CCAGAGTCCT 1400 GAGGAGGAGC GCCAGTTGCC CCTCGCTCAC AGACCACACA CCCAGCCCTC 1450 CTGGGCCAGC CCAGAGGGCC CTTCAGACCC CAGCTGTCTG CGCGTCTGAC 1500 TCTTGTGGCC TCAGCAGGAC AGGCCCCGGG CACTGCCTCA CAGCCAAGGC 1550 TGGACTGGGT TGGCTGCAGT GTGGTGTTTA GTGGATACCA CATCGGAAGT 1600 GATTTTCTAA ATTGGATTTG AATTCGGCTC CTGTTTTCTA TTTGTCATGA 1650 AACAGTGTAT TTGGGGAGAT GCTGTGGGAG GATGTAAATA TCTTGTTTCT 1700 CCTCAAaaaa aaaaaaaaaa aaaaaaaaaa As you can see, they are a very good match. -Galt On 09/08/10 13:23, Jennifer Jackson wrote: > Hello Lipika, > > Perhaps some help understanding the coordinate system used by UCSC will > help. We use a 0-based start position. This can get tricky, especially > when converting to the (-) strand, since we also store all coordinates > smallest->largest along the chromosome. > > Help is located in this wiki: > http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms > > All database tables/files will be formatted this way unless specifically > noted in the data format FAQ: > http://genome.ucsc.edu/FAQ/FAQformat.html > > There are utilities readily available that work with our coordinate > system. Some function stand-alone and others require a database. The > public mySQL database can be used when a database is required, if you do > not run your own mirror. > > A list of utilities is here: > http://hgwdev.cse.ucsc.edu/~larrym/utilities.html > > Many can be downloaded pre-compiled from here (for certain platforms): > http://hgdownload.cse.ucsc.edu/admin/exe/ > > Otherwise, obtain the source and compile locally: > http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads > > Public mySQL access instructions: > http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29 > > Please feel free to contact the mailing list support team again if you > would like more assistance. > > Warm regards, > > Jen > UCSC Genome Browser Support > > On 9/8/10 11:35 AM, Lipika Ray wrote: >> Hello UCSC group, >> >> I like to get the coding sequence of gene from refseq mrna ids (like, >> NM_003820) from hg18 version - big list of such ids. >> >> So I am getting information of exonstarts , exonends, cdsStart, cdsend from >> refFlat table under hg18. >> >> So for NM_003820, the record looks like this: >> >> geneName: TNFRSF14 >> name: NM_003820 >> chrom: chr1 >> strand: - >> txStart: 2479150 >> txEnd: 2486613 >> cdsStart: 2479705 >> cdsEnd: 2486314 >> exonCount: 8 >> exonStarts: 2479150,2480082,2481163,2482264,2483000,2484510,2485144,2486245, >> exonEnds: 2479831,2480114,2481306,2482355,2483156,2484636,2485253,2486613, >> >> To get the dna sequence corresponding to the coding regions, I am extracting >> sequences from chr1.fa.gz file under chromosomes in hg18 version and then >> extracting the dna sequence corresponding to the region: >> >> 2479705-2479831, 2480082-2480114, 2481163-2481306, 2482264-2482355, >> 2483000-2483156, 2484510-2484636, 2485144-2485253, 2486245-2486314 >> >> The corresponding sequence is not matching if I cross check with the >> sequence from web. Can you please guide me whether I can extract sequence in >> this way, or you already have sequences corresponding to genes stored >> separately in your datanbase. >> >> Thanks for your help. >> >> Lipika >> _______________________________________________ >> Genome maillist - [email protected] >> https://lists.soe.ucsc.edu/mailman/listinfo/genome > _______________________________________________ > Genome maillist - [email protected] > https://lists.soe.ucsc.edu/mailman/listinfo/genome _______________________________________________ Genome maillist - [email protected] https://lists.soe.ucsc.edu/mailman/listinfo/genome
