|
I was
exporting some upstream sequences for Homo sapiens. Of the 31,545 genes
exported (no filters) I received 21 duplicates. Both the fasta header
line with '>' and the upstream sequence were identical in all cases.
Here is some debugging output showing details: NOTE 1th duplicate, at fasta input record 30639: [ENSG00000185960, ENSG00000185960.4]. gene identifier 'ENSG00000185960' previously found at fasta input record 8429 which has these geneIds: [ENSG00000185960, ENSG00000185960.4]. Do the sequences match? true Partial old sequence: TAAAAAGAAAAGTGTTTCCTCCCTGGCTGGAGGACCCAGGAGGAGGTCCCAGTTTTCCGGTGGGGATGGGCGTGGAGTAGGGGGCGGGGAAGGGATGAGG Partial new sequence: TAAAAAGAAAAGTGTTTCCTCCCTGGCTGGAGGACCCAGGAGGAGGTCCCAGTTTTCCGGTGGGGATGGGCGTGGAGTAGGGGGCGGGGAAGGGATGAGG NOTE 2th duplicate, at fasta input record 30727: [ENSG00000197976, ENSG00000197976.2]. gene identifier 'ENSG00000197976' previously found at fasta input record 9268 which has these geneIds: [ENSG00000197976, ENSG00000197976.2]. Do the sequences match? true Partial old sequence: CCTTCCCCTCCCCTCCCCTCCTTTCCCTTCCCCTCCCCTCCTCTCCCTTCCCCTCCCCTCCTCTCCCTTCCCCTCCCTTCCCCTCCATTCCCCTCCCTTC Partial new sequence: CCTTCCCCTCCCCTCCCCTCCTTTCCCTTCCCCTCCCCTCCTCTCCCTTCCCCTCCCCTCCTCTCCCTTCCCCTCCCTTCCCCTCCATTCCCCTCCCTTC NOTE 3th duplicate, at fasta input record 30730: [ENSG00000182162, ENSG00000182162.2]. gene identifier 'ENSG00000182162' previously found at fasta input record 9310 which has these geneIds: [ENSG00000182162, ENSG00000182162.2]. Do the sequences match? true Partial old sequence: TTTATTTGTTTATTTATTTATTTTTTGAGACAGAGTTTCGCTCTTGTTGCCCAGGCTGGGGTGCAGCGGCATGATCTCGGCTCACTGCAACCTCCGCCTC Partial new sequence: TTTATTTGTTTATTTATTTATTTTTTGAGACAGAGTTTCGCTCTTGTTGCCCAGGCTGGGGTGCAGCGGCATGATCTCGGCTCACTGCAACCTCCGCCTC NOTE 4th duplicate, at fasta input record 30798: [ENSG00000205681, ENSG00000205681.1]. gene identifier 'ENSG00000205681' previously found at fasta input record 10007 which has these geneIds: [ENSG00000205681, ENSG00000205681.1]. Do the sequences match? true Partial old sequence: GCTATGGCGCTTGGCTACCTGAGTCTTTATTCTGCCTTCCAGGTGCTTGTTGGTTGGATAACTTTGGGTAGGTTCTTGTACCTCTTTGAGCTTCAAGACT Partial new sequence: GCTATGGCGCTTGGCTACCTGAGTCTTTATTCTGCCTTCCAGGTGCTTGTTGGTTGGATAACTTTGGGTAGGTTCTTGTACCTCTTTGAGCTTCAAGACT NOTE 5th duplicate, at fasta input record 30820: [ENSG00000124343, ENSG00000124343.2]. gene identifier 'ENSG00000124343' previously found at fasta input record 7623 which has these geneIds: [ENSG00000124343, ENSG00000124343.2]. Do the sequences match? true Partial old sequence: CTAATCTCCAGTGATCCGCTCACCTCAGCCACCCAAAGTGCTGGGATTACAGACGTGAGCCACCGGGCCCAGCCAGCAGGGCTGATTTCTTCTGATGCTG Partial new sequence: CTAATCTCCAGTGATCCGCTCACCTCAGCCACCCAAAGTGCTGGGATTACAGACGTGAGCCACCGGGCCCAGCCAGCAGGGCTGATTTCTTCTGATGCTG NOTE 6th duplicate, at fasta input record 30844: [ENSG00000124333, ENSG00000124333.4]. gene identifier 'ENSG00000124333' previously found at fasta input record 19603 which has these geneIds: [ENSG00000124333, ENSG00000124333.4]. Do the sequences match? true Partial old sequence: AGGAAAAATAGCTAATGCATGCTGGGCTTTAATACCTAGGTGATGGGTTGATAGGTGCAGCAAATTACCATGGCACACATTTACCTGTATAACAAACCTG Partial new sequence: AGGAAAAATAGCTAATGCATGCTGGGCTTTAATACCTAGGTGATGGGTTGATAGGTGCAGCAAATTACCATGGCACACATTTACCTGTATAACAAACCTG NOTE 7th duplicate, at fasta input record 30934: [ENSG00000198223, ENSG00000198223.3]. gene identifier 'ENSG00000198223' previously found at fasta input record 8798 which has these geneIds: [ENSG00000198223, ENSG00000198223.3]. Do the sequences match? true Partial old sequence: TCCTGCAGGAATGGGGAGGCTAAGACGGTAGAGGTGCAGCCTGGTCAGCCATCTTTCACCTTTGCTGATGTTGCTATCCAGGTGTTTTCCATTGCATGTG Partial new sequence: TCCTGCAGGAATGGGGAGGCTAAGACGGTAGAGGTGCAGCCTGGTCAGCCATCTTTCACCTTTGCTGATGTTGCTATCCAGGTGTTTTCCATTGCATGTG NOTE 8th duplicate, at fasta input record 30968: [ENSG00000205755, ENSG00000205755.1]. gene identifier 'ENSG00000205755' previously found at fasta input record 9187 which has these geneIds: [ENSG00000205755, ENSG00000205755.1]. Do the sequences match? true Partial old sequence: GACGGAGTCTTGCTCTTGTCGCCCAGGCTGGAGTGCCGTGGCACGATCTCAGCTCACTGCCAACTCCGCCTCCCGGGTTCACGCCATTCTCCTGCCTCAG Partial new sequence: GACGGAGTCTTGCTCTTGTCGCCCAGGCTGGAGTGCCGTGGCACGATCTCAGCTCACTGCCAACTCCGCCTCCCGGGTTCACGCCATTCTCCTGCCTCAG NOTE 9th duplicate, at fasta input record 31013: [ENSG00000196433, ENSG00000196433.2]. gene identifier 'ENSG00000196433' previously found at fasta input record 9741 which has these geneIds: [ENSG00000196433, ENSG00000196433.2]. Do the sequences match? true Partial old sequence: GCCAATATAGTGAAACCCTGTCTCTACGAAAAATACAAAAATTAGCCAGGTATGGTGGCAGGTGCTTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGAA Partial new sequence: GCCAATATAGTGAAACCCTGTCTCTACGAAAAATACAAAAATTAGCCAGGTATGGTGGCAGGTGCTTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGAA NOTE 10th duplicate, at fasta input record 31022: [ENSG00000168939, ENSG00000168939.2]. gene identifier 'ENSG00000168939' previously found at fasta input record 20849 which has these geneIds: [ENSG00000168939, ENSG00000168939.2]. Do the sequences match? true Partial old sequence: GAGACAGCCTGAGTCAGCCTGAGTTAAAATCCTAGATCTGCAAACTGCCAACTGTGTAACCTTGGACAAGTTACTTAAGGTCTTTGGACCTTGGTTTCTC Partial new sequence: GAGACAGCCTGAGTCAGCCTGAGTTAAAATCCTAGATCTGCAAACTGCCAACTGTGTAACCTTGGACAAGTTACTTAAGGTCTTTGGACCTTGGTTTCTC NOTE 11th duplicate, at fasta input record 31055: [ENSG00000169100, ENSG00000169100.3]. gene identifier 'ENSG00000169100' previously found at fasta input record 7624 which has these geneIds: [ENSG00000169100, ENSG00000169100.3]. Do the sequences match? true Partial old sequence: AGCCAGCCTCATCTGGAAATAGCAGCTCTGGTCCCGGCCTCGCTGAGGCACTGAAAACCAGCACCAGGGCCCCGTCCAGCCCGGCCTCGCTGAGGCTGGG Partial new sequence: AGCCAGCCTCATCTGGAAATAGCAGCTCTGGTCCCGGCCTCGCTGAGGCACTGAAAACCAGCACCAGGGCCCCGTCCAGCCCGGCCTCGCTGAGGCTGGG NOTE 12th duplicate, at fasta input record 31115: [ENSG00000185291, ENSG00000185291.3]. gene identifier 'ENSG00000185291' previously found at fasta input record 8154 which has these geneIds: [ENSG00000185291, ENSG00000185291.3]. Do the sequences match? true Partial old sequence: AGGCTGGTCTTGAACCCCTGACCTCAGGTGATGCACCCACCTTGGCCTCCCACAGAGCTGGGATTACAGGCGTGAGCCACTGGGCCCCGCCCTGTATTTG Partial new sequence: AGGCTGGTCTTGAACCCCTGACCTCAGGTGATGCACCCACCTTGGCCTCCCACAGAGCTGGGATTACAGGCGTGAGCCACTGGGCCCCGCCCTGTATTTG NOTE 13th duplicate, at fasta input record 31130: [ENSG00000124334, ENSG00000124334.6]. gene identifier 'ENSG00000124334' previously found at fasta input record 19934 which has these geneIds: [ENSG00000124334, ENSG00000124334.6]. Do the sequences match? true Partial old sequence: CTTTTCTCTTAAGCATGGGTGACATAGTACTCTTTCTTCATGTGTTTGATAAATTTGTTTTTATCTTAGAAATTGTGAATGGTATACATTGTTGAGACTG Partial new sequence: CTTTTCTCTTAAGCATGGGTGACATAGTACTCTTTCTTCATGTGTTTGATAAATTTGTTTTTATCTTAGAAATTGTGAATGGTATACATTGTTGAGACTG NOTE 14th duplicate, at fasta input record 31198: [ENSG00000169084, ENSG00000169084.3]. gene identifier 'ENSG00000169084' previously found at fasta input record 9163 which has these geneIds: [ENSG00000169084, ENSG00000169084.3]. Do the sequences match? true Partial old sequence: ATTACCTGAGGTCAGGAGTTTGAGACCAGCCAGGCCAACATGGTGAAATCCCATCTCTATTAAAAATACGAAAATTATTTGGGTGTGCTGGTGCATGCCT Partial new sequence: ATTACCTGAGGTCAGGAGTTTGAGACCAGCCAGGCCAACATGGTGAAATCCCATCTCTATTAAAAATACGAAAATTATTTGGGTGTGCTGGTGCATGCCT NOTE 15th duplicate, at fasta input record 31327: [ENSG00000182484, ENSG00000182484.4]. gene identifier 'ENSG00000182484' previously found at fasta input record 19614 which has these geneIds: [ENSG00000182484, ENSG00000182484.4]. Do the sequences match? true Partial old sequence: ATGCATTCAGAAAACTTTAGATCACGGTTGAGAAGAATCAAAAATATTAAATCAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTATTGTTAG Partial new sequence: ATGCATTCAGAAAACTTTAGATCACGGTTGAGAAGAATCAAAAATATTAAATCAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTATTGTTAG NOTE 16th duplicate, at fasta input record 31342: [ENSG00000002586, ENSG00000002586.7]. gene identifier 'ENSG00000002586' previously found at fasta input record 8086 which has these geneIds: [ENSG00000002586, ENSG00000002586.7]. Do the sequences match? true Partial old sequence: AGCCTGTACCCCAGAACTTAAAGTATAATAATAACAATAATAAAAAGACAGGTGTTATCTCAGAGCCCCTGACTCAGTCGGCTGGGCAGCAAGTATGCCA Partial new sequence: AGCCTGTACCCCAGAACTTAAAGTATAATAATAACAATAATAAAAAGACAGGTGTTATCTCAGAGCCCCTGACTCAGTCGGCTGGGCAGCAAGTATGCCA NOTE 17th duplicate, at fasta input record 31373: [ENSG00000182378, ENSG00000182378.3]. gene identifier 'ENSG00000182378' previously found at fasta input record 8467 which has these geneIds: [ENSG00000182378, ENSG00000182378.3]. Do the sequences match? true Partial old sequence: GACCACAGTCCACATCACACCAGGACACGGAGGAAGGGCCAGGCCTCATGACCACAGTCCAGATCACACCAGGACACAGAGGAAGGGCCGGGCCCTGTGA Partial new sequence: GACCACAGTCCACATCACACCAGGACACGGAGGAAGGGCCAGGCCTCATGACCACAGTCCAGATCACACCAGGACACAGAGGAAGGGCCGGGCCCTGTGA NOTE 18th duplicate, at fasta input record 31428: [ENSG00000169093, ENSG00000169093.5]. gene identifier 'ENSG00000169093' previously found at fasta input record 9036 which has these geneIds: [ENSG00000169093, ENSG00000169093.5]. Do the sequences match? true Partial old sequence: TATTCCTTGATTTCAGATGTCTGGGCTCCAGAGCTGTAATACAATTAAGTTTTGCTGTTTTAAGCCCCAGGGTTTTGAGTGACAGTTACCAGCAACCCCC Partial new sequence: TATTCCTTGATTTCAGATGTCTGGGCTCCAGAGCTGTAATACAATTAAGTTTTGCTGTTTTAAGCCCCAGGGTTTTGAGTGACAGTTACCAGCAACCCCC NOTE 19th duplicate, at fasta input record 31442: [ENSG00000167393, ENSG00000167393.7]. gene identifier 'ENSG00000167393' previously found at fasta input record 9170 which has these geneIds: [ENSG00000167393, ENSG00000167393.7]. Do the sequences match? true Partial old sequence: CCCAGCAAACTCTGCAACACCTCAGGCCCTGCCAGCCTTGGGGGCCCGACAGCACCTCTTTGTTCTCCCAGAGCAAAGCCTGCACGGAGTGGGCCCCCGG Partial new sequence: CCCAGCAAACTCTGCAACACCTCAGGCCCTGCCAGCCTTGGGGGCCCGACAGCACCTCTTTGTTCTCCCAGAGCAAAGCCTGCACGGAGTGGGCCCCCGG NOTE 20th duplicate, at fasta input record 31485: [ENSG00000178605, ENSG00000178605.4]. gene identifier 'ENSG00000178605' previously found at fasta input record 9699 which has these geneIds: [ENSG00000178605, ENSG00000178605.4]. Do the sequences match? true Partial old sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Partial new sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NOTE 21th duplicate, at fasta input record 31508: [ENSG00000169098, ENSG00000169098.5]. gene identifier 'ENSG00000169098' previously found at fasta input record 9900 which has these geneIds: [ENSG00000169098, ENSG00000169098.5]. Do the sequences match? true Partial old sequence: GCCGGGCACGGTGGCTCACGCCTGCAATGCCAGCACTTTAGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATG Partial new sequence: GCCGGGCACGGTGGCTCACGCCTGCAATGCCAGCACTTTAGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATG For those interested the results should still be available for a little while: http://www.biomart.org/biomart/martresults?file=martquery_0523154530_544.txt.gz Ideas? Thanks, Peter Andrews -- -------------- Peter Andrews Computational Genetics Lab Dartmouth Hitchcock Medical Center (603) 653-3598 |
- [mart-dev] Biomart exporting duplicate genes/sequences? Peter Andrews
