Hmm. It's telling the truth - the moltype has a non-zero rank, so it doesn't find it.
I've modified SimpleAnnotation in CVS to return the first value found regardless of rank. Try it now. cheers, Richard On Wed, 2006-06-07 at 11:36 -0400, Seth Johnson wrote: > Hello again Richard, > > Thank you for updating the INSDseqFormat to 1.4 so promptly. Another > reason I inquired about accessing different terms is because the code: > > rs.getAnnotation().getProperty(Terms.getMolTypeTerm()) > > When the above is executed after parsing the INSDseq file it produces > the following exception: > ~~~~~~~~~~~~~~~~~~~~~ > Exception in thread "main" java.util.NoSuchElementException: No such > property: biojavax:moltype, rank 0 > at > org.biojavax.SimpleRichAnnotation.getNote(SimpleRichAnnotation.java:137) > at > org.biojavax.SimpleRichAnnotation.getProperty(SimpleRichAnnotation.java:147) > at exonhit.parsers.GenBankParser.main(GenBankParser.java:370) > ~~~~~~~~~~~~~~~~~~~~~ > The file that I'm parsing is as follows and does contain the 'moltype': > +++++++++++++++++++++ > <?xml version="1.0"?> > <!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN" > "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"> > <INSDSet> > <INSDSeq> > <INSDSeq_locus>AY069118</INSDSeq_locus> > <INSDSeq_length>1502</INSDSeq_length> > <INSDSeq_strandedness>single</INSDSeq_strandedness> > <INSDSeq_moltype>mRNA</INSDSeq_moltype> > <INSDSeq_topology>linear</INSDSeq_topology> > <INSDSeq_division>INV</INSDSeq_division> > <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date> > <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date> > <INSDSeq_definition>Drosophila melanogaster GH13089 full length > cDNA</INSDSeq_definition> > <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession> > <INSDSeq_accession-version>AY069118.1</INSDSeq_accession-version> > <INSDSeq_other-seqids> > <INSDSeqid>gb|AY069118.1|</INSDSeqid> > <INSDSeqid>gi|17861571</INSDSeqid> > </INSDSeq_other-seqids> > <INSDSeq_keywords> > <INSDKeyword>FLI_CDNA</INSDKeyword> > </INSDSeq_keywords> > <INSDSeq_source>Drosophila melanogaster (fruit fly)</INSDSeq_source> > <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism> > <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; > Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; > Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy> > <INSDSeq_references> > <INSDReference> > <INSDReference_reference>1 (bases 1 to 1502)</INSDReference_reference> > <INSDReference_position>1..1502</INSDReference_position> > <INSDReference_authors> > <INSDAuthor>Stapleton,M.</INSDAuthor> > <INSDAuthor>Brokstein,P.</INSDAuthor> > <INSDAuthor>Hong,L.</INSDAuthor> > <INSDAuthor>Agbayani,A.</INSDAuthor> > <INSDAuthor>Carlson,J.</INSDAuthor> > <INSDAuthor>Champe,M.</INSDAuthor> > <INSDAuthor>Chavez,C.</INSDAuthor> > <INSDAuthor>Dorsett,V.</INSDAuthor> > <INSDAuthor>Farfan,D.</INSDAuthor> > <INSDAuthor>Frise,E.</INSDAuthor> > <INSDAuthor>George,R.</INSDAuthor> > <INSDAuthor>Gonzalez,M.</INSDAuthor> > <INSDAuthor>Guarin,H.</INSDAuthor> > <INSDAuthor>Li,P.</INSDAuthor> > <INSDAuthor>Liao,G.</INSDAuthor> > <INSDAuthor>Miranda,A.</INSDAuthor> > <INSDAuthor>Mungall,C.J.</INSDAuthor> > <INSDAuthor>Nunoo,J.</INSDAuthor> > <INSDAuthor>Pacleb,J.</INSDAuthor> > <INSDAuthor>Paragas,V.</INSDAuthor> > <INSDAuthor>Park,S.</INSDAuthor> > <INSDAuthor>Phouanenavong,S.</INSDAuthor> > <INSDAuthor>Wan,K.</INSDAuthor> > <INSDAuthor>Yu,C.</INSDAuthor> > <INSDAuthor>Lewis,S.E.</INSDAuthor> > <INSDAuthor>Rubin,G.M.</INSDAuthor> > <INSDAuthor>Celniker,S.</INSDAuthor> > </INSDReference_authors> > <INSDReference_title>Direct Submission</INSDReference_title> > <INSDReference_journal>Submitted (10-DEC-2001) Berkeley > Drosophila Genome Project, Lawrence Berkeley National Laboratory, One > Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal> > </INSDReference> > </INSDSeq_references> > <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome > Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This > clone was sequenced as part of a high-throughput process to sequence > clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000). > The sequence has been subjected to integrity checks for sequence > accuracy, presence of a polyA tail and contiguity within 100 kb in the > genome. Thus we believe the sequence to reflect accurately this > particular cDNA clone. However, there are artifacts associated with > the generation of cDNA clones that may have not been detected in our > initial analyses such as internal priming, priming from contaminating > genomic DNA, retained introns due to reverse transcription of > unspliced precursor RNAs, and reverse transcriptase errors that result > in single base changes. For further information about this sequence, > including its location and relationship to other sequences, please > visit our Web site (http://fruitfly.berkeley.edu) or send email to > [EMAIL PROTECTED]</INSDSeq_comment> > <INSDSeq_feature-table> > <INSDFeature> > <INSDFeature_key>source</INSDFeature_key> > <INSDFeature_location>1..1502</INSDFeature_location> > <INSDFeature_intervals> > <INSDInterval> > <INSDInterval_from>1</INSDInterval_from> > <INSDInterval_to>1502</INSDInterval_to> > <INSDInterval_accession>AY069118.1</INSDInterval_accession> > </INSDInterval> > </INSDFeature_intervals> > <INSDFeature_quals> > <INSDQualifier> > <INSDQualifier_name>organism</INSDQualifier_name> > <INSDQualifier_value>Drosophila melanogaster</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>mol_type</INSDQualifier_name> > <INSDQualifier_value>mRNA</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>strain</INSDQualifier_name> > <INSDQualifier_value>y; cn bw sp</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>db_xref</INSDQualifier_name> > <INSDQualifier_value>taxon:7227</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>map</INSDQualifier_name> > <INSDQualifier_value>39B3-39B3</INSDQualifier_value> > </INSDQualifier> > </INSDFeature_quals> > </INSDFeature> > <INSDFeature> > <INSDFeature_key>gene</INSDFeature_key> > <INSDFeature_location>1..1502</INSDFeature_location> > <INSDFeature_intervals> > <INSDInterval> > <INSDInterval_from>1</INSDInterval_from> > <INSDInterval_to>1502</INSDInterval_to> > <INSDInterval_accession>AY069118.1</INSDInterval_accession> > </INSDInterval> > </INSDFeature_intervals> > <INSDFeature_quals> > <INSDQualifier> > <INSDQualifier_name>gene</INSDQualifier_name> > <INSDQualifier_value>E2f2</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>note</INSDQualifier_name> > <INSDQualifier_value>alignment with genomic scaffold > AE003669</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>db_xref</INSDQualifier_name> > <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value> > </INSDQualifier> > </INSDFeature_quals> > </INSDFeature> > <INSDFeature> > <INSDFeature_key>CDS</INSDFeature_key> > <INSDFeature_location>189..1301</INSDFeature_location> > <INSDFeature_intervals> > <INSDInterval> > <INSDInterval_from>189</INSDInterval_from> > <INSDInterval_to>1301</INSDInterval_to> > <INSDInterval_accession>AY069118.1</INSDInterval_accession> > </INSDInterval> > </INSDFeature_intervals> > <INSDFeature_quals> > <INSDQualifier> > <INSDQualifier_name>gene</INSDQualifier_name> > <INSDQualifier_value>E2f2</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>note</INSDQualifier_name> > <INSDQualifier_value>Longest ORF</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>codon_start</INSDQualifier_name> > <INSDQualifier_value>1</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>transl_table</INSDQualifier_name> > <INSDQualifier_value>1</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>product</INSDQualifier_name> > <INSDQualifier_value>GH13089p</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>protein_id</INSDQualifier_name> > <INSDQualifier_value>AAL39263.1</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>db_xref</INSDQualifier_name> > <INSDQualifier_value>GI:17861572</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>db_xref</INSDQualifier_name> > <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value> > </INSDQualifier> > <INSDQualifier> > <INSDQualifier_name>translation</INSDQualifier_name> > > <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value> > </INSDQualifier> > </INSDFeature_quals> > </INSDFeature> > </INSDSeq_feature-table> > > <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTA! TCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence> > </INSDSeq> > </INSDSet> > +++++++++++++++++++++ > On 6/6/06, Richard Holland <[EMAIL PROTECTED]> wrote: > ... > > For your second question, the tutorial makes the mistake in several > > places of saying getNoteSet(Terms.blahblah()). This was shorthand for: > > > > rs.getAnnotation().getProperty(Terms.blahblah()) > > (for single values) > > > > or > > > > ((RichAnnotation)rs.getAnnotation()).getProperties(Terms.blahblah()) > > (for multiple values) > > > > but never got expanded. Maybe someone can fix that one day... :)ded... > > > > I'm just updating INSDseq to 1.4 now. The guys next door gave me the > > details of the changes, and told me that 1.3 is actually no longer > > supported by them after Friday this week! So I'll make it 1.4 only. > > > > cheers, > > Richard > > > > On Tue, 2006-06-06 at 10:34 -0400, Seth Johnson wrote: > > > I think it would be best to wait for the 'official response'. I could > > > only locate the general changes detailed here: > > > > > > http://www.bio.net/bionet/mm/genbankb/2005-December/000233.html > > > > > > As far as the solution to the ever changing formats I just don't see > > > an elegant way. :( The only things that comes to mind is creating > > > separate format "INSDseq14Format.java" and build new readers & writers > > > on top of that. > > > > > > #1: And on that note I wanted to ask about differences between Genbank > > > & INSDseq parsers and a ways to retrieve certain values. The tutorial > > > states that those two formats are essentialy mirror images of each > > > other with the latter being an XML. When parsing Genbank files > > > "rs.getIdentifier()" retrieves the GI number, however, when the same > > > function is used on RichSequence obtained by parsing INSDseq format, I > > > get a 'null' value. Moreover, I could not even locate that number > > > during debugging in the structure of RichSequence object. Is there a > > > bug or GI number should be obtained differently??? > > > > > > #2: Also, what is the best way to obtain "mol_type" value from > > > RichSequence object??? The tutorial states that it's > > > "getNoteSet(Terms.getMolTypeTerm())". I guess it' either a simplified > > > explanation or something has changed since .getNoteSet() does not take > > > any parameters. I used > > > "rs.getAnnotation().asMap().get(Terms.getMolTypeTerm())" and was > > > wondering if that's how it was intended to be retrieved. > > > > > > As always, below is the INSDseq file I tried to parse: > > > ================================ > > > <?xml version="1.0"?> > > > <!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN" > > > "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"> > > > <INSDSet> > > > <INSDSeq> > > > <INSDSeq_locus>AY069118</INSDSeq_locus> > > > <INSDSeq_length>1502</INSDSeq_length> > > > <INSDSeq_strandedness>single</INSDSeq_strandedness> > > > <INSDSeq_moltype>mRNA</INSDSeq_moltype> > > > <INSDSeq_topology>linear</INSDSeq_topology> > > > <INSDSeq_division>INV</INSDSeq_division> > > > <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date> > > > <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date> > > > <INSDSeq_definition>Drosophila melanogaster GH13089 full length > > > cDNA</INSDSeq_definition> > > > <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession> > > > <INSDSeq_accession-version>AY069118.1</INSDSeq_accession-version> > > > <INSDSeq_other-seqids> > > > <INSDSeqid>gb|AY069118.1|</INSDSeqid> > > > <INSDSeqid>gi|17861571</INSDSeqid> > > > </INSDSeq_other-seqids> > > > <INSDSeq_keywords> > > > <INSDKeyword>FLI_CDNA</INSDKeyword> > > > </INSDSeq_keywords> > > > <INSDSeq_source>Drosophila melanogaster (fruit fly)</INSDSeq_source> > > > <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism> > > > <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; > > > Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; > > > Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy> > > > <INSDSeq_references> > > > <INSDReference> > > > <INSDReference_reference>1 (bases 1 to > > > 1502)</INSDReference_reference> > > > <INSDReference_position>1..1502</INSDReference_position> > > > <INSDReference_authors> > > > <INSDAuthor>Stapleton,M.</INSDAuthor> > > > <INSDAuthor>Brokstein,P.</INSDAuthor> > > > <INSDAuthor>Hong,L.</INSDAuthor> > > > <INSDAuthor>Agbayani,A.</INSDAuthor> > > > <INSDAuthor>Carlson,J.</INSDAuthor> > > > <INSDAuthor>Champe,M.</INSDAuthor> > > > <INSDAuthor>Chavez,C.</INSDAuthor> > > > <INSDAuthor>Dorsett,V.</INSDAuthor> > > > <INSDAuthor>Farfan,D.</INSDAuthor> > > > <INSDAuthor>Frise,E.</INSDAuthor> > > > <INSDAuthor>George,R.</INSDAuthor> > > > <INSDAuthor>Gonzalez,M.</INSDAuthor> > > > <INSDAuthor>Guarin,H.</INSDAuthor> > > > <INSDAuthor>Li,P.</INSDAuthor> > > > <INSDAuthor>Liao,G.</INSDAuthor> > > > <INSDAuthor>Miranda,A.</INSDAuthor> > > > <INSDAuthor>Mungall,C.J.</INSDAuthor> > > > <INSDAuthor>Nunoo,J.</INSDAuthor> > > > <INSDAuthor>Pacleb,J.</INSDAuthor> > > > <INSDAuthor>Paragas,V.</INSDAuthor> > > > <INSDAuthor>Park,S.</INSDAuthor> > > > <INSDAuthor>Phouanenavong,S.</INSDAuthor> > > > <INSDAuthor>Wan,K.</INSDAuthor> > > > <INSDAuthor>Yu,C.</INSDAuthor> > > > <INSDAuthor>Lewis,S.E.</INSDAuthor> > > > <INSDAuthor>Rubin,G.M.</INSDAuthor> > > > <INSDAuthor>Celniker,S.</INSDAuthor> > > > </INSDReference_authors> > > > <INSDReference_title>Direct Submission</INSDReference_title> > > > <INSDReference_journal>Submitted (10-DEC-2001) Berkeley > > > Drosophila Genome Project, Lawrence Berkeley National Laboratory, One > > > Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal> > > > </INSDReference> > > > </INSDSeq_references> > > > <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome > > > Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This > > > clone was sequenced as part of a high-throughput process to sequence > > > clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000). > > > The sequence has been subjected to integrity checks for sequence > > > accuracy, presence of a polyA tail and contiguity within 100 kb in the > > > genome. Thus we believe the sequence to reflect accurately this > > > particular cDNA clone. However, there are artifacts associated with > > > the generation of cDNA clones that may have not been detected in our > > > initial analyses such as internal priming, priming from contaminating > > > genomic DNA, retained introns due to reverse transcription of > > > unspliced precursor RNAs, and reverse transcriptase errors that result > > > in single base changes. For further information about this sequence, > > > including its location and relationship to other sequences, please > > > visit our Web site (http://fruitfly.berkeley.edu) or send email to > > > [EMAIL PROTECTED]</INSDSeq_comment> > > > <INSDSeq_feature-table> > > > <INSDFeature> > > > <INSDFeature_key>source</INSDFeature_key> > > > <INSDFeature_location>1..1502</INSDFeature_location> > > > <INSDFeature_intervals> > > > <INSDInterval> > > > <INSDInterval_from>1</INSDInterval_from> > > > <INSDInterval_to>1502</INSDInterval_to> > > > <INSDInterval_accession>AY069118.1</INSDInterval_accession> > > > </INSDInterval> > > > </INSDFeature_intervals> > > > <INSDFeature_quals> > > > <INSDQualifier> > > > <INSDQualifier_name>organism</INSDQualifier_name> > > > <INSDQualifier_value>Drosophila > > > melanogaster</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>mol_type</INSDQualifier_name> > > > <INSDQualifier_value>mRNA</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>strain</INSDQualifier_name> > > > <INSDQualifier_value>y; cn bw sp</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>db_xref</INSDQualifier_name> > > > <INSDQualifier_value>taxon:7227</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>map</INSDQualifier_name> > > > <INSDQualifier_value>39B3-39B3</INSDQualifier_value> > > > </INSDQualifier> > > > </INSDFeature_quals> > > > </INSDFeature> > > > <INSDFeature> > > > <INSDFeature_key>gene</INSDFeature_key> > > > <INSDFeature_location>1..1502</INSDFeature_location> > > > <INSDFeature_intervals> > > > <INSDInterval> > > > <INSDInterval_from>1</INSDInterval_from> > > > <INSDInterval_to>1502</INSDInterval_to> > > > <INSDInterval_accession>AY069118.1</INSDInterval_accession> > > > </INSDInterval> > > > </INSDFeature_intervals> > > > <INSDFeature_quals> > > > <INSDQualifier> > > > <INSDQualifier_name>gene</INSDQualifier_name> > > > <INSDQualifier_value>E2f2</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>note</INSDQualifier_name> > > > <INSDQualifier_value>alignment with genomic scaffold > > > AE003669</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>db_xref</INSDQualifier_name> > > > <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value> > > > </INSDQualifier> > > > </INSDFeature_quals> > > > </INSDFeature> > > > <INSDFeature> > > > <INSDFeature_key>CDS</INSDFeature_key> > > > <INSDFeature_location>189..1301</INSDFeature_location> > > > <INSDFeature_intervals> > > > <INSDInterval> > > > <INSDInterval_from>189</INSDInterval_from> > > > <INSDInterval_to>1301</INSDInterval_to> > > > <INSDInterval_accession>AY069118.1</INSDInterval_accession> > > > </INSDInterval> > > > </INSDFeature_intervals> > > > <INSDFeature_quals> > > > <INSDQualifier> > > > <INSDQualifier_name>gene</INSDQualifier_name> > > > <INSDQualifier_value>E2f2</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>note</INSDQualifier_name> > > > <INSDQualifier_value>Longest ORF</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>codon_start</INSDQualifier_name> > > > <INSDQualifier_value>1</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>transl_table</INSDQualifier_name> > > > <INSDQualifier_value>1</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>product</INSDQualifier_name> > > > <INSDQualifier_value>GH13089p</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>protein_id</INSDQualifier_name> > > > <INSDQualifier_value>AAL39263.1</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>db_xref</INSDQualifier_name> > > > <INSDQualifier_value>GI:17861572</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>db_xref</INSDQualifier_name> > > > <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value> > > > </INSDQualifier> > > > <INSDQualifier> > > > <INSDQualifier_name>translation</INSDQualifier_name> > > > > > > <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value> > > > </INSDQualifier> > > > </INSDFeature_quals> > > > </INSDFeature> > > > </INSDSeq_feature-table> > > > > > > <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTG! ACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence> > > > </INSDSeq> > > > </INSDSet> > > > ================================ > > > On 6/6/06, Richard Holland <[EMAIL PROTECTED]> wrote: > > > > I can't find any document detailing the differences between INSDseq XML > > > > versions 1.3 and 1.4, so I've asked the guys over in the data library > > > > section here to see if they have one or can produce one for me. They > > > > wrote it so they should know! > > > > > > > > Once I have this I'll get the INSDseq parser up-to-date. (I could go > > > > through the DTDs by hand and work it all out manually, but that would > > > > take rather longer than I've got time for at the moment!). > > > > > > > > It's a bit of a pain trying to keep the parsers up-to-date all the time, > > > > especially when people start wanting backwards-compatibility. Does > > > > anyone have any bright ideas as to how to manage version changes in file > > > > formats? > > > > > > > > cheers, > > > > Richard > > > > > > > > On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote: > > > > > I agree with you on that one. However, the problem might be a little > > > > > deeper. Same '?' appear in the INSDseq format bounded by > > > > > <INSDReference_reference> tags and cause the following exception. > > > > > This tells me that the '?' are actually values that are being > > > > > incorrectly parsed. Further examination of the .dtd reveals that > > > > > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the > > > > > files I obtain are in the INSDSeq v. 1.4 (which among other things > > > > > contain a new tag <INSDReference_position>). Here're links to both > > > > > .dtd's: > > > > > > > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt > > > > > > > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt > > > > > > > > > > I think it might be worth accommodating changes for the INSDseq > > > > > format, not sure how that would affect the '?' in Genbank. > > > > > > > > > > Seth > > > > > > > > > > ====================== > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > at > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > > > > Caused by: org.biojava.bio.seq.io.ParseException: > > > > > org.biojava.bio.seq.io.ParseException: Bad reference line found: ? > > > > > at > > > > > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250) > > > > > at > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > ... 1 more > > > > > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line > > > > > found: ? > > > > > at > > > > > org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901) > > > > > at > > > > > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633) > > > > > at > > > > > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241) > > > > > at > > > > > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685) > > > > > at > > > > > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368) > > > > > at > > > > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834) > > > > > at > > > > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) > > > > > at > > > > > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148) > > > > > at > > > > > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242) > > > > > at javax.xml.parsers.SAXParser.parse(SAXParser.java:375) > > > > > at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97) > > > > > at > > > > > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246) > > > > > ... 2 more > > > > > Java Result: -1 > > > > > ====================== > > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > > <INSDSeq_references> > > > > > <INSDReference> > > > > > <INSDReference_reference>?</INSDReference_reference> > > > > > <INSDReference_position>1..16732</INSDReference_position> > > > > > <INSDReference_authors> > > > > > <INSDAuthor>Bjornerfeldt,S.</INSDAuthor> > > > > > <INSDAuthor>Webster,M.T.</INSDAuthor> > > > > > <INSDAuthor>Vila,C.</INSDAuthor> > > > > > </INSDReference_authors> > > > > > <INSDReference_title>Relaxation of Selective Constraint on Dog > > > > > Mitochondrial DNA Following Domestication</INSDReference_title> > > > > > <INSDReference_journal>Unpublished</INSDReference_journal> > > > > > </INSDReference> > > > > > <INSDReference> > > > > > <INSDReference_reference>?</INSDReference_reference> > > > > > <INSDReference_position>1..16732</INSDReference_position> > > > > > <INSDReference_authors> > > > > > <INSDAuthor>Bjornerfeldt,S.</INSDAuthor> > > > > > <INSDAuthor>Webster,M.T.</INSDAuthor> > > > > > <INSDAuthor>Vila,C.</INSDAuthor> > > > > > </INSDReference_authors> > > > > > <INSDReference_journal>Submitted (06-APR-2006) to the > > > > > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary > > > > > Biology, Norbyvagen 18D, Uppsala 752 36, > > > > > Sweden</INSDReference_journal> > > > > > </INSDReference> > > > > > </INSDSeq_references> > > > > > ~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote: > > > > > > Hmmm... interesting. I _could_ put in a special case that ignores > > > > > > the > > > > > > question marks, but that wouldn't be 'nice' really - this is more > > > > > > of a > > > > > > problem with the program that is producing the Genbank files than a > > > > > > problem with the parser trying to read them. '?' is not a valid tag > > > > > > in > > > > > > the official Genbank format, and has no meaning attached to it that > > > > > > I > > > > > > can work out, so I'm reluctant to make the parser recognise it. > > > > > > > > > > > > I'd suggest you contact the people who write the software you are > > > > > > using > > > > > > to produce the Genbank files and ask them if they could stick to the > > > > > > rules! > > > > > > > > > > > > In the meantime you could work around the problem by stripping the > > > > > > question marks in some kind of pre-processor before passing it onto > > > > > > BioJavaX for parsing. > > > > > > > > > > > > cheers, > > > > > > Richard > > > > > > > > > > > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote: > > > > > > > Removing '?' (or several of them in my case) avoids the following > > > > > > > exception: > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > > at > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > > at > > > > > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:348) > > > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957 > > > > > > > at > > > > > > > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > > > > at > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > ... 1 more > > > > > > > Java Result: -1 > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > I don't know where that previous tokenization problem came from > > > > > > > since > > > > > > > I can no longer reproduce it. This time it's more or less > > > > > > > straight > > > > > > > forward. > > > > > > > Here's the original file with question marks: > > > > > > > ============================ > > > > > > > LOCUS DQ415957 1437 bp mRNA linear > > > > > > > VRT 01-JUN-2006 > > > > > > > DEFINITION Danio rerio capillary morphogenesis protein 2A > > > > > > > (cmg2a) mRNA, > > > > > > > complete cds. > > > > > > > ACCESSION DQ415957 > > > > > > > VERSION DQ415957.1 GI:89513612 > > > > > > > KEYWORDS . > > > > > > > SOURCE Unknown. > > > > > > > ORGANISM Unknown. > > > > > > > Unclassified. > > > > > > > ? > > > > > > > ? > > > > > > > FEATURES Location/Qualifiers > > > > > > > ? > > > > > > > gene 1..1437 > > > > > > > /gene="cmg2a" > > > > > > > CDS 1..1437 > > > > > > > /gene="cmg2a" > > > > > > > /note="cell surface receptor; similar to > > > > > > > anthrax toxin > > > > > > > receptor 2 (ANTXR2, ATR2, CMG2)" > > > > > > > /codon_start=1 > > > > > > > /product="capillary morphogenesis protein 2A" > > > > > > > /protein_id="ABD74633.1" > > > > > > > /db_xref="GI:89513613" > > > > > > > > > > > > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS > > > > > > > > > > > > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS > > > > > > > > > > > > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY > > > > > > > > > > > > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS > > > > > > > > > > > > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT > > > > > > > > > > > > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC > > > > > > > > > > > > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS > > > > > > > > > > > > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL > > > > > > > RRQYDRVSVMRPTSADKGRCMNFSRTQH" > > > > > > > ORIGIN > > > > > > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt > > > > > > > ctgtttatgc > > > > > > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct > > > > > > > gtactttgtg > > > > > > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt > > > > > > > tgtcaaaaat > > > > > > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt > > > > > > > ttcatcaaga > > > > > > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg > > > > > > > cctgaagacc > > > > > > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa > > > > > > > attggcaact > > > > > > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt > > > > > > > gactgatgga > > > > > > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc > > > > > > > aaggaagtat > > > > > > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct > > > > > > > agccgatgtg > > > > > > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct > > > > > > > caaaggcatc > > > > > > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc > > > > > > > gtccagcgtc > > > > > > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt > > > > > > > ggggagacaa > > > > > > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca > > > > > > > aaaaccaacc > > > > > > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt > > > > > > > tggacagcaa > > > > > > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc > > > > > > > tttcatcatc > > > > > > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt > > > > > > > gctttttctc > > > > > > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt > > > > > > > cgttattaaa > > > > > > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga > > > > > > > cccggaaccc > > > > > > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc > > > > > > > tggtggaatc > > > > > > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc > > > > > > > aagactagag > > > > > > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat > > > > > > > ggtcaaaaag > > > > > > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac > > > > > > > accaatcaga > > > > > > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt > > > > > > > ttcagttatg > > > > > > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca > > > > > > > gcattaa > > > > > > > // > > > > > > > > > > > > > > ============================ > > > > > > > > > > > > > > > > > > > > > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi again. > > > > > > > > > > > > > > > > Could you remove the offending question mark from the GenBank > > > > > > > > file and > > > > > > > > try it again to see if that fixes it? The parser should just > > > > > > > > ignore it > > > > > > > > but apparently not. The error looks weird to me because the > > > > > > > > tokenization > > > > > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure > > > > > > > > what's > > > > > > > > going on here. > > > > > > > ... > > > > > > > > > > > > > > > > cheers, > > > > > > > > Richard > > > > > > > > > > > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote: > > > > > > > > > Hell again Richard, > > > > > > > > > > > > > > > > > > No sooner I've said about the fix of the last parsing > > > > > > > > > exception than > > > > > > > > > another one came up with Genbank format: > > > > > > > > > -------------------------------------- > > > > > > > > > org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > > > > > org.biojava.bio.BioException: Could not read sequence > > > > > > > > > at > > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112) > > > > > > > > > at > > > > > > > > > exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151) > > > > > > > > > at > > > > > > > > > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246) > > > > > > > > > at > > > > > > > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:326) > > > > > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065 > > > > > > > > > at > > > > > > > > > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245) > > > > > > > > > at > > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109) > > > > > > > > > ... 3 more > > > > > > > > > org.biojava.bio.seq.io.ParseException: > > > > > > > > > org.biojava.bio.symbol.IllegalSymbolException: This > > > > > > > > > tokenization > > > > > > > > > doesn't contain character: 't' > > > > > > > > > ---------------------------------------- > > > > > > > > > The Genbank file that caused it is as follows: > > > > > > > > > ========================================= > > > > > > > > > LOCUS DQ431065 425 bp DNA linear > > > > > > > > > INV 01-JUN-2006 > > > > > > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA > > > > > > > > > gene, partial > > > > > > > > > sequence; mitochondrial. > > > > > > > > > ACCESSION DQ431065 > > > > > > > > > VERSION DQ431065.1 GI:90102206 > > > > > > > > > KEYWORDS . > > > > > > > > > SOURCE Vaccinium corymbosum > > > > > > > > > ORGANISM Vaccinium corymbosum > > > > > > > > > Eukaryota; Viridiplantae; Streptophyta; > > > > > > > > > Embryophyta; Tracheophyta; > > > > > > > > > Spermatophyta; Magnoliophyta; eudicotyledons; > > > > > > > > > core eudicotyledons; > > > > > > > > > asterids; Ericales; Ericaceae; Vaccinioideae; > > > > > > > > > Vaccinieae; > > > > > > > > > Vaccinium. > > > > > > > > > ? > > > > > > > > > REFERENCE 2 (bases 1 to 425) > > > > > > > > > AUTHORS Naik,L.D. and Rowland,L.J. > > > > > > > > > TITLE Expressed Sequence Tags of cDNA clones from > > > > > > > > > subtracted library of > > > > > > > > > Vaccinium corymbosum > > > > > > > > > JOURNAL Unpublished (2005) > > > > > > > > > FEATURES Location/Qualifiers > > > > > > > > > source 1..425 > > > > > > > > > /organism="Vaccinium corymbosum" > > > > > > > > > /mol_type="genomic DNA" > > > > > > > > > /cultivar="Bluecrop" > > > > > > > > > /db_xref="taxon:69266" > > > > > > > > > /tissue_type="Flower buds" > > > > > > > > > /clone_lib="Subtracted cDNA library of > > > > > > > > > Vaccinium > > > > > > > > > corymbosum" > > > > > > > > > /dev_stage="399 hour chill unit exposure" > > > > > > > > > /note="Vector: pCR4TOPO; Site_1: Eco R > > > > > > > > > I; Site_2: Eco R I" > > > > > > > > > rRNA <1..>425 > > > > > > > > > /product="16S ribosomal RNA" > > > > > > > > > ORIGIN > > > > > > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga > > > > > > > > > agtatggcct gcccgctgac > > > > > > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg > > > > > > > > > tagcatagtc attagttctt > > > > > > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc > > > > > > > > > tgtcttaatt ttgaattgtt > > > > > > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt > > > > > > > > > tatgggacga gaagacccta > > > > > > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag > > > > > > > > > ggctcactgg gccgtctaat > > > > > > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct > > > > > > > > > cctttttatt attatattta > > > > > > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta > > > > > > > > > aattacctta gggataacag > > > > > > > > > 421 cgtaa > > > > > > > > > // > > > > > > > > > ================================== > > > > > > > > > I think it's the presence of the '?' at the beginning of the > > > > > > > > > line?!?! > > > > > > > > > I'm not sure wether the information that was supposed to be > > > > > > > > > present > > > > > > > > > instead of those question marks is absent from the original > > > > > > > > > ASN.1 > > > > > > > > > batch file or it's a bug in the NCBI ASN2GO software. It > > > > > > > > > looks to me > > > > > > > > > that the former is the case since the file from NCBI website > > > > > > > > > contains > > > > > > > > > much more information than the batch file. Just bringing this > > > > > > > > > to > > > > > > > > > everyone's attention. > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Best Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > > Seth Johnson > > > > > > > > > Senior Bioinformatics Associate > > > > > > > > > > > > > > > > > > Ph: (202) 470-0900 > > > > > > > > > Fx: (775) 251-0358 > > > > > > > > > > > > > > > > > > On 6/2/06, Richard Holland <[EMAIL PROTECTED]> wrote: > > > > > > > > > > Hi Seth. > > > > > > > > > > > > > > > > > > > > Your second point, about the authors string not being read > > > > > > > > > > correctly in > > > > > > > > > > Genbank format, has been fixed (or should have been if I > > > > > > > > > > got the code > > > > > > > > > > right!). Could you check the latest version of biojava-live > > > > > > > > > > out of CVS > > > > > > > > > > and give it another go? Basically the parser did not > > > > > > > > > > recognise the > > > > > > > > > > CONSRTM tag, as it is not mentioned in the sample record > > > > > > > > > > provided by > > > > > > > > > > NCBI, which is what I based the parser on. > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > cheers, > > > > > > > > > > Richard > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Richard Holland (BioMart Team) > > > > > > > > EMBL-EBI > > > > > > > > Wellcome Trust Genome Campus > > > > > > > > Hinxton > > > > > > > > Cambridge CB10 1SD > > > > > > > > UNITED KINGDOM > > > > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Richard Holland (BioMart Team) > > > > > > EMBL-EBI > > > > > > Wellcome Trust Genome Campus > > > > > > Hinxton > > > > > > Cambridge CB10 1SD > > > > > > UNITED KINGDOM > > > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Richard Holland (BioMart Team) > > > > EMBL-EBI > > > > Wellcome Trust Genome Campus > > > > Hinxton > > > > Cambridge CB10 1SD > > > > UNITED KINGDOM > > > > Tel: +44-(0)1223-494416 > > > > > > > > > > > > > > > > -- > > Richard Holland (BioMart Team) > > EMBL-EBI > > Wellcome Trust Genome Campus > > Hinxton > > Cambridge CB10 1SD > > UNITED KINGDOM > > Tel: +44-(0)1223-494416 > > > > > > -- Richard Holland (BioMart Team) EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UNITED KINGDOM Tel: +44-(0)1223-494416 _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
