Re: [Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4)

Richard Holland Wed, 07 Jun 2006 08:53:35 -0700

Hmm. It's telling the truth - the moltype has a non-zero rank, so it
doesn't find it.


I've modified SimpleAnnotation in CVS to return the first value found
regardless of rank.

Try it now.

cheers,
Richard

On Wed, 2006-06-07 at 11:36 -0400, Seth Johnson wrote:
> Hello again Richard,
> 
> Thank you for updating the INSDseqFormat to 1.4 so promptly.  Another
> reason I inquired about accessing different terms is because the code:
> 
> rs.getAnnotation().getProperty(Terms.getMolTypeTerm())
> 
> When the above is executed after parsing the INSDseq file it produces
> the following exception:
> ~~~~~~~~~~~~~~~~~~~~~
> Exception in thread "main" java.util.NoSuchElementException: No such
> property: biojavax:moltype, rank 0
>         at 
> org.biojavax.SimpleRichAnnotation.getNote(SimpleRichAnnotation.java:137)
>         at 
> org.biojavax.SimpleRichAnnotation.getProperty(SimpleRichAnnotation.java:147)
>         at exonhit.parsers.GenBankParser.main(GenBankParser.java:370)
> ~~~~~~~~~~~~~~~~~~~~~
> The file that I'm parsing is as follows and does contain the 'moltype':
> +++++++++++++++++++++
> <?xml version="1.0"?>
> <!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN"
> "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd";>
> <INSDSet>
> <INSDSeq>
>   <INSDSeq_locus>AY069118</INSDSeq_locus>
>   <INSDSeq_length>1502</INSDSeq_length>
>   <INSDSeq_strandedness>single</INSDSeq_strandedness>
>   <INSDSeq_moltype>mRNA</INSDSeq_moltype>
>   <INSDSeq_topology>linear</INSDSeq_topology>
>   <INSDSeq_division>INV</INSDSeq_division>
>   <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date>
>   <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
>   <INSDSeq_definition>Drosophila melanogaster GH13089 full length
> cDNA</INSDSeq_definition>
>   <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession>
>   <INSDSeq_accession-version>AY069118.1</INSDSeq_accession-version>
>   <INSDSeq_other-seqids>
>     <INSDSeqid>gb|AY069118.1|</INSDSeqid>
>     <INSDSeqid>gi|17861571</INSDSeqid>
>   </INSDSeq_other-seqids>
>   <INSDSeq_keywords>
>     <INSDKeyword>FLI_CDNA</INSDKeyword>
>   </INSDSeq_keywords>
>   <INSDSeq_source>Drosophila melanogaster (fruit fly)</INSDSeq_source>
>   <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism>
>   <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta;
> Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
> Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy>
>   <INSDSeq_references>
>     <INSDReference>
>       <INSDReference_reference>1 (bases 1 to 1502)</INSDReference_reference>
>       <INSDReference_position>1..1502</INSDReference_position>
>       <INSDReference_authors>
>         <INSDAuthor>Stapleton,M.</INSDAuthor>
>         <INSDAuthor>Brokstein,P.</INSDAuthor>
>         <INSDAuthor>Hong,L.</INSDAuthor>
>         <INSDAuthor>Agbayani,A.</INSDAuthor>
>         <INSDAuthor>Carlson,J.</INSDAuthor>
>         <INSDAuthor>Champe,M.</INSDAuthor>
>         <INSDAuthor>Chavez,C.</INSDAuthor>
>         <INSDAuthor>Dorsett,V.</INSDAuthor>
>         <INSDAuthor>Farfan,D.</INSDAuthor>
>         <INSDAuthor>Frise,E.</INSDAuthor>
>         <INSDAuthor>George,R.</INSDAuthor>
>         <INSDAuthor>Gonzalez,M.</INSDAuthor>
>         <INSDAuthor>Guarin,H.</INSDAuthor>
>         <INSDAuthor>Li,P.</INSDAuthor>
>         <INSDAuthor>Liao,G.</INSDAuthor>
>         <INSDAuthor>Miranda,A.</INSDAuthor>
>         <INSDAuthor>Mungall,C.J.</INSDAuthor>
>         <INSDAuthor>Nunoo,J.</INSDAuthor>
>         <INSDAuthor>Pacleb,J.</INSDAuthor>
>         <INSDAuthor>Paragas,V.</INSDAuthor>
>         <INSDAuthor>Park,S.</INSDAuthor>
>         <INSDAuthor>Phouanenavong,S.</INSDAuthor>
>         <INSDAuthor>Wan,K.</INSDAuthor>
>         <INSDAuthor>Yu,C.</INSDAuthor>
>         <INSDAuthor>Lewis,S.E.</INSDAuthor>
>         <INSDAuthor>Rubin,G.M.</INSDAuthor>
>         <INSDAuthor>Celniker,S.</INSDAuthor>
>       </INSDReference_authors>
>       <INSDReference_title>Direct Submission</INSDReference_title>
>       <INSDReference_journal>Submitted (10-DEC-2001) Berkeley
> Drosophila Genome Project, Lawrence Berkeley National Laboratory, One
> Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal>
>     </INSDReference>
>   </INSDSeq_references>
>   <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome
> Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This
> clone was sequenced as part of a high-throughput process to sequence
> clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000).
> The sequence has been subjected to integrity checks for sequence
> accuracy, presence of a polyA tail and contiguity within 100 kb in the
> genome. Thus we believe the sequence to reflect accurately this
> particular cDNA clone. However, there are artifacts associated with
> the generation of cDNA clones that may have not been detected in our
> initial analyses such as internal priming, priming from contaminating
> genomic DNA, retained introns due to reverse transcription of
> unspliced precursor RNAs, and reverse transcriptase errors that result
> in single base changes. For further information about this sequence,
> including its location and relationship to other sequences, please
> visit our Web site (http://fruitfly.berkeley.edu) or send email to
> [EMAIL PROTECTED]</INSDSeq_comment>
>   <INSDSeq_feature-table>
>     <INSDFeature>
>       <INSDFeature_key>source</INSDFeature_key>
>       <INSDFeature_location>1..1502</INSDFeature_location>
>       <INSDFeature_intervals>
>         <INSDInterval>
>           <INSDInterval_from>1</INSDInterval_from>
>           <INSDInterval_to>1502</INSDInterval_to>
>           <INSDInterval_accession>AY069118.1</INSDInterval_accession>
>         </INSDInterval>
>       </INSDFeature_intervals>
>       <INSDFeature_quals>
>         <INSDQualifier>
>           <INSDQualifier_name>organism</INSDQualifier_name>
>           <INSDQualifier_value>Drosophila melanogaster</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>mol_type</INSDQualifier_name>
>           <INSDQualifier_value>mRNA</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>strain</INSDQualifier_name>
>           <INSDQualifier_value>y; cn bw sp</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name>
>           <INSDQualifier_value>taxon:7227</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>map</INSDQualifier_name>
>           <INSDQualifier_value>39B3-39B3</INSDQualifier_value>
>         </INSDQualifier>
>       </INSDFeature_quals>
>     </INSDFeature>
>     <INSDFeature>
>       <INSDFeature_key>gene</INSDFeature_key>
>       <INSDFeature_location>1..1502</INSDFeature_location>
>       <INSDFeature_intervals>
>         <INSDInterval>
>           <INSDInterval_from>1</INSDInterval_from>
>           <INSDInterval_to>1502</INSDInterval_to>
>           <INSDInterval_accession>AY069118.1</INSDInterval_accession>
>         </INSDInterval>
>       </INSDFeature_intervals>
>       <INSDFeature_quals>
>         <INSDQualifier>
>           <INSDQualifier_name>gene</INSDQualifier_name>
>           <INSDQualifier_value>E2f2</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>note</INSDQualifier_name>
>           <INSDQualifier_value>alignment with genomic scaffold
> AE003669</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name>
>           <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
>         </INSDQualifier>
>       </INSDFeature_quals>
>     </INSDFeature>
>     <INSDFeature>
>       <INSDFeature_key>CDS</INSDFeature_key>
>       <INSDFeature_location>189..1301</INSDFeature_location>
>       <INSDFeature_intervals>
>         <INSDInterval>
>           <INSDInterval_from>189</INSDInterval_from>
>           <INSDInterval_to>1301</INSDInterval_to>
>           <INSDInterval_accession>AY069118.1</INSDInterval_accession>
>         </INSDInterval>
>       </INSDFeature_intervals>
>       <INSDFeature_quals>
>         <INSDQualifier>
>           <INSDQualifier_name>gene</INSDQualifier_name>
>           <INSDQualifier_value>E2f2</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>note</INSDQualifier_name>
>           <INSDQualifier_value>Longest ORF</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>codon_start</INSDQualifier_name>
>           <INSDQualifier_value>1</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>transl_table</INSDQualifier_name>
>           <INSDQualifier_value>1</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>product</INSDQualifier_name>
>           <INSDQualifier_value>GH13089p</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>protein_id</INSDQualifier_name>
>           <INSDQualifier_value>AAL39263.1</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name>
>           <INSDQualifier_value>GI:17861572</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name>
>           <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>translation</INSDQualifier_name>
>           
> <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value>
>         </INSDQualifier>
>       </INSDFeature_quals>
>     </INSDFeature>
>   </INSDSeq_feature-table>
>   
> <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTA!
 
TCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence>
> </INSDSeq>
> </INSDSet>
> +++++++++++++++++++++
> On 6/6/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> ...
> > For your second question, the tutorial makes the mistake in several
> > places of saying getNoteSet(Terms.blahblah()). This was shorthand for:
> >
> >  rs.getAnnotation().getProperty(Terms.blahblah())
> >         (for single values)
> >
> > or
> >
> >  ((RichAnnotation)rs.getAnnotation()).getProperties(Terms.blahblah())
> >         (for multiple values)
> >
> > but never got expanded. Maybe someone can fix that one day... :)ded...
> >
> > I'm just updating INSDseq to 1.4 now. The guys next door gave me the
> > details of the changes, and told me that 1.3 is actually no longer
> > supported by them after Friday this week! So I'll make it 1.4 only.
> >
> > cheers,
> > Richard
> >
> > On Tue, 2006-06-06 at 10:34 -0400, Seth Johnson wrote:
> > > I think it would be best to wait for the 'official response'.  I could
> > > only locate the general changes detailed here:
> > >
> > > http://www.bio.net/bionet/mm/genbankb/2005-December/000233.html
> > >
> > > As far as the solution to the ever changing formats I just don't see
> > > an elegant way. :(  The only things that comes to mind is creating
> > > separate format "INSDseq14Format.java" and build new readers & writers
> > > on top of that.
> > >
> > > #1: And on that note I wanted to ask about differences between Genbank
> > > & INSDseq parsers and a ways to retrieve certain values.  The tutorial
> > > states that those two formats are essentialy mirror images of each
> > > other with the latter being an XML.  When parsing Genbank files
> > > "rs.getIdentifier()" retrieves the GI number, however, when the same
> > > function is used on RichSequence obtained by parsing INSDseq format, I
> > > get a 'null' value.  Moreover, I could not even locate that number
> > > during debugging in the structure of RichSequence object.  Is there a
> > > bug or GI number should be obtained differently???
> > >
> > > #2: Also, what is the best way to obtain "mol_type" value from
> > > RichSequence object???  The tutorial states that it's
> > > "getNoteSet(Terms.getMolTypeTerm())".  I guess it' either a simplified
> > > explanation or something has changed since .getNoteSet() does not take
> > > any parameters.  I used
> > > "rs.getAnnotation().asMap().get(Terms.getMolTypeTerm())" and was
> > > wondering if that's how it was intended to be retrieved.
> > >
> > > As always, below is the INSDseq file I tried to parse:
> > > ================================
> > > <?xml version="1.0"?>
> > > <!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN"
> > > "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd";>
> > > <INSDSet>
> > > <INSDSeq>
> > >   <INSDSeq_locus>AY069118</INSDSeq_locus>
> > >   <INSDSeq_length>1502</INSDSeq_length>
> > >   <INSDSeq_strandedness>single</INSDSeq_strandedness>
> > >   <INSDSeq_moltype>mRNA</INSDSeq_moltype>
> > >   <INSDSeq_topology>linear</INSDSeq_topology>
> > >   <INSDSeq_division>INV</INSDSeq_division>
> > >   <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date>
> > >   <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
> > >   <INSDSeq_definition>Drosophila melanogaster GH13089 full length
> > > cDNA</INSDSeq_definition>
> > >   <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession>
> > >   <INSDSeq_accession-version>AY069118.1</INSDSeq_accession-version>
> > >   <INSDSeq_other-seqids>
> > >     <INSDSeqid>gb|AY069118.1|</INSDSeqid>
> > >     <INSDSeqid>gi|17861571</INSDSeqid>
> > >   </INSDSeq_other-seqids>
> > >   <INSDSeq_keywords>
> > >     <INSDKeyword>FLI_CDNA</INSDKeyword>
> > >   </INSDSeq_keywords>
> > >   <INSDSeq_source>Drosophila melanogaster (fruit fly)</INSDSeq_source>
> > >   <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism>
> > >   <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta;
> > > Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
> > > Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy>
> > >   <INSDSeq_references>
> > >     <INSDReference>
> > >       <INSDReference_reference>1 (bases 1 to 
> > > 1502)</INSDReference_reference>
> > >       <INSDReference_position>1..1502</INSDReference_position>
> > >       <INSDReference_authors>
> > >         <INSDAuthor>Stapleton,M.</INSDAuthor>
> > >         <INSDAuthor>Brokstein,P.</INSDAuthor>
> > >         <INSDAuthor>Hong,L.</INSDAuthor>
> > >         <INSDAuthor>Agbayani,A.</INSDAuthor>
> > >         <INSDAuthor>Carlson,J.</INSDAuthor>
> > >         <INSDAuthor>Champe,M.</INSDAuthor>
> > >         <INSDAuthor>Chavez,C.</INSDAuthor>
> > >         <INSDAuthor>Dorsett,V.</INSDAuthor>
> > >         <INSDAuthor>Farfan,D.</INSDAuthor>
> > >         <INSDAuthor>Frise,E.</INSDAuthor>
> > >         <INSDAuthor>George,R.</INSDAuthor>
> > >         <INSDAuthor>Gonzalez,M.</INSDAuthor>
> > >         <INSDAuthor>Guarin,H.</INSDAuthor>
> > >         <INSDAuthor>Li,P.</INSDAuthor>
> > >         <INSDAuthor>Liao,G.</INSDAuthor>
> > >         <INSDAuthor>Miranda,A.</INSDAuthor>
> > >         <INSDAuthor>Mungall,C.J.</INSDAuthor>
> > >         <INSDAuthor>Nunoo,J.</INSDAuthor>
> > >         <INSDAuthor>Pacleb,J.</INSDAuthor>
> > >         <INSDAuthor>Paragas,V.</INSDAuthor>
> > >         <INSDAuthor>Park,S.</INSDAuthor>
> > >         <INSDAuthor>Phouanenavong,S.</INSDAuthor>
> > >         <INSDAuthor>Wan,K.</INSDAuthor>
> > >         <INSDAuthor>Yu,C.</INSDAuthor>
> > >         <INSDAuthor>Lewis,S.E.</INSDAuthor>
> > >         <INSDAuthor>Rubin,G.M.</INSDAuthor>
> > >         <INSDAuthor>Celniker,S.</INSDAuthor>
> > >       </INSDReference_authors>
> > >       <INSDReference_title>Direct Submission</INSDReference_title>
> > >       <INSDReference_journal>Submitted (10-DEC-2001) Berkeley
> > > Drosophila Genome Project, Lawrence Berkeley National Laboratory, One
> > > Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal>
> > >     </INSDReference>
> > >   </INSDSeq_references>
> > >   <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome
> > > Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This
> > > clone was sequenced as part of a high-throughput process to sequence
> > > clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000).
> > > The sequence has been subjected to integrity checks for sequence
> > > accuracy, presence of a polyA tail and contiguity within 100 kb in the
> > > genome. Thus we believe the sequence to reflect accurately this
> > > particular cDNA clone. However, there are artifacts associated with
> > > the generation of cDNA clones that may have not been detected in our
> > > initial analyses such as internal priming, priming from contaminating
> > > genomic DNA, retained introns due to reverse transcription of
> > > unspliced precursor RNAs, and reverse transcriptase errors that result
> > > in single base changes. For further information about this sequence,
> > > including its location and relationship to other sequences, please
> > > visit our Web site (http://fruitfly.berkeley.edu) or send email to
> > > [EMAIL PROTECTED]</INSDSeq_comment>
> > >   <INSDSeq_feature-table>
> > >     <INSDFeature>
> > >       <INSDFeature_key>source</INSDFeature_key>
> > >       <INSDFeature_location>1..1502</INSDFeature_location>
> > >       <INSDFeature_intervals>
> > >         <INSDInterval>
> > >           <INSDInterval_from>1</INSDInterval_from>
> > >           <INSDInterval_to>1502</INSDInterval_to>
> > >           <INSDInterval_accession>AY069118.1</INSDInterval_accession>
> > >         </INSDInterval>
> > >       </INSDFeature_intervals>
> > >       <INSDFeature_quals>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>organism</INSDQualifier_name>
> > >           <INSDQualifier_value>Drosophila 
> > > melanogaster</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>mol_type</INSDQualifier_name>
> > >           <INSDQualifier_value>mRNA</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>strain</INSDQualifier_name>
> > >           <INSDQualifier_value>y; cn bw sp</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> > >           <INSDQualifier_value>taxon:7227</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>map</INSDQualifier_name>
> > >           <INSDQualifier_value>39B3-39B3</INSDQualifier_value>
> > >         </INSDQualifier>
> > >       </INSDFeature_quals>
> > >     </INSDFeature>
> > >     <INSDFeature>
> > >       <INSDFeature_key>gene</INSDFeature_key>
> > >       <INSDFeature_location>1..1502</INSDFeature_location>
> > >       <INSDFeature_intervals>
> > >         <INSDInterval>
> > >           <INSDInterval_from>1</INSDInterval_from>
> > >           <INSDInterval_to>1502</INSDInterval_to>
> > >           <INSDInterval_accession>AY069118.1</INSDInterval_accession>
> > >         </INSDInterval>
> > >       </INSDFeature_intervals>
> > >       <INSDFeature_quals>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>gene</INSDQualifier_name>
> > >           <INSDQualifier_value>E2f2</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>note</INSDQualifier_name>
> > >           <INSDQualifier_value>alignment with genomic scaffold
> > > AE003669</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> > >           <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
> > >         </INSDQualifier>
> > >       </INSDFeature_quals>
> > >     </INSDFeature>
> > >     <INSDFeature>
> > >       <INSDFeature_key>CDS</INSDFeature_key>
> > >       <INSDFeature_location>189..1301</INSDFeature_location>
> > >       <INSDFeature_intervals>
> > >         <INSDInterval>
> > >           <INSDInterval_from>189</INSDInterval_from>
> > >           <INSDInterval_to>1301</INSDInterval_to>
> > >           <INSDInterval_accession>AY069118.1</INSDInterval_accession>
> > >         </INSDInterval>
> > >       </INSDFeature_intervals>
> > >       <INSDFeature_quals>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>gene</INSDQualifier_name>
> > >           <INSDQualifier_value>E2f2</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>note</INSDQualifier_name>
> > >           <INSDQualifier_value>Longest ORF</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>codon_start</INSDQualifier_name>
> > >           <INSDQualifier_value>1</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>transl_table</INSDQualifier_name>
> > >           <INSDQualifier_value>1</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>product</INSDQualifier_name>
> > >           <INSDQualifier_value>GH13089p</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>protein_id</INSDQualifier_name>
> > >           <INSDQualifier_value>AAL39263.1</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> > >           <INSDQualifier_value>GI:17861572</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> > >           <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
> > >         </INSDQualifier>
> > >         <INSDQualifier>
> > >           <INSDQualifier_name>translation</INSDQualifier_name>
> > >           
> > > <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value>
> > >         </INSDQualifier>
> > >       </INSDFeature_quals>
> > >     </INSDFeature>
> > >   </INSDSeq_feature-table>
> > >   
> > > <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTG!
 
ACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence>
> > > </INSDSeq>
> > > </INSDSet>
> > > ================================
> > > On 6/6/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> > > > I can't find any document detailing the differences between INSDseq XML
> > > > versions 1.3 and 1.4, so I've asked the guys over in the data library
> > > > section here to see if they have one or can produce one for me. They
> > > > wrote it so they should know!
> > > >
> > > > Once I have this I'll get the INSDseq parser up-to-date. (I could go
> > > > through the DTDs by hand and work it all out manually, but that would
> > > > take rather longer than I've got time for at the moment!).
> > > >
> > > > It's a bit of a pain trying to keep the parsers up-to-date all the time,
> > > > especially when people start wanting backwards-compatibility. Does
> > > > anyone have any bright ideas as to how to manage version changes in file
> > > > formats?
> > > >
> > > > cheers,
> > > > Richard
> > > >
> > > > On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote:
> > > > > I agree with you on that one.  However, the problem might be a little
> > > > > deeper.  Same '?' appear in the INSDseq format bounded by
> > > > > <INSDReference_reference> tags and cause the following exception.
> > > > > This tells me that the '?' are actually values that are being
> > > > > incorrectly parsed.  Further examination of the .dtd reveals that
> > > > > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the
> > > > > files I obtain are in the INSDSeq v. 1.4 (which among other things
> > > > > contain a new tag <INSDReference_position>).  Here're links to both
> > > > > .dtd's:
> > > > >
> > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt
> > > > >
> > > > > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt
> > > > >
> > > > > I think it might be worth accommodating changes for the INSDseq
> > > > > format, not sure how that would affect the '?' in Genbank.
> > > > >
> > > > > Seth
> > > > >
> > > > > ======================
> > > > > org.biojava.bio.BioException: Could not read sequence
> > > > >         at 
> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > > > > Caused by: org.biojava.bio.seq.io.ParseException:
> > > > > org.biojava.bio.seq.io.ParseException: Bad reference line found: ?
> > > > >         at 
> > > > > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250)
> > > > >         at 
> > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > >         ... 1 more
> > > > > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line 
> > > > > found: ?
> > > > >         at 
> > > > > org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
> > > > >         at 
> > > > > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
> > > > >         at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
> > > > >         at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97)
> > > > >         at 
> > > > > org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246)
> > > > >         ... 2 more
> > > > > Java Result: -1
> > > > > ======================
> > > > >
> > > > > ~~~~~~~~~~~~~~~~~~~~~~
> > > > > <INSDSeq_references>
> > > > >     <INSDReference>
> > > > >       <INSDReference_reference>?</INSDReference_reference>
> > > > >       <INSDReference_position>1..16732</INSDReference_position>
> > > > >       <INSDReference_authors>
> > > > >         <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> > > > >         <INSDAuthor>Webster,M.T.</INSDAuthor>
> > > > >         <INSDAuthor>Vila,C.</INSDAuthor>
> > > > >       </INSDReference_authors>
> > > > >       <INSDReference_title>Relaxation of Selective Constraint on Dog
> > > > > Mitochondrial DNA Following Domestication</INSDReference_title>
> > > > >       <INSDReference_journal>Unpublished</INSDReference_journal>
> > > > >     </INSDReference>
> > > > >     <INSDReference>
> > > > >       <INSDReference_reference>?</INSDReference_reference>
> > > > >       <INSDReference_position>1..16732</INSDReference_position>
> > > > >       <INSDReference_authors>
> > > > >         <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> > > > >         <INSDAuthor>Webster,M.T.</INSDAuthor>
> > > > >         <INSDAuthor>Vila,C.</INSDAuthor>
> > > > >       </INSDReference_authors>
> > > > >       <INSDReference_journal>Submitted (06-APR-2006) to the
> > > > > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary
> > > > > Biology, Norbyvagen 18D, Uppsala 752 36,
> > > > > Sweden</INSDReference_journal>
> > > > >     </INSDReference>
> > > > >   </INSDSeq_references>
> > > > > ~~~~~~~~~~~~~~~~~~~~~~
> > > > >
> > > > > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> > > > > > Hmmm... interesting. I _could_ put in a special case that ignores 
> > > > > > the
> > > > > > question marks, but that wouldn't be 'nice' really - this is more 
> > > > > > of a
> > > > > > problem with the program that is producing the Genbank files than a
> > > > > > problem with the parser trying to read them. '?' is not a valid tag 
> > > > > > in
> > > > > > the official Genbank format, and has no meaning attached to it that 
> > > > > > I
> > > > > > can work out, so I'm reluctant to make the parser recognise it.
> > > > > >
> > > > > > I'd suggest you contact the people who write the software you are 
> > > > > > using
> > > > > > to produce the Genbank files and ask them if they could stick to the
> > > > > > rules!
> > > > > >
> > > > > > In the meantime you could work around the problem by stripping the
> > > > > > question marks in some kind of pre-processor before passing it onto
> > > > > > BioJavaX for parsing.
> > > > > >
> > > > > > cheers,
> > > > > > Richard
> > > > > >
> > > > > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote:
> > > > > > > Removing '?' (or several of them in my case) avoids the following 
> > > > > > > exception:
> > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > > >         at 
> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > > >         at 
> > > > > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957
> > > > > > >         at 
> > > > > > > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > > > > >         at 
> > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > > > >         ... 1 more
> > > > > > > Java Result: -1
> > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > > I don't know where that previous tokenization problem came from 
> > > > > > > since
> > > > > > > I can no longer reproduce it.  This time it's more or less 
> > > > > > > straight
> > > > > > > forward.
> > > > > > > Here's the original file with question marks:
> > > > > > > ============================
> > > > > > > LOCUS       DQ415957                1437 bp    mRNA    linear   
> > > > > > > VRT 01-JUN-2006
> > > > > > > DEFINITION  Danio rerio capillary morphogenesis protein 2A 
> > > > > > > (cmg2a) mRNA,
> > > > > > >             complete cds.
> > > > > > > ACCESSION   DQ415957
> > > > > > > VERSION     DQ415957.1  GI:89513612
> > > > > > > KEYWORDS    .
> > > > > > > SOURCE      Unknown.
> > > > > > >   ORGANISM  Unknown.
> > > > > > >             Unclassified.
> > > > > > > ?
> > > > > > > ?
> > > > > > > FEATURES             Location/Qualifiers
> > > > > > > ?
> > > > > > >      gene            1..1437
> > > > > > >                      /gene="cmg2a"
> > > > > > >      CDS             1..1437
> > > > > > >                      /gene="cmg2a"
> > > > > > >                      /note="cell surface receptor; similar to 
> > > > > > > anthrax toxin
> > > > > > >                      receptor 2 (ANTXR2, ATR2, CMG2)"
> > > > > > >                      /codon_start=1
> > > > > > >                      /product="capillary morphogenesis protein 2A"
> > > > > > >                      /protein_id="ABD74633.1"
> > > > > > >                      /db_xref="GI:89513613"
> > > > > > >                      
> > > > > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS
> > > > > > >                      
> > > > > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS
> > > > > > >                      
> > > > > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY
> > > > > > >                      
> > > > > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS
> > > > > > >                      
> > > > > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT
> > > > > > >                      
> > > > > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC
> > > > > > >                      
> > > > > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS
> > > > > > >                      
> > > > > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL
> > > > > > >                      RRQYDRVSVMRPTSADKGRCMNFSRTQH"
> > > > > > > ORIGIN
> > > > > > >         1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt 
> > > > > > > ctgtttatgc
> > > > > > >        61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct 
> > > > > > > gtactttgtg
> > > > > > >       121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt 
> > > > > > > tgtcaaaaat
> > > > > > >       181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt 
> > > > > > > ttcatcaaga
> > > > > > >       241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg 
> > > > > > > cctgaagacc
> > > > > > >       301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa 
> > > > > > > attggcaact
> > > > > > >       361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt 
> > > > > > > gactgatgga
> > > > > > >       421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc 
> > > > > > > aaggaagtat
> > > > > > >       481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct 
> > > > > > > agccgatgtg
> > > > > > >       541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct 
> > > > > > > caaaggcatc
> > > > > > >       601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc 
> > > > > > > gtccagcgtc
> > > > > > >       661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt 
> > > > > > > ggggagacaa
> > > > > > >       721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca 
> > > > > > > aaaaccaacc
> > > > > > >       781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt 
> > > > > > > tggacagcaa
> > > > > > >       841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc 
> > > > > > > tttcatcatc
> > > > > > >       901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt 
> > > > > > > gctttttctc
> > > > > > >       961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt 
> > > > > > > cgttattaaa
> > > > > > >      1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga 
> > > > > > > cccggaaccc
> > > > > > >      1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc 
> > > > > > > tggtggaatc
> > > > > > >      1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc 
> > > > > > > aagactagag
> > > > > > >      1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat 
> > > > > > > ggtcaaaaag
> > > > > > >      1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac 
> > > > > > > accaatcaga
> > > > > > >      1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt 
> > > > > > > ttcagttatg
> > > > > > >      1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca 
> > > > > > > gcattaa
> > > > > > > //
> > > > > > >
> > > > > > > ============================
> > > > > > >
> > > > > > >
> > > > > > > On 6/5/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> > > > > > > > Hi again.
> > > > > > > >
> > > > > > > > Could you remove the offending question mark from the GenBank 
> > > > > > > > file and
> > > > > > > > try it again to see if that fixes it? The parser should just 
> > > > > > > > ignore it
> > > > > > > > but apparently not. The error looks weird to me because the 
> > > > > > > > tokenization
> > > > > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure 
> > > > > > > > what's
> > > > > > > > going on here.
> > > > > > > ...
> > > > > > > >
> > > > > > > > cheers,
> > > > > > > > Richard
> > > > > > > >
> > > > > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote:
> > > > > > > > > Hell again Richard,
> > > > > > > > >
> > > > > > > > > No sooner I've said about the fix of the last parsing 
> > > > > > > > > exception than
> > > > > > > > > another one came up with Genbank format:
> > > > > > > > > --------------------------------------
> > > > > > > > > org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > > > > >         at 
> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > > > > >         at 
> > > > > > > > > exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151)
> > > > > > > > >         at 
> > > > > > > > > exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246)
> > > > > > > > >         at 
> > > > > > > > > exonhit.parsers.GenBankParser.main(GenBankParser.java:326)
> > > > > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > > > > >         at 
> > > > > > > > > org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > > > > > > >         at 
> > > > > > > > > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > > > > > >         ... 3 more
> > > > > > > > > org.biojava.bio.seq.io.ParseException:
> > > > > > > > > org.biojava.bio.symbol.IllegalSymbolException: This 
> > > > > > > > > tokenization
> > > > > > > > > doesn't contain character: 't'
> > > > > > > > > ----------------------------------------
> > > > > > > > > The Genbank file that caused it is as follows:
> > > > > > > > > =========================================
> > > > > > > > > LOCUS       DQ431065                 425 bp    DNA     linear 
> > > > > > > > >   INV 01-JUN-2006
> > > > > > > > > DEFINITION  Reticulitermes sp. ALS-2006c 16S ribosomal RNA 
> > > > > > > > > gene, partial
> > > > > > > > >             sequence; mitochondrial.
> > > > > > > > > ACCESSION   DQ431065
> > > > > > > > > VERSION     DQ431065.1  GI:90102206
> > > > > > > > > KEYWORDS    .
> > > > > > > > > SOURCE      Vaccinium corymbosum
> > > > > > > > >   ORGANISM  Vaccinium corymbosum
> > > > > > > > >             Eukaryota; Viridiplantae; Streptophyta; 
> > > > > > > > > Embryophyta; Tracheophyta;
> > > > > > > > >             Spermatophyta; Magnoliophyta; eudicotyledons; 
> > > > > > > > > core eudicotyledons;
> > > > > > > > >             asterids; Ericales; Ericaceae; Vaccinioideae; 
> > > > > > > > > Vaccinieae;
> > > > > > > > >             Vaccinium.
> > > > > > > > > ?
> > > > > > > > > REFERENCE   2  (bases 1 to 425)
> > > > > > > > >   AUTHORS   Naik,L.D. and Rowland,L.J.
> > > > > > > > >   TITLE     Expressed Sequence Tags of cDNA clones from 
> > > > > > > > > subtracted library of
> > > > > > > > >             Vaccinium corymbosum
> > > > > > > > >   JOURNAL   Unpublished (2005)
> > > > > > > > > FEATURES             Location/Qualifiers
> > > > > > > > >      source          1..425
> > > > > > > > >                      /organism="Vaccinium corymbosum"
> > > > > > > > >                      /mol_type="genomic DNA"
> > > > > > > > >                      /cultivar="Bluecrop"
> > > > > > > > >                      /db_xref="taxon:69266"
> > > > > > > > >                      /tissue_type="Flower buds"
> > > > > > > > >                      /clone_lib="Subtracted cDNA library of 
> > > > > > > > > Vaccinium
> > > > > > > > >                      corymbosum"
> > > > > > > > >                      /dev_stage="399 hour chill unit exposure"
> > > > > > > > >                      /note="Vector: pCR4TOPO; Site_1: Eco R 
> > > > > > > > > I; Site_2: Eco R I"
> > > > > > > > >      rRNA            <1..>425
> > > > > > > > >                      /product="16S ribosomal RNA"
> > > > > > > > > ORIGIN
> > > > > > > > >         1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga 
> > > > > > > > > agtatggcct gcccgctgac
> > > > > > > > >        61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg 
> > > > > > > > > tagcatagtc attagttctt
> > > > > > > > >       121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc 
> > > > > > > > > tgtcttaatt ttgaattgtt
> > > > > > > > >       181 tattgaattt ggtctttgag ttaaaattct tagatgtttt 
> > > > > > > > > tatgggacga gaagacccta
> > > > > > > > >       241 tagagtttaa catttattat ggtccttttc tgtttgtgag 
> > > > > > > > > ggctcactgg gccgtctaat
> > > > > > > > >       301 atgttttgtt ggggtgatgg gagggaataa tttaacccct 
> > > > > > > > > cctttttatt attatattta
> > > > > > > > >       361 tttatattta tttgatccat ttattttgat tgtaagatta 
> > > > > > > > > aattacctta gggataacag
> > > > > > > > >       421 cgtaa
> > > > > > > > > //
> > > > > > > > > ==================================
> > > > > > > > > I think it's the presence of the '?' at the beginning of the 
> > > > > > > > > line?!?!
> > > > > > > > > I'm not sure wether the information that was supposed to be 
> > > > > > > > > present
> > > > > > > > > instead of those question marks is absent from the original 
> > > > > > > > > ASN.1
> > > > > > > > > batch file or it's a bug in the NCBI ASN2GO software.  It 
> > > > > > > > > looks to me
> > > > > > > > > that the former is the case since the file from NCBI website 
> > > > > > > > > contains
> > > > > > > > > much more information than the batch file. Just bringing this 
> > > > > > > > > to
> > > > > > > > > everyone's attention.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Seth Johnson
> > > > > > > > > Senior Bioinformatics Associate
> > > > > > > > >
> > > > > > > > > Ph: (202) 470-0900
> > > > > > > > > Fx: (775) 251-0358
> > > > > > > > >
> > > > > > > > > On 6/2/06, Richard Holland <[EMAIL PROTECTED]> wrote:
> > > > > > > > > > Hi Seth.
> > > > > > > > > >
> > > > > > > > > > Your second point, about the authors string not being read 
> > > > > > > > > > correctly in
> > > > > > > > > > Genbank format, has been fixed (or should have been if I 
> > > > > > > > > > got the code
> > > > > > > > > > right!). Could you check the latest version of biojava-live 
> > > > > > > > > > out of CVS
> > > > > > > > > > and give it another go? Basically the parser did not 
> > > > > > > > > > recognise the
> > > > > > > > > > CONSRTM tag, as it is not mentioned in the sample record 
> > > > > > > > > > provided by
> > > > > > > > > > NCBI, which is what I based the parser on.
> > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > cheers,
> > > > > > > > > > Richard
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > --
> > > > > > > > Richard Holland (BioMart Team)
> > > > > > > > EMBL-EBI
> > > > > > > > Wellcome Trust Genome Campus
> > > > > > > > Hinxton
> > > > > > > > Cambridge CB10 1SD
> > > > > > > > UNITED KINGDOM
> > > > > > > > Tel: +44-(0)1223-494416
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > --
> > > > > > Richard Holland (BioMart Team)
> > > > > > EMBL-EBI
> > > > > > Wellcome Trust Genome Campus
> > > > > > Hinxton
> > > > > > Cambridge CB10 1SD
> > > > > > UNITED KINGDOM
> > > > > > Tel: +44-(0)1223-494416
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > --
> > > > Richard Holland (BioMart Team)
> > > > EMBL-EBI
> > > > Wellcome Trust Genome Campus
> > > > Hinxton
> > > > Cambridge CB10 1SD
> > > > UNITED KINGDOM
> > > > Tel: +44-(0)1223-494416
> > > >
> > > >
> > >
> > >
> > --
> > Richard Holland (BioMart Team)
> > EMBL-EBI
> > Wellcome Trust Genome Campus
> > Hinxton
> > Cambridge CB10 1SD
> > UNITED KINGDOM
> > Tel: +44-(0)1223-494416
> >
> >
> 
> 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416


_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: [Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4)

Reply via email to