Hi Chris, I've run into this problem before.
See http://lists.open-bio.org/pipermail/biojava-l/2009-May/006834.html for details and some unofficial patches that fix the problem. Josh On 12/18/2009 10:53 AM, Chris Cole wrote: > I'm wanting to parse a fasta file obtained from IPI using the code at > the bottom of this message, but I get the following error: > > org.biojava.bio.BioException: Could not read sequence > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113) > > at test.readFasta(test.java:39) > at test.main(test.java:18) > Caused by: java.io.IOException: Mark invalid > at java.io.BufferedReader.reset(BufferedReader.java:485) > at > org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:202) > at > org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110) > > ... 2 more > > Looking at the Fasta file itself and doing some tests, it seems to fail > consistently at one or two entries /preceding/ an entry with a very long > description line e.g.: >>IPI:IPI00021421.4|SWISS-PROT:Q9UMR5-1|TREMBL:B0S868|ENSEMBL:ENSP00000382748;ENSP00000382749;ENSP00000382750;ENSP00000387679;ENSP00000388341;ENSP00000388618;ENSP00000389930;ENSP00000392885;ENSP00000393009;ENSP00000395242;ENSP00000395562;ENSP00000397025;ENSP00000399879;ENSP00000403820;ENSP00000406496;ENSP00000406566;ENSP00000408703;ENSP00000411007;ENSP00000411625;ENSP00000412827|REFSEQ:NP_005146|VEGA:OTTHUMP00000014775;OTTHUMP00000014776;OTTHUMP00000014778;OTTHUMP00000175028;OTTHUMP00000175029;OTTHUMP00000175030;OTTHUMP00000193135;OTTHUMP00000193136;OTTHUMP00000193138;OTTHUMP00000193964;OTTHUMP00000193965;OTTHUMP00000193967;OTTHUMP00000194391;OTTHUMP00000194392;OTTHUMP00000194394 > Tax_Id=9606 Gene_Symbol=PPT2 Isoform 1 of Lysosomal thioesterase PPT2 > MLGLWGQRLPAAWVLLLLPFLPLLLLAAPAPHRASYKPVIVVHGLFDSSYSFRHLLEYIN > ETHPGTVVTVLDLFDGRESLRPLWEQVQGFREAVVPIMAKAPQGVHLICYSQGGLVCRAL > LSVMDDHNVDSFISLSSPQMGQYGDTDYLKWLFPTSMRSNLYRICYSPWGQEFSICNYWH > DPHHDDLYLNASSFLALINGERDHPNATVWRKNFLRVGHLVLIGGPDDGVITPWQSSFFG > FYDANETVLEMEEQLVYLRDSFGLKTLLARGAIVRCPMAGISHTAWHSNRTLYETCIEPW > LS > > Deleting the large entries allows the code to continue until it reaches > another long description line. > > It also seems to be a feature of large Fasta files as reading the above > sequence alone or as part of a small file is fine. > > Is this a known problem or am I doing something wrong? BTW I'm using > biojava 1.7 and Java 1.6.0_17. > Any help would be most appreciated. > Cheers. > > code: > import java.io.*; > > import org.biojava.bio.*; > import org.biojavax.*; > import org.biojavax.bio.seq.*; > > public class test { > private static PrintStream o = System.out; > > public static void main(String[] args) { > // TODO Auto-generated method stub > readFasta(args[0]); > } > > public static void readFasta(String filename) { > try { > o.println("Reading file: " + filename); > //prepare a BufferedReader for file io > BufferedReader br = new BufferedReader(new FileReader(filename)); > > // read Fasta file as BioJava RichSequence object > Namespace ns = RichObjectFactory.getDefaultNamespace(); > RichSequenceIterator iter = > RichSequence.IOTools.readFastaProtein(br,ns); > > int numProteins = 0; > while(iter.hasNext()) { > ++numProteins; > > // Retrieve sequence and description data > RichSequence seq = iter.nextRichSequence(); > String ipi = seq.getName().substring(4,15); > o.println(ipi); > > } > o.println("Found " + numProteins + " in Fasta file"); > } catch (FileNotFoundException ex) { > //can't find file specified by args[0] > ex.printStackTrace(); > } catch (BioException ex) { > //error parsing requested format > ex.printStackTrace(); > } > } > > } > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
