Hi Pola and Jose (and BioJava community), this mail is first a reply on the MrBayes topoc, but also contains a new idea for parsing multiple sequence alignments in BioJava, which is in this case closely linked. I'm sorry it has quite a long text, but I wanted to be clear about what I mean.
MrBayes Topic: I'm following this topic with interest, because I'm currently working programmatically both with BioJava and MrBayes, although I don't know anything about the concrete plans behind the BioJava-issue on MrBayes. My first question would be: Do we want to parse just the consensus tree output or also the markov chain protocol files (*.p and *.t) which log the states of the MCMC runs (where *.t contains the trees and *.p the other parameter states). Parsing such files might be interesting to analyse a run (e.g. to adress questions like "Was the stationary state reached early enough?" "Did the chain run long enough?", ...). Features like this would be similar to the features implemented in Tracer (http://tree.bio.ed.ac.uk/software/tracer/ ) which comes with BEAST. It could also be considered that BEAST2 and other programs output similar data possibly in a similar format. The more interesting/urgent thing though might be parsing the consensus tree which is in Nexus format (or writing the input files for MrBayes). Although the Nexus format is not really state of the art anymore and replacements like e.g. NeXML (http://nexml.org/ ) - which overcome its limitations - should be prefered if you implement a new software, the Nexus format is still widely used and supporting in BioJava 3 (or 4) would surely be a good idea. There was a extensible Nexus parser in BioJava 1.x (http://www.biojava.org/docs/api1.9.1/org/biojavax/bio/phylo/io/nexus/package-summary.html ) which could be ported to BioJava 3 (4). (This has never been done until now, hasen't it?) The thing about Nexus is that it can contain tree and sequence and meta data, so a complete parser would need to have all these different functions and the previous approach of having a set of plug-in-classes for each Nexus-block made sense to me. If you are thinking about writing a whole new parser you can also have a look of the code I already wrote the phylogenetic tree editor TreeGraph as a starting point: http://bioinfweb.info/Code/sventon/repos/TreeGraph2/list/trunk/main/src/info/bioinfweb/treegraph/document/io/nexus/?revision=HEAD (Of course you would have to use a different tree model in BioJava, maybe forester, if that is the current standard.) Sequence parsers: This already leads to another topic (I was plannung to post to this list some time anyway): When talking about sequence parsers I would have another idea to implement a general parser framework for multiple sequence alignment in BioJava to which different parsers (implementing according interfaces) can be added with the time to support many formats with an abstract strategy pattern. In contrast to the current parsers in BioJava (e.g. http://www.biojava.org/docs/api/org/biojava3/core/sequence/io/FastaReader.html ), the ones I'm thinking of should not themselves decide on which implementation of the Sequence interface to use but keep this decision to the user of the class. To achieve this, I would propose an interface extending the Sequence interface called e.g. EditableSequence which additionally offers methods like setTokenAt(), insertTokenAt(), removeTokenAt(), ... . Implementations of the parser classes would than just use this methods to load the sequence into RAM instead of the current way. (And of course a general way for creating instances of EditableSequence implementations which would be no problem with according factory method definitions.) The benefit from this is, that the storage method would be independet from the parser class, which would allow to use e.g. compressed sequence storage like http://www.biojava.org/docs/api/index.html?org/biojava3/core/sequence/storage/TwoBitSequenceReader.html currently does or a cached sequence for large data sets, ... . (If I haven't missed something, the problem with the current implementaiton is that you cannot benefit from the compression of such classes, if you do not implement your own parser that does not load all sequences into a string first to pass this to the contructor of a Sequence implementation.) Of course new implementations would be needed for EditableSequence, but since this interface extends Sequence such new classes would be fully interoperable with current code relying on the Sequence interface but additionally offer editbiliy. I already implemented a similar framework in one of my current projects LibrAlign (http://bioinfweb.info/LibrAlign/ ) which is a Java GUI library for multiple sequence alignments and attached raw and meta data which is compotible with BioJava. See http://bioinfweb.info/Code/sventon/repos/LibrAlign/show/trunk/main/src/info/bioinfweb/libralign/sequenceprovider/SequenceDataProvider.java?revision=HEAD and http://bioinfweb.info/Code/sventon/repos/LibrAlign/show/trunk/main/src/info/bioinfweb/libralign/sequenceprovider/implementations/PackedSequenceDataProvider.java?revision=HEAD . Implementations of similar parsers can be found here: http://bioinfweb.info/Code/sventon/repos/Commons.Java/list/trunk/main/experimental/info/bioinfweb/commons/bio/biojava3/alignment/io/?revision=HEAD&bypassEmpty=true Therefore I would offer to implement such functionality for BioJava, but before making a pull request or anything, I wanted to ask for opinion of the cummunity on that idea and also if I might have missed concepts in BioJava that would currently already allow to do something similar. I would be happy to get some feedback on that idea. @Pola: If you have further questions on MrBayes, let me know. I could also send you some illustrations on how the MCMC works from one of my lectures, if needed. http://www2.ieb.uni-muenster.de/EvolBiodivPlants/en/Teaching/WS2013_2014/MolecularPhylogenetics Best Ben Dipl. Biologe Ben Stöver Evolution und Biodiversity of Plants Group Institute for Evolution and Biodiversity University of Münster Germany Phone: +49 251 83 21647 Fax: +49 251 83 24668 http://www2.ieb.uni-muenster.de/EvolBiodivPlants/en/People/Stoever [email protected] Jose Manuel Duarte schrieb am 2014-11-05: > Hi Pola > Welcome and great that you want to plunge in! I don't know much about > MrBayes myself, but the idea was to include a parser in the > biojava3-phylo module. The module uses forester > (https://code.google.com/p/forester/wiki/forester) as the underlying > library to deal with phylogeny data. So the idea would be to parse > the > output into a forester data structure (most likely into > org.forester.phylogeny.Phylogeny). > Anyway hopefully someone with a bit more knowledge about this might > be > able to add something. > Cheers > Jose > On 31/10/14 18:17, Pola Kyzioł wrote: > >Hello, > >my name is Pola and I'm currently a third year student in the > >Theoretical Computer Science > >at Jagiellonian University. I have also interest in biology, > >especially in the field of genetics. > >I've been searching a project connected with bioinformatics which I > >could develop > >and next use to writing my bachelor's thesis. I've found BioJava and > >looked at its issues - > >parser for MrBayes output seems for me to be interesting to code. > >I would like to know some details about it: > >- what data you want extracted from MrBayes' output files; > >- how the created model should look like and if appropriate modules > >already exist. > >Thanks for your help, > >Pola > >_______________________________________________ > >Biojava-l mailing list - [email protected] > >http://mailman.open-bio.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - [email protected] http://mailman.open-bio.org/mailman/listinfo/biojava-l
