Hi - With all of the recent discussion about alignments, I have been thinking about ways to represent EST assemblies in biojava. Of course in reality they are gapped alignments with potentially tens of thousands of sequences. They also tend to carry some cruft around with them like a consensus sequence (the alignment consensus) and the contig sequence - basically the quality clipped and ungapped consensus. There needs to be mapping between the contig sequence coordinates and the underlying alignment coordinates. They also have interesting things like SNPs which really only exist as columns in the alignment that exceed some threshold conditions.
Some issues to think about: 1. How to best hold potentially thousands of sequences in an alignment. One solution might be to store only the differences from the consensus and infer the rest from the consensus. 2. How to represent the quality data, should the contig/ consensus sequence be represented as PhredSequences, Sequences or maybe even Markov Chains. 3. How to make a SNP like feature. - Mark Mark Schreiber Bioinformatics AgResearch Invermay PO Box 50034 Mosgiel New Zealand PH: +64 3 489 9175 FAX: +64 3 489 3739 ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= _______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l
