Dear SIr, Could anyone tell me where I could start? Is there any lead who might need my help in Software Development and research-oriebted aspects?
Any comments on my previous emails would be most welcomed... Regards, JItesh Dundas On 4/8/10, Andreas Dräger <[email protected]> wrote: > > Hi all, > > This e-mail is just for your information about somebody new, who'd like to > contribute to our project. > > Cheers > Andreas > > > Subject: > Re: Fwd: Proposing a project on "Biojava alignment lead" > From: > Andreas Dräger <[email protected]> > Date: > Wed, 07 Apr 2010 09:27:13 +0200 > To: > Cai Shaojiang <[email protected]> > > Hi Cai Shaojiang, > > Thank you for you e-mail! I don't know what happened to the e-mail list. > Sometimes it takes a while due to the spam filters, I guess. > > > I am a PhD student from National University of Singapore. My major > research area is local alignment algorithms and data structures for SNP > identification. And I have used Java and Eclipse for years for software > development. I am very interested in your GSoC programme. I find that there > is a module called "biojava-alignment lead" whose mentor is you. I want to > propose a new project on this module. I have several questions about this > module. > > Yes, that's me. So great to get your support. > > > 1. It seems that pairwise alignment is to find similarity between two > short sequences. Existing pairwise alignment is based on dynamic > programming, is it Smith-Waterman algorithm? > > So, currently, BioJava contains three different alignment approaches. > There are two deterministic algorithms, i.e., Smith-Waterman for local > alignment and Needleman-Wunsch for global alignment. Third, there is the > possibility to apply Hidden Markov Models for alignment. An example of the > latter approach should be in the cookbook. > > > 2. What is the exact task of "refactoring of underlying data structures"? > > Yes, this is something, I did last week already but it could still be > improved. The problem was that the alignment algorithms actually produced a > kind of string that looks similar to the output of BLAST. This string > contained the score, the computation time, the length of the alignment etc. > The problem was that people wanted to perform higher-level computation on > the score value or evaluate some other information. Now, the alignment will > produce a data structure that contains all the information and can, in > addition to that, also produce such a BLAST-like output. There is, however, > still the following problem: The data structure requires both sequences in > the pair-wise alignment to have an identical length. In case of local > alignment this is especially stupid (actually), because gaps are inserted to > fill the sequences. And then the data structure tries to keep the old > sequence coordinates, leading to the effect that the numbers "query start", > "query end", "subject start", and "subject end" are required to shift the > sequences against each other when displaying the output. So, you cannot > easily print the sequences below of each other, you first have to shift > them. Please check out the latest version of this package via anonymeous svn > and have a look ;-) > > > 3. My existing research area is aiming to deal with aligning short read > (10s~100s bp) against extremely long sequences (e.g., human genome). Af far > as I know, there is not existing such alignment tools implemented in Java. > Would you consider this direction? > > See, this would be very nice to include. But this requires that we no > longer fill the short sequence with many, many gap symbols (just a waist of > memory), but improve the data structure. There is already an > UnequalLenghtAlignment (just a data structure, no algorithm) and I think we > could use this as a starting point. Then your algorithm should only produce > such a data structure and this would be fine. > > > 4. It seems that the existing tools is just lacking of some refactoring > and representation interfaces. Any more underlying tasks? > > Hm. Yes: With the release of BioJava 3 data structures have changed again. > So maybe there's also some adaptation to the new structure required. > > > I am keeping an eye on GSoC from last month, but sorry to find out that I > sent the initial email to the mailing list before I subscribe it... > > Ok. Sounds good. Thanks for your interest. So I suggest: Download the > latest trunk, have a look, play around and if you can improve something > we'll put it into the trunk and write your name into the authors' tag. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dräger > Eberhard Karls University Tübingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 Tübingen > Germany > > Phone: +49-7071-29-70436 > Fax: +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
