just kicked off a new build.. alpha4 should be on the servers shortly... you don't need cruisecontrol for a release. Anybody with an ssh account on portal.open-bio (and set up ssh keys correctly) can do mvn release:clean release:prepare release:perform
A On Sat, Oct 30, 2010 at 5:20 AM, Andy Yates <[email protected]> wrote: > You should be aware I just found a bug in the code. This has been fixed but > the bug will still be in the alpha3 release. I would recommend either > building a version yourself or if Andreas can post up the continuous > integration server address there will be a release tonight. > > Just goes to show you should always do more testing than you think :). > > Andy > > On 29 Oct 2010, at 20:43, jitesh dundas wrote: > >> That is good news.Thanks for the directions Andy. >> >> I have already started on this.Let me analyze and write the code now. >> >> Maybe a next month deadline is not unreachable in this case. >> >> Here we go! >> JD >> >> On 10/30/10, Andy Yates <[email protected]> wrote: >>> So we've got some basic kmer work now in SVN. If you look in the class >>> SequenceMixin there are two static methods there for generating the two >>> types of k-mers. It's not developed with Map storage in mind & I'll leave >>> the door open there for anyone else to come in & develop it. The k-mers are >>> also not unique across the sequence but it's a start :) >>> >>> Share & enjoy! >>> >>> Andy >>> >>> On 29 Oct 2010, at 19:50, jitesh dundas wrote: >>> >>>> I agree Andy. These have become standard functionalities that >>>> scientists do these days. I am all for implementing that in BioJava3. >>>> Java isn't that efficient for such functionalities so we will surely >>>> need more effort compared to the same in Python/Perl. >>>> >>>> Regards, >>>> Jitesh Dundas >>>> >>>> On 10/30/10, Andy Yates <[email protected]> wrote: >>>>> So if it's a suffix tree that's quite a fixed data structure so the >>>>> chances >>>>> of developing a pluggable mechanism there would be hard. I think there >>>>> also >>>>> has to be a limit as to what we can sensibly do. If people want to >>>>> contribute this kind of work though then it's all be very well received >>>>> (with the corresponding test environment/cases of course). >>>>> >>>>> Cheers, >>>>> >>>>> Andy >>>>> >>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>>>> >>>>>> It might be useful to make the K-mer storage mechanism pluggable. This >>>>>> would allow a developer to use anything from a simple MultiMap, to a >>>>>> NoSQL >>>>>> key-value database to store K-mers. You could plugin custom map >>>>>> implementations to allow you to keep a count of the number of instances >>>>>> of >>>>>> particular K-mers that were found. It might also be useful to be able >>>>>> to >>>>>> do >>>>>> set operations on those K-mer collections. You could use it to >>>>>> determine >>>>>> which K-mers were present in a pathogen and not in a host. >>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Mark >>>>>> >>>>>> card.ly: <http://card.ly/phidias51> >>>>>> >>>>>> >>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> This is good to have. I feel that including it as a part of core may >>>>>>> not >>>>>>> be >>>>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>>>> nice. >>>>>>> There is a project Bioinformatica >>>>>>> >>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>>>> does something similar although not exactly. It counts the k-mers in a >>>>>>> given fasta file but it does not count k-mers for each sequence within >>>>>>> the >>>>>>> file, just all within a file. This is a good feature to have specially >>>>>>> if >>>>>>> one is trying to find patterns within sequences which is what I am >>>>>>> trying >>>>>>> to >>>>>>> do. It would most certainly be helpful to have a k-mer counting >>>>>>> algorithm >>>>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>>>> to >>>>>>> use >>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>>>> not >>>>>>> since I haven't used java in a while and am just switching back to it. >>>>>>> A >>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>>>> is: >>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>>>> software >>>>>>> is tallymer). It would be some work to implement this in java as a >>>>>>> module >>>>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>>>> fasta >>>>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>>>> files, I think that might be the way to go. >>>>>>> >>>>>>> Thats just my two cents.What do you think? >>>>>>> >>>>>>> -vishal >>>>>>> >>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates <[email protected]> wrote: >>>>>>> >>>>>>>> Hi Vishal, >>>>>>>> >>>>>>>> As far as I am aware there is nothing which will generate them in >>>>>>>> BioJava >>>>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>>>> >>>>>>>> public static void main(String[] args) { >>>>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>>>> System.out.println("Non-Overlap"); >>>>>>>> nonOverlap(d); >>>>>>>> System.out.println("Overlap"); >>>>>>>> overlap(d); >>>>>>>> } >>>>>>>> >>>>>>>> public static final int KMER = 3; >>>>>>>> >>>>>>>> //Generate triplets overlapping >>>>>>>> public static void overlap(Sequence<NucleotideCompound> d) { >>>>>>>> List<WindowedSequence<NucleotideCompound>> l = >>>>>>>> new ArrayList<WindowedSequence<NucleotideCompound>>(); >>>>>>>> for(int i=1; i<=KMER; i++) { >>>>>>>> SequenceView<NucleotideCompound> sub = d.getSubSequence( >>>>>>>> i, d.getLength()); >>>>>>>> WindowedSequence<NucleotideCompound> w = >>>>>>>> new WindowedSequence<NucleotideCompound>(sub, KMER); >>>>>>>> l.add(w); >>>>>>>> } >>>>>>>> >>>>>>>> //Will return ATG, ATC, TGA & GAT >>>>>>>> for(WindowedSequence<NucleotideCompound> w: l) { >>>>>>>> for(List<NucleotideCompound> subList: w) { >>>>>>>> System.out.println(subList); >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> //Generate triplet Compound lists non-overlapping >>>>>>>> public static void nonOverlap(Sequence<NucleotideCompound> d) { >>>>>>>> WindowedSequence<NucleotideCompound> w = >>>>>>>> new WindowedSequence<NucleotideCompound>(d, KMER); >>>>>>>> //Will return ATG & ATC >>>>>>>> for(List<NucleotideCompound> subList: w) { >>>>>>>> System.out.println(subList); >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>>>> of >>>>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>>>> This >>>>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>>>> iterates through each window rather than stepping through delegating >>>>>>>> onto >>>>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>>>> >>>>>>>> As for unique k-mers that's something which would require a bit more >>>>>>>> engineering & would be better suited to a solution built around a Trie >>>>>>>> (prefix tree). >>>>>>>> >>>>>>>> Hope this helps, >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>>>> or >>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>>>> k-mer >>>>>>>>> counts for every sequence in a fasta file. If something like this >>>>>>> exists >>>>>>>> it >>>>>>>>> would save me some time to write the code. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Vishal >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list - [email protected] >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>>> -- >>>>>>>> Andrew Yates Ensembl Genomes Engineer >>>>>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Vishal Thapar, Ph.D.* >>>>>>> *Scientific informatics Analyst >>>>>>> Cold Spring Harbor Lab >>>>>>> Quick Bldg, Lowe Lab >>>>>>> 1 Bungtown Road >>>>>>> Cold Spring Harbor, NY - 11724* >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - [email protected] >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - [email protected] >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> -- >>>>> Andrew Yates Ensembl Genomes Engineer >>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Biojava-l mailing list - [email protected] >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
