I'm looking to move some previous hard coded workflow style analysis of protein sequences (reference links below), to something like Kepler, so that it can be easily modified and expanded. The biggest problem is that there is a lot of proteins to be processed (about 1.5 million for 537 bacterial genomes), each would be requiring various tasks. (If you know bioinformatics, it's stuff like membrane helix prediction, running blast against the NR database, secondary structure prediction, protein threading, running modeller). This is all more work then then I would want one computer to do. Has anybody done any work on parallel work flows? What is the best way to handle a workflow of this scope? I could try to set it up so that Kepler merely manages the workflow of coordinating web services and queue submissions. But that would introduce a lot of extra 'lag' for communication time and submission to busy queues. I would prefer some method where I got a block of computers on a cluster and the various actors would fire off on the free nodes, and the director/management system would coordinate the actors and move data around in-between the different computers.
Has there been any research into this sort of thing? Does anybody have any ideas on the best way to tackle this sort of thing? Kyle Ellrott PROSPECT-PSPP: an automatic computational pipeline for protein structure prediction. http://www.ncbi.nlm.nih.gov/sites/entrez? Db=pubmed&Cmd=ShowDetailView&TermToSearch=15215441&ordinalpos=5&itool=En trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum A computational pipeline for protein structure prediction and analysis at genome scale. http://www.ncbi.nlm.nih.gov/sites/entrez? Db=pubmed&Cmd=ShowDetailView&TermToSearch=14555633&ordinalpos=7&itool=En trezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum

