Thanks Richard for a good summary, and it does sound like quite exciting research!
On Fri, Jul 8, 2011 at 16:24, Richard Holland <[email protected]> wrote: > If I got this right, Efthymia is suggesting that workflows are not very > unlike SQL queries. They consist of a number of data sources which are > filtered and joined to produce the final result. The order in which they are > filtered and joined is the decision of the query optimiser, which can be > helped by hints inserted by the query author but largely has to rely on > observed statistics to make decisions. > > The key difference is that query optimisers usually have full access to all > stats on all the data sources involved. If stats are missing then they just > go by the order originally specified by the query author in the SQL > statement. Obviously such stats are not immediately available when it comes > to services. > > If you can imagine a workflow as a pseudo-SQL statement it might help: > > select a.seqID, b.E_VALUE > from MY_FASTA_FILE a > left join BLAST_RESULTS b on a.seqID=b.querySeqID > left join ENSEMBL_DB c on a.seqID=c.xref_id > where b.db='human' > and b.E_VALUE<0.0001 > and c.xref_db='custom mappings'; > > Efthymia appears to have written a query optimiser that can reorganise the > steps of a workflow to make the most efficient use of available services. > This is not so much about whether a service can or will accept multiple > values in a query, although that is definitely part of the problem, but more > about the numbers of items returned by a call to a service and how much this > is affected by pre-filtering of the input/selection criteria by other > services. In the example above the query/workflow would perform differently > depending on whether more of the input seqIDs were filtered out by the BLAST > step or the Ensembl step, and performance would also be affected by which of > the two services would respond faster. The optimiser would know this and > reorder to suit (the logical choice in this example being to do the Ensembl > filter first, even though the workflow specifies to do the BLAST first). > > I think this sounds like a good idea. It would need clever tricks to actually > come up with any stats about services that can be used by such an optimiser, > but once those stats were gathered the optimiser could easily decide whether > it would be best to dynamically reorder the order of the workflow components > to produce the most efficient execution time. > > cheers, > Richard. > > On 8 Jul 2011, at 15:45, Stian Soiland-Reyes wrote: > >> I think you have described almost every bioinformatics genomics workflow. >> >> You start with a small selection, search and expand out to lots of >> candidates, then filter and narrow down the search space before >> finding more about those prioritised/matched items. >> >> The workflow designer must get an understanding of where the large >> data sizes and processing times are before they can determine the >> order of many of these operations - and so the first workflows are >> probably very inefficient compared to the later ones, when it's >> becoming more clear where one needs to filter, which services can be >> done later ('fill in details' services which don't contribute to >> filtering). >> >> However, we've often seen cases where involving a computer scientists >> in reviewing the workflow can provide further optimisation. For >> instance, in one case we realised that a service unofficially >> supported multiple identifiers for search by using comma separation. >> That meant we could move from 40.000 individual service calls to 1000 >> grouped calls (the service still fell over if you gave too many in >> that list!). >> >> >> On Thu, Jul 7, 2011 at 13:01, Efthymia Tsamoura <[email protected]> wrote: >>> Hello >>> I am a phd student and during this period i am dealing with workflow >>> optimization problems in distributed environments. I would like to >>> ask, if there are exist any cases where if the order of task >>> invocation in a scientific workflow changes its performance changes >>> too without, however, affecting the produced results. In the >>> following, a present a small use case of the problem i am interested in: >>> >>> Suppose that a company wants to obtain a list of email addresses of >>> potential customers selecting only those who have a good payment >>> history for at least one card and a credit rating above some >>> threshold. The company has the right to use the following web services >>> >>> WS1 : SSN id (ssn, threshold) -> credit rating (cr) >>> WS2 : SSN id (ssn) -> credit card numbers (ccn) >>> WS3 : card number (ccn, good) -> good history (gph) >>> WS4 : SSN id (ssn) -> email addresses (ea) >>> >>> The input data containing customer identifiers (ssn) and other >>> relevant information is stored in a local data resource. Two possible >>> web service linear workflows that can be formed to process the input >>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 = >>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a >>> good payment history are initially selected (WS2,WS3), and then, the >>> remaining customers whose credit history is below some threshold are >>> filtered out (through WS1). The C2 workflow performs the same tasks in >>> a reverse order. The above linear workflows may have different >>> performance; if WS3 filters out more data than WS1, then it will be >>> more beneficial to invoke WS3 before WS1 in order for the subsequent >>> web services in the workflow to process less data. >>> >>> It would be very useful to know if there exist similar scientific >>> workflow examples (where users have many options for ordering the >>> workflow tasks but cannot decide which task ordering to use, while the >>> workflow performance depends on the workflow task invocation order) >>> and if you are interested in using optimizers for such types of >>> workflows. >>> >>> I am asking because i have recently developed an optimization >>> algorithm for this problem and i would like to test its performance in >>> a real-world workflow management system with real-world workflows. >>> >>> P.S.: references to publications or any other information dealing with >>> scientific workflows of the above rationale will be extremely useful. >>> >>> Thank you very much for your time >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> All of the data generated in your IT infrastructure is seriously valuable. >>> Why? It contains a definitive record of application performance, security >>> threats, fraudulent activity, and more. Splunk takes this data and makes >>> sense of it. IT sense. And common sense. >>> http://p.sf.net/sfu/splunk-d2d-c2 >>> _______________________________________________ >>> taverna-users mailing list >>> [email protected] >>> [email protected] >>> Web site: http://www.taverna.org.uk >>> Mailing lists: http://www.taverna.org.uk/about/contact-us/ >>> >> >> >> >> -- >> Stian Soiland-Reyes, myGrid team >> School of Computer Science >> The University of Manchester >> >> ------------------------------------------------------------------------------ >> All of the data generated in your IT infrastructure is seriously valuable. >> Why? It contains a definitive record of application performance, security >> threats, fraudulent activity, and more. Splunk takes this data and makes >> sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-d2d-c2 >> _______________________________________________ >> taverna-users mailing list >> [email protected] >> [email protected] >> Web site: http://www.taverna.org.uk >> Mailing lists: http://www.taverna.org.uk/about/contact-us/ > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: [email protected] > http://www.eaglegenomics.com/ > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2d-c2 > _______________________________________________ > taverna-users mailing list > [email protected] > [email protected] > Web site: http://www.taverna.org.uk > Mailing lists: http://www.taverna.org.uk/about/contact-us/ > -- Stian Soiland-Reyes, myGrid team School of Computer Science The University of Manchester ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ taverna-users mailing list [email protected] [email protected] Web site: http://www.taverna.org.uk Mailing lists: http://www.taverna.org.uk/about/contact-us/
