Dear Stian and Richard Thank you very much for the valuable feedback.
Richard, the example that you have written is exactly what I was looking for: a workflow of several filtering and join tasks, where, if the task execution order changes, the execution time changes too. I have just started searching for similar workflows in the myexperiment site. Best Regards, Efi Quoting Richard Holland <[email protected]>: > If I got this right, Efthymia is suggesting that workflows are not > very unlike SQL queries. They consist of a number of data sources > which are filtered and joined to produce the final result. The order > in which they are filtered and joined is the decision of the query > optimiser, which can be helped by hints inserted by the query author > but largely has to rely on observed statistics to make decisions. > > The key difference is that query optimisers usually have full access > to all stats on all the data sources involved. If stats are missing > then they just go by the order originally specified by the query > author in the SQL statement. Obviously such stats are not > immediately available when it comes to services. > > If you can imagine a workflow as a pseudo-SQL statement it might help: > > select a.seqID, b.E_VALUE > from MY_FASTA_FILE a > left join BLAST_RESULTS b on a.seqID=b.querySeqID > left join ENSEMBL_DB c on a.seqID=c.xref_id > where b.db='human' > and b.E_VALUE<0.0001 > and c.xref_db='custom mappings'; > > Efthymia appears to have written a query optimiser that can > reorganise the steps of a workflow to make the most efficient use of > available services. This is not so much about whether a service can > or will accept multiple values in a query, although that is > definitely part of the problem, but more about the numbers of items > returned by a call to a service and how much this is affected by > pre-filtering of the input/selection criteria by other services. In > the example above the query/workflow would perform differently > depending on whether more of the input seqIDs were filtered out by > the BLAST step or the Ensembl step, and performance would also be > affected by which of the two services would respond faster. The > optimiser would know this and reorder to suit (the logical choice in > this example being to do the Ensembl filter first, even though the > workflow specifies to do the BLAST first). > > I think this sounds like a good idea. It would need clever tricks to > actually come up with any stats about services that can be used by > such an optimiser, but once those stats were gathered the optimiser > could easily decide whether it would be best to dynamically reorder > the order of the workflow components to produce the most efficient > execution time. > > cheers, > Richard. > > On 8 Jul 2011, at 15:45, Stian Soiland-Reyes wrote: > >> I think you have described almost every bioinformatics genomics workflow. >> >> You start with a small selection, search and expand out to lots of >> candidates, then filter and narrow down the search space before >> finding more about those prioritised/matched items. >> >> The workflow designer must get an understanding of where the large >> data sizes and processing times are before they can determine the >> order of many of these operations - and so the first workflows are >> probably very inefficient compared to the later ones, when it's >> becoming more clear where one needs to filter, which services can be >> done later ('fill in details' services which don't contribute to >> filtering). >> >> However, we've often seen cases where involving a computer scientists >> in reviewing the workflow can provide further optimisation. For >> instance, in one case we realised that a service unofficially >> supported multiple identifiers for search by using comma separation. >> That meant we could move from 40.000 individual service calls to 1000 >> grouped calls (the service still fell over if you gave too many in >> that list!). >> >> >> On Thu, Jul 7, 2011 at 13:01, Efthymia Tsamoura >> <[email protected]> wrote: >>> Hello >>> I am a phd student and during this period i am dealing with workflow >>> optimization problems in distributed environments. I would like to >>> ask, if there are exist any cases where if the order of task >>> invocation in a scientific workflow changes its performance changes >>> too without, however, affecting the produced results. In the >>> following, a present a small use case of the problem i am interested in: >>> >>> Suppose that a company wants to obtain a list of email addresses of >>> potential customers selecting only those who have a good payment >>> history for at least one card and a credit rating above some >>> threshold. The company has the right to use the following web services >>> >>> WS1 : SSN id (ssn, threshold) -> credit rating (cr) >>> WS2 : SSN id (ssn) -> credit card numbers (ccn) >>> WS3 : card number (ccn, good) -> good history (gph) >>> WS4 : SSN id (ssn) -> email addresses (ea) >>> >>> The input data containing customer identifiers (ssn) and other >>> relevant information is stored in a local data resource. Two possible >>> web service linear workflows that can be formed to process the input >>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 = >>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a >>> good payment history are initially selected (WS2,WS3), and then, the >>> remaining customers whose credit history is below some threshold are >>> filtered out (through WS1). The C2 workflow performs the same tasks in >>> a reverse order. The above linear workflows may have different >>> performance; if WS3 filters out more data than WS1, then it will be >>> more beneficial to invoke WS3 before WS1 in order for the subsequent >>> web services in the workflow to process less data. >>> >>> It would be very useful to know if there exist similar scientific >>> workflow examples (where users have many options for ordering the >>> workflow tasks but cannot decide which task ordering to use, while the >>> workflow performance depends on the workflow task invocation order) >>> and if you are interested in using optimizers for such types of >>> workflows. >>> >>> I am asking because i have recently developed an optimization >>> algorithm for this problem and i would like to test its performance in >>> a real-world workflow management system with real-world workflows. >>> >>> P.S.: references to publications or any other information dealing with >>> scientific workflows of the above rationale will be extremely useful. >>> >>> Thank you very much for your time >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> All of the data generated in your IT infrastructure is seriously valuable. >>> Why? It contains a definitive record of application performance, security >>> threats, fraudulent activity, and more. Splunk takes this data and makes >>> sense of it. IT sense. And common sense. >>> http://p.sf.net/sfu/splunk-d2d-c2 >>> _______________________________________________ >>> taverna-users mailing list >>> [email protected] >>> [email protected] >>> Web site: http://www.taverna.org.uk >>> Mailing lists: http://www.taverna.org.uk/about/contact-us/ >>> >> >> >> >> -- >> Stian Soiland-Reyes, myGrid team >> School of Computer Science >> The University of Manchester >> >> ------------------------------------------------------------------------------ >> All of the data generated in your IT infrastructure is seriously valuable. >> Why? It contains a definitive record of application performance, security >> threats, fraudulent activity, and more. Splunk takes this data and makes >> sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-d2d-c2 >> _______________________________________________ >> taverna-users mailing list >> [email protected] >> [email protected] >> Web site: http://www.taverna.org.uk >> Mailing lists: http://www.taverna.org.uk/about/contact-us/ > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: [email protected] > http://www.eaglegenomics.com/ > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2d-c2 > _______________________________________________ > taverna-users mailing list > [email protected] > [email protected] > Web site: http://www.taverna.org.uk > Mailing lists: http://www.taverna.org.uk/about/contact-us/ > ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ taverna-users mailing list [email protected] [email protected] Web site: http://www.taverna.org.uk Mailing lists: http://www.taverna.org.uk/about/contact-us/
