Re: [Taverna-users] extending taverna with workflow optimization algorithms

Stian Soiland-Reyes Mon, 11 Jul 2011 06:13:12 -0700

Thanks Richard for a good summary, and it does sound like quite
exciting research!


On Fri, Jul 8, 2011 at 16:24, Richard Holland <[email protected]> wrote:
> If I got this right, Efthymia is suggesting that workflows are not very 
> unlike SQL queries. They consist of a number of data sources which are 
> filtered and joined to produce the final result. The order in which they are 
> filtered and joined is the decision of the query optimiser, which can be 
> helped by hints inserted by the query author but largely has to rely on 
> observed statistics to make decisions.
>
> The key difference is that query optimisers usually have full access to all 
> stats on all the data sources involved. If stats are missing then they just 
> go by the order originally specified by the query author in the SQL 
> statement. Obviously such stats are not immediately available when it comes 
> to services.
>
> If you can imagine a workflow as a pseudo-SQL statement it might help:
>
> select a.seqID, b.E_VALUE
> from MY_FASTA_FILE a
> left join BLAST_RESULTS b on a.seqID=b.querySeqID
> left join ENSEMBL_DB c on a.seqID=c.xref_id
> where b.db='human'
> and b.E_VALUE<0.0001
> and c.xref_db='custom mappings';
>
> Efthymia appears to have written a query optimiser that can reorganise the 
> steps of a workflow to make the most efficient use of available services. 
> This is not so much about whether a service can or will accept multiple 
> values in a query, although that is definitely part of the problem, but more 
> about the numbers of items returned by a call to a service and how much this 
> is affected by pre-filtering of the input/selection criteria by other 
> services. In the example above the query/workflow would perform differently 
> depending on whether more of the input seqIDs were filtered out by the BLAST 
> step or the Ensembl step, and performance would also be affected by which of 
> the two services would respond faster. The optimiser would know this and 
> reorder to suit (the logical choice in this example being to do the Ensembl 
> filter first, even though the workflow specifies to do the BLAST first).
>
> I think this sounds like a good idea. It would need clever tricks to actually 
> come up with any stats about services that can be used by such an optimiser, 
> but once those stats were gathered the optimiser could easily decide whether 
> it would be best to dynamically reorder the order of the workflow components 
> to produce the most efficient execution time.
>
> cheers,
> Richard.
>
> On 8 Jul 2011, at 15:45, Stian Soiland-Reyes wrote:
>
>> I think you have described almost every bioinformatics genomics workflow.
>>
>> You start with a small selection, search and expand out to lots of
>> candidates, then filter and narrow down the search space before
>> finding more about those prioritised/matched items.
>>
>> The workflow designer must get an understanding of where the large
>> data sizes and processing times are before they can determine the
>> order of many of these operations - and so the first workflows are
>> probably very inefficient compared to the later ones, when it's
>> becoming more clear where one needs to filter, which services can be
>> done later ('fill in details' services which don't contribute to
>> filtering).
>>
>> However, we've often seen cases where involving a computer scientists
>> in reviewing the workflow can provide further optimisation. For
>> instance, in one case we realised that a service unofficially
>> supported multiple identifiers for search by using comma separation.
>> That meant we could move from 40.000 individual service calls to 1000
>> grouped calls (the service still fell over if you gave too many in
>> that list!).
>>
>>
>> On Thu, Jul 7, 2011 at 13:01, Efthymia Tsamoura <[email protected]> wrote:
>>> Hello
>>> I am a phd student and during this period i am dealing with workflow
>>> optimization problems in distributed environments.  I would like to
>>> ask, if there are exist any cases where if the order of task
>>> invocation in a scientific workflow changes its performance changes
>>> too without, however, affecting the produced results. In the
>>> following, a present a small use case of the problem i am interested in:
>>>
>>> Suppose that a company wants to obtain a list of email addresses of
>>> potential customers selecting only those who have a good payment
>>> history for at least one card and a credit rating above some
>>> threshold. The company has the right to use the following web services
>>>
>>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>>> WS3 : card number (ccn, good) -> good history (gph)
>>> WS4 : SSN id (ssn) -> email addresses (ea)
>>>
>>> The input data containing customer identifiers (ssn) and other
>>> relevant information is stored in a local data resource. Two possible
>>> web service linear workflows that can be formed to process the input
>>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
>>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
>>> good payment history are initially selected (WS2,WS3), and then, the
>>> remaining customers whose credit history is below some threshold are
>>> filtered out (through WS1). The C2 workflow performs the same tasks in
>>> a reverse order. The above linear workflows may have different
>>> performance; if WS3 filters out more data than WS1, then it will be
>>> more beneficial to invoke WS3 before WS1 in order for the subsequent
>>> web services in the workflow to process less data.
>>>
>>> It would be very useful to know if there exist similar scientific
>>> workflow examples (where users have many options for ordering the
>>> workflow tasks but cannot decide which task ordering to use, while the
>>> workflow performance depends on the workflow task invocation order)
>>> and if you are interested in using optimizers for such types of
>>> workflows.
>>>
>>> I am asking because i have recently developed an optimization
>>> algorithm for this problem and i would like to test its performance in
>>> a real-world workflow management system with real-world workflows.
>>>
>>> P.S.: references to publications or any other information dealing with
>>> scientific workflows of the above rationale will be extremely useful.
>>>
>>> Thank you very much for your time
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> All of the data generated in your IT infrastructure is seriously valuable.
>>> Why? It contains a definitive record of application performance, security
>>> threats, fraudulent activity, and more. Splunk takes this data and makes
>>> sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-d2d-c2
>>> _______________________________________________
>>> taverna-users mailing list
>>> [email protected]
>>> [email protected]
>>> Web site: http://www.taverna.org.uk
>>> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>>>
>>
>>
>>
>> --
>> Stian Soiland-Reyes, myGrid team
>> School of Computer Science
>> The University of Manchester
>>
>> ------------------------------------------------------------------------------
>> All of the data generated in your IT infrastructure is seriously valuable.
>> Why? It contains a definitive record of application performance, security
>> threats, fraudulent activity, and more. Splunk takes this data and makes
>> sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-d2d-c2
>> _______________________________________________
>> taverna-users mailing list
>> [email protected]
>> [email protected]
>> Web site: http://www.taverna.org.uk
>> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: [email protected]
> http://www.eaglegenomics.com/
>
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> taverna-users mailing list
> [email protected]
> [email protected]
> Web site: http://www.taverna.org.uk
> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/

Re: [Taverna-users] extending taverna with workflow optimization algorithms

Reply via email to