Re: [Taverna-users] extending taverna with workflow optimization algorithms

Efthymia Tsamoura Mon, 11 Jul 2011 06:08:14 -0700

Dear Stian and Richard

Thank you very much for the valuable feedback.


Richard, the example that you have written is exactly what I was  
looking for: a workflow of several filtering and join tasks, where, if  
the task execution order changes, the execution time changes too.

I have just started searching for similar workflows in the myexperiment site.

Best Regards,
Efi



Quoting Richard Holland <[email protected]>:

> If I got this right, Efthymia is suggesting that workflows are not  
> very unlike SQL queries. They consist of a number of data sources  
> which are filtered and joined to produce the final result. The order  
> in which they are filtered and joined is the decision of the query  
> optimiser, which can be helped by hints inserted by the query author  
> but largely has to rely on observed statistics to make decisions.
>
> The key difference is that query optimisers usually have full access  
> to all stats on all the data sources involved. If stats are missing  
> then they just go by the order originally specified by the query  
> author in the SQL statement. Obviously such stats are not  
> immediately available when it comes to services.
>
> If you can imagine a workflow as a pseudo-SQL statement it might help:
>
> select a.seqID, b.E_VALUE
> from MY_FASTA_FILE a
> left join BLAST_RESULTS b on a.seqID=b.querySeqID
> left join ENSEMBL_DB c on a.seqID=c.xref_id
> where b.db='human'
> and b.E_VALUE<0.0001
> and c.xref_db='custom mappings';
>
> Efthymia appears to have written a query optimiser that can  
> reorganise the steps of a workflow to make the most efficient use of  
> available services. This is not so much about whether a service can  
> or will accept multiple values in a query, although that is  
> definitely part of the problem, but more about the numbers of items  
> returned by a call to a service and how much this is affected by  
> pre-filtering of the input/selection criteria by other services. In  
> the example above the query/workflow would perform differently  
> depending on whether more of the input seqIDs were filtered out by  
> the BLAST step or the Ensembl step, and performance would also be  
> affected by which of the two services would respond faster. The  
> optimiser would know this and reorder to suit (the logical choice in  
> this example being to do the Ensembl filter first, even though the  
> workflow specifies to do the BLAST first).
>
> I think this sounds like a good idea. It would need clever tricks to  
> actually come up with any stats about services that can be used by  
> such an optimiser, but once those stats were gathered the optimiser  
> could easily decide whether it would be best to dynamically reorder  
> the order of the workflow components to produce the most efficient  
> execution time.
>
> cheers,
> Richard.
>
> On 8 Jul 2011, at 15:45, Stian Soiland-Reyes wrote:
>
>> I think you have described almost every bioinformatics genomics workflow.
>>
>> You start with a small selection, search and expand out to lots of
>> candidates, then filter and narrow down the search space before
>> finding more about those prioritised/matched items.
>>
>> The workflow designer must get an understanding of where the large
>> data sizes and processing times are before they can determine the
>> order of many of these operations - and so the first workflows are
>> probably very inefficient compared to the later ones, when it's
>> becoming more clear where one needs to filter, which services can be
>> done later ('fill in details' services which don't contribute to
>> filtering).
>>
>> However, we've often seen cases where involving a computer scientists
>> in reviewing the workflow can provide further optimisation. For
>> instance, in one case we realised that a service unofficially
>> supported multiple identifiers for search by using comma separation.
>> That meant we could move from 40.000 individual service calls to 1000
>> grouped calls (the service still fell over if you gave too many in
>> that list!).
>>
>>
>> On Thu, Jul 7, 2011 at 13:01, Efthymia Tsamoura  
>> <[email protected]> wrote:
>>> Hello
>>> I am a phd student and during this period i am dealing with workflow
>>> optimization problems in distributed environments.  I would like to
>>> ask, if there are exist any cases where if the order of task
>>> invocation in a scientific workflow changes its performance changes
>>> too without, however, affecting the produced results. In the
>>> following, a present a small use case of the problem i am interested in:
>>>
>>> Suppose that a company wants to obtain a list of email addresses of
>>> potential customers selecting only those who have a good payment
>>> history for at least one card and a credit rating above some
>>> threshold. The company has the right to use the following web services
>>>
>>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>>> WS3 : card number (ccn, good) -> good history (gph)
>>> WS4 : SSN id (ssn) -> email addresses (ea)
>>>
>>> The input data containing customer identifiers (ssn) and other
>>> relevant information is stored in a local data resource. Two possible
>>> web service linear workflows that can be formed to process the input
>>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
>>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
>>> good payment history are initially selected (WS2,WS3), and then, the
>>> remaining customers whose credit history is below some threshold are
>>> filtered out (through WS1). The C2 workflow performs the same tasks in
>>> a reverse order. The above linear workflows may have different
>>> performance; if WS3 filters out more data than WS1, then it will be
>>> more beneficial to invoke WS3 before WS1 in order for the subsequent
>>> web services in the workflow to process less data.
>>>
>>> It would be very useful to know if there exist similar scientific
>>> workflow examples (where users have many options for ordering the
>>> workflow tasks but cannot decide which task ordering to use, while the
>>> workflow performance depends on the workflow task invocation order)
>>> and if you are interested in using optimizers for such types of
>>> workflows.
>>>
>>> I am asking because i have recently developed an optimization
>>> algorithm for this problem and i would like to test its performance in
>>> a real-world workflow management system with real-world workflows.
>>>
>>> P.S.: references to publications or any other information dealing with
>>> scientific workflows of the above rationale will be extremely useful.
>>>
>>> Thank you very much for your time
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> All of the data generated in your IT infrastructure is seriously valuable.
>>> Why? It contains a definitive record of application performance, security
>>> threats, fraudulent activity, and more. Splunk takes this data and makes
>>> sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-d2d-c2
>>> _______________________________________________
>>> taverna-users mailing list
>>> [email protected]
>>> [email protected]
>>> Web site: http://www.taverna.org.uk
>>> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>>>
>>
>>
>>
>> --
>> Stian Soiland-Reyes, myGrid team
>> School of Computer Science
>> The University of Manchester
>>
>> ------------------------------------------------------------------------------
>> All of the data generated in your IT infrastructure is seriously valuable.
>> Why? It contains a definitive record of application performance, security
>> threats, fraudulent activity, and more. Splunk takes this data and makes
>> sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-d2d-c2
>> _______________________________________________
>> taverna-users mailing list
>> [email protected]
>> [email protected]
>> Web site: http://www.taverna.org.uk
>> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: [email protected]
> http://www.eaglegenomics.com/
>
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> taverna-users mailing list
> [email protected]
> [email protected]
> Web site: http://www.taverna.org.uk
> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>





------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/

Re: [Taverna-users] extending taverna with workflow optimization algorithms

Reply via email to