If I got this right, Efthymia is suggesting that workflows are not very unlike
SQL queries. They consist of a number of data sources which are filtered and
joined to produce the final result. The order in which they are filtered and
joined is the decision of the query optimiser, which can be helped by hints
inserted by the query author but largely has to rely on observed statistics to
make decisions.
The key difference is that query optimisers usually have full access to all
stats on all the data sources involved. If stats are missing then they just go
by the order originally specified by the query author in the SQL statement.
Obviously such stats are not immediately available when it comes to services.
If you can imagine a workflow as a pseudo-SQL statement it might help:
select a.seqID, b.E_VALUE
from MY_FASTA_FILE a
left join BLAST_RESULTS b on a.seqID=b.querySeqID
left join ENSEMBL_DB c on a.seqID=c.xref_id
where b.db='human'
and b.E_VALUE<0.0001
and c.xref_db='custom mappings';
Efthymia appears to have written a query optimiser that can reorganise the
steps of a workflow to make the most efficient use of available services. This
is not so much about whether a service can or will accept multiple values in a
query, although that is definitely part of the problem, but more about the
numbers of items returned by a call to a service and how much this is affected
by pre-filtering of the input/selection criteria by other services. In the
example above the query/workflow would perform differently depending on whether
more of the input seqIDs were filtered out by the BLAST step or the Ensembl
step, and performance would also be affected by which of the two services would
respond faster. The optimiser would know this and reorder to suit (the logical
choice in this example being to do the Ensembl filter first, even though the
workflow specifies to do the BLAST first).
I think this sounds like a good idea. It would need clever tricks to actually
come up with any stats about services that can be used by such an optimiser,
but once those stats were gathered the optimiser could easily decide whether it
would be best to dynamically reorder the order of the workflow components to
produce the most efficient execution time.
cheers,
Richard.
On 8 Jul 2011, at 15:45, Stian Soiland-Reyes wrote:
> I think you have described almost every bioinformatics genomics workflow.
>
> You start with a small selection, search and expand out to lots of
> candidates, then filter and narrow down the search space before
> finding more about those prioritised/matched items.
>
> The workflow designer must get an understanding of where the large
> data sizes and processing times are before they can determine the
> order of many of these operations - and so the first workflows are
> probably very inefficient compared to the later ones, when it's
> becoming more clear where one needs to filter, which services can be
> done later ('fill in details' services which don't contribute to
> filtering).
>
> However, we've often seen cases where involving a computer scientists
> in reviewing the workflow can provide further optimisation. For
> instance, in one case we realised that a service unofficially
> supported multiple identifiers for search by using comma separation.
> That meant we could move from 40.000 individual service calls to 1000
> grouped calls (the service still fell over if you gave too many in
> that list!).
>
>
> On Thu, Jul 7, 2011 at 13:01, Efthymia Tsamoura <[email protected]> wrote:
>> Hello
>> I am a phd student and during this period i am dealing with workflow
>> optimization problems in distributed environments. I would like to
>> ask, if there are exist any cases where if the order of task
>> invocation in a scientific workflow changes its performance changes
>> too without, however, affecting the produced results. In the
>> following, a present a small use case of the problem i am interested in:
>>
>> Suppose that a company wants to obtain a list of email addresses of
>> potential customers selecting only those who have a good payment
>> history for at least one card and a credit rating above some
>> threshold. The company has the right to use the following web services
>>
>> WS1 : SSN id (ssn, threshold) -> credit rating (cr)
>> WS2 : SSN id (ssn) -> credit card numbers (ccn)
>> WS3 : card number (ccn, good) -> good history (gph)
>> WS4 : SSN id (ssn) -> email addresses (ea)
>>
>> The input data containing customer identifiers (ssn) and other
>> relevant information is stored in a local data resource. Two possible
>> web service linear workflows that can be formed to process the input
>> data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
>> WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
>> good payment history are initially selected (WS2,WS3), and then, the
>> remaining customers whose credit history is below some threshold are
>> filtered out (through WS1). The C2 workflow performs the same tasks in
>> a reverse order. The above linear workflows may have different
>> performance; if WS3 filters out more data than WS1, then it will be
>> more beneficial to invoke WS3 before WS1 in order for the subsequent
>> web services in the workflow to process less data.
>>
>> It would be very useful to know if there exist similar scientific
>> workflow examples (where users have many options for ordering the
>> workflow tasks but cannot decide which task ordering to use, while the
>> workflow performance depends on the workflow task invocation order)
>> and if you are interested in using optimizers for such types of
>> workflows.
>>
>> I am asking because i have recently developed an optimization
>> algorithm for this problem and i would like to test its performance in
>> a real-world workflow management system with real-world workflows.
>>
>> P.S.: references to publications or any other information dealing with
>> scientific workflows of the above rationale will be extremely useful.
>>
>> Thank you very much for your time
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> All of the data generated in your IT infrastructure is seriously valuable.
>> Why? It contains a definitive record of application performance, security
>> threats, fraudulent activity, and more. Splunk takes this data and makes
>> sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-d2d-c2
>> _______________________________________________
>> taverna-users mailing list
>> [email protected]
>> [email protected]
>> Web site: http://www.taverna.org.uk
>> Mailing lists: http://www.taverna.org.uk/about/contact-us/
>>
>
>
>
> --
> Stian Soiland-Reyes, myGrid team
> School of Computer Science
> The University of Manchester
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> taverna-users mailing list
> [email protected]
> [email protected]
> Web site: http://www.taverna.org.uk
> Mailing lists: http://www.taverna.org.uk/about/contact-us/
--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: [email protected]
http://www.eaglegenomics.com/
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
taverna-users mailing list
[email protected]
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/