Re: [Taverna-hackers] Taverna, SCUFL2 and wfdesc

Stian Soiland-Reyes Fri, 24 Oct 2014 09:09:13 -0700

On 23 October 2014 18:02, Nebojsa Tijanic <[email protected]>
wrote:
>
> Thanks a lot for the links and explanations. I've seen some taverna
workflow
> XML files earlier and they seemed too tightly coupled with Java to suit
our
> purpose (I suppose they were from older version).

Yes, that is particularly one reason why we moved away from the t2flow
format - it relied on a particular serialization from Java that was hard to
deal with outside the engine code when those classes are not available.

> The Scufl2 format seems a
> lot more similar to what we've been discussing here. I didn't understand a
> lot of things about it, hope you don't mind me asking a few questions:

Feel free! I am sorry that our scufl2 documentation is still lacking quite
a bit..

We have only used Scufl2 for Taverna workflows, so it follows quite close
to the Taverna execution semantics. For other systems it would be natural
to reuse what can be equivalent (e.g. processors, ports, data links) and
leave out what is not directly mappable, e.g. the dispatch layers.

We have made processors be configurable as well, so if there is no need to
distinguish between a node and its execution, then profiles for workflow
system X could skip activities and their binding to processors.

> - If I understand correctly, the "parallelize" layer would process
> individual items of a collection supplied on an input port if incoming
> collection depth is larger than declared port depth. It seems that
> iterationStrategyStack is meant to specify what happens if there is more
> than one port with depth > expected. Is this correct? If so, are there any
> strategies apart from combinations of Cartesian product and zipping?

That's right. If the port depths of inputs match those on the link, there
would be no iterations and no parallellization within that processor - it
is simply executed 'as is'. That is also the interpretation if there is no
iteration strategy defined. (In Taverna 1 we would autogenerate a cross
product if nothing was defined, but this gave unpredictable iterations with
more than two inputs)

If the depth on the incoming link is less than the expected depth (e.g. a
single value when expecting a list) it is simply wrapped in singleton
list(s).

There are only those two strategies at the moment, what we call cross and
dot product.

http://dev.mygrid.org.uk/wiki/display/tav250/List+handling

It is also possible, in SCUFL2, to combine multiple layers in the iteration
strategy stack to do say an outer dot product of lists at depth 2, and a
cross product at the inner lists of depth 1. This is quite complex to
explain to the users, so we never added this to the user interface. We
instead explain how to do this with a nested workflow -
http://dev.mygrid.org.uk/wiki/display/tav250/List+handling#Listhandling-Usingnestedworkflowstotweaklisthandling

In our engine we also have a third option which we have never really used -
PrefixDotProduct.

> Matches jobs where the index array of the job on index 0 is the prefix of
the index array of the job on index 1. This node can only ever have exactly
two child nodes!

As it was never used we did not add support for this to SCUFL2.

Are there other iteration mechanisms you are thinking of? E.g. "Just first
value" or something?

> - What is the difference between portDepth and granularPortDepth?

This has to do with streaming activities. Most of Taverna activities have
always the same values for these (in scufl2 you can therefore leave
granularPortDepth==null to mean the same). However, there are some services
that output a list, but which are able to give back those values one at a
time. The Taverna Engine supports this - and forwards on those items to any
service downstream that expects individual values. The original service do
however still have to return the final list when it is finished iterating
(if it ever finishes) - which the iteration strategy and downstream
services expecting more than single values are waiting for.

This is for instance the case for a service that retrieves a large CSV file
or does a large SQL query, and can spit out the individual column values,
row by row, even before the whole file has been transferred back.

In this case, granularPortDepth=0 and portDepth=1.

When modifying granularPortDepth there are many port combinations that
don't make sense, as you have to output on all the granular ports at the
same time, e.g. if you say output port A,B and C are granularDepth 0 and
depth 1, while port D is granular/depth 0 - then you have to return return
all A/B/C for every granular return.

{ a1, b1, c1 } # granular
{ a2, b2, c2 } # granular
{ a3, b3, c3 } # granular
{ [a1,a2,a3] [b1,b2,b3] [c1,c2,c3] d } # final

> - Can port types be declared? If so, what are the available types and are
> structs supported?

Syntactic types we do s part of the activity configuration - as different
activities have different ways to describe bytes/integers/etc.

We use the configuration key "dataType" per port defined in the config, e.g.

{ "outputTypes": [
{ "port": "inputA", "dataType": "R_EXP" }
{ "port": "inputB", "dataType": "PNG_FILE" }
] }

In this example, configuring an R script, there are some pre-determined
constants which determines how the output values are picked up or delivered
to R - e.g. PNG_File will save a graph as a PNG, while an R_EXP is
serializing an R structure so that it can be passed to another R script.

Most of the time we don't need to define the syntactic types in the
workflow definition, as the service implementation will know based on the
rest of the configuration - e.g. if we are using a WSDL Service then the
XSD will say that for portA we need a string, and for portB we need to
base64-encode
a binary. A REST service will know based on the declared Content-Type if
the input is binary or string - and it will know from the HTTP-returned
Content-Type what we're getting back.

In t2flow we often included this inferred information at definition time,
but we found that usually it would end up out of sync, wrong or confusing
(many home-brewn mime-type that were never used by anything). With SCUFL2
we wanted to move to a more prescriptive workflow definition.

Semantic types we do as annotations as they do not affect the execution. We
also see these as advisory as the service might change its mind over night
and return something differently (or more like it - something even more
specific).
We use http://purl.org/DP/components - see the extracted wfdesc (which
includes all annotations):

https://github.com/wf4ever/scufl2-wfdesc/blob/master/src/test/resources/valid_component_imagemagickconvert.wfdesc.ttl

> - In the paper, the example for loops was an async service. This seems
like
> it saves some compute resources (a thread), but like it also could have
been
> done by having the processor poll+block. Have you encountered any other
> cases where loops were needed?

Yes, it can also be implemented by the specific activity/processor, which
several of our plugins do. It is however difficult to support this for
generic services as no agreed system for poll/getResult is established.

The loop can also be used with user-driven loops, where you have an
interactive step that asks for tweaked parameters or "is this result good
enough". This can even be automated as you can configure loop to feedback
output ports as new input ports. In this case your inner workflow must
output a "loop": "false" value to stop the loop.

The loop condition checker can be calling something in the world, e.g. loop
until the weather forecast is sunny or a sensor measurement is within range.

> Regarding the tool service, the model seems straightforward (by looking at
> the screenshots). The advanced/file_lists wiki page is empty, and that was
> one of the things I didn't get: how do input/output ports map to tools
that
> work with lists of files (of arbitrary length)?

It gets trickier.. as Taverna lists are ordered. For input lists you can
say to put it as a folder, and we'll make folder/0, folder/1 etc. and also
produce an index file.

We have not yet got support for arbitrary list of output, but I guess you
could do the same in inverse.

>
> Thanks,
>
> On Tue, Oct 21, 2014 at 5:42 PM, Stian Soiland-Reyes <[email protected]>
> wrote:
>>
>> Hi,
>>
>> I am one of the developers of the Taverna workflow system -
>> http://taverna.org.uk/
>>
>> (btw - we have just recently been accepted as an Apache Incubator project
>> - http://incubator.apache.org/projects/taverna.html )
>>
>>
>>
>> In Taverna 2, we had our workflow definition language called t2flow. It
is
>> fairly one-to-one mapping to internal Java objects in Taverna, and people
>> found it hard to develop against it (although several did anyway, e.g. a
>> web-based editor) - see http://ns.taverna.org.uk/2008/xml/t2flow/ for the
>> gory details :) .
>>
>>
>>
>> For Taverna 3 we therefore made the Scufl2 workflow format -
>> http://dev.mygrid.org.uk/wiki/display/developer/SCUFL2 - and a
corresponding
>> Java API - http://github.com/taverna/taverna-scufl2. SCUFL2 is not meant
as
>> a generic workflow language, but as a way to generalize the Taverna
workflow
>> model. See our "Taverna, reloaded" paper
>>
http://www.taverna.org.uk/pages/wp-content/uploads/2010/04/T2Architecture.pdf
>> for details of the Taverna execution model.
>>
>> In short, a Scufl2 Workflow Bundle is a structured ZIP file with a series
>> of XML files, which follow XSD schemas (but are also valid RDF/XML). The
>> schemas are currently at
>>
https://github.com/taverna/taverna-scufl2/tree/master/scufl2-rdfxml/src/main/resources/uk/org/taverna/scufl2/rdfxml/xsd
>>
>> Within the bundle, configuration of each step (e.g. a command line tool
>> invocation) is described as a JSON file - their structure vary per
activity
>> type, and Taverna has quite a few activity types from plugins -
>> http://dev.mygrid.org.uk/wiki/display/tav250/Service+types
>>
>>
>> The JSON for the Command Line Tool activity, which I guess is most
>> relevant to you, is unfortunately in a bit in flux - we feel a need for a
>> cleaner separation between the Tool Definition (command line, parameters,
>> input and output files, etc) and the invocation details (e.g. where to
SSH
>> or how to create symbolic links).
>>
>>
>> In Taverna's Workbench application we have a UI for creating such
>> configuratoin on an ad-hoc basis with a kind of template-based shell
script.
>> I mentioned this in the call. See
>> http://dev.mygrid.org.uk/wiki/display/tav250/Command and sibling pages.
>>
>> It is possible to load a set of these descriptions from an XML file, to
>> browse possible command line tools. Obviously this also requires these
tools
>> to be installed, so this has been used mainly within a grid execution
>> infrastructure like Nordugrid/KnowARC. The most used XML is
>> http://taverna.nordugrid.org/sharedRepository/xml.php which is generated
>> from http://taverna.nordugrid.org/sharedRepository/index.php
>>
>>
>> --
>> Stian Soiland-Reyes
>> http://orcid.org/0000-0001-9842-9718
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "common-workflow-language" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to
>> [email protected].
>> To view this discussion on the web visit
>>
https://groups.google.com/d/msgid/common-workflow-language/1a7eb2fd-a564-4152-a8b9-09c5d6e2b2db%40googlegroups.com
.
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Nebojsa Tijanic
> Seven Bridges Genomics, Inc.

--
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718

------------------------------------------------------------------------------

_______________________________________________
taverna-hackers mailing list
[email protected]
Web site: http://www.taverna.org.uk
Mailing lists: http://www.taverna.org.uk/about/contact-us/
Developers Guide: http://www.taverna.org.uk/developers/

Re: [Taverna-hackers] Taverna, SCUFL2 and wfdesc

Reply via email to