Hi Keith,

Am 13.04.2015 um 20:12 schrieb Keith Suderman:
> Hello,
> 
> Our group is investigating using Galaxy as a workflow engine for NLP (Natural 
> Language Processing) tasks. 

Good choice! :)

> I have installed a local Galaxy instance and created wrappers for the 
> services we use and so far everything is working great.  I do have a few 
> questions and they all fall under the “Advanced Topics”  section as defined 
> at the end of the tutorial for creating a Histogram [1]
> 
> 1. parameter validation: 
> 
> Many of our tools rely on additions made by previous tools in the workflow; 
> for example, a tool that identifies noun phrases may require that the input 
> has been run through a part of speech (POS) tagger, the POS tagger may 
> require that the input has been run through a tokenizer, etc.  Our tools can 
> do this validation, I am just looking for a way to wire this into Galaxy so a 
> user can only connect tools in the workflow editor if this validation passes.
> 
> I have been looking looking at the code for 
> lib/galaxy/tools/parameters/validation.py and I don’t see anything that I can 
> (easily) bend to our use case.  What I was hoping for was something like:
> 
>       <input type=“data” format=“our_custom_format” name=“input”>
>               <validator type=“dataset_custom”>
>                       <command interpreter=“bash”>validate.sh $input</command>
>                       <!— OR —>
>                       <tool file=“custom_validator.xml”/>
>               </validator>
>       </input>

Can you tell me how your tool detects if it was processed before by an
other tool? Metadata detection? Is this is different file type? If so
you can define your own datatype(s). One of your tools can only consume
the file types of an other tools output and so on.

> I also see the tantalizing sentence, "Custom code execution at various time 
> points of the workflow that allows a fine grained control over the execution 
> process", but I can't find any examples of how this is done.

This is currently only accessible via the API I think. The backend is
currently under testing and it will be integrated during the next
releases afaik.

> 2. data repositories / data collections
> 
> I need to be able to process collections of data pulled from remote servers. 
> I have been looking at DataManagers and data collections in Galaxy, but 
> everything seems to assume 
> the data is local to the server, or can be copied/uploaded to the server.

This is the preferred way, for reproducibility reasons.

> For practical and legal reasons beyond my pay grade this is not a solution in 
> our case.  

> For example, an organization may be willing to allow our users to query their 
> service for documents, 
> run the documents through our workflow, and store the intermediate results; 
> but they will not allow us to copy their data to another server verbatim.  

> There are possibilities for me to cache data, but the general use case is 
> that I have to call an external service to fetch documents one at a 
> time and then run the same workflow on each document.

I don't think you can use data-collections for this :(
What you can do is simply write a tool which takes an URL and consumes
this document and do the first step. But in the end this resutl/document
will be stored on a server.

> Any suggestions on how to accomplish this in Galaxy?  I can do single 
> documents, I just need to expand this to include collections of documents.  
> A typical workflow might look something like:
> 
> a) Query Tool -> Server, find all documents that contain the word “cheese”
> b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
> c) WorkFlow -> for each id in the list do
>       c1) Download document 
>       c2 ) Work work work work…
>       c3) Persist output
> 
> I can do all of the above except the most important bit; iterating…

Oh yes, this is simple. Just create one workflow that deals with one ID.
This workflow you can run on multiple ids.

> 
> 3. format conversion:  
> 
> Is it possible for Galaxy to automatically convert between formats when 
> designing a workflow?  
> I see the <change_format/> tag, but that seems to change the output format of 
> a tool based on the input 
> (or some other condition) in the same tool; I need to be able to change the 
> format based on the input requirements 
> of the next tool in the workflow. For example, if Tool A produces format X, 
> Tool B requires format Y,  and a converter 
> from X to Y has been defined in the datatypes_conf.xml; I would like for 
> Galaxy to implicitly insert the converter 
> from X to Y when I drag the output noodle from Tool A to Tool B in the 
> designer.  Is this possible? 

Oh yes this is supported out of the box!
See here for a small documentation:
https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes

Here is a example of how you can write your own datatypes:

https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes

> 
> 4. OAuth 2.0 / OpenID Connect: 
> 
> I need to be able to fetch documents from data providers that require an 
> OAuth 2.0 access token. Currently, I use a separate service to go 
> through the OAuth authentication/authorization process and then have the user 
> copy/paste their access token into a text field in Galaxy.   
> Is there a way to perform the OAuth authentication dance required by the 
> remote service inside Galaxy itself?   

I don't think so, but maybe someone else has an idea here.

> I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and 
> OpenID Connect are on the radar, hopefully this use case is being considered 
> as well.
> 
> I’m sure to have more questions after working through some visualization 
> examples, but this should keep me busy for now.

Hope you are busy now :)
Cheers and keep us up to date!
Bjoern

> Sincerely,
> Keith Suderman
> 
> REFERENCES
> 
> 1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools
> 
> ------------------------------
> Research Associate
> Department of Computer Science
> Vassar College
> Poughkeepsie, NY
> 
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
> 
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
> 
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to