[galaxy-dev] Consuming dataset collections

Steve Cassidy Mon, 25 Jul 2016 23:00:07 -0700

Hi all,
   I’m staring at the discussion of handling dataset collections:


http://planemo.readthedocs.io/en/latest/_writing_collections.html

but failing to see the solution to my problem.

I have a tool that creates a dataset collection, a group of files with names 
like 1_1308_1_2_092-ch6-speaker16.TextGrid where the 1_1308_1_2_092 part is a 
unique identifier that I’d like to keep track of.  I’ve used a 
discover_datasets tag in the tool xml file to match my output filenames and 
extract the designation (1_1308_1_2_092-ch6-speaker16.TextGrid) and the ext 
(TextGrid).

I have another tool that runs a query over these files and generates a single 
tabular result that will ideally include the identifier in some form. Here’s 
the command section for that tool:

        query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" --tier 
$tier --regex '$regex' --output_path $output

where ‘$textgrid’ is one of my input parameters that has multiple=“true” set so 
that it can be a dataset collection.  That works ok but the input I get are the 
filenames (dataset_1.dat, etc.) not the name of the datasets.

The page above mentions something called the ‘element_identifier’ and gives 
this funky example:


merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" 
--file "$input" --to $output;


I can’t see what this element_identifier thing is - the suggestion is that it 
might be the dataset name, but I’m not sure.  Also I don’t understand why the 
command above is doing replacement of whitespace with underscores.

If this is the name I’m after, it would seem that I’d need to pass these names 
along with the textgrid files and then pair them up inside my script - is that 
what I need to do?

All of this cries out to me for a more explicit representation of a dataset 
collection that my tool can create and consume rather than this hacky treatment 
of filenames.   If I could generate a manifest file of some kind describing my 
dataset collection then none of this parsing of filenames would be needed.  I 
could also consume the manifest file as well and it could be used for 
collection level metadata.  Is this a silly idea?

Anyway, any help with my immediate problem would be appreciated.

Thanks,

Steve

—
Department of Computing, Macquarie University
http://web.science.mq.edu.au/~cassidy

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] Consuming dataset collections

Reply via email to