Re: [galaxy-dev] Appending _task_%d suffix to multi files

Jorrit Boekel Thu, 01 Aug 2013 02:46:26 -0700

Hi Piotr,

In our proteomics lab, a protein sample is fractionated (by e.g. pH)before analysis in a nr of sample fractions. The fractions are then runthrough the mass spectrometer one at a time. Each fraction yields a datafile.

The mass spec data is then matched to peptides by searching a FASTAfile, termed target, with protein sequences. Afterwards the matches arestatistically scored by machine learning. To do this, the data is alsomatched with a scrambled FASTA file, termed decoy. Each fraction ismatched to a target and decoy file, which yields two match-files perfraction.

The machine learning tool thus picks a target and a decoy matchfile andputs statistical significances on the matches. In order for this to becorrect, it needs to pick matchfiles that correspond, ie that arederived from the same fraction.

In our lab, we have not yet looked at John Chilton's (I think) work withthe m: data sets, and our parallel processing is done inside galaxy,using its split and merge functions to divide a job into tasks. Eachtask is sent as a separate job to sge, I think, but others may know moreabout this than I.


I really have to get back to my holiday now, cheers,
jorrit

On 08/01/2013 04:17 AM, piotr.s...@csiro.au wrote:

Hi Jorrit,
Thank you for your explanation. Would you be able to give us anexample of what do you mean by fractions and when the task_%d arebeing used to pick files. Just want to make sure we have goodunderstanding of the problem that you solved.
Also, I vaguely remember seeing 'data parallelism" mentioned somewherewith relation to the m: data sets. Do you currently support in anyway automatic distribution of processing of such datasets to parallelenvironments (e.g. array jobs in sge or such?)
Cheers,

-Piotr

*From:*Jorrit Boekel [mailto:jorrit.boe...@scilifelab.se]
*Sent:* Wednesday, July 31, 2013 8:18 PM
*To:* Khassapov, Alex (CSIRO IM&T, Clayton)
*Cc:* p.j.a.c...@googlemail.com; jmchil...@gmail.com;galaxy-dev@lists.bx.psu.edu; Szul, Piotr (ICT Centre, Marsfield);Burdett, Neil (ICT Centre, Herston - RBWH)
*Subject:* Re: Appending _task_%d suffix to multi files

Hi Alex,
In our lab, files are often fractions of an experiments, but they arenamed by their creators in whatever way they like. I put that code into standardize fraction naming, in case a tool needs input from twofiles that originate from the same fraction (but have been treated indifferent ways). In those cases, in my fork, Galaxy always picks thefiles with the same task_%d numbers.
I can't help you very much right now, as I'm currently away from workuntil October, but I hope this explains why its in there.
cheers,
jorrit
On 07/31/2013 04:15 AM, alex.khassa...@csiro.au<mailto:alex.khassa...@csiro.au> wrote:
    Hi guys,

    We've been using Galaxy for a year now, we created our own Galaxy
    fork where we were making changes to adapt Galaxy to our
    requirements.  As we need "multiple file dataset" - we were using
    Johns' fork for that initially.

    Now we are trying to use "The most updated version of the multiple
    file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/
    directly as we don't want to maintain our own version.

    One of the problems we have - when we upload multiple files -
    their file names are changed (_task_%d suffix is added to their
    names).

    On our branch we simply removed the code which does it, but now we
    wonder if it is possible to avoid this renaming somehow? I.e. make
    it configurable?

    Is it really necessary to change the file names?

    -Alex

    -----Original Message-----
    From: galaxy-dev-boun...@lists.bx.psu.edu
    <mailto:galaxy-dev-boun...@lists.bx.psu.edu>
    [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Jorrit
    Boekel
    Sent: Thursday, 25 October 2012 8:35 PM
    To: Peter Cock
    Cc: galaxy-dev@lists.bx.psu.edu <mailto:galaxy-dev@lists.bx.psu.edu>
    Subject: Re: [galaxy-dev] the multi job splitter

    I keep the files matched by keeping a _task_%d suffix to their
    names. So each task is matched with its correct counterpart with
    the same number.

    cheers,

    jorrit

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Appending _task_%d suffix to multi files

Reply via email to