Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-23 Thread Peter Cock
On Wed, Feb 22, 2012 at 7:07 PM,  dannonba...@me.com wrote:
 Awesome, I'll take a look.  And, if you're able to pull it together easily
 enough, clean branches are always nice.

 -Dannon

It is all on one new branch, but this covers FASTA splitting (ready),
splitting in the BLAST+ wrapper (ready bar merging datatypes),
XML merging (may need more work). It has also occurred to me
I may need to implement HTML merging (or even remove this as
a BLAST output option - do people use it?).

https://bitbucket.org/peterjc/galaxy-central/src/split_blast

All the commits should be self contained allowing the FASTA
splitting bits to be transplanted/cherry-picked. If you want I'll do
that on a new branch focused on FASTA splitting only.

But before I do, I'd appreciate any initial comments you might
have from a first inspection.

Thanks,

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-22 Thread Peter Cock
On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
 On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
 On Feb 16, 2012, at 12:24 PM, Peter wrote:
 I also need to look at merging multiple BLAST XML outputs,
 but this is looking promising.

 Yep, that's definitely one where a simple concatenation
 wouldn't work (though NCBI used to think so, years ago…)

 Well, given the NCBI's historic practise of producing 'XML'
 output which was the concatenation of several XML files,
 some tools will tolerate this out of practicality - the Biopython
 BLAST XML parser for example.

 But yes, some care is needed over the header/footer to
 ensure a valid XML output is created by the merge. This
 may also require renumbering queries... I will check.

Basic BLAST XML merging implemented and apparently working:
https://bitbucket.org/peterjc/galaxy-central/changeset/ebf65c0b1e26

This does not currently attempt to remap the iteration
numbers or automatically assigned query names, e.g.
you can have this kind of thing in the middle of the XML
at a merge point:

  Iteration_iter-num1/Iteration_iter-num
  Iteration_query-IDQuery_1/Iteration_query-ID

That isn't a problem for some tools, e.g. my code in
Galaxy to convert BLAST XML to tabular, but I suspect
it could cause trouble elsewhere. If anyone has specific
suggestions for what to test, that would be great.

If this is an issue, then the merge code needs a little
more work to edit these values.

I think the FASTA split code could be reviewed for
inclusion though. Dan - do you want to look at that?
Would a clean branch help?

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-22 Thread dannonbaker
Awesome, I'll take a look. And, if you're able to pull it together easily enough, clean branches are always nice.-DannonOn Feb 22, 2012, at 10:59 AM, Peter Cock p.j.a.c...@googlemail.com wrote: Basic BLAST XML merging implemented and apparently working: https://bitbucket.org/peterjc/galaxy-central/changeset/ebf65c0b1e26  This does not currently attempt to remap the iteration numbers or automatically assigned query names, e.g. you can have this kind of thing in the middle of the XML at a merge point:  Iteration_iter-num1/Iteration_iter-num Iteration_query-IDQuery_1/Iteration_query-ID  That isn't a problem for some tools, e.g. my code in Galaxy to convert BLAST XML to tabular, but I suspect it could cause trouble elsewhere. If anyone has specific suggestions for what to test, that would be great.  If this is an issue, then the merge code needs a little more work to edit these values.  I think the FASTA split code could be reviewed for inclusion though. Dan - do you want to look at that? Would a clean branch help?  Peter___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-20 Thread Bram Slabbinck

Hi Dannon,

If I may further elaborate on this issue, I would like to mention that 
this kind of functionality is also supported by the Sun Grid Engine in 
the form of 'array jobs'. With this functionality you can execute a job 
multiple times in an independent way, only differing for instance in the 
parameter settings. From your description below, it seems similar to the 
Galaxy parallelism tag. Is there or do you foresee any implementation of 
this SGE functionality through the drmaa interface in Galaxy? If not, is 
there anybody who has achieved this through some custom coding? We would 
be highly interested in this.


thanks
Bram

On 15/02/2012 18:08, Dannon Baker wrote:

It's definitely an experimental feature at this point, and there's no wiki, but basic support for breaking jobs into 
tasks does exist.  It needs a lot more work and can go in a few different directions to make it better, but check out 
the wrappers withparallelism  defined, and enable use_tasked_jobs in your universe_wsgi.ini and restart.  
That's all it should take from a fresh galaxy install to get, iirc, at least BWA and a few other tools working.  If 
you want a super trivial example to play with, change the tool .xml for text tool like change case to 
haveparallelism method=basic/parallelism  and give that a shot.

If you decide to try this out, do keep in mind that this feature is not at all 
complete and while there's a long list of things we still want to experiment 
with along these lines suggestions (and especially contributions) are 
absolutely welcome.

-Dannon

On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:


Hi all,

The comments on this issue suggest that the Galaxy team is/were
working on splitting large jobs over multiple nodes/CPUs:

https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs

Is there any relevant page on the wiki I should be aware of?

Specifically I am hoping for a general framework where one of the tool
inputs can be marked as embarrassingly parallel meaning it can be
subdivided easily (e.g. multiple sequences in FASTA or FASTQ format,
multiple annotations in BED format, multiple lines in tabular format) and
the outputs can all be easily combined (e.g. by concatenation in the
same order as the input was split).

Thanks,

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/

--
==
Bram Slabbinck, PhD

Bioinformatics  Systems Biology Division
VIB Department of Plant Systems Biology, UGent
Technologiepark 927, 9052 Gent, BELGIUM

Email: bram.slabbi...@psb.ugent.be
WWW: http://bioinformatics.psb.ugent.be
==
Please consider the environment before printing this email

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-20 Thread Peter Cock
On Mon, Feb 20, 2012 at 8:08 AM, Bram Slabbinck br...@psb.vib-ugent.be wrote:
 Hi Dannon,

 If I may further elaborate on this issue, I would like to mention that this
 kind of functionality is also supported by the Sun Grid Engine in the form
 of 'array jobs'. With this functionality you can execute a job multiple
 times in an independent way, only differing for instance in the parameter
 settings. From your description below, it seems similar to the Galaxy
 parallelism tag. Is there or do you foresee any implementation of this SGE
 functionality through the drmaa interface in Galaxy? If not, is there
 anybody who has achieved this through some custom coding? We
 would be highly interested in this.

 thanks
 Bram

I was wondering about why Galaxy submits N separate jobs to SGE
after splitting (identical bar their working directory). I'm not sure if
all the other cluster back ends supported can do this, but basic
dependencies is possible using SGE. That means the cluster could
take care of scheduling the split jobs, the N processing jobs, and
the final merge job (i.e. three stages where for example it won't
do the merge till all the N processing jobs are finished).

My hunch is Galaxy is doing a lot of this 'housekeeping' internally
in order to remain flexible regarding the cluster back end.

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-20 Thread Dannon Baker
Peter has it right in that we need to do this internally to ensure 
functionality across a range of job runners.  A side benefit is that it gives 
us direct access to the tasks so that we can eventually do interesting things 
with scheduling, resubmission, feedback, etc.  If the overhead looks to be a 
performance issue I could see having an override that would allow pushing task 
scheduling to the underlying cluster, but that functionality would come later.

-Dannon

On Feb 20, 2012, at 3:13 AM, Peter Cock wrote:

 On Mon, Feb 20, 2012 at 8:08 AM, Bram Slabbinck br...@psb.vib-ugent.be 
 wrote:
 Hi Dannon,
 
 If I may further elaborate on this issue, I would like to mention that this
 kind of functionality is also supported by the Sun Grid Engine in the form
 of 'array jobs'. With this functionality you can execute a job multiple
 times in an independent way, only differing for instance in the parameter
 settings. From your description below, it seems similar to the Galaxy
 parallelism tag. Is there or do you foresee any implementation of this SGE
 functionality through the drmaa interface in Galaxy? If not, is there
 anybody who has achieved this through some custom coding? We
 would be highly interested in this.
 
 thanks
 Bram
 
 I was wondering about why Galaxy submits N separate jobs to SGE
 after splitting (identical bar their working directory). I'm not sure if
 all the other cluster back ends supported can do this, but basic
 dependencies is possible using SGE. That means the cluster could
 take care of scheduling the split jobs, the N processing jobs, and
 the final merge job (i.e. three stages where for example it won't
 do the merge till all the N processing jobs are finished).
 
 My hunch is Galaxy is doing a lot of this 'housekeeping' internally
 in order to remain flexible regarding the cluster back end.
 
 Peter
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
  http://lists.bx.psu.edu/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-17 Thread Peter Cock
On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
 On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
 Cool!  Seems like a perfectly fine start.  I guess you could
 grab the # of sequences from the dataset somehow (I'm
 guessing that is set somehow upon import into Galaxy).

 Yes, I should be able to get that from Galaxy's metadata
 if known - much like how the FASTQ splitter works. It only
 needs to be an estimate anyway - which is what I think
 Galaxy does for large files - if we get it wrong then rather
 than using n sub-jobs as suggested, we might use n+1
 or n-1.

Done, and it seems to be working nicely now. If we don't
know the sequence count, I divide the file based on the
total size in bytes - which avoids any extra IO.
https://bitbucket.org/peterjc/galaxy-central/changeset/26a0c0aa776d

Taking advantage of this I have switched the BLAST tools
from saying split the query into batches of 500 sequences
(which worked fine but only gave benefits if doing genome
scale queries) to just split the query into four parts (which
will be done based on the sequence count if known, or the
file size if not). This way any multi-query BLAST will get
divided and run in parallel, not just the larger jobs. This
gives a nice improvement (over yesterday's progress)
with small tasks like 10 query sequences against a big
database like NR or NT.
https://bitbucket.org/peterjc/galaxy-central/changeset/1fb89ae798be

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:

 Main still runs these jobs in the standard non-split fashion, and as a
 resource that is occasionally saturated (and thus doesn't necessarily have
 extra resources to parallelize to) will probably continue doing so as long
 as there's significant overhead involved in splitting the files.  Fancy
 scheduling could minimize the issue, but as it is during heavy load you
 would actually have lower total throughput due to the splitting overhead.


Because the splitting (currently) happens on the main server?

 Regarding the merging of the out, I see there is a default merge
 method in lib/galaxy/datatypes/data.py which just concatenates
 the files. I am surprised at that - it seems like a very bad idea in
 general - consider many binary files, or XML. Why not put this
 as the default for text and subclasses thereof?

 I can't think of a better reasonable default behavior for Data, though
 you're obviously right that each datatype subclass will need to define
 particular behaviors for merging files.

The default should raise an error (and better yet, refuse to do the
split in the first place). Zen of Python: In the face of ambiguity,
refuse the temptation to guess.

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Dannon Baker

On Feb 16, 2012, at 5:15 AM, Peter Cock wrote:

 On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 
 Main still runs these jobs in the standard non-split fashion, and as a
 resource that is occasionally saturated (and thus doesn't necessarily have
 extra resources to parallelize to) will probably continue doing so as long
 as there's significant overhead involved in splitting the files.  Fancy
 scheduling could minimize the issue, but as it is during heavy load you
 would actually have lower total throughput due to the splitting overhead.
 
 
 Because the splitting (currently) happens on the main server?

No, because the splitting process is work which has to happen somewhere.  
Ignoring possible benefits from things that haven't been implemented yet, in a 
situation where your cluster is saturated with work you are unable to take 
advantage of the parallelism and splitting files apart is only adding more 
work, reducing total job throughput.  That splitting always happens on the head 
node is not ideal, and needs to be configurable.  I have a fork somewhere that 
attempts to address this but it needs work.
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 Good luck, let me know how it goes, and again - contributions are certainly
 welcome :)

I think I found the first bug, method split in lib/galaxy/datatypes/sequence.py
for class Sequence assumes four lines per sequence. This would make
sense as the split method of the Fastq class (after grooming to remove
any line wrapping) but is a very bad idea on most sequence file formats
(e.g. FASTA).

It looks like a little refactoring is needed, defining a Sequence split method
which raises not implemented, and moving the current code to the Fastq
class, then writing something similar but allowing multiple lines per record
for the Fasta class.

Does that sound reasonable? I'll do this on a new branch for review...

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 10:47 AM, Peter Cock p.j.a.c...@googlemail.com wrote:
 On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 Good luck, let me know how it goes, and again - contributions are certainly
 welcome :)

 I think I found the first bug, method split in 
 lib/galaxy/datatypes/sequence.py
 for class Sequence assumes four lines per sequence. This would make
 sense as the split method of the Fastq class (after grooming to remove
 any line wrapping) but is a very bad idea on most sequence file formats
 (e.g. FASTA).

 It looks like a little refactoring is needed, defining a Sequence split method
 which raises not implemented, and moving the current code to the Fastq
 class, then writing something similar but allowing multiple lines per record
 for the Fasta class.

 Does that sound reasonable? I'll do this on a new branch for review...

Refactoring lib/galaxy/datatypes/sequence.py split method here,
https://bitbucket.org/peterjc/galaxy-central/changeset/762777618073

This is part of a work-in-progress split_blast branch to try splitting
BLAST jobs, for which I will need to split FASTA files as inputs, and
also merge BLAST XML output:
https://bitbucket.org/peterjc/galaxy-central/src/split_blast

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Fields, Christopher J
On Feb 16, 2012, at 4:47 AM, Peter Cock wrote:

 On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker dannonba...@me.com wrote:
 Good luck, let me know how it goes, and again - contributions are certainly
 welcome :)
 
 I think I found the first bug, method split in 
 lib/galaxy/datatypes/sequence.py
 for class Sequence assumes four lines per sequence. This would make
 sense as the split method of the Fastq class (after grooming to remove
 any line wrapping) but is a very bad idea on most sequence file formats
 (e.g. FASTA).
 
 It looks like a little refactoring is needed, defining a Sequence split method
 which raises not implemented, and moving the current code to the Fastq
 class, then writing something similar but allowing multiple lines per record
 for the Fasta class.
 
 Does that sound reasonable? I'll do this on a new branch for review...
 
 Peter

Makes sense from my perspective; splits have to be defined based on data type.  
It could be as low-level as defining a simple iterator per record, then a 
wrapper that allows a specific chunk-size.  The split file creation could 
almost be abstracted completely away into a common method.

As Peter implies, maybe a simple API for defining a split method would be all 
that is needed.  Might also be useful on any merge step, 'cat'-like merges 
won't work for every format but would be a suitable default.

chris
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 1:53 PM, Fields, Christopher J
cjfie...@illinois.edu wrote:

 Makes sense from my perspective; splits have to be defined based on
 data type.  It could be as low-level as defining a simple iterator per
 record, then a wrapper that allows a specific chunk-size.  The split
 file creation could almost be abstracted completely away into a
 common method.

I'm trying to understand exactly how the current code creates the
splits, but yes - something like that is what I would expect.

 As Peter implies, maybe a simple API for defining a split method
 would be all that is needed.  Might also be useful on any merge
 step, 'cat'-like merges won't work for every format but would be
 a suitable default.

Yes, for a lot of file types concatenation is fine. Again, like the
splitting, this has to be and is defined at the data type level (which
is a heirachy of classes in Galaxy).

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
Hi Dan,

I think I need a little more advice - what is the role of the script
scripts/extract_dataset_part.py and the JSON files created
when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
and then used by the class' process_split_file method?

Why is there no JSON file created by the base data class in
lib/galaxy/datatypes/data.py and no method process_split_file?

Is the JSON thing part of a partial and unfinished rewrite of the
splitter code?

On the assumption that not all splitters bother with the JSON,
I am trying a little hack to scripts/extract_dataset_part.py to
abort silently if there is no JSON file:
https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

This seems to be working with my current attempt at a FASTA
splitter (not checked in yes, only partly implemented and tested).

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 Hi Dan,

 I think I need a little more advice - what is the role of the script
 scripts/extract_dataset_part.py and the JSON files created
 when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
 and then used by the class' process_split_file method?

 Why is there no JSON file created by the base data class in
 lib/galaxy/datatypes/data.py and no method process_split_file?

 Is the JSON thing part of a partial and unfinished rewrite of the
 splitter code?

 On the assumption that not all splitters bother with the JSON,
 I am trying a little hack to scripts/extract_dataset_part.py to
 abort silently if there is no JSON file:
 https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

 This seems to be working with my current attempt at a FASTA
 splitter (not checked in yes, only partly implemented and tested).

I've checked in my FASTA splitting, which now seems to be
working OK with my BLAST tests. So far this only does splitting
into chunks of the requested number of sequences, rather than
the option to split the whole file into a given number of pieces.
https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9

I also need to look at merging multiple BLAST XML outputs, but
this is looking promising.

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Dannon Baker
Very cool, I'll check it out!  The addition of the JSON files is indeed very 
new and was likely unfinished with respect to the base splitter.

-Dannon

On Feb 16, 2012, at 1:24 PM, Peter Cock wrote:

 On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 Hi Dan,
 
 I think I need a little more advice - what is the role of the script
 scripts/extract_dataset_part.py and the JSON files created
 when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
 and then used by the class' process_split_file method?
 
 Why is there no JSON file created by the base data class in
 lib/galaxy/datatypes/data.py and no method process_split_file?
 
 Is the JSON thing part of a partial and unfinished rewrite of the
 splitter code?
 
 On the assumption that not all splitters bother with the JSON,
 I am trying a little hack to scripts/extract_dataset_part.py to
 abort silently if there is no JSON file:
 https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3
 
 This seems to be working with my current attempt at a FASTA
 splitter (not checked in yes, only partly implemented and tested).
 
 I've checked in my FASTA splitting, which now seems to be
 working OK with my BLAST tests. So far this only does splitting
 into chunks of the requested number of sequences, rather than
 the option to split the whole file into a given number of pieces.
 https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
 
 I also need to look at merging multiple BLAST XML outputs, but
 this is looking promising.
 
 Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Dannon Baker
It's definitely an experimental feature at this point, and there's no wiki, but 
basic support for breaking jobs into tasks does exist.  It needs a lot more 
work and can go in a few different directions to make it better, but check out 
the wrappers with parallelism defined, and enable use_tasked_jobs in your 
universe_wsgi.ini and restart.  That's all it should take from a fresh galaxy 
install to get, iirc, at least BWA and a few other tools working.  If you want 
a super trivial example to play with, change the tool .xml for text tool like 
change case to have parallelism method=basic/parallelism and give that 
a shot.

If you decide to try this out, do keep in mind that this feature is not at all 
complete and while there's a long list of things we still want to experiment 
with along these lines suggestions (and especially contributions) are 
absolutely welcome.

-Dannon

On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:

 Hi all,
 
 The comments on this issue suggest that the Galaxy team is/were
 working on splitting large jobs over multiple nodes/CPUs:
 
 https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs
 
 Is there any relevant page on the wiki I should be aware of?
 
 Specifically I am hoping for a general framework where one of the tool
 inputs can be marked as embarrassingly parallel meaning it can be
 subdivided easily (e.g. multiple sequences in FASTA or FASTQ format,
 multiple annotations in BED format, multiple lines in tabular format) and
 the outputs can all be easily combined (e.g. by concatenation in the
 same order as the input was split).
 
 Thanks,
 
 Peter
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
  http://lists.bx.psu.edu/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Fields, Christopher J
Ah, was just about to ask about this as well, nice to know something is already 
in place (as experimental as it might be). Thanks Dannon!

chris

On Feb 15, 2012, at 11:08 AM, Dannon Baker wrote:

 It's definitely an experimental feature at this point, and there's no wiki, 
 but basic support for breaking jobs into tasks does exist.  It needs a lot 
 more work and can go in a few different directions to make it better, but 
 check out the wrappers with parallelism defined, and enable use_tasked_jobs 
 in your universe_wsgi.ini and restart.  That's all it should take from a 
 fresh galaxy install to get, iirc, at least BWA and a few other tools 
 working.  If you want a super trivial example to play with, change the tool 
 .xml for text tool like change case to have parallelism 
 method=basic/parallelism and give that a shot.
 
 If you decide to try this out, do keep in mind that this feature is not at 
 all complete and while there's a long list of things we still want to 
 experiment with along these lines suggestions (and especially contributions) 
 are absolutely welcome.
 
 -Dannon
 
 On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:
 
 Hi all,
 
 The comments on this issue suggest that the Galaxy team is/were
 working on splitting large jobs over multiple nodes/CPUs:
 
 https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs
 
 Is there any relevant page on the wiki I should be aware of?
 
 Specifically I am hoping for a general framework where one of the tool
 inputs can be marked as embarrassingly parallel meaning it can be
 subdivided easily (e.g. multiple sequences in FASTA or FASTQ format,
 multiple annotations in BED format, multiple lines in tabular format) and
 the outputs can all be easily combined (e.g. by concatenation in the
 same order as the input was split).
 
 Thanks,
 
 Peter
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
 http://lists.bx.psu.edu/
 
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
  http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Peter Cock
On Wed, Feb 15, 2012 at 5:08 PM, Dannon Baker dannonba...@me.com wrote:
 It's definitely an experimental feature at this point, and there's no wiki,
 but basic support for breaking jobs into tasks does exist.  It needs a lot
 more work and can go in a few different directions to make it better,

Not what I was hoping to hear, but a promising start :)

 but check out the wrappers with parallelism defined, and enable
 use_tasked_jobs in your universe_wsgi.ini and restart.  That's all it
 should take from a fresh galaxy install to get, iirc, at least BWA and
 a few other tools working.  If you want a super trivial example to play
 with, change the tool .xml for text tool like change case to have
 parallelism method=basic/parallelism and give that a shot.

Excellent - that saved me searching blindly.

$ cd tools
$ grep parallelism */*.xml
samtools/sam_bitwise_flag_filter.xml:  parallelism
method=basic/parallelism
sr_mapping/bowtie_wrapper.xml:  parallelism method=basic/parallelism
sr_mapping/bwa_color_wrapper.xml:  parallelism method=basic/parallelism
sr_mapping/bwa_wrapper.xml:  parallelism method=basic/parallelism

Are those four tools being used on Galaxy Main already with
this basic parallelism in place?

Looking at the code in lib/galaxy/jobs/splitters/basic.py its
comments suggest it only works on tools with one input and
one output file (although that seems a bit fuzzy as you could
be using BWA with a FASTA history item as the reference -
would that fail?).

I see also interesting things in lib/galaxy/jobs/splitters/multi.py
Is that even more experimental? It looks like it could be used
to say BWA's read file was to be split, but the reference file
shared.

Regarding the merging of the out, I see there is a default merge
method in lib/galaxy/datatypes/data.py which just concatenates
the files. I am surprised at that - it seems like a very bad idea in
general - consider many binary files, or XML. Why not put this
as the default for text and subclasses thereof?

There is also one example where the merge method gets
overridden, lib/galaxy/datatypes/tabular.py which avoids the
repetition of any headers when merging SAM files.

That should be enough clues to implement other customized
merge code for other datatypes.

 If you decide to try this out, do keep in mind that this feature is not
 at all complete and while there's a long list of things we still want
 to experiment with along these lines suggestions (and especially
 contributions) are absolutely welcome.

OK then, I hope to have a play with this shortly.

Thanks,

Peter

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Dannon Baker
Are those four tools being used on Galaxy Main already with this basic parallelism in place?Main still runs these jobs in the standard non-split fashion, and as a resource that is occasionally saturated (and thus doesn't necessarily have extra resources to parallelize to) willprobablycontinue doing so as long as there's significant overhead involved in splitting the files. Fancy scheduling could minimize the issue, but as it is during heavy load you would actually have lower total throughput due to the splitting overhead.Looking at the code in lib/galaxy/jobs/splitters/basic.py its comments suggest it only works on tools with one input and one output file (although that seems a bit fuzzy as you could be using BWA with a FASTA history item as the reference - would that fail?).I haven't tried it, but probably. I see also interesting things in lib/galaxy/jobs/splitters/multi.py Is that even more experimental? It looks like it could be used to say BWA's read file was to be split, but the reference file shared.Yes. Regarding the merging of the out, I see there is a default merge method in lib/galaxy/datatypes/data.py which just concatenates the files. I am surprised at that - it seems like a very bad idea in general - consider many binary files, or XML. Why not put this as the default for text and subclasses thereof?I can't think of a better reasonable default behavior for "Data", though you're obviously right that each datatype subclass will need to define particular behaviors for merging files.OK then, I hope to have a play with this shortly.Good luck, let me know how it goes, and again - contributions are certainly welcome :)-Dannon___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/