Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-23 Thread Peter Cock
On Wed, Feb 22, 2012 at 7:07 PM,   wrote:
> Awesome, I'll take a look.  And, if you're able to pull it together easily
> enough, clean branches are always nice.
>
> -Dannon

It is all on one new branch, but this covers FASTA splitting (ready),
splitting in the BLAST+ wrapper (ready bar merging datatypes),
XML merging (may need more work). It has also occurred to me
I may need to implement HTML merging (or even remove this as
a BLAST output option - do people use it?).

https://bitbucket.org/peterjc/galaxy-central/src/split_blast

All the commits should be self contained allowing the FASTA
splitting bits to be transplanted/cherry-picked. If you want I'll do
that on a new branch focused on FASTA splitting only.

But before I do, I'd appreciate any initial comments you might
have from a first inspection.

Thanks,

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-22 Thread dannonbaker
Awesome, I'll take a look.  And, if you're able to pull it together easily enough, clean branches are always nice.-DannonOn Feb 22, 2012, at 10:59 AM, Peter Cock  wrote: Basic BLAST XML merging implemented and apparently working: https://bitbucket.org/peterjc/galaxy-central/changeset/ebf65c0b1e26  This does not currently attempt to remap the iteration numbers or automatically assigned query names, e.g. you can have this kind of thing in the middle of the XML at a merge point:  1 Query_1  That isn't a problem for some tools, e.g. my code in Galaxy to convert BLAST XML to tabular, but I suspect it could cause trouble elsewhere. If anyone has specific suggestions for what to test, that would be great.  If this is an issue, then the merge code needs a little more work to edit these values.  I think the FASTA split code could be reviewed for inclusion though. Dan - do you want to look at that? Would a clean branch help?  Peter___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-22 Thread Peter Cock
On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
> On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
>> On Feb 16, 2012, at 12:24 PM, Peter wrote:
>>> I also need to look at merging multiple BLAST XML outputs,
>>> but this is looking promising.
>>
>> Yep, that's definitely one where a simple concatenation
>> wouldn't work (though NCBI used to think so, years ago…)
>
> Well, given the NCBI's historic practise of producing 'XML'
> output which was the concatenation of several XML files,
> some tools will tolerate this out of practicality - the Biopython
> BLAST XML parser for example.
>
> But yes, some care is needed over the header/footer to
> ensure a valid XML output is created by the merge. This
> may also require renumbering queries... I will check.

Basic BLAST XML merging implemented and apparently working:
https://bitbucket.org/peterjc/galaxy-central/changeset/ebf65c0b1e26

This does not currently attempt to remap the iteration
numbers or automatically assigned query names, e.g.
you can have this kind of thing in the middle of the XML
at a merge point:

  1
  Query_1

That isn't a problem for some tools, e.g. my code in
Galaxy to convert BLAST XML to tabular, but I suspect
it could cause trouble elsewhere. If anyone has specific
suggestions for what to test, that would be great.

If this is an issue, then the merge code needs a little
more work to edit these values.

I think the FASTA split code could be reviewed for
inclusion though. Dan - do you want to look at that?
Would a clean branch help?

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-20 Thread Dannon Baker
Peter has it right in that we need to do this internally to ensure 
functionality across a range of job runners.  A side benefit is that it gives 
us direct access to the tasks so that we can eventually do interesting things 
with scheduling, resubmission, feedback, etc.  If the overhead looks to be a 
performance issue I could see having an override that would allow pushing task 
scheduling to the underlying cluster, but that functionality would come later.

-Dannon

On Feb 20, 2012, at 3:13 AM, Peter Cock wrote:

> On Mon, Feb 20, 2012 at 8:08 AM, Bram Slabbinck  
> wrote:
>> Hi Dannon,
>> 
>> If I may further elaborate on this issue, I would like to mention that this
>> kind of functionality is also supported by the Sun Grid Engine in the form
>> of 'array jobs'. With this functionality you can execute a job multiple
>> times in an independent way, only differing for instance in the parameter
>> settings. From your description below, it seems similar to the Galaxy
>> parallelism tag. Is there or do you foresee any implementation of this SGE
>> functionality through the drmaa interface in Galaxy? If not, is there
>> anybody who has achieved this through some custom coding? We
>> would be highly interested in this.
>> 
>> thanks
>> Bram
> 
> I was wondering about why Galaxy submits N separate jobs to SGE
> after splitting (identical bar their working directory). I'm not sure if
> all the other cluster back ends supported can do this, but basic
> dependencies is possible using SGE. That means the cluster could
> take care of scheduling the split jobs, the N processing jobs, and
> the final merge job (i.e. three stages where for example it won't
> do the merge till all the N processing jobs are finished).
> 
> My hunch is Galaxy is doing a lot of this 'housekeeping' internally
> in order to remain flexible regarding the cluster back end.
> 
> Peter
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-20 Thread Peter Cock
On Mon, Feb 20, 2012 at 8:08 AM, Bram Slabbinck  wrote:
> Hi Dannon,
>
> If I may further elaborate on this issue, I would like to mention that this
> kind of functionality is also supported by the Sun Grid Engine in the form
> of 'array jobs'. With this functionality you can execute a job multiple
> times in an independent way, only differing for instance in the parameter
> settings. From your description below, it seems similar to the Galaxy
> parallelism tag. Is there or do you foresee any implementation of this SGE
> functionality through the drmaa interface in Galaxy? If not, is there
> anybody who has achieved this through some custom coding? We
> would be highly interested in this.
>
> thanks
> Bram

I was wondering about why Galaxy submits N separate jobs to SGE
after splitting (identical bar their working directory). I'm not sure if
all the other cluster back ends supported can do this, but basic
dependencies is possible using SGE. That means the cluster could
take care of scheduling the split jobs, the N processing jobs, and
the final merge job (i.e. three stages where for example it won't
do the merge till all the N processing jobs are finished).

My hunch is Galaxy is doing a lot of this 'housekeeping' internally
in order to remain flexible regarding the cluster back end.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-20 Thread Bram Slabbinck

Hi Dannon,

If I may further elaborate on this issue, I would like to mention that 
this kind of functionality is also supported by the Sun Grid Engine in 
the form of 'array jobs'. With this functionality you can execute a job 
multiple times in an independent way, only differing for instance in the 
parameter settings. From your description below, it seems similar to the 
Galaxy parallelism tag. Is there or do you foresee any implementation of 
this SGE functionality through the drmaa interface in Galaxy? If not, is 
there anybody who has achieved this through some custom coding? We would 
be highly interested in this.


thanks
Bram

On 15/02/2012 18:08, Dannon Baker wrote:

It's definitely an experimental feature at this point, and there's no wiki, but basic support for breaking jobs into 
tasks does exist.  It needs a lot more work and can go in a few different directions to make it better, but check out 
the wrappers with  defined, and enable use_tasked_jobs in your universe_wsgi.ini and restart.  
That's all it should take from a fresh galaxy install to get, iirc, at least BWA and a few other tools working.  If 
you want a super trivial example to play with, change the tool .xml for text tool like "change case" to 
have  and give that a shot.

If you decide to try this out, do keep in mind that this feature is not at all 
complete and while there's a long list of things we still want to experiment 
with along these lines suggestions (and especially contributions) are 
absolutely welcome.

-Dannon

On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:


Hi all,

The comments on this issue suggest that the Galaxy team is/were
working on splitting large jobs over multiple nodes/CPUs:

https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs

Is there any relevant page on the wiki I should be aware of?

Specifically I am hoping for a general framework where one of the tool
inputs can be marked as "embarrassingly parallel" meaning it can be
subdivided easily (e.g. multiple sequences in FASTA or FASTQ format,
multiple annotations in BED format, multiple lines in tabular format) and
the outputs can all be easily combined (e.g. by concatenation in the
same order as the input was split).

Thanks,

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/

--
==
Bram Slabbinck, PhD

Bioinformatics&  Systems Biology Division
VIB Department of Plant Systems Biology, UGent
Technologiepark 927, 9052 Gent, BELGIUM

Email: bram.slabbi...@psb.ugent.be
WWW: http://bioinformatics.psb.ugent.be
==
Please consider the environment before printing this email

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-17 Thread Peter Cock
On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
> On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
>> Cool!  Seems like a perfectly fine start.  I guess you could
>> grab the # of sequences from the dataset somehow (I'm
>> guessing that is set somehow upon import into Galaxy).
>
> Yes, I should be able to get that from Galaxy's metadata
> if known - much like how the FASTQ splitter works. It only
> needs to be an estimate anyway - which is what I think
> Galaxy does for large files - if we get it wrong then rather
> than using n sub-jobs as suggested, we might use n+1
> or n-1.

Done, and it seems to be working nicely now. If we don't
know the sequence count, I divide the file based on the
total size in bytes - which avoids any extra IO.
https://bitbucket.org/peterjc/galaxy-central/changeset/26a0c0aa776d

Taking advantage of this I have switched the BLAST tools
from saying split the query into batches of 500 sequences
(which worked fine but only gave benefits if doing genome
scale queries) to just split the query into four parts (which
will be done based on the sequence count if known, or the
file size if not). This way any multi-query BLAST will get
divided and run in parallel, not just the larger jobs. This
gives a nice improvement (over yesterday's progress)
with small tasks like 10 query sequences against a big
database like NR or NT.
https://bitbucket.org/peterjc/galaxy-central/changeset/1fb89ae798be

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 6:42 PM, Fields, Christopher J
 wrote:
> On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:
>> I've checked in my FASTA splitting, which now seems to be
>> working OK with my BLAST tests.

(If this was unclear, I mean checked into my branch - I don't
have commit privileges to the main repository. When/if this
is ready I'll ask for it to be merged in though.)

>> So far this only does splitting
>> into chunks of the requested number of sequences, rather than
>> the option to split the whole file into a given number of pieces.
>> https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
>
> Cool!  Seems like a perfectly fine start.  I guess you could
> grab the # of sequences from the dataset somehow (I'm
> guessing that is set somehow upon import into Galaxy).

Yes, I should be able to get that from Galaxy's metadata
if known - much like how the FASTQ splitter works. It only
needs to be an estimate anyway - which is what I think
Galaxy does for large files - if we get it wrong then rather
than using n sub-jobs as suggested, we might use n+1
or n-1.

>> I also need to look at merging multiple BLAST XML outputs,
>> but this is looking promising.
>
> Yep, that's definitely one where a simple concatenation
> wouldn't work (though NCBI used to think so, years ago…)

Well, given the NCBI's historic practise of producing 'XML'
output which was the concatenation of several XML files,
some tools will tolerate this out of practicality - the Biopython
BLAST XML parser for example.

But yes, some care is needed over the header/footer to
ensure a valid XML output is created by the merge. This
may also require renumbering queries... I will check.

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Fields, Christopher J
On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:

> On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock  wrote:
>> Hi Dan,
>> 
>> I think I need a little more advice - what is the role of the script
>> scripts/extract_dataset_part.py and the JSON files created
>> when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
>> and then used by the class' process_split_file method?
>> 
>> Why is there no JSON file created by the base data class in
>> lib/galaxy/datatypes/data.py and no method process_split_file?
>> 
>> Is the JSON thing part of a partial and unfinished rewrite of the
>> splitter code?
>> 
>> On the assumption that not all splitters bother with the JSON,
>> I am trying a little hack to scripts/extract_dataset_part.py to
>> abort silently if there is no JSON file:
>> https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3
>> 
>> This seems to be working with my current attempt at a FASTA
>> splitter (not checked in yes, only partly implemented and tested).
> 
> I've checked in my FASTA splitting, which now seems to be
> working OK with my BLAST tests. So far this only does splitting
> into chunks of the requested number of sequences, rather than
> the option to split the whole file into a given number of pieces.
> https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9

Cool!  Seems like a perfectly fine start.  I guess you could grab the # of 
sequences from the dataset somehow (I'm guessing that is set somehow upon 
import into Galaxy).

> I also need to look at merging multiple BLAST XML outputs, but
> this is looking promising.
> 
> Peter

Yep, that's definitely one where a simple concatenation wouldn't work (though 
NCBI used to think so, years ago…)

chris
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Dannon Baker
Very cool, I'll check it out!  The addition of the JSON files is indeed very 
new and was likely unfinished with respect to the base splitter.

-Dannon

On Feb 16, 2012, at 1:24 PM, Peter Cock wrote:

> On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock  wrote:
>> Hi Dan,
>> 
>> I think I need a little more advice - what is the role of the script
>> scripts/extract_dataset_part.py and the JSON files created
>> when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
>> and then used by the class' process_split_file method?
>> 
>> Why is there no JSON file created by the base data class in
>> lib/galaxy/datatypes/data.py and no method process_split_file?
>> 
>> Is the JSON thing part of a partial and unfinished rewrite of the
>> splitter code?
>> 
>> On the assumption that not all splitters bother with the JSON,
>> I am trying a little hack to scripts/extract_dataset_part.py to
>> abort silently if there is no JSON file:
>> https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3
>> 
>> This seems to be working with my current attempt at a FASTA
>> splitter (not checked in yes, only partly implemented and tested).
> 
> I've checked in my FASTA splitting, which now seems to be
> working OK with my BLAST tests. So far this only does splitting
> into chunks of the requested number of sequences, rather than
> the option to split the whole file into a given number of pieces.
> https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
> 
> I also need to look at merging multiple BLAST XML outputs, but
> this is looking promising.
> 
> Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock  wrote:
> Hi Dan,
>
> I think I need a little more advice - what is the role of the script
> scripts/extract_dataset_part.py and the JSON files created
> when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
> and then used by the class' process_split_file method?
>
> Why is there no JSON file created by the base data class in
> lib/galaxy/datatypes/data.py and no method process_split_file?
>
> Is the JSON thing part of a partial and unfinished rewrite of the
> splitter code?
>
> On the assumption that not all splitters bother with the JSON,
> I am trying a little hack to scripts/extract_dataset_part.py to
> abort silently if there is no JSON file:
> https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3
>
> This seems to be working with my current attempt at a FASTA
> splitter (not checked in yes, only partly implemented and tested).

I've checked in my FASTA splitting, which now seems to be
working OK with my BLAST tests. So far this only does splitting
into chunks of the requested number of sequences, rather than
the option to split the whole file into a given number of pieces.
https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9

I also need to look at merging multiple BLAST XML outputs, but
this is looking promising.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
Hi Dan,

I think I need a little more advice - what is the role of the script
scripts/extract_dataset_part.py and the JSON files created
when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
and then used by the class' process_split_file method?

Why is there no JSON file created by the base data class in
lib/galaxy/datatypes/data.py and no method process_split_file?

Is the JSON thing part of a partial and unfinished rewrite of the
splitter code?

On the assumption that not all splitters bother with the JSON,
I am trying a little hack to scripts/extract_dataset_part.py to
abort silently if there is no JSON file:
https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3

This seems to be working with my current attempt at a FASTA
splitter (not checked in yes, only partly implemented and tested).

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 1:53 PM, Fields, Christopher J
 wrote:
>
> Makes sense from my perspective; splits have to be defined based on
> data type.  It could be as low-level as defining a simple iterator per
> record, then a wrapper that allows a specific chunk-size.  The split
> file creation could almost be abstracted completely away into a
> common method.

I'm trying to understand exactly how the current code creates the
splits, but yes - something like that is what I would expect.

> As Peter implies, maybe a simple API for defining a split method
> would be all that is needed.  Might also be useful on any merge
> step, 'cat'-like merges won't work for every format but would be
> a suitable default.

Yes, for a lot of file types concatenation is fine. Again, like the
splitting, this has to be and is defined at the data type level (which
is a heirachy of classes in Galaxy).

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Fields, Christopher J
On Feb 16, 2012, at 4:47 AM, Peter Cock wrote:

> On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker  wrote:
>> Good luck, let me know how it goes, and again - contributions are certainly
>> welcome :)
> 
> I think I found the first bug, method split in 
> lib/galaxy/datatypes/sequence.py
> for class Sequence assumes four lines per sequence. This would make
> sense as the split method of the Fastq class (after grooming to remove
> any line wrapping) but is a very bad idea on most sequence file formats
> (e.g. FASTA).
> 
> It looks like a little refactoring is needed, defining a Sequence split method
> which raises not implemented, and moving the current code to the Fastq
> class, then writing something similar but allowing multiple lines per record
> for the Fasta class.
> 
> Does that sound reasonable? I'll do this on a new branch for review...
> 
> Peter

Makes sense from my perspective; splits have to be defined based on data type.  
It could be as low-level as defining a simple iterator per record, then a 
wrapper that allows a specific chunk-size.  The split file creation could 
almost be abstracted completely away into a common method.

As Peter implies, maybe a simple API for defining a split method would be all 
that is needed.  Might also be useful on any merge step, 'cat'-like merges 
won't work for every format but would be a suitable default.

chris
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Thu, Feb 16, 2012 at 10:47 AM, Peter Cock  wrote:
> On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker  wrote:
>> Good luck, let me know how it goes, and again - contributions are certainly
>> welcome :)
>
> I think I found the first bug, method split in 
> lib/galaxy/datatypes/sequence.py
> for class Sequence assumes four lines per sequence. This would make
> sense as the split method of the Fastq class (after grooming to remove
> any line wrapping) but is a very bad idea on most sequence file formats
> (e.g. FASTA).
>
> It looks like a little refactoring is needed, defining a Sequence split method
> which raises not implemented, and moving the current code to the Fastq
> class, then writing something similar but allowing multiple lines per record
> for the Fasta class.
>
> Does that sound reasonable? I'll do this on a new branch for review...

Refactoring lib/galaxy/datatypes/sequence.py split method here,
https://bitbucket.org/peterjc/galaxy-central/changeset/762777618073

This is part of a work-in-progress "split_blast" branch to try splitting
BLAST jobs, for which I will need to split FASTA files as inputs, and
also merge BLAST XML output:
https://bitbucket.org/peterjc/galaxy-central/src/split_blast

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker  wrote:
> Good luck, let me know how it goes, and again - contributions are certainly
> welcome :)

I think I found the first bug, method split in lib/galaxy/datatypes/sequence.py
for class Sequence assumes four lines per sequence. This would make
sense as the split method of the Fastq class (after grooming to remove
any line wrapping) but is a very bad idea on most sequence file formats
(e.g. FASTA).

It looks like a little refactoring is needed, defining a Sequence split method
which raises not implemented, and moving the current code to the Fastq
class, then writing something similar but allowing multiple lines per record
for the Fasta class.

Does that sound reasonable? I'll do this on a new branch for review...

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Dannon Baker

On Feb 16, 2012, at 5:15 AM, Peter Cock wrote:

> On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker  wrote:
>> 
>> Main still runs these jobs in the standard non-split fashion, and as a
>> resource that is occasionally saturated (and thus doesn't necessarily have
>> extra resources to parallelize to) will probably continue doing so as long
>> as there's significant overhead involved in splitting the files.  Fancy
>> scheduling could minimize the issue, but as it is during heavy load you
>> would actually have lower total throughput due to the splitting overhead.
>> 
> 
> Because the splitting (currently) happens on the main server?

No, because the splitting process is work which has to happen somewhere.  
Ignoring possible benefits from things that haven't been implemented yet, in a 
situation where your cluster is saturated with work you are unable to take 
advantage of the parallelism and splitting files apart is only adding more 
work, reducing total job throughput.  That splitting always happens on the head 
node is not ideal, and needs to be configurable.  I have a fork somewhere that 
attempts to address this but it needs work.
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-16 Thread Peter Cock
On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker  wrote:
>
> Main still runs these jobs in the standard non-split fashion, and as a
> resource that is occasionally saturated (and thus doesn't necessarily have
> extra resources to parallelize to) will probably continue doing so as long
> as there's significant overhead involved in splitting the files.  Fancy
> scheduling could minimize the issue, but as it is during heavy load you
> would actually have lower total throughput due to the splitting overhead.
>

Because the splitting (currently) happens on the main server?

>> Regarding the merging of the out, I see there is a default merge
>> method in lib/galaxy/datatypes/data.py which just concatenates
>> the files. I am surprised at that - it seems like a very bad idea in
>> general - consider many binary files, or XML. Why not put this
>> as the default for text and subclasses thereof?
>
> I can't think of a better reasonable default behavior for "Data", though
> you're obviously right that each datatype subclass will need to define
> particular behaviors for merging files.

The default should raise an error (and better yet, refuse to do the
split in the first place). Zen of Python: In the face of ambiguity,
refuse the temptation to guess.

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Dannon Baker
Are those four tools being used on Galaxy Main already with this basic parallelism in place? Main still runs these jobs in the standard non-split fashion, and as a resource that is occasionally saturated (and thus doesn't necessarily have extra resources to parallelize to) will probably continue doing so as long as there's significant overhead involved in splitting the files.  Fancy scheduling could minimize the issue, but as it is during heavy load you would actually have lower total throughput due to the splitting overhead. Looking at the code in lib/galaxy/jobs/splitters/basic.py its comments suggest it only works on tools with one input and one output file (although that seems a bit fuzzy as you could be using BWA with a FASTA history item as the reference - would that fail?).I haven't tried it, but probably. I see also interesting things in lib/galaxy/jobs/splitters/multi.py Is that even more experimental? It looks like it could be used to say BWA's read file was to be split, but the reference file shared. Yes. Regarding the merging of the out, I see there is a default merge method in lib/galaxy/datatypes/data.py which just concatenates the files. I am surprised at that - it seems like a very bad idea in general - consider many binary files, or XML. Why not put this as the default for text and subclasses thereof?I can't think of a better reasonable default behavior for "Data", though you're obviously right that each datatype subclass will need to define particular behaviors for merging files.OK then, I hope to have a play with this shortly. Good luck, let me know how it goes, and again - contributions are certainly welcome :)-Dannon___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Peter Cock
On Wed, Feb 15, 2012 at 5:08 PM, Dannon Baker  wrote:
> It's definitely an experimental feature at this point, and there's no wiki,
> but basic support for breaking jobs into tasks does exist.  It needs a lot
> more work and can go in a few different directions to make it better,

Not what I was hoping to hear, but a promising start :)

> but check out the wrappers with  defined, and enable
> use_tasked_jobs in your universe_wsgi.ini and restart.  That's all it
> should take from a fresh galaxy install to get, iirc, at least BWA and
> a few other tools working.  If you want a super trivial example to play
> with, change the tool .xml for text tool like "change case" to have
>  and give that a shot.

Excellent - that saved me searching blindly.

$ cd tools
$ grep parallelism */*.xml
samtools/sam_bitwise_flag_filter.xml:  
sr_mapping/bowtie_wrapper.xml:  
sr_mapping/bwa_color_wrapper.xml:  
sr_mapping/bwa_wrapper.xml:  

Are those four tools being used on Galaxy Main already with
this basic parallelism in place?

Looking at the code in lib/galaxy/jobs/splitters/basic.py its
comments suggest it only works on tools with one input and
one output file (although that seems a bit fuzzy as you could
be using BWA with a FASTA history item as the reference -
would that fail?).

I see also interesting things in lib/galaxy/jobs/splitters/multi.py
Is that even more experimental? It looks like it could be used
to say BWA's read file was to be split, but the reference file
shared.

Regarding the merging of the out, I see there is a default merge
method in lib/galaxy/datatypes/data.py which just concatenates
the files. I am surprised at that - it seems like a very bad idea in
general - consider many binary files, or XML. Why not put this
as the default for text and subclasses thereof?

There is also one example where the merge method gets
overridden, lib/galaxy/datatypes/tabular.py which avoids the
repetition of any headers when merging SAM files.

That should be enough clues to implement other customized
merge code for other datatypes.

> If you decide to try this out, do keep in mind that this feature is not
> at all complete and while there's a long list of things we still want
> to experiment with along these lines suggestions (and especially
> contributions) are absolutely welcome.

OK then, I hope to have a play with this shortly.

Thanks,

Peter

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Fields, Christopher J
Ah, was just about to ask about this as well, nice to know something is already 
in place (as experimental as it might be). Thanks Dannon!

chris

On Feb 15, 2012, at 11:08 AM, Dannon Baker wrote:

> It's definitely an experimental feature at this point, and there's no wiki, 
> but basic support for breaking jobs into tasks does exist.  It needs a lot 
> more work and can go in a few different directions to make it better, but 
> check out the wrappers with  defined, and enable use_tasked_jobs 
> in your universe_wsgi.ini and restart.  That's all it should take from a 
> fresh galaxy install to get, iirc, at least BWA and a few other tools 
> working.  If you want a super trivial example to play with, change the tool 
> .xml for text tool like "change case" to have  method="basic"> and give that a shot.
> 
> If you decide to try this out, do keep in mind that this feature is not at 
> all complete and while there's a long list of things we still want to 
> experiment with along these lines suggestions (and especially contributions) 
> are absolutely welcome.
> 
> -Dannon
> 
> On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:
> 
>> Hi all,
>> 
>> The comments on this issue suggest that the Galaxy team is/were
>> working on splitting large jobs over multiple nodes/CPUs:
>> 
>> https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs
>> 
>> Is there any relevant page on the wiki I should be aware of?
>> 
>> Specifically I am hoping for a general framework where one of the tool
>> inputs can be marked as "embarrassingly parallel" meaning it can be
>> subdivided easily (e.g. multiple sequences in FASTA or FASTQ format,
>> multiple annotations in BED format, multiple lines in tabular format) and
>> the outputs can all be easily combined (e.g. by concatenation in the
>> same order as the input was split).
>> 
>> Thanks,
>> 
>> Peter
>> ___
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>> 
>> http://lists.bx.psu.edu/
> 
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

2012-02-15 Thread Dannon Baker
It's definitely an experimental feature at this point, and there's no wiki, but 
basic support for breaking jobs into tasks does exist.  It needs a lot more 
work and can go in a few different directions to make it better, but check out 
the wrappers with  defined, and enable use_tasked_jobs in your 
universe_wsgi.ini and restart.  That's all it should take from a fresh galaxy 
install to get, iirc, at least BWA and a few other tools working.  If you want 
a super trivial example to play with, change the tool .xml for text tool like 
"change case" to have  and give that 
a shot.

If you decide to try this out, do keep in mind that this feature is not at all 
complete and while there's a long list of things we still want to experiment 
with along these lines suggestions (and especially contributions) are 
absolutely welcome.

-Dannon

On Feb 15, 2012, at 11:36 AM, Peter Cock wrote:

> Hi all,
> 
> The comments on this issue suggest that the Galaxy team is/were
> working on splitting large jobs over multiple nodes/CPUs:
> 
> https://bitbucket.org/galaxy/galaxy-central/issue/79/split-large-jobs
> 
> Is there any relevant page on the wiki I should be aware of?
> 
> Specifically I am hoping for a general framework where one of the tool
> inputs can be marked as "embarrassingly parallel" meaning it can be
> subdivided easily (e.g. multiple sequences in FASTA or FASTQ format,
> multiple annotations in BED format, multiple lines in tabular format) and
> the outputs can all be easily combined (e.g. by concatenation in the
> same order as the input was split).
> 
> Thanks,
> 
> Peter
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/