Re: [galaxy-dev] Preffered way of running a tool on multiple input files

2013-02-18 Thread Hagai Cohen
Hi John,
I'm using your bitbucket fork.I still didn't finish all the needed job but
meanwhile it works great.
The two tools you add: multi-upload  split - works great too - and are all
what I needed.
(I had only one patch which I had to add somewhere - I'm still trying to
understand the galaxy code there.)

The difference between tool which accepts input file and a tool which
accepts multi-input-file is nice.
Currently, I'm gone use this. I hope the official release will have similar
feature in the future.

Thanks,
Hagai


On Wed, Feb 13, 2013 at 9:56 AM, Hagai Cohen haga...@gmail.com wrote:


 John, that's seems great.
 I will read this stuff and see if  I can use it (The bed format isn't that
 essential, bowtie can bam instead).

 If it wont work I will try the other solution which doesn't need to change
 the galaxy own code (Creating hundreds of workflow run, linking to their
 outputs and running last workflow with the merging tool - this solution
 also distribute in a better way).

 Because galaxy is used a lot on sequencers output, I think someday it
 should support this kind of jobs internally.
 When I will have a running solution, I will publish what solution I have
 used.

 Its really great to know I'm not the first one to attack this problem.
 Thanks for the advices.
 Hagai





 On Tue, Feb 12, 2013 at 5:42 PM, Joachim Jacob |VIB| joachim.ja...@vib.be
  wrote:

 You cannot directly couple different workflows.

 But you could indeed copy all outputs of the different workflows into one
 history, and create a separate workflow with your tool to work on all those
 input files.

 Cheers,

 Joachim

 Joachim Jacob

 Rijvisschestraat 120, 9052 Zwijnaarde
 Tel: +32 9 244.66.34
 Bioinformatics Training and Services (BITS)
 http://www.bits.vib.be
 @bitsatvib

 On 02/12/2013 04:31 PM, Hagai Cohen wrote:


 Thanks for your answer.
 I figured that there is an option to run a workflow on multiple files,
 but I can't merge the outputs afterwardsl. I would like the workflow to
 return one final output.

 But you gave me another idea.
 Can I somehow tell one workflow to run on other workflow output?
 If this can be done, I can run 100 different workflows with bowtie 
 statistics, each working on one fastq file, than run another workflow which
 gets 100 xls inputs and merge them to one.




 On Tue, Feb 12, 2013 at 5:20 PM, Joachim Jacob |VIB| 
 joachim.ja...@vib.be mailto:joachim.ja...@vib.be wrote:

 Hi Hagai,

 Actually, using a workflow, you are able to select multiple input
 files, and let the workflow run separately on all input files.

 I would proceed by creating a data library for all your fastq
 files, which you can upload via FTP, or via a system directory.
 You can use a sample of your fastq files to create the steps in a
 history you want to perform, and extract a workflow out of it.
 Next, copy all fastq files from a data library in a new history,
 and run your workflow on the all input files.

 I hope this helps you further,
 Joachim


 Joachim Jacob

 Rijvisschestraat 120, 9052 Zwijnaarde
 Tel: +32 9 244.66.34 tel:%2B32%209%20244.66.34

 Bioinformatics Training and Services (BITS)
 http://www.bits.vib.be
 @bitsatvib


 On 02/12/2013 04:02 PM, Hagai Cohen wrote:

 Hi,
 I'm looking for a preferred way of running Bowtie (or any
 other tool) on multiple input files and run statistics on the
 Bowtie output afterwards.

 The input is a directory of files fastq1..fastq100
 The bowtie output should be bed1...bed100
 The statistics tool should run on bed1...bed100 and return
 xls1..xls100
 Then I will write a tool which will get xls1..xls100 and merge
 them to one final output.

 I searched for a smiliar cases, and I couldn't figure anyone
 which had this problem before.
 Can't use the parallelism tag, because what will be the input
 for each tool? it should be a fastq file not a directory of
 fastq files.
 Neither I would like to run each fastq file in a different
 workflow - creating a mess.

 I thought only on two solutions:
 1. Implement new datatypes: bed_dir  fastq_dir and implements
 new tool wrappers which will get a folder instead of a file.
 2. merge the input files before sending to bowtie, and use
 parallelism tag to make them be splitted  merged again on
 each tool.

 Does anyone has any better suggestion?

 Thanks,
 Hagai











 __**_
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/






___
Please keep all replies on 

Re: [galaxy-dev] Preffered way of running a tool on multiple input files

2013-02-12 Thread Joachim Jacob |VIB|

Hi Hagai,

Actually, using a workflow, you are able to select multiple input files, 
and let the workflow run separately on all input files.


I would proceed by creating a data library for all your fastq files, 
which you can upload via FTP, or via a system directory.
You can use a sample of your fastq files to create the steps in a 
history you want to perform, and extract a workflow out of it.
Next, copy all fastq files from a data library in a new history, and run 
your workflow on the all input files.


I hope this helps you further,
Joachim


Joachim Jacob

Rijvisschestraat 120, 9052 Zwijnaarde
Tel: +32 9 244.66.34
Bioinformatics Training and Services (BITS)
http://www.bits.vib.be
@bitsatvib

On 02/12/2013 04:02 PM, Hagai Cohen wrote:

Hi,
I'm looking for a preferred way of running Bowtie (or any other tool) 
on multiple input files and run statistics on the Bowtie output 
afterwards.


The input is a directory of files fastq1..fastq100
The bowtie output should be bed1...bed100
The statistics tool should run on bed1...bed100 and return xls1..xls100
Then I will write a tool which will get xls1..xls100 and merge them to 
one final output.


I searched for a smiliar cases, and I couldn't figure anyone which had 
this problem before.
Can't use the parallelism tag, because what will be the input for each 
tool? it should be a fastq file not a directory of fastq files.
Neither I would like to run each fastq file in a different workflow - 
creating a mess.


I thought only on two solutions:
1. Implement new datatypes: bed_dir  fastq_dir and implements new 
tool wrappers which will get a folder instead of a file.
2. merge the input files before sending to bowtie, and use parallelism 
tag to make them be splitted  merged again on each tool.


Does anyone has any better suggestion?

Thanks,
Hagai











___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] Preffered way of running a tool on multiple input files

2013-02-12 Thread Hagai Cohen
Thanks for your answer.
I figured that there is an option to run a workflow on multiple files, but
I can't merge the outputs afterwardsl. I would like the workflow to return
one final output.

But you gave me another idea.
Can I somehow tell one workflow to run on other workflow output?
If this can be done, I can run 100 different workflows with bowtie 
statistics, each working on one fastq file, than run another workflow which
gets 100 xls inputs and merge them to one.




On Tue, Feb 12, 2013 at 5:20 PM, Joachim Jacob |VIB|
joachim.ja...@vib.bewrote:

 Hi Hagai,

 Actually, using a workflow, you are able to select multiple input files,
 and let the workflow run separately on all input files.

 I would proceed by creating a data library for all your fastq files, which
 you can upload via FTP, or via a system directory.
 You can use a sample of your fastq files to create the steps in a history
 you want to perform, and extract a workflow out of it.
 Next, copy all fastq files from a data library in a new history, and run
 your workflow on the all input files.

 I hope this helps you further,
 Joachim


 Joachim Jacob

 Rijvisschestraat 120, 9052 Zwijnaarde
 Tel: +32 9 244.66.34
 Bioinformatics Training and Services (BITS)
 http://www.bits.vib.be
 @bitsatvib


 On 02/12/2013 04:02 PM, Hagai Cohen wrote:

 Hi,
 I'm looking for a preferred way of running Bowtie (or any other tool) on
 multiple input files and run statistics on the Bowtie output afterwards.

 The input is a directory of files fastq1..fastq100
 The bowtie output should be bed1...bed100
 The statistics tool should run on bed1...bed100 and return xls1..xls100
 Then I will write a tool which will get xls1..xls100 and merge them to
 one final output.

 I searched for a smiliar cases, and I couldn't figure anyone which had
 this problem before.
 Can't use the parallelism tag, because what will be the input for each
 tool? it should be a fastq file not a directory of fastq files.
 Neither I would like to run each fastq file in a different workflow -
 creating a mess.

 I thought only on two solutions:
 1. Implement new datatypes: bed_dir  fastq_dir and implements new tool
 wrappers which will get a folder instead of a file.
 2. merge the input files before sending to bowtie, and use parallelism
 tag to make them be splitted  merged again on each tool.

 Does anyone has any better suggestion?

 Thanks,
 Hagai











 __**_
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/



___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Preffered way of running a tool on multiple input files

2013-02-12 Thread John Chilton
Hagai,

Jorrit Boekel and I have implemented essentially literally what you described.

https://bitbucket.org/galaxy/galaxy-central/pull-request/116/multiple-file-datasets-implementation

Merge this in to your Galaxy tree
https://bitbucket.org/jmchilton/galaxy-central-multifiles-feb2013.
Switch use_composite_multfiles to true in universe_wsgi.ini. Then you
automatically get a multiple file version of each of your datatypes
(so m:fastq, m:xls, etc...). Tools that process a singleton version of
a datatype can seamlessly process a multiple file version of that
dataset in parallel and the outputs that are created as a result are
going to be of the multifile type of the original types.

These datasets can be created using the multifile upload tool, a
directory on the FTP server, or via library imports via API.

Input names are preserved like you described.

Some huge caveats:
 - The Galaxy team has expressed reservations about this particular
implementation so it will never be officially supported.
 - Its early days and this is very experimental (use at your own risk).
 - I am pretty sure it is not going to work with bed files, since
there is special logic in Galaxy to deal with bed indices (I think we
can work around it by declaring a concrete m:bed type and replicated
that logic, its on the TODO list but happy to accept contributions :)
).

More discussion of this can be found at these places:
http://www.youtube.com/watch?v=DxJzEkOasu4
https://bitbucket.org/galaxy/galaxy-central/pull-request/116/multiple-file-datasets-implementation
http://dev.list.galaxyproject.org/pass-more-information-on-a-dataset-merge-td4656455.html

-John



On Tue, Feb 12, 2013 at 9:02 AM, Hagai Cohen haga...@gmail.com wrote:
 Hi,
 I'm looking for a preferred way of running Bowtie (or any other tool) on
 multiple input files and run statistics on the Bowtie output afterwards.

 The input is a directory of files fastq1..fastq100
 The bowtie output should be bed1...bed100
 The statistics tool should run on bed1...bed100 and return xls1..xls100
 Then I will write a tool which will get xls1..xls100 and merge them to one
 final output.

 I searched for a smiliar cases, and I couldn't figure anyone which had this
 problem before.
 Can't use the parallelism tag, because what will be the input for each tool?
 it should be a fastq file not a directory of fastq files.
 Neither I would like to run each fastq file in a different workflow -
 creating a mess.

 I thought only on two solutions:
 1. Implement new datatypes: bed_dir  fastq_dir and implements new tool
 wrappers which will get a folder instead of a file.
 2. merge the input files before sending to bowtie, and use parallelism tag
 to make them be splitted  merged again on each tool.

 Does anyone has any better suggestion?

 Thanks,
 Hagai










 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Preffered way of running a tool on multiple input files

2013-02-12 Thread Joachim Jacob |VIB|

You cannot directly couple different workflows.

But you could indeed copy all outputs of the different workflows into 
one history, and create a separate workflow with your tool to work on 
all those input files.


Cheers,
Joachim

Joachim Jacob

Rijvisschestraat 120, 9052 Zwijnaarde
Tel: +32 9 244.66.34
Bioinformatics Training and Services (BITS)
http://www.bits.vib.be
@bitsatvib

On 02/12/2013 04:31 PM, Hagai Cohen wrote:


Thanks for your answer.
I figured that there is an option to run a workflow on multiple files, 
but I can't merge the outputs afterwardsl. I would like the workflow 
to return one final output.


But you gave me another idea.
Can I somehow tell one workflow to run on other workflow output?
If this can be done, I can run 100 different workflows with bowtie  
statistics, each working on one fastq file, than run another workflow 
which gets 100 xls inputs and merge them to one.





On Tue, Feb 12, 2013 at 5:20 PM, Joachim Jacob |VIB| 
joachim.ja...@vib.be mailto:joachim.ja...@vib.be wrote:


Hi Hagai,

Actually, using a workflow, you are able to select multiple input
files, and let the workflow run separately on all input files.

I would proceed by creating a data library for all your fastq
files, which you can upload via FTP, or via a system directory.
You can use a sample of your fastq files to create the steps in a
history you want to perform, and extract a workflow out of it.
Next, copy all fastq files from a data library in a new history,
and run your workflow on the all input files.

I hope this helps you further,
Joachim


Joachim Jacob

Rijvisschestraat 120, 9052 Zwijnaarde
Tel: +32 9 244.66.34 tel:%2B32%209%20244.66.34
Bioinformatics Training and Services (BITS)
http://www.bits.vib.be
@bitsatvib


On 02/12/2013 04:02 PM, Hagai Cohen wrote:

Hi,
I'm looking for a preferred way of running Bowtie (or any
other tool) on multiple input files and run statistics on the
Bowtie output afterwards.

The input is a directory of files fastq1..fastq100
The bowtie output should be bed1...bed100
The statistics tool should run on bed1...bed100 and return
xls1..xls100
Then I will write a tool which will get xls1..xls100 and merge
them to one final output.

I searched for a smiliar cases, and I couldn't figure anyone
which had this problem before.
Can't use the parallelism tag, because what will be the input
for each tool? it should be a fastq file not a directory of
fastq files.
Neither I would like to run each fastq file in a different
workflow - creating a mess.

I thought only on two solutions:
1. Implement new datatypes: bed_dir  fastq_dir and implements
new tool wrappers which will get a folder instead of a file.
2. merge the input files before sending to bowtie, and use
parallelism tag to make them be splitted  merged again on
each tool.

Does anyone has any better suggestion?

Thanks,
Hagai











___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/





___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] Preffered way of running a tool on multiple input files

2013-02-12 Thread Hagai Cohen
John, that's seems great.
I will read this stuff and see if  I can use it (The bed format isn't that
essential, bowtie can bam instead).

If it wont work I will try the other solution which doesn't need to change
the galaxy own code (Creating hundreds of workflow run, linking to their
outputs and running last workflow with the merging tool - this solution
also distribute in a better way).

Because galaxy is used a lot on sequencers output, I think someday it
should support this kind of jobs internally.
When I will have a running solution, I will publish what solution I have
used.

Its really great to know I'm not the first one to attack this problem.
Thanks for the advices.
Hagai




On Tue, Feb 12, 2013 at 5:42 PM, Joachim Jacob |VIB|
joachim.ja...@vib.bewrote:

 You cannot directly couple different workflows.

 But you could indeed copy all outputs of the different workflows into one
 history, and create a separate workflow with your tool to work on all those
 input files.

 Cheers,

 Joachim

 Joachim Jacob

 Rijvisschestraat 120, 9052 Zwijnaarde
 Tel: +32 9 244.66.34
 Bioinformatics Training and Services (BITS)
 http://www.bits.vib.be
 @bitsatvib

 On 02/12/2013 04:31 PM, Hagai Cohen wrote:


 Thanks for your answer.
 I figured that there is an option to run a workflow on multiple files,
 but I can't merge the outputs afterwardsl. I would like the workflow to
 return one final output.

 But you gave me another idea.
 Can I somehow tell one workflow to run on other workflow output?
 If this can be done, I can run 100 different workflows with bowtie 
 statistics, each working on one fastq file, than run another workflow which
 gets 100 xls inputs and merge them to one.




 On Tue, Feb 12, 2013 at 5:20 PM, Joachim Jacob |VIB| 
 joachim.ja...@vib.be mailto:joachim.ja...@vib.be wrote:

 Hi Hagai,

 Actually, using a workflow, you are able to select multiple input
 files, and let the workflow run separately on all input files.

 I would proceed by creating a data library for all your fastq
 files, which you can upload via FTP, or via a system directory.
 You can use a sample of your fastq files to create the steps in a
 history you want to perform, and extract a workflow out of it.
 Next, copy all fastq files from a data library in a new history,
 and run your workflow on the all input files.

 I hope this helps you further,
 Joachim


 Joachim Jacob

 Rijvisschestraat 120, 9052 Zwijnaarde
 Tel: +32 9 244.66.34 tel:%2B32%209%20244.66.34

 Bioinformatics Training and Services (BITS)
 http://www.bits.vib.be
 @bitsatvib


 On 02/12/2013 04:02 PM, Hagai Cohen wrote:

 Hi,
 I'm looking for a preferred way of running Bowtie (or any
 other tool) on multiple input files and run statistics on the
 Bowtie output afterwards.

 The input is a directory of files fastq1..fastq100
 The bowtie output should be bed1...bed100
 The statistics tool should run on bed1...bed100 and return
 xls1..xls100
 Then I will write a tool which will get xls1..xls100 and merge
 them to one final output.

 I searched for a smiliar cases, and I couldn't figure anyone
 which had this problem before.
 Can't use the parallelism tag, because what will be the input
 for each tool? it should be a fastq file not a directory of
 fastq files.
 Neither I would like to run each fastq file in a different
 workflow - creating a mess.

 I thought only on two solutions:
 1. Implement new datatypes: bed_dir  fastq_dir and implements
 new tool wrappers which will get a folder instead of a file.
 2. merge the input files before sending to bowtie, and use
 parallelism tag to make them be splitted  merged again on
 each tool.

 Does anyone has any better suggestion?

 Thanks,
 Hagai











 __**_
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/





___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/