Re: parallel + blast + LSF

Giuseppe Aprea Wed, 15 Apr 2015 12:51:18 -0700

Hi George!

I am not sure who you are talking with. Martin or me? I remind the original
topic is about using blast under parallel with LSF.
Martin's problem sounds like something offtopic.


You have both sysadmin and bioinformatics experience so I would really
appreciate your help!

I am working on a cluster so I must use LSF to get slots and I would prefer
using parallel also since it splits input automatically with --recstart
(which is quite nice:D otherwise I have to use another script for that). I
see I could do better with chunksize (I have 1 record at time in my
example) but that's a secondary problem now. First I have the
"lsb_launch(): Failed while waiting for tasks to finish." issue to solve.

cheers,

g



On Wed, Apr 15, 2015 at 7:44 PM, George Marselis <[email protected]> wrote:

> By the way, LSF and GNU parallel do almost the same thing. So using one of
> the two, defeats the purpose of using the other.
>
> In the same way, you could have used LSF to submit your jobs to LSF:
>
> bsub < script.sh
>
> where script.sh was
>
> bsub -J amoeba -q smalljobs  qfasta file1
> bsub -J amoeba -q smalljobs  qfasta file2
> ...
> bsub -J amoeba -q smalljobs  qfasta file2000
>
> On Wed, Apr 15, 2015 at 8:39 PM, George Marselis <[email protected]> wrote:
>
>> Hi. LSF/Openlava sysadmin in bioinformatics and parallel user here.
>>
>> I have seen this a couple more times: You are trying to use GNU parallel
>> to submit the jobs to all nodes.
>>
>> THat's now the way to do things: You should not submit jobs on *all* your
>> nodes. Please don't do that, as bsub was not designed to read large chunks
>> of jobs. bsub writes the jobs to your home directory, so if your storage is
>> not designed for a lot of writes, you are going to blow the cluster out of
>> the water.
>>
>> What you want to do is look up either:
>>
>> 1. bsub scripts
>> https://rc.fas.harvard.edu/resources/documentation/legacy-lsf/lsf-submit-an-lsf-job/
>>
>> or
>>
>> 2. job arrays
>> https://rc.fas.harvard.edu/resources/documentation/legacy-lsf/lsf-submitting-lots-of-short-jobs-job-arrays/
>>
>> Both bsub scripts and job arrays are useful to you: bsub scripts can be
>> submitted as part of a pipeline: you can program the output of the bsub
>> script from your pipeline and then submit it to bsub. So, instead of
>> submitting your job 2000 times as in
>>
>> bsub job0
>> bsub job1
>>
>> ....
>>
>> bsub job1999
>>
>> you just submit "bsub < scriptname" which contains 2000 lines which
>> describe your jobs and you are done. The rest is done by bsub/LSF
>>
>>
>> Now, if your jobs are similar in a way that you just increment counter
>> (as in most bioinformatics jobs), use arrays.
>>
>> bsub -J JOBNAME[0-1999], where JOBNAME is a string you would like to
>> name your job as, eg "fasta files alignment"
>>
>>
>> These techniques are useful because you can submit all 2000 jobs in less
>> than a second, you can do it from a single node and you will not have to
>> deal with a grumpy sysadmin or grumpy colleagues who cannot use the
>> cluster. Just make sure you use the appropriate queue.
>>
>> Let me know if you have any questions.
>>
>> Best Regards,
>>
>> George Marselis
>>
>> On Wed, Apr 15, 2015 at 6:48 PM, Martin d'Anjou <
>> [email protected]> wrote:
>>
>>>  Hi,
>>>
>>> Thanks for clarifying. I want to use GNU Parallel to bsub jobs. This way
>>> I can use GNU Parallel to throttle the number of jobs that are submitted to
>>> LSF, and it is easier than writing a loop.
>>>
>>> parallel -j 100 my_script [bsub options] ::: {1..2000}
>>>
>>> my_script (pseudo-code):
>>> #!/bin/bash
>>> ...
>>> bsub [bsub options] command ...
>>> post-process data
>>>
>>> This way I can submit jobs, say 100 at a time. When I submit all 2000
>>> jobs, it gets problematic and I start hitting limits with file descriptors,
>>> etc.
>>>
>>> Thanks for sharing,
>>> Martin
>>>
>>>
>>> On 15-04-15 11:35 AM, Giuseppe Aprea wrote:
>>>
>>> Hi Martin,
>>>
>>>  I am not sure I understand. As far as I can see, things work exactly
>>> the opposite way: you have an LSF script which launches GNU Parallel on
>>> some hosts provided by LSF. Something like:
>>>
>>>
>>> -------------------------------------------------------------------------------
>>>
>>> -------------------------------------------------------------------------------
>>> #!/bin/bash
>>>
>>>  #BSUB -J gnuParallel_blast_test      # Name of the job.
>>> #BSUB -o %J.out                              # Appends std output to
>>> file %J.out. (%J is the Job ID)
>>> #BSUB -e %J.err                               # Appends std error to
>>> file %J.err.
>>> #BSUB -q large                                 # Queue name.
>>> #BSUB -n 30                                      # Number of CPUs.
>>>
>>>  module load 4.8.3/ncbi/12.0.0
>>> module load 4.8.3/parallel/20150122
>>>
>>>  SLOTS=`cat ${LSB_DJOB_HOSTFILE} |wc -l`
>>>
>>>  SERVER=""
>>>
>>>  for i in `cat ${LSB_DJOB_HOSTFILE}| sort`
>>>  do
>>>  echo "/afs/enea.it/software/bin/blaunch.sh ${i}" >> servers
>>> done
>>>
>>>  cat absolute_path_to_sequences.fasta | parallel --no-notice -vv -j
>>> ${SLOTS} --slf servers --plain --recstart '>' -N 1 --pipe blastp -evalue
>>> 1e-05 -outfmt 6 -db absolute_path_to_db_file -query - -out
>>> absolute_path_to_result_file_{%}
>>>
>>> -------------------------------------------------------------------------------
>>>
>>> -------------------------------------------------------------------------------
>>>
>>>  LSF is the one which gives you the execution hosts so if you are
>>> launching bsub from GNU parallel how do you know how to set the --slf
>>> option?
>>>
>>>
>>>  g
>>>
>>>
>>>
>>>   On Wed, Apr 15, 2015 at 4:24 PM, Martin d'Anjou <
>>> [email protected]> wrote:
>>>
>>>> On 15-04-15 09:34 AM, Giuseppe Aprea wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to ask you, please, some help in using parallel with
>>>>> blast alignment software.
>>>>>
>>>>>
>>>>> I am trying to use GNU parallel v. 20150122 with blast for a very
>>>>> large sequences alignment. I am using Parallel on a cluster which uses LSF
>>>>> as queue system.
>>>>>
>>>>
>>>>  Hello Giuseppe,
>>>>
>>>> I am an avid LSF user, and I want to use GNU Parallel to dispatch jobs
>>>> to LSF. Could you please explain a little bit to me how GNU Parallel works
>>>> with LSF? I do not see it in the on-line tutorials. For example, I would
>>>> like to understand how to pass "bsub" options like -oo, -q queue_name, etc.
>>>> to LSF from GNU Parallel.
>>>>
>>>> Thanks,
>>>> Martin
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: parallel + blast + LSF

Reply via email to