Torque problem

Laurence Marks Sun, 3 Apr 2011 11:34:30 -0400

On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote:
>
>> Let me expand on this slightly (in response to Ralph Castain's posting
>> -- I had digest mode set). As currently constructed a shellscript in
>> Wien2k (www.wien2k.at) launches a series of tasks using
>>
>> ($remote $remotemachine "cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]")
>>>> .time1_$loop &
>>
>> where the standard setting for "remote" is "ssh", remotemachine is the
>> appropriate host, "t" is "time" and "ttt" is a concatenation of
>> commands, for instance when using 2 cores on one node for Task1, 2
>> cores on 2 nodes for Task2 and 2 cores on 1 node for Task3
>>
>> Task1:
>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine1
>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>> Task2:
>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 4 -machinefile .machine2
>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_2.def
>> Task3:
>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine3
>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_3.def
>>
>> This is a stable script, works under SGI, linux, mvapich and many
>> others using ssh or rsh (although I've never myself used it with rsh).
>> It is general purpose, i.e. will work to run just 1 task on 8x8
>> nodes/cores or 8 parallel tasks on 8 nodes all with 8 cores or any
>> scatter of nodes/cores.
>>
>> According to some, ssh is becoming obsolete within supercomputers and
>> the "replacement" is pbsdsh at least under Torque.
>
> Somebody is playing an April Fools joke on you. The majority of 
> supercomputers use ssh as their sole launch mechanism, and I have seen no 
> indication that anyone intends to change that situation. That said, Torque is 
> certainly popular and a good environment.


Alas, it is not an April fools joke, to quote from
http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml
"pbsdsh can be used as a replacement for an ssh or rsh command which
invokes a user command on a worker machine. Some applications expect
the availability of rsh or ssh  in order to invoke parts of the
computation on the sister worker nodes of the main worker. Using
pbsdsh instead is necessary on this cluster because direct use of ssh
or rsh is not allowed, for accounting and security reasons."

I am not using that computer. A scenario that I have come across is
that when a msub job is killed because it has exceeded it's Walltime
mpi tasks spawned by ssh may not be terminated because (so I am told)
Torque does not know about them.

>
>> Getting pbsdsh is
>> certainly not as simple as the documentation I've seen. To get it to
>> even partially work I am using for "remote" a script "pbsh" which
>> creates an executable bash file where HOME, PATH, LD_LIBRARY_PATH etc
>> as well as the PBS environmental variables listed at the bottom of
>> http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml plus PBS_NODEFILE to
>> a file $PBS_O_WORKDIR/.tmp_$1 followed by the relevant command and
>> then runs
>>
>> pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1  "
>>
>> This works fine so long as Task2 does not have 2 nodes (probably 3 as
>> well, I've not tested this). If it does there is a communications
>> failure with nothing launched on the 2nd node of Task2.
>>
>> I'm including the script below, as maybe there are some other
>> environmental variables needed or some should not be there in order to
>> properly rebuilt the environment so things will work. (And yes, I know
>> there should be tests to see if the variables are set first and so
>> forth and this is not so clean, this is just an initial version.)
>
> By providing all those PBS-related envars to OMPI, you are causing OMPI to 
> think it should use Torque as the launch mechanism. Unfortunately, that won't 
> work in this scenario.
>
> When you start a Torque job (get an allocation etc.), Torque puts you on one 
> of the allocated nodes and creates a "sister mom" on that node. This is your 
> job's "master node". All Torque-based launches must take place from that 
> location.
>
> So when you pbsdsh to another node and attempt to execute mpirun with those 
> envars set, mpirun attempts to contact the local "sister mom" so it can order 
> the launch of any daemons on other nodes....only the "sister mom" isn't 
> there! So the connection fails and mpirun aborts.
>
> If mpirun is -only- launching procs on the local node, then it doesn't need 
> to launch another daemon (as mpirun will host the local procs itself), and so 
> it doesn't attempt to contact the "sister mom" and the comm failure doesn't 
> occur.
>
> What I still don't understand is why you are trying to do it this way. Why 
> not just run
>
> time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN 
> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>
> where machineN contains the names of the nodes where you want the MPI apps to 
> execute? mpirun will only execute apps on those nodes, so this accomplishes 
> the same thing as your script - only with a lot less pain.
>
> Your script would just contain a sequence of these commands, each with its 
> number of procs and machinefile as required.
>
> Actually, it would be pretty much identical to the script I use when doing 
> scaling tests...

This can be done, and in fact I have a job running where non-mpi is
launched using pbsdsh but all the mpi is launched locally and this
seems to be working. This may be a viable, general solution but there
could also be issues with SCRATCH and other directories. In principle
there could also be issues with launching N mpi tasks from one node.
The executables I am using work well with very scattered cores, e.g.
using procs=64 or procs=256 but (at least with the system I am using)
I may only end up with 1 or 2 cores on the local node where the job
starts. (I've asked the sys admin people to find a way to do this
better, e.g. prefer launching from the node with the largest number of
cores available which I think can be done, but they do not have this
setup as yet.)
>
>
>>
>> ----------
>> # Script to replace ssh by pbsdsh
>> # Beta version, April 2011, L. D. Marks
>> #
>> # Remove old file -- needed !
>> rm -f $PBS_O_WORKDIR/.tmp_$1
>>
>> # Create a script that exports the environment we have
>> # This may not be enough
>> echo #!/bin/bash > $PBS_O_WORKDIR/.tmp_$1
>> echo source $HOME/.bashrc                       >> $PBS_O_WORKDIR/.tmp_$1
>> echo cd $PBS_O_WORKDIR                          >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PATH=$PBS_O_PATH                    >> $PBS_O_WORKDIR/.tmp_$1
>> echo export TMPDIR=$TMPDIR                      >> $PBS_O_WORKDIR/.tmp_$1
>> echo export SCRATCH=$SCRATCH                    >> $PBS_O_WORKDIR/.tmp_$1
>> echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH    >> $PBS_O_WORKDIR/.tmp_$1
>>
>> # Openmpi needs to have this defined, even if we don't use it
>> echo export PBS_NODEFILE=$PBS_NODEFILE >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_ENVIRONMENT=$PBS_ENVIRONMENT    >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_JOBCOOKIE=$PBS_JOBCOOKIE        >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_JOBID=$PBS_JOBID                >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_JOBNAME=$PBS_JOBNAME            >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_MOMPORT=$PBS_MOMPORT            >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_NODENUM=$PBS_NODENUM            >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_HOME=$PBS_O_HOME              >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_HOST=$PBS_O_HOST              >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_LANG=$PBS_O_LANG              >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_LOGNAME=$PBS_O_LOGNAME        >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_MAIL=$PBS_O_MAIL              >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_PATH=$PBS_O_PATH              >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_QUEUE=$PBS_O_QUEUE            >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_SHELL=$PBS_O_SHELL            >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_O_WORKDIR=$PBS_O_WORKDIR        >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_QUEUE=$PBS_QUEUE                >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_TASKNUM=$PBS_TASKNUM            >> $PBS_O_WORKDIR/.tmp_$1
>> echo export PBS_VNODENUM=$PBS_VNODENUM          >> $PBS_O_WORKDIR/.tmp_$1
>>
>> # Now the command we want to run
>> echo $2 >> $PBS_O_WORKDIR/.tmp_$1
>>
>> # Make it executable
>> chmod a+x $PBS_O_WORKDIR/.tmp_$1
>>
>> pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1  "
>>
>> #Cleanup if needed (commented out for debugging)
>> #rm $PBS_O_WORKDIR/.tmp_$1
>>
>>
>> On Sat, Apr 2, 2011 at 9:36 PM, Laurence Marks <l-ma...@northwestern.edu> 
>> wrote:
>>> I have a problem which may or may not be openmpi, but since this list
>>> was useful before with a race condition I am posting.
>>>
>>> I am trying to use pbsdsh as a ssh replacement, pushed by sysadmins as
>>> Torque does not know about ssh tasks launched from a task. In a simple
>>> case, a script launches three mpi tasks in parallel,
>>>
>>> Task1: NodeA
>>> Task2: NodeB and NodeC
>>> Task3: NodeD
>>>
>>> (some cores on each, all handled correctly). Reproducible (but with
>>> different nodes and numbers of cores) Task1 and Task3 work fine, the
>>> mpi task starts on NodeB but nothing starts on NodeC, it appears that
>>> NodeC does not communicate. It does not have to be this it could be
>>>
>>> Task1: NodeA NodeB
>>> Task2: NodeC NodeD
>>>
>>> Here NodeC will start and it looks as if NodeD never starts anything.
>>> I've also run it with 4 Tasks (1,3,4 work) and if Task2 only uses one
>>> Node (number of cores do not matter) it is fine.
>>>
>>> --
>>> Laurence Marks
>>> Department of Materials Science and Engineering
>>> MSE Rm 2036 Cook Hall
>>> 2220 N Campus Drive
>>> Northwestern University
>>> Evanston, IL 60208, USA
>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>> email: L-marks at northwestern dot edu
>>> Web: www.numis.northwestern.edu
>>> Chair, Commission on Electron Crystallography of IUCR
>>> www.numis.northwestern.edu/
>>> Research is to see what everybody else has seen, and to think what
>>> nobody else has thought
>>> Albert Szent-Györgi
>>>
>>
>>
>>
>> --
>> Laurence Marks
>> Department of Materials Science and Engineering
>> MSE Rm 2036 Cook Hall
>> 2220 N Campus Drive
>> Northwestern University
>> Evanston, IL 60208, USA
>> Tel: (847) 491-3996 Fax: (847) 491-7820
>> email: L-marks at northwestern dot edu
>> Web: www.numis.northwestern.edu
>> Chair, Commission on Electron Crystallography of IUCR
>> www.numis.northwestern.edu/
>> Research is to see what everybody else has seen, and to think what
>> nobody else has thought
>> Albert Szent-Györgi
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi

Re: [OMPI users] openmpi/pbsdsh/Torque problem

Reply via email to