Let me expand on this slightly (in response to Ralph Castain's posting -- I had digest mode set). As currently constructed a shellscript in Wien2k (www.wien2k.at) launches a series of tasks using
($remote $remotemachine "cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]") >>.time1_$loop & where the standard setting for "remote" is "ssh", remotemachine is the appropriate host, "t" is "time" and "ttt" is a concatenation of commands, for instance when using 2 cores on one node for Task1, 2 cores on 2 nodes for Task2 and 2 cores on 1 node for Task3 Task1: mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine1 /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def Task2: mpirun -v -x LD_LIBRARY_PATH -x PATH -np 4 -machinefile .machine2 /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_2.def Task3: mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine3 /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_3.def This is a stable script, works under SGI, linux, mvapich and many others using ssh or rsh (although I've never myself used it with rsh). It is general purpose, i.e. will work to run just 1 task on 8x8 nodes/cores or 8 parallel tasks on 8 nodes all with 8 cores or any scatter of nodes/cores. According to some, ssh is becoming obsolete within supercomputers and the "replacement" is pbsdsh at least under Torque. Getting pbsdsh is certainly not as simple as the documentation I've seen. To get it to even partially work I am using for "remote" a script "pbsh" which creates an executable bash file where HOME, PATH, LD_LIBRARY_PATH etc as well as the PBS environmental variables listed at the bottom of http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml plus PBS_NODEFILE to a file $PBS_O_WORKDIR/.tmp_$1 followed by the relevant command and then runs pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1 " This works fine so long as Task2 does not have 2 nodes (probably 3 as well, I've not tested this). If it does there is a communications failure with nothing launched on the 2nd node of Task2. I'm including the script below, as maybe there are some other environmental variables needed or some should not be there in order to properly rebuilt the environment so things will work. (And yes, I know there should be tests to see if the variables are set first and so forth and this is not so clean, this is just an initial version.) ---------- # Script to replace ssh by pbsdsh # Beta version, April 2011, L. D. Marks # # Remove old file -- needed ! rm -f $PBS_O_WORKDIR/.tmp_$1 # Create a script that exports the environment we have # This may not be enough echo #!/bin/bash > $PBS_O_WORKDIR/.tmp_$1 echo source $HOME/.bashrc >> $PBS_O_WORKDIR/.tmp_$1 echo cd $PBS_O_WORKDIR >> $PBS_O_WORKDIR/.tmp_$1 echo export PATH=$PBS_O_PATH >> $PBS_O_WORKDIR/.tmp_$1 echo export TMPDIR=$TMPDIR >> $PBS_O_WORKDIR/.tmp_$1 echo export SCRATCH=$SCRATCH >> $PBS_O_WORKDIR/.tmp_$1 echo export LD_LIBRARY_PATH=$LD_LIBRARY_PATH >> $PBS_O_WORKDIR/.tmp_$1 # Openmpi needs to have this defined, even if we don't use it echo export PBS_NODEFILE=$PBS_NODEFILE >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_ENVIRONMENT=$PBS_ENVIRONMENT >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_JOBCOOKIE=$PBS_JOBCOOKIE >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_JOBID=$PBS_JOBID >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_JOBNAME=$PBS_JOBNAME >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_MOMPORT=$PBS_MOMPORT >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_NODENUM=$PBS_NODENUM >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_HOME=$PBS_O_HOME >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_HOST=$PBS_O_HOST >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_LANG=$PBS_O_LANG >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_LOGNAME=$PBS_O_LOGNAME >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_MAIL=$PBS_O_MAIL >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_PATH=$PBS_O_PATH >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_QUEUE=$PBS_O_QUEUE >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_SHELL=$PBS_O_SHELL >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_O_WORKDIR=$PBS_O_WORKDIR >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_QUEUE=$PBS_QUEUE >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_TASKNUM=$PBS_TASKNUM >> $PBS_O_WORKDIR/.tmp_$1 echo export PBS_VNODENUM=$PBS_VNODENUM >> $PBS_O_WORKDIR/.tmp_$1 # Now the command we want to run echo $2 >> $PBS_O_WORKDIR/.tmp_$1 # Make it executable chmod a+x $PBS_O_WORKDIR/.tmp_$1 pbsdsh -h $1 /bin/bash -lc " $PBS_O_WORKDIR/.tmp_$1 " #Cleanup if needed (commented out for debugging) #rm $PBS_O_WORKDIR/.tmp_$1 On Sat, Apr 2, 2011 at 9:36 PM, Laurence Marks <l-ma...@northwestern.edu> wrote: > I have a problem which may or may not be openmpi, but since this list > was useful before with a race condition I am posting. > > I am trying to use pbsdsh as a ssh replacement, pushed by sysadmins as > Torque does not know about ssh tasks launched from a task. In a simple > case, a script launches three mpi tasks in parallel, > > Task1: NodeA > Task2: NodeB and NodeC > Task3: NodeD > > (some cores on each, all handled correctly). Reproducible (but with > different nodes and numbers of cores) Task1 and Task3 work fine, the > mpi task starts on NodeB but nothing starts on NodeC, it appears that > NodeC does not communicate. It does not have to be this it could be > > Task1: NodeA NodeB > Task2: NodeC NodeD > > Here NodeC will start and it looks as if NodeD never starts anything. > I've also run it with 4 Tasks (1,3,4 work) and if Task2 only uses one > Node (number of cores do not matter) it is fine. > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Chair, Commission on Electron Crystallography of IUCR > www.numis.northwestern.edu/ > Research is to see what everybody else has seen, and to think what > nobody else has thought > Albert Szent-Györgi > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Györgi