Hello,

I am using the --multi-prog switch with srun for a master/slave collective
communication process.  I am using openmpi to pass host based buffers
between the master and slave.

The code works over infiniband connection between master and slave.
However, I want to test using  10GigE which also connects my two nodes.

My batch file call (for this GPU based code) is as follows:

sbatch --gres=gpu:1 --nodelist="fupone4,t2"  -n2 slurm.bat

Where slurm.bat invokes the 'multi-prog' switch as follows:

#!/bin/csh

#SBATCH --output=pnacq.txt

#SBATCH --export=HOME,DISPLAY

srun --multi-prog -l master_slave1.conf


The master_slave1.conf file is as follows:

0      test.bat
1      test1.bat


Where test.bat is:

#!/bin/tcsh

hostname
source $HOME/.cshrc
xms
xmpy pnacq.py


And test1.bat is:

#!/bin/tcsh

hostname
source $HOME/.cshrc
env
xms
/home/brian/praxis/host/pnacq_grid.exe


NOTE: in the above test.bat, test1.bat the 'xms' starts an interpretive
X-midas environment on which the baseline code pnacq.py and pnacq_grid.exe
is compiled.

I try setting environment variables to suppress openip and implement my tcp
connection as follows:

setenv OMPI_MCA_BTL_TCP_IF_INCLUDE 81.1.1.194/24
setenv OMPI_MCA_BTL \^openib

In general,  the mca params are recognized by the 'mpirun' command, but as
you can see from the above scripts I am not using mpirun, but rather
launching my apps via the X-midas environment started with the 'xms' call

I tried using sbatch 'network' switches as follow:

sbatch -v --gres=gpu:1 --nodelist="fupone4,t2" --network=devtype="IPONLY"
--network=devname="eth3" -n2 slurm.bat

But still the communication path is defaulting to infiniband.

I am all out of ideas.  Any suggestions would be appreciated.

Regards,

Brian

Reply via email to