Hello, I am using the --multi-prog switch with srun for a master/slave collective communication process. I am using openmpi to pass host based buffers between the master and slave.
The code works over infiniband connection between master and slave. However, I want to test using 10GigE which also connects my two nodes. My batch file call (for this GPU based code) is as follows: sbatch --gres=gpu:1 --nodelist="fupone4,t2" -n2 slurm.bat Where slurm.bat invokes the 'multi-prog' switch as follows: #!/bin/csh #SBATCH --output=pnacq.txt #SBATCH --export=HOME,DISPLAY srun --multi-prog -l master_slave1.conf The master_slave1.conf file is as follows: 0 test.bat 1 test1.bat Where test.bat is: #!/bin/tcsh hostname source $HOME/.cshrc xms xmpy pnacq.py And test1.bat is: #!/bin/tcsh hostname source $HOME/.cshrc env xms /home/brian/praxis/host/pnacq_grid.exe NOTE: in the above test.bat, test1.bat the 'xms' starts an interpretive X-midas environment on which the baseline code pnacq.py and pnacq_grid.exe is compiled. I try setting environment variables to suppress openip and implement my tcp connection as follows: setenv OMPI_MCA_BTL_TCP_IF_INCLUDE 81.1.1.194/24 setenv OMPI_MCA_BTL \^openib In general, the mca params are recognized by the 'mpirun' command, but as you can see from the above scripts I am not using mpirun, but rather launching my apps via the X-midas environment started with the 'xms' call I tried using sbatch 'network' switches as follow: sbatch -v --gres=gpu:1 --nodelist="fupone4,t2" --network=devtype="IPONLY" --network=devname="eth3" -n2 slurm.bat But still the communication path is defaulting to infiniband. I am all out of ideas. Any suggestions would be appreciated. Regards, Brian
