Hi,

I an trying to get familiar with SLURM.  I like so far what I read and
tried out.  But I am not able to find how to do a very simple thing in
SLURM.  I wish to distribute a bunch of srun commands on multiple nodes
without duplication (same job being run on more than one node).  And I wish
to tell SLURM the node configuration parameters that I wish to have.

So far I tried my hands with SRUN, SBATCH and SALLOC, and thought SBATCH
will do what I am looking for.  However, SBATCH starts with assigning the
requested resource configuration but then runs every srun command on every
node.  For instance, if my script looks like:

#!/bin/bash

# set the number of nodes
#SBATCH --nodes=3

# set the number of GPU cards to use per node
#SBATCH --gres=gpu:2

# set name of job
#SBATCH --job-name=testslurm2

# run the application
echo $CUDA_VISIBLE_DEVICES

srun --gres=gpu:1 ./mycode.x < input.in > output1.out &
srun --gres=gpu:1 ./mycode.x < input.in > output2.out &
srun --gres=gpu:1 ./mycode.x < input.in > output3.out &
srun --gres=gpu:1 ./mycode.x < input.in > output4.out &
wait

I want to run 4 srun on total 3 nodes (how to distribute them among nodes
is up to SLURM), but this script will run every srun command on every
node.  Can someone please help me with this very simple problem?

Thank you very much in advance!

Reply via email to