Hi, I an trying to get familiar with SLURM. I like so far what I read and tried out. But I am not able to find how to do a very simple thing in SLURM. I wish to distribute a bunch of srun commands on multiple nodes without duplication (same job being run on more than one node). And I wish to tell SLURM the node configuration parameters that I wish to have.
So far I tried my hands with SRUN, SBATCH and SALLOC, and thought SBATCH will do what I am looking for. However, SBATCH starts with assigning the requested resource configuration but then runs every srun command on every node. For instance, if my script looks like: #!/bin/bash # set the number of nodes #SBATCH --nodes=3 # set the number of GPU cards to use per node #SBATCH --gres=gpu:2 # set name of job #SBATCH --job-name=testslurm2 # run the application echo $CUDA_VISIBLE_DEVICES srun --gres=gpu:1 ./mycode.x < input.in > output1.out & srun --gres=gpu:1 ./mycode.x < input.in > output2.out & srun --gres=gpu:1 ./mycode.x < input.in > output3.out & srun --gres=gpu:1 ./mycode.x < input.in > output4.out & wait I want to run 4 srun on total 3 nodes (how to distribute them among nodes is up to SLURM), but this script will run every srun command on every node. Can someone please help me with this very simple problem? Thank you very much in advance!
