Hi all,

I'm trying to use heterogeneous jobs with the following slurm script:


#!/usr/bin/env bash

#SBATCH --partition=cpu --time=01:00:00 --nodes=2 --ntasks-per-node=1 
--cpus-per-task=2 --mem=8G

#SBATCH hetjob

#SBATCH --partition=gpu --time=01:00:00 --nodes=2 --ntasks-per-node=1 
--cpus-per-task=2 --mem=8G --gres=gpu:1


srun \

    --het-group=0 -K sh -c 'echo group 0 $(hostname) $SLURM_PROCID' : \

    --het-group=1 -K sh -c 'echo group 1 $(hostname) $SLURM_PROCID'

It works when I manually run the commands via salloc, but it fails via sbatch 
with the following message:


srun: error: Allocation failure of 2 nodes: job size of 2, already allocated 2 
nodes to previous components.

srun: error: Allocation failure of 2 nodes: job size of 2, already allocated 2 
nodes to previous components.

Am I misunderstanding the sbatch documentation? Is it normal that sbatch and 
salloc behave differently?

Note: with salloc the job script runs on the slurmctld server whereas with 
sbatch it runs on the first node allocated to the batch. Slurm is in version 
20.11.3.

Best regards,
Nicolas

Reply via email to