Hi.
I think I am also experiencing the same problem with SLURM 16.05.2.
slurm.conf:
SchedulerPort=7321
SchedulerRootFilter=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_Pack_Nodes
NodeName=CompNode[001-020] NodeAddr=172.16.32.[1-20] Boards=1
SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=193233
TmpDisk=102350 State=UNKNOWN
PartitionName=unlimited Nodes=CompNode[001-020] Default=NO MaxNodes=20
DefaultTime=UNLIMITED MaxTime=UNLIMITED Priority=100 Hidden=YES State=UP
AllowGroups=slurmspecial
mm.batch:
#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --nodes=5
#SBATCH --ntasks=10
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12
#SBATCH --licenses=comsol@headnode
#SBATCH --partition=unlimited
#SBATCH --exclusive
#SBATCH --output=~/Documents/COMSOL/mm.out
#SBATCH --error=~/Documents/COMSOL/mm.err
module load Programs/comsol-5.2a
comsol -nn $SLURM_NTASKS -nnhost $SLURM_NTASKS_PER_NODE -np
$SLURM_CPUS_PER_TASK -numasets 2 -mpmode owner -mpibootstrap slurm
-mpifabrics ofa:ofa batch -job b1 -inputfile ~/Documents/COMSOL/mm.mph
-outputfile ~/Documents/COMSOL/mm-results.mph -batchlog
~/Documents/COMSOL/mm.log
mm.err:
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't
match the requested tasks 5 with the number of requested nodes 5.
Ignoring --ntasks-per-node.
mm.out:
[0] MPI startup(): Multi-threaded optimized library
[9] MPID_nem_ofacm_init(): Init
[1] MPID_nem_ofacm_init(): Init
[6] MPID_nem_ofacm_init(): Init
[4] MPID_nem_ofacm_init(): Init
[0] MPID_nem_ofacm_init(): Init
[8] MPID_nem_ofacm_init(): Init
[5] MPID_nem_ofacm_init(): Init
[7] MPID_nem_ofacm_init(): Init
[3] MPID_nem_ofacm_init(): Init
[2] MPID_nem_ofacm_init(): Init
[9] MPI startup(): ofa data transfer mode
[1] MPI startup(): ofa data transfer mode
[6] MPI startup(): ofa data transfer mode
[4] MPI startup(): ofa data transfer mode
[0] MPI startup(): ofa data transfer mode
[8] MPI startup(): ofa data transfer mode
[7] MPI startup(): ofa data transfer mode
[5] MPI startup(): ofa data transfer mode
[3] MPI startup(): ofa data transfer mode
[2] MPI startup(): ofa data transfer mode
[0] MPI startup(): RankPid Node namePin cpu
[0] MPI startup(): 0 15011CompNode001
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 1 15012CompNode001
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 2 45695CompNode002
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 4 17504CompNode003
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 5 17505CompNode003
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 6 8158 CompNode004
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 7 8159 CompNode004
{0,2,4,6,8,10,12,14,16,18,20,22}
[0] MPI startup(): 8 45973CompNode005
{1,3,5,7,9,11,13,15,17,19,21,23}
[0] MPI startup(): 9 45974CompNode005
{0,2,4,6,8,10,12,14,16,18,20,22}
Node 0 is running on host: CompNode001
Node 0 has address: CompNode001.laced.ib
Node 1 is running on host: CompNode001
Node 1 has address: CompNode001.laced.ib
Node 2 is running on host: CompNode002
Node 2 has address: CompNode002.laced.ib
Node 3 is running on host: CompNode002
Node 3 has address: CompNode002.laced.ib
Node 4 is running on host: CompNode003
Node 4 has address: CompNode003.laced.ib
Node 5 is running on host: CompNode003
Node 5 has address: CompNode003.laced.ib
Node 6 is running on host: CompNode004
Node 6 has address: CompNode004.laced.ib
Node 7 is running on host: CompNode004
Node 7 has address: CompNode004.laced.ib
Node 8 is running on host: CompNode005
Node 8 has address: CompNode005.laced.ib
Node 9 is running on host: CompNode005
Node 9 has address: CompNode005.laced.ib
Regards,
Miguel
On 10/21/2016 02:54 PM, Manuel Rodríguez Pascual wrote:
Wrong behaviour of "--tasks-per-node" flag
Hi all,
I am having the weirdest error ever. I am pretty sure this is a bug.
I have reproduced the error in latest slurm commit (slurm
17.02.0-0pre2, commit 406d3fe429ef6b694f30e19f69acf989e65d7509 ) and
in slurm 16.05.5 branch. It does NOT happen in slurm 15.08.12 .
My cluster is composed by 8 nodes, each with 2 sockets, each with 8
cores. Slurm.conf content is
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear #DEDICATED NODES
NodeName=acme[11-14,21-24] CPUs=16 Sockets=2 CoresPerSocket=8
ThreadsPerCore=1 State=UNKNOWN
I am running a simple hello World parallel code. It is submitted as
"sbatch --ntasks=X --tasks-per-node=Y myScript.sh ". The problem is
that, depending on the values of X and Y, Slurm performs a wrong
opperation and returns an error.
"
sbatch --ntasks=8 --tasks-per-node=2 myScript.sh
srun: Warning: can't honor --ntasks-per-node set to 2 which doesn't
match the requested tasks 4 with the number of requested nodes 4.
Ignoring --ntasks-per-node.
"
Note that I did not request 4 but 8 tasks, and I did not request any
n