Hi Karl,

I haven't tested the MAX_TASKS_PER_NODE limits.
According to slurm.conf

*MaxTasksPerNode*
   Maximum number of tasks Slurm will allow a job step to spawn on a
   single node.
   The default *MaxTasksPerNode* is 512. May not exceed 65533

So I'd try setting that and "scontrol reconfigure"
before attempting a recompile.
Seems the documentation is inconsistent on this point.

-Emre



Karl Lovink wrote:
Hi Emre,

MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than
512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to
a higher value?
And recompile slurm.....

Regards,
Karl


On 14/09/2021 21:47, Emre Brookes wrote:
*-O*, *--overcommit*
    Overcommit resources. When applied to job allocation, only one CPU
    is allocated to the job per node and options used to specify the
    number of tasks per node, socket, core, etc. are ignored. When
    applied to job step allocations (the *srun* command when executed
    within an existing job allocation), this option can be used to
    launch more than one task per CPU. Normally, *srun* will not
    allocate more than one process per CPU. By specifying *--overcommit*
    you are explicitly allowing more than one process per CPU. However
    no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
    node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
    and is not a variable, it is set at Slurm build time.

I have used this successfully to run more jobs than cpus/cores avail.

-e.



Karl Lovink wrote:
Hello,

I am in the process of setting up our SLURM environment. We want to use
SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
need a lot of parallel running jobs on a total of 3 nodes.I can't get it
to run more than 128 jobs simultaneously. There are 128 cpu's in the
compute nodes.

How can I ensure that I can run more jobs in parallel than there are
CPUs in the compute node?

Thanks
Karl


My srun script is:
srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh

And my slurm.conf file:
ClusterName=ddos-cluster
ControlMachine=slurm
SlurmUser=ddos
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/opt/slurm/spool/ctld
SlurmdSpoolDir=/opt/slurm/spool/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/opt/slurm/run/.pid
SlurmdPidFile=/opt/slurm/run/slurmd.pid
ProctrackType=proctrack/pgid
PluginDir=/opt/slurm/lib/slurm
ReturnToService=2
TaskPlugin=task/none
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill

SelectType=select/cons_tres
SelectTypeParameters=CR_Core

SlurmctldDebug=3
SlurmctldLogFile=/opt/slurm/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/opt/slurm/log/slurmd.log
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/none
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageUser=slurm
SlurmctldParameters=enable_configurable
GresTypes=gpu
DefMemPerNode=256000
NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP

.

.



Reply via email to