Re: [slurm-users] Running multi jobs on one CPU in parallel
The simplest approach might be to run multiple processes within each batch job. Gareth Get Outlook for Android<https://aka.ms/ghei36> From: slurm-users on behalf of Emre Brookes Sent: Wednesday, September 15, 2021 6:42:24 AM To: Karl Lovink ; Slurm User Community List Subject: Re: [slurm-users] Running multi jobs on one CPU in parallel Hi Karl, I haven't tested the MAX_TASKS_PER_NODE limits. According to slurm.conf *MaxTasksPerNode* Maximum number of tasks Slurm will allow a job step to spawn on a single node. The default *MaxTasksPerNode* is 512. May not exceed 65533 So I'd try setting that and "scontrol reconfigure" before attempting a recompile. Seems the documentation is inconsistent on this point. -Emre Karl Lovink wrote: > Hi Emre, > > MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than > 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to > a higher value? > And recompile slurm. > > Regards, > Karl > > > On 14/09/2021 21:47, Emre Brookes wrote: >> *-O*, *--overcommit* >> Overcommit resources. When applied to job allocation, only one CPU >> is allocated to the job per node and options used to specify the >> number of tasks per node, socket, core, etc. are ignored. When >> applied to job step allocations (the *srun* command when executed >> within an existing job allocation), this option can be used to >> launch more than one task per CPU. Normally, *srun* will not >> allocate more than one process per CPU. By specifying *--overcommit* >> you are explicitly allowing more than one process per CPU. However >> no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per >> node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/ >> and is not a variable, it is set at Slurm build time. >> >> I have used this successfully to run more jobs than cpus/cores avail. >> >> -e. >> >> >> >> Karl Lovink wrote: >>> Hello, >>> >>> I am in the process of setting up our SLURM environment. We want to use >>> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We >>> need a lot of parallel running jobs on a total of 3 nodes.I can't get it >>> to run more than 128 jobs simultaneously. There are 128 cpu's in the >>> compute nodes. >>> >>> How can I ensure that I can run more jobs in parallel than there are >>> CPUs in the compute node? >>> >>> Thanks >>> Karl >>> >>> >>> My srun script is: >>> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh >>> >>> And my slurm.conf file: >>> ClusterName=ddos-cluster >>> ControlMachine=slurm >>> SlurmUser=ddos >>> SlurmctldPort=6817 >>> SlurmdPort=6818 >>> AuthType=auth/munge >>> StateSaveLocation=/opt/slurm/spool/ctld >>> SlurmdSpoolDir=/opt/slurm/spool/d >>> SwitchType=switch/none >>> MpiDefault=none >>> SlurmctldPidFile=/opt/slurm/run/.pid >>> SlurmdPidFile=/opt/slurm/run/slurmd.pid >>> ProctrackType=proctrack/pgid >>> PluginDir=/opt/slurm/lib/slurm >>> ReturnToService=2 >>> TaskPlugin=task/none >>> SlurmctldTimeout=300 >>> SlurmdTimeout=300 >>> InactiveLimit=0 >>> MinJobAge=300 >>> KillWait=30 >>> Waittime=0 >>> SchedulerType=sched/backfill >>> >>> SelectType=select/cons_tres >>> SelectTypeParameters=CR_Core >>> >>> SlurmctldDebug=3 >>> SlurmctldLogFile=/opt/slurm/log/slurmctld.log >>> SlurmdDebug=3 >>> SlurmdLogFile=/opt/slurm/log/slurmd.log >>> JobCompType=jobcomp/none >>> JobAcctGatherType=jobacct_gather/none >>> AccountingStorageTRES=gres/gpu >>> DebugFlags=CPU_Bind,gres >>> AccountingStorageType=accounting_storage/slurmdbd >>> AccountingStorageHost=localhost >>> AccountingStoragePass=/var/run/munge/munge.socket.2 >>> AccountingStorageUser=slurm >>> SlurmctldParameters=enable_configurable >>> GresTypes=gpu >>> DefMemPerNode=256000 >>> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >>> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >>> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >>> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>> >>> . >>> > . >
Re: [slurm-users] Running multi jobs on one CPU in parallel
Hi Karl, I haven't tested the MAX_TASKS_PER_NODE limits. According to slurm.conf *MaxTasksPerNode* Maximum number of tasks Slurm will allow a job step to spawn on a single node. The default *MaxTasksPerNode* is 512. May not exceed 65533 So I'd try setting that and "scontrol reconfigure" before attempting a recompile. Seems the documentation is inconsistent on this point. -Emre Karl Lovink wrote: Hi Emre, MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to a higher value? And recompile slurm. Regards, Karl On 14/09/2021 21:47, Emre Brookes wrote: *-O*, *--overcommit* Overcommit resources. When applied to job allocation, only one CPU is allocated to the job per node and options used to specify the number of tasks per node, socket, core, etc. are ignored. When applied to job step allocations (the *srun* command when executed within an existing job allocation), this option can be used to launch more than one task per CPU. Normally, *srun* will not allocate more than one process per CPU. By specifying *--overcommit* you are explicitly allowing more than one process per CPU. However no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/ and is not a variable, it is set at Slurm build time. I have used this successfully to run more jobs than cpus/cores avail. -e. Karl Lovink wrote: Hello, I am in the process of setting up our SLURM environment. We want to use SLURM during our DDoS exercises for dispatching DDoS attack scripts. We need a lot of parallel running jobs on a total of 3 nodes.I can't get it to run more than 128 jobs simultaneously. There are 128 cpu's in the compute nodes. How can I ensure that I can run more jobs in parallel than there are CPUs in the compute node? Thanks Karl My srun script is: srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh And my slurm.conf file: ClusterName=ddos-cluster ControlMachine=slurm SlurmUser=ddos SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/opt/slurm/spool/ctld SlurmdSpoolDir=/opt/slurm/spool/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/opt/slurm/run/.pid SlurmdPidFile=/opt/slurm/run/slurmd.pid ProctrackType=proctrack/pgid PluginDir=/opt/slurm/lib/slurm ReturnToService=2 TaskPlugin=task/none SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core SlurmctldDebug=3 SlurmctldLogFile=/opt/slurm/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/opt/slurm/log/slurmd.log JobCompType=jobcomp/none JobAcctGatherType=jobacct_gather/none AccountingStorageTRES=gres/gpu DebugFlags=CPU_Bind,gres AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=localhost AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStorageUser=slurm SlurmctldParameters=enable_configurable GresTypes=gpu DefMemPerNode=256000 NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP . .
Re: [slurm-users] Running multi jobs on one CPU in parallel
Hi Emre, MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to a higher value? And recompile slurm. Regards, Karl On 14/09/2021 21:47, Emre Brookes wrote: > *-O*, *--overcommit* > Overcommit resources. When applied to job allocation, only one CPU > is allocated to the job per node and options used to specify the > number of tasks per node, socket, core, etc. are ignored. When > applied to job step allocations (the *srun* command when executed > within an existing job allocation), this option can be used to > launch more than one task per CPU. Normally, *srun* will not > allocate more than one process per CPU. By specifying *--overcommit* > you are explicitly allowing more than one process per CPU. However > no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per > node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/ > and is not a variable, it is set at Slurm build time. > > I have used this successfully to run more jobs than cpus/cores avail. > > -e. > > > > Karl Lovink wrote: >> Hello, >> >> I am in the process of setting up our SLURM environment. We want to use >> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We >> need a lot of parallel running jobs on a total of 3 nodes.I can't get it >> to run more than 128 jobs simultaneously. There are 128 cpu's in the >> compute nodes. >> >> How can I ensure that I can run more jobs in parallel than there are >> CPUs in the compute node? >> >> Thanks >> Karl >> >> >> My srun script is: >> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh >> >> And my slurm.conf file: >> ClusterName=ddos-cluster >> ControlMachine=slurm >> SlurmUser=ddos >> SlurmctldPort=6817 >> SlurmdPort=6818 >> AuthType=auth/munge >> StateSaveLocation=/opt/slurm/spool/ctld >> SlurmdSpoolDir=/opt/slurm/spool/d >> SwitchType=switch/none >> MpiDefault=none >> SlurmctldPidFile=/opt/slurm/run/.pid >> SlurmdPidFile=/opt/slurm/run/slurmd.pid >> ProctrackType=proctrack/pgid >> PluginDir=/opt/slurm/lib/slurm >> ReturnToService=2 >> TaskPlugin=task/none >> SlurmctldTimeout=300 >> SlurmdTimeout=300 >> InactiveLimit=0 >> MinJobAge=300 >> KillWait=30 >> Waittime=0 >> SchedulerType=sched/backfill >> >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core >> >> SlurmctldDebug=3 >> SlurmctldLogFile=/opt/slurm/log/slurmctld.log >> SlurmdDebug=3 >> SlurmdLogFile=/opt/slurm/log/slurmd.log >> JobCompType=jobcomp/none >> JobAcctGatherType=jobacct_gather/none >> AccountingStorageTRES=gres/gpu >> DebugFlags=CPU_Bind,gres >> AccountingStorageType=accounting_storage/slurmdbd >> AccountingStorageHost=localhost >> AccountingStoragePass=/var/run/munge/munge.socket.2 >> AccountingStorageUser=slurm >> SlurmctldParameters=enable_configurable >> GresTypes=gpu >> DefMemPerNode=256000 >> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> >> . >> >
Re: [slurm-users] Running multi jobs on one CPU in parallel
*-O*, *--overcommit* Overcommit resources. When applied to job allocation, only one CPU is allocated to the job per node and options used to specify the number of tasks per node, socket, core, etc. are ignored. When applied to job step allocations (the *srun* command when executed within an existing job allocation), this option can be used to launch more than one task per CPU. Normally, *srun* will not allocate more than one process per CPU. By specifying *--overcommit* you are explicitly allowing more than one process per CPU. However no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/ and is not a variable, it is set at Slurm build time. I have used this successfully to run more jobs than cpus/cores avail. -e. Karl Lovink wrote: Hello, I am in the process of setting up our SLURM environment. We want to use SLURM during our DDoS exercises for dispatching DDoS attack scripts. We need a lot of parallel running jobs on a total of 3 nodes.I can't get it to run more than 128 jobs simultaneously. There are 128 cpu's in the compute nodes. How can I ensure that I can run more jobs in parallel than there are CPUs in the compute node? Thanks Karl My srun script is: srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh And my slurm.conf file: ClusterName=ddos-cluster ControlMachine=slurm SlurmUser=ddos SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/opt/slurm/spool/ctld SlurmdSpoolDir=/opt/slurm/spool/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/opt/slurm/run/.pid SlurmdPidFile=/opt/slurm/run/slurmd.pid ProctrackType=proctrack/pgid PluginDir=/opt/slurm/lib/slurm ReturnToService=2 TaskPlugin=task/none SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core SlurmctldDebug=3 SlurmctldLogFile=/opt/slurm/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/opt/slurm/log/slurmd.log JobCompType=jobcomp/none JobAcctGatherType=jobacct_gather/none AccountingStorageTRES=gres/gpu DebugFlags=CPU_Bind,gres AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=localhost AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStorageUser=slurm SlurmctldParameters=enable_configurable GresTypes=gpu DefMemPerNode=256000 NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP .