The simplest approach might be to run multiple processes within each batch job.
Gareth Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Emre Brookes <emre.broo...@mso.umt.edu> Sent: Wednesday, September 15, 2021 6:42:24 AM To: Karl Lovink <k...@lovink.net>; Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Running multi jobs on one CPU in parallel Hi Karl, I haven't tested the MAX_TASKS_PER_NODE limits. According to slurm.conf *MaxTasksPerNode* Maximum number of tasks Slurm will allow a job step to spawn on a single node. The default *MaxTasksPerNode* is 512. May not exceed 65533 So I'd try setting that and "scontrol reconfigure" before attempting a recompile. Seems the documentation is inconsistent on this point. -Emre Karl Lovink wrote: > Hi Emre, > > MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than > 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to > a higher value? > And recompile slurm..... > > Regards, > Karl > > > On 14/09/2021 21:47, Emre Brookes wrote: >> *-O*, *--overcommit* >> Overcommit resources. When applied to job allocation, only one CPU >> is allocated to the job per node and options used to specify the >> number of tasks per node, socket, core, etc. are ignored. When >> applied to job step allocations (the *srun* command when executed >> within an existing job allocation), this option can be used to >> launch more than one task per CPU. Normally, *srun* will not >> allocate more than one process per CPU. By specifying *--overcommit* >> you are explicitly allowing more than one process per CPU. However >> no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per >> node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/ >> and is not a variable, it is set at Slurm build time. >> >> I have used this successfully to run more jobs than cpus/cores avail. >> >> -e. >> >> >> >> Karl Lovink wrote: >>> Hello, >>> >>> I am in the process of setting up our SLURM environment. We want to use >>> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We >>> need a lot of parallel running jobs on a total of 3 nodes.I can't get it >>> to run more than 128 jobs simultaneously. There are 128 cpu's in the >>> compute nodes. >>> >>> How can I ensure that I can run more jobs in parallel than there are >>> CPUs in the compute node? >>> >>> Thanks >>> Karl >>> >>> >>> My srun script is: >>> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh >>> >>> And my slurm.conf file: >>> ClusterName=ddos-cluster >>> ControlMachine=slurm >>> SlurmUser=ddos >>> SlurmctldPort=6817 >>> SlurmdPort=6818 >>> AuthType=auth/munge >>> StateSaveLocation=/opt/slurm/spool/ctld >>> SlurmdSpoolDir=/opt/slurm/spool/d >>> SwitchType=switch/none >>> MpiDefault=none >>> SlurmctldPidFile=/opt/slurm/run/.pid >>> SlurmdPidFile=/opt/slurm/run/slurmd.pid >>> ProctrackType=proctrack/pgid >>> PluginDir=/opt/slurm/lib/slurm >>> ReturnToService=2 >>> TaskPlugin=task/none >>> SlurmctldTimeout=300 >>> SlurmdTimeout=300 >>> InactiveLimit=0 >>> MinJobAge=300 >>> KillWait=30 >>> Waittime=0 >>> SchedulerType=sched/backfill >>> >>> SelectType=select/cons_tres >>> SelectTypeParameters=CR_Core >>> >>> SlurmctldDebug=3 >>> SlurmctldLogFile=/opt/slurm/log/slurmctld.log >>> SlurmdDebug=3 >>> SlurmdLogFile=/opt/slurm/log/slurmd.log >>> JobCompType=jobcomp/none >>> JobAcctGatherType=jobacct_gather/none >>> AccountingStorageTRES=gres/gpu >>> DebugFlags=CPU_Bind,gres >>> AccountingStorageType=accounting_storage/slurmdbd >>> AccountingStorageHost=localhost >>> AccountingStoragePass=/var/run/munge/munge.socket.2 >>> AccountingStorageUser=slurm >>> SlurmctldParameters=enable_configurable >>> GresTypes=gpu >>> DefMemPerNode=256000 >>> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >>> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >>> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 >>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN >>> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>> >>> . >>> > . >