Re: [slurm-users] Running multi jobs on one CPU in parallel

2021-09-14 Thread Williams, Gareth (IM&T, Black Mountain)
The simplest approach might be to run multiple processes within each batch job.

Gareth

Get Outlook for Android<https://aka.ms/ghei36>

From: slurm-users  on behalf of Emre 
Brookes 
Sent: Wednesday, September 15, 2021 6:42:24 AM
To: Karl Lovink ; Slurm User Community List 

Subject: Re: [slurm-users] Running multi jobs on one CPU in parallel

Hi Karl,

I haven't tested the MAX_TASKS_PER_NODE limits.
According to slurm.conf

*MaxTasksPerNode*
Maximum number of tasks Slurm will allow a job step to spawn on a
single node.
The default *MaxTasksPerNode* is 512. May not exceed 65533

So I'd try setting that and "scontrol reconfigure"
before attempting a recompile.
Seems the documentation is inconsistent on this point.

-Emre



Karl Lovink wrote:
> Hi Emre,
>
> MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than
> 512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to
> a higher value?
> And recompile slurm.
>
> Regards,
> Karl
>
>
> On 14/09/2021 21:47, Emre Brookes wrote:
>> *-O*, *--overcommit*
>> Overcommit resources. When applied to job allocation, only one CPU
>> is allocated to the job per node and options used to specify the
>> number of tasks per node, socket, core, etc. are ignored. When
>> applied to job step allocations (the *srun* command when executed
>> within an existing job allocation), this option can be used to
>> launch more than one task per CPU. Normally, *srun* will not
>> allocate more than one process per CPU. By specifying *--overcommit*
>> you are explicitly allowing more than one process per CPU. However
>> no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
>> node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
>> and is not a variable, it is set at Slurm build time.
>>
>> I have used this successfully to run more jobs than cpus/cores avail.
>>
>> -e.
>>
>>
>>
>> Karl Lovink wrote:
>>> Hello,
>>>
>>> I am in the process of setting up our SLURM environment. We want to use
>>> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
>>> need a lot of parallel running jobs on a total of 3 nodes.I can't get it
>>> to run more than 128 jobs simultaneously. There are 128 cpu's in the
>>> compute nodes.
>>>
>>> How can I ensure that I can run more jobs in parallel than there are
>>> CPUs in the compute node?
>>>
>>> Thanks
>>> Karl
>>>
>>>
>>> My srun script is:
>>> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh
>>>
>>> And my slurm.conf file:
>>> ClusterName=ddos-cluster
>>> ControlMachine=slurm
>>> SlurmUser=ddos
>>> SlurmctldPort=6817
>>> SlurmdPort=6818
>>> AuthType=auth/munge
>>> StateSaveLocation=/opt/slurm/spool/ctld
>>> SlurmdSpoolDir=/opt/slurm/spool/d
>>> SwitchType=switch/none
>>> MpiDefault=none
>>> SlurmctldPidFile=/opt/slurm/run/.pid
>>> SlurmdPidFile=/opt/slurm/run/slurmd.pid
>>> ProctrackType=proctrack/pgid
>>> PluginDir=/opt/slurm/lib/slurm
>>> ReturnToService=2
>>> TaskPlugin=task/none
>>> SlurmctldTimeout=300
>>> SlurmdTimeout=300
>>> InactiveLimit=0
>>> MinJobAge=300
>>> KillWait=30
>>> Waittime=0
>>> SchedulerType=sched/backfill
>>>
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core
>>>
>>> SlurmctldDebug=3
>>> SlurmctldLogFile=/opt/slurm/log/slurmctld.log
>>> SlurmdDebug=3
>>> SlurmdLogFile=/opt/slurm/log/slurmd.log
>>> JobCompType=jobcomp/none
>>> JobAcctGatherType=jobacct_gather/none
>>> AccountingStorageTRES=gres/gpu
>>> DebugFlags=CPU_Bind,gres
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStorageHost=localhost
>>> AccountingStoragePass=/var/run/munge/munge.socket.2
>>> AccountingStorageUser=slurm
>>> SlurmctldParameters=enable_configurable
>>> GresTypes=gpu
>>> DefMemPerNode=256000
>>> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>>> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>
>>> .
>>>
> .
>




Re: [slurm-users] Running multi jobs on one CPU in parallel

2021-09-14 Thread Emre Brookes

Hi Karl,

I haven't tested the MAX_TASKS_PER_NODE limits.
According to slurm.conf

*MaxTasksPerNode*
   Maximum number of tasks Slurm will allow a job step to spawn on a
   single node.
   The default *MaxTasksPerNode* is 512. May not exceed 65533

So I'd try setting that and "scontrol reconfigure"
before attempting a recompile.
Seems the documentation is inconsistent on this point.

-Emre



Karl Lovink wrote:

Hi Emre,

MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than
512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to
a higher value?
And recompile slurm.

Regards,
Karl


On 14/09/2021 21:47, Emre Brookes wrote:

*-O*, *--overcommit*
    Overcommit resources. When applied to job allocation, only one CPU
    is allocated to the job per node and options used to specify the
    number of tasks per node, socket, core, etc. are ignored. When
    applied to job step allocations (the *srun* command when executed
    within an existing job allocation), this option can be used to
    launch more than one task per CPU. Normally, *srun* will not
    allocate more than one process per CPU. By specifying *--overcommit*
    you are explicitly allowing more than one process per CPU. However
    no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
    node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
    and is not a variable, it is set at Slurm build time.

I have used this successfully to run more jobs than cpus/cores avail.

-e.



Karl Lovink wrote:

Hello,

I am in the process of setting up our SLURM environment. We want to use
SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
need a lot of parallel running jobs on a total of 3 nodes.I can't get it
to run more than 128 jobs simultaneously. There are 128 cpu's in the
compute nodes.

How can I ensure that I can run more jobs in parallel than there are
CPUs in the compute node?

Thanks
Karl


My srun script is:
srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh

And my slurm.conf file:
ClusterName=ddos-cluster
ControlMachine=slurm
SlurmUser=ddos
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/opt/slurm/spool/ctld
SlurmdSpoolDir=/opt/slurm/spool/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/opt/slurm/run/.pid
SlurmdPidFile=/opt/slurm/run/slurmd.pid
ProctrackType=proctrack/pgid
PluginDir=/opt/slurm/lib/slurm
ReturnToService=2
TaskPlugin=task/none
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill

SelectType=select/cons_tres
SelectTypeParameters=CR_Core

SlurmctldDebug=3
SlurmctldLogFile=/opt/slurm/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/opt/slurm/log/slurmd.log
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/none
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageUser=slurm
SlurmctldParameters=enable_configurable
GresTypes=gpu
DefMemPerNode=256000
NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP

.


.






Re: [slurm-users] Running multi jobs on one CPU in parallel

2021-09-14 Thread Karl Lovink
Hi Emre,

MAX_TASKS_PER_NODE is set to 512. Does this means I cannot run more than
512 jobs in parallel on one node? Or can I change MAX_TASKS_PER_NODE to
a higher value?
And recompile slurm.

Regards,
Karl


On 14/09/2021 21:47, Emre Brookes wrote:
> *-O*, *--overcommit*
>    Overcommit resources. When applied to job allocation, only one CPU
>    is allocated to the job per node and options used to specify the
>    number of tasks per node, socket, core, etc. are ignored. When
>    applied to job step allocations (the *srun* command when executed
>    within an existing job allocation), this option can be used to
>    launch more than one task per CPU. Normally, *srun* will not
>    allocate more than one process per CPU. By specifying *--overcommit*
>    you are explicitly allowing more than one process per CPU. However
>    no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
>    node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
>    and is not a variable, it is set at Slurm build time.
> 
> I have used this successfully to run more jobs than cpus/cores avail.
> 
> -e.
> 
> 
> 
> Karl Lovink wrote:
>> Hello,
>>
>> I am in the process of setting up our SLURM environment. We want to use
>> SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
>> need a lot of parallel running jobs on a total of 3 nodes.I can't get it
>> to run more than 128 jobs simultaneously. There are 128 cpu's in the
>> compute nodes.
>>
>> How can I ensure that I can run more jobs in parallel than there are
>> CPUs in the compute node?
>>
>> Thanks
>> Karl
>>
>>
>> My srun script is:
>> srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh
>>
>> And my slurm.conf file:
>> ClusterName=ddos-cluster
>> ControlMachine=slurm
>> SlurmUser=ddos
>> SlurmctldPort=6817
>> SlurmdPort=6818
>> AuthType=auth/munge
>> StateSaveLocation=/opt/slurm/spool/ctld
>> SlurmdSpoolDir=/opt/slurm/spool/d
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/opt/slurm/run/.pid
>> SlurmdPidFile=/opt/slurm/run/slurmd.pid
>> ProctrackType=proctrack/pgid
>> PluginDir=/opt/slurm/lib/slurm
>> ReturnToService=2
>> TaskPlugin=task/none
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> InactiveLimit=0
>> MinJobAge=300
>> KillWait=30
>> Waittime=0
>> SchedulerType=sched/backfill
>>
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core
>>
>> SlurmctldDebug=3
>> SlurmctldLogFile=/opt/slurm/log/slurmctld.log
>> SlurmdDebug=3
>> SlurmdLogFile=/opt/slurm/log/slurmd.log
>> JobCompType=jobcomp/none
>> JobAcctGatherType=jobacct_gather/none
>> AccountingStorageTRES=gres/gpu
>> DebugFlags=CPU_Bind,gres
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageHost=localhost
>> AccountingStoragePass=/var/run/munge/munge.socket.2
>> AccountingStorageUser=slurm
>> SlurmctldParameters=enable_configurable
>> GresTypes=gpu
>> DefMemPerNode=256000
>> NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>> NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>> NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
>> ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
>> PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>> PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>
>> .
>>
> 



Re: [slurm-users] Running multi jobs on one CPU in parallel

2021-09-14 Thread Emre Brookes

*-O*, *--overcommit*
   Overcommit resources. When applied to job allocation, only one CPU
   is allocated to the job per node and options used to specify the
   number of tasks per node, socket, core, etc. are ignored. When
   applied to job step allocations (the *srun* command when executed
   within an existing job allocation), this option can be used to
   launch more than one task per CPU. Normally, *srun* will not
   allocate more than one process per CPU. By specifying *--overcommit*
   you are explicitly allowing more than one process per CPU. However
   no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
   node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
   and is not a variable, it is set at Slurm build time. 



I have used this successfully to run more jobs than cpus/cores avail.

-e.



Karl Lovink wrote:

Hello,

I am in the process of setting up our SLURM environment. We want to use
SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
need a lot of parallel running jobs on a total of 3 nodes.I can't get it
to run more than 128 jobs simultaneously. There are 128 cpu's in the
compute nodes.

How can I ensure that I can run more jobs in parallel than there are
CPUs in the compute node?

Thanks
Karl


My srun script is:
srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh

And my slurm.conf file:
ClusterName=ddos-cluster
ControlMachine=slurm
SlurmUser=ddos
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/opt/slurm/spool/ctld
SlurmdSpoolDir=/opt/slurm/spool/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/opt/slurm/run/.pid
SlurmdPidFile=/opt/slurm/run/slurmd.pid
ProctrackType=proctrack/pgid
PluginDir=/opt/slurm/lib/slurm
ReturnToService=2
TaskPlugin=task/none
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill

SelectType=select/cons_tres
SelectTypeParameters=CR_Core

SlurmctldDebug=3
SlurmctldLogFile=/opt/slurm/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/opt/slurm/log/slurmd.log
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/none
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageUser=slurm
SlurmctldParameters=enable_configurable
GresTypes=gpu
DefMemPerNode=256000
NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP

.