Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

if i try to request just nodes and memory, for instance:
#SBATCH -N 2
#SBATCH --mem=0
to resquest all memory on a node, and 2nodes seem sufficient for a 
program that consumes 100GB, i ot this error:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration 
is not available


thanks

On 30/10/2023 15:46, Gérard Henry (AMU) wrote:

Hello all,


I can't configure the slurm script correctly. My program needs 100GB of 
memory, it's the only criteria. But the job always fails with an out of 
memory.

Here's the cluster configuration I'm using:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

partition:
DefMemPerCPU=5770 MaxMemPerCPU=5778
TRES=cpu=5056,mem=3002M,node=158
for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640

my script contains:
#SBATCH -N 5
#SBATCH --ntasks=60
#SBATCH --mem-per-cpu=1500M
#SBATCH --cpus-per-task=1
...
mpirun ../zsimpletest_analyse

when it fails, sacct gives the follwing information:
JobID   JobName    Elapsed  NCPUS   TotalCPU    CPUTime 
ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
 -- -- -- -- -- 
-- --   -- 
8500578    analyse5   00:03:04 60   02:57:58   03:04:00 
9M  OUT_OF_ME+    0:125
8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04 
    21465736K    0.23M    0.01M OUT_OF_ME+    0:125
8500578.0 orted   00:03:05 44   02:11:24   02:15:40 
   40952K    0.42M    0.03M  COMPLETED  0:0


i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus 
and 1500M per cpu (24M)


if anybody can help?

thanks in advance



--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.




[slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

Hello all,


I can't configure the slurm script correctly. My program needs 100GB of 
memory, it's the only criteria. But the job always fails with an out of 
memory.

Here's the cluster configuration I'm using:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

partition:
DefMemPerCPU=5770 MaxMemPerCPU=5778
TRES=cpu=5056,mem=3002M,node=158
for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640

my script contains:
#SBATCH -N 5
#SBATCH --ntasks=60
#SBATCH --mem-per-cpu=1500M
#SBATCH --cpus-per-task=1
...
mpirun ../zsimpletest_analyse

when it fails, sacct gives the follwing information:
JobID   JobNameElapsed  NCPUS   TotalCPUCPUTime 
ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
 -- -- -- -- -- 
-- --   -- 
8500578analyse5   00:03:04 60   02:57:58   03:04:00 
9M  OUT_OF_ME+0:125
8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04 
   21465736K0.23M0.01M OUT_OF_ME+0:125
8500578.0 orted   00:03:05 44   02:11:24   02:15:40 
  40952K0.42M0.03M  COMPLETED  0:0


i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus 
and 1500M per cpu (24M)


if anybody can help?

thanks in advance

--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.




Re: [slurm-users] slurmctld and slurmdbd on the server, mysql on remote

2023-07-19 Thread AMU
oups, i found my error, i forgot to remove JobCompHost, found it after 
reading this:

https://bugs.schedmd.com/show_bug.cgi?id=2322#c5

sorry for the noise

On 19/07/2023 14:51, Gérard Henry (AMU) wrote:

Hello all,
is it possible to have this configuration? i installed slurm on ubuntu 
20 LTS, but slurmctld refuses to start with messages:


[2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded
[2023-07-19T14:37:59.563] debug:  /var/log/slurm/jobcomp doesn't look 
like a database name using slurm_jobcomp_db
[2023-07-19T14:37:59.563] debug2: mysql_connect() called for db 
slurm_jobcomp_db

[2023-07-19T14:37:59.571] debug2: Attempting to connect to localhost:3306
[2023-07-19T14:37:59.571] error: mysql_real_connect failed: 2002 Can't 
connect to local MySQL server through socket 
'/var/run/mysqld/mysqld.sock' (2)

[2023-07-19T14:37:59.572] fatal: You haven't inited this storage yet.

slurmdbd is running, and some stuff seems to be written in db:
# sacctmgr show cluster
    Cluster ControlHost  ControlPort   RPC Share GrpJobs GrpTRES 
GrpSubmit MaxJobs   MaxTRES MaxSubmit MaxWall

   QOS   Def QOS
-- ---  - - --- 
- - --- - - --- ---

- -
    cathena    0 0 1
    normal

i don't understand why slurmctld needs to connect to mysql, since it 
connects to slurmdbd.


slurm is # slurmctld -V
slurm-wlm 19.05.5

Thanks in advance for help,




--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.




[slurm-users] slurmctld and slurmdbd on the server, mysql on remote

2023-07-19 Thread AMU

Hello all,
is it possible to have this configuration? i installed slurm on ubuntu 
20 LTS, but slurmctld refuses to start with messages:


[2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded
[2023-07-19T14:37:59.563] debug:  /var/log/slurm/jobcomp doesn't look 
like a database name using slurm_jobcomp_db
[2023-07-19T14:37:59.563] debug2: mysql_connect() called for db 
slurm_jobcomp_db

[2023-07-19T14:37:59.571] debug2: Attempting to connect to localhost:3306
[2023-07-19T14:37:59.571] error: mysql_real_connect failed: 2002 Can't 
connect to local MySQL server through socket 
'/var/run/mysqld/mysqld.sock' (2)

[2023-07-19T14:37:59.572] fatal: You haven't inited this storage yet.

slurmdbd is running, and some stuff seems to be written in db:
# sacctmgr show cluster
   Cluster ControlHost  ControlPort   RPC Share GrpJobs 
GrpTRES GrpSubmit MaxJobs   MaxTRES MaxSubmit MaxWall

  QOS   Def QOS
-- ---  - - --- 
- - --- - - --- ---

- -
   cathena0 0 1 


   normal

i don't understand why slurmctld needs to connect to mysql, since it 
connects to slurmdbd.


slurm is # slurmctld -V
slurm-wlm 19.05.5

Thanks in advance for help,


--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.




[slurm-users] slurm 18.08.3 on CentOS 6.18: error: _slurm_cgroup_destroy

2020-02-28 Thread AMU

Hello,
on an old machine CentOS 6.10, i've installed slurm 18.08.3 from 
sources, and tried to configure a simple configuration (attached 
slurm.conf).
Afterstarting slurmctld et slurmd, sinfo shows everything oaky, but at 
the first submission with sbatch, i got errors and the node becomes "drain":
[2020-02-28T14:44:57.883] [2.batch] error: _slurm_cgroup_destroy: Unable 
to move pid 10322 to root cgroup
[2020-02-28T14:44:57.883] [2.batch] error: proctrack_g_create: No such 
file or directory
[2020-02-28T14:44:57.883] [2.batch] error: job_manager exiting 
abnormally, rc = 4014
[2020-02-28T14:44:57.883] [2.batch] sending 
REQUEST_COMPLETE_BATCH_SCRIPT, error:4014 status 0


I'm not very confident with cgroup, and don't understand where is the 
problem.

I read in archives that people have success with CentOS 6 and slurm 18.
Anybody can help?

Thanks in advance,

Gérard

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=tramel
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=99
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=4
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
AccountingStorageLoc=/var/log/slurm/accounting
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/SlurmdLogFile.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=tramel CPUs=64 Boards=2 SocketsPerBoard=4 CoresPerSocket=8 
ThreadsPerCore=1 RealMemory=1033892
PartitionName=hipe Nodes=tramel Default=YES MaxTime=INFINITE State=UP


smime.p7s
Description: S/MIME Cryptographic Signature