Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory
if i try to request just nodes and memory, for instance: #SBATCH -N 2 #SBATCH --mem=0 to resquest all memory on a node, and 2nodes seem sufficient for a program that consumes 100GB, i ot this error: sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available thanks On 30/10/2023 15:46, Gérard Henry (AMU) wrote: Hello all, I can't configure the slurm script correctly. My program needs 100GB of memory, it's the only criteria. But the job always fails with an out of memory. Here's the cluster configuration I'm using: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory partition: DefMemPerCPU=5770 MaxMemPerCPU=5778 TRES=cpu=5056,mem=3002M,node=158 for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640 my script contains: #SBATCH -N 5 #SBATCH --ntasks=60 #SBATCH --mem-per-cpu=1500M #SBATCH --cpus-per-task=1 ... mpirun ../zsimpletest_analyse when it fails, sacct gives the follwing information: JobID JobName Elapsed NCPUS TotalCPU CPUTime ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode -- -- -- -- -- -- -- -- 8500578 analyse5 00:03:04 60 02:57:58 03:04:00 9M OUT_OF_ME+ 0:125 8500578.bat+ batch 00:03:04 16 46:34.302 00:49:04 21465736K 0.23M 0.01M OUT_OF_ME+ 0:125 8500578.0 orted 00:03:05 44 02:11:24 02:15:40 40952K 0.42M 0.03M COMPLETED 0:0 i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus and 1500M per cpu (24M) if anybody can help? thanks in advance -- Gérard HENRY Institut Fresnel - UMR 7249 +33 413945457 Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue Escadrille Normandie Niemen, 13013 Marseille Site : https://fresnel.fr/ Afin de respecter l'environnement, merci de n'imprimer cet email que si nécessaire.
[slurm-users] how to configure correctly node and memory when a script fails with out of memory
Hello all, I can't configure the slurm script correctly. My program needs 100GB of memory, it's the only criteria. But the job always fails with an out of memory. Here's the cluster configuration I'm using: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory partition: DefMemPerCPU=5770 MaxMemPerCPU=5778 TRES=cpu=5056,mem=3002M,node=158 for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640 my script contains: #SBATCH -N 5 #SBATCH --ntasks=60 #SBATCH --mem-per-cpu=1500M #SBATCH --cpus-per-task=1 ... mpirun ../zsimpletest_analyse when it fails, sacct gives the follwing information: JobID JobNameElapsed NCPUS TotalCPUCPUTime ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode -- -- -- -- -- -- -- -- 8500578analyse5 00:03:04 60 02:57:58 03:04:00 9M OUT_OF_ME+0:125 8500578.bat+ batch 00:03:04 16 46:34.302 00:49:04 21465736K0.23M0.01M OUT_OF_ME+0:125 8500578.0 orted 00:03:05 44 02:11:24 02:15:40 40952K0.42M0.03M COMPLETED 0:0 i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus and 1500M per cpu (24M) if anybody can help? thanks in advance -- Gérard HENRY Institut Fresnel - UMR 7249 +33 413945457 Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue Escadrille Normandie Niemen, 13013 Marseille Site : https://fresnel.fr/ Afin de respecter l'environnement, merci de n'imprimer cet email que si nécessaire.
Re: [slurm-users] slurmctld and slurmdbd on the server, mysql on remote
oups, i found my error, i forgot to remove JobCompHost, found it after reading this: https://bugs.schedmd.com/show_bug.cgi?id=2322#c5 sorry for the noise On 19/07/2023 14:51, Gérard Henry (AMU) wrote: Hello all, is it possible to have this configuration? i installed slurm on ubuntu 20 LTS, but slurmctld refuses to start with messages: [2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded [2023-07-19T14:37:59.563] debug: /var/log/slurm/jobcomp doesn't look like a database name using slurm_jobcomp_db [2023-07-19T14:37:59.563] debug2: mysql_connect() called for db slurm_jobcomp_db [2023-07-19T14:37:59.571] debug2: Attempting to connect to localhost:3306 [2023-07-19T14:37:59.571] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [2023-07-19T14:37:59.572] fatal: You haven't inited this storage yet. slurmdbd is running, and some stuff seems to be written in db: # sacctmgr show cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS -- --- - - --- - - --- - - --- --- - - cathena 0 0 1 normal i don't understand why slurmctld needs to connect to mysql, since it connects to slurmdbd. slurm is # slurmctld -V slurm-wlm 19.05.5 Thanks in advance for help, -- Gérard HENRY Institut Fresnel - UMR 7249 +33 413945457 Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue Escadrille Normandie Niemen, 13013 Marseille Site : https://fresnel.fr/ Afin de respecter l'environnement, merci de n'imprimer cet email que si nécessaire.
[slurm-users] slurmctld and slurmdbd on the server, mysql on remote
Hello all, is it possible to have this configuration? i installed slurm on ubuntu 20 LTS, but slurmctld refuses to start with messages: [2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded [2023-07-19T14:37:59.563] debug: /var/log/slurm/jobcomp doesn't look like a database name using slurm_jobcomp_db [2023-07-19T14:37:59.563] debug2: mysql_connect() called for db slurm_jobcomp_db [2023-07-19T14:37:59.571] debug2: Attempting to connect to localhost:3306 [2023-07-19T14:37:59.571] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [2023-07-19T14:37:59.572] fatal: You haven't inited this storage yet. slurmdbd is running, and some stuff seems to be written in db: # sacctmgr show cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS -- --- - - --- - - --- - - --- --- - - cathena0 0 1 normal i don't understand why slurmctld needs to connect to mysql, since it connects to slurmdbd. slurm is # slurmctld -V slurm-wlm 19.05.5 Thanks in advance for help, -- Gérard HENRY Institut Fresnel - UMR 7249 +33 413945457 Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue Escadrille Normandie Niemen, 13013 Marseille Site : https://fresnel.fr/ Afin de respecter l'environnement, merci de n'imprimer cet email que si nécessaire.
[slurm-users] slurm 18.08.3 on CentOS 6.18: error: _slurm_cgroup_destroy
Hello, on an old machine CentOS 6.10, i've installed slurm 18.08.3 from sources, and tried to configure a simple configuration (attached slurm.conf). Afterstarting slurmctld et slurmd, sinfo shows everything oaky, but at the first submission with sbatch, i got errors and the node becomes "drain": [2020-02-28T14:44:57.883] [2.batch] error: _slurm_cgroup_destroy: Unable to move pid 10322 to root cgroup [2020-02-28T14:44:57.883] [2.batch] error: proctrack_g_create: No such file or directory [2020-02-28T14:44:57.883] [2.batch] error: job_manager exiting abnormally, rc = 4014 [2020-02-28T14:44:57.883] [2.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4014 status 0 I'm not very confident with cgroup, and don't understand where is the problem. I read in archives that people have success with CentOS 6 and slurm 18. Anybody can help? Thanks in advance, Gérard # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # SlurmctldHost=tramel #SlurmctldHost= # #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=99 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=4 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm SwitchType=switch/none #TaskEpilog= TaskPlugin=task/affinity TaskPluginParam=Sched #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= AccountingStorageLoc=/var/log/slurm/accounting #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/filetxt #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/SlurmdLogFile.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=tramel CPUs=64 Boards=2 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=1033892 PartitionName=hipe Nodes=tramel Default=YES MaxTime=INFINITE State=UP smime.p7s Description: S/MIME Cryptographic Signature