Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Brian Andrus
After you restart slurmctld do "scontrol reconfigure" Brian Andrus On 8/30/2019 6:57 AM, Robert Kudyba wrote: I had set RealMemory to a really high number as I mis-interpreted the recommendation. NodeName=node[001-003]  CoresPerSocket=12 RealMemory= 196489092  Sockets=2 Gres=gpu:1 But now I

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Robert Kudyba
I had set RealMemory to a really high number as I mis-interpreted the recommendation. NodeName=node[001-003] CoresPerSocket=12 RealMemory= 196489092 Sockets=2 Gres=gpu:1 But now I set it to: RealMemory=191000 I restarted slurmctld. And according to the Bright Cluster support team: "Unless it ha

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-29 Thread Alex Chekholko
Sounds like maybe you didn't correctly roll out / update your slurm.conf everywhere as your RealMemory value is back to your large wrong number. You need to update your slurm.conf everywhere and restart all the slurm daemons. I recommend the "safe procedure" from here: https://wiki.fysik.dtu.dk/ni

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-29 Thread Robert Kudyba
I thought I had taken care of this a while back but it appears the issue has returned. A very simply sbatch slurmhello.sh: cat slurmhello.sh #!/bin/sh #SBATCH -o my.stdout #SBATCH -N 3 #SBATCH --ntasks=16 module add shared openmpi/gcc/64/1.10.7 slurm mpirun hello sbatch slurmhello.sh Submitted b

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Robert Kudyba
Thanks Brian indeed we did have it set in bytes. I set it to the MB value. Hoping this takes care of the situation. > On Jul 8, 2019, at 4:02 PM, Brian Andrus wrote: > > Your problem here is that the configuration for the nodes in question have an > incorrect amount of memory set for them. Loo

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Brian Andrus
Your problem here is that the configuration for the nodes in question have an incorrect amount of memory set for them. Looks like you have it set in bytes instead of megabytes In your slurm.conf you should look at the RealMemory setting: *RealMemory* Size of real memory on the node in megab

[slurm-users] sbatch tasks stuck in queue when a job is hung

2019-07-08 Thread Robert Kudyba
I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or sett