Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ole Holm Nielsen
On 10-12-2023 17:29, Ryan Novosielski wrote: This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory. /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large. Agreed! That's

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Peter Goode
We maintain /tmp as a separate partition to mitigate this exact scenario on all nodes though it doesn’t necessarily need to be part of the primary system RAID. No need for tmp resiliency. Regards, Peter Peter Goode Research Computing Systems Administrator Lafayette College > On Dec 10,

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ryan Novosielski
This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory. /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large. Sent from my iPhone > On Dec 8, 2023, at 10:02, Xaver

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Xaver Stiensmeier
Hello Brian Andrus, we ran 'df -h' to determine the amount of free space I mentioned below. I also should add that at the time we inspected the node, there was still around 38 GB of space left - however, we were unable to watch the remaining space while the error occurred so maybe the large

Re: [slurm-users] SlurmdSpoolDir full

2023-12-09 Thread Brian Andrus
Xaver, It is likely your /var or /var/spool mount. That may be a separate partition or part of your root partition. It is the partition that is full, not the directory itself. So the cause could very well be log files in /var/log. I would check to see what (if any) partitions are getting

Re: [slurm-users] SlurmdSpoolDir full

2023-12-08 Thread Ole Holm Nielsen
Hi Xaver, On 12/8/23 16:00, Xaver Stiensmeier wrote: during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information

[slurm-users] SlurmdSpoolDir full

2023-12-08 Thread Xaver Stiensmeier
Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir).