Re: [slurm-users] SlurmdSpoolDir full
On 10-12-2023 17:29, Ryan Novosielski wrote: This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory. /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large. Agreed! That's why temporary job directories may be configured in Slurm, see the Wiki page for a summary: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories /Ole On Dec 8, 2023, at 10:02, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, I was unable to find more precise information on that dictionary. We compute all data on another volume so SlurmdSpoolDir has roughly 38 GB of free space where nothing is intentionally put during the run. This error only occurred on very few nodes. I would like to understand what Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here. Best regards Xaver Stiensmeier
Re: [slurm-users] SlurmdSpoolDir full
We maintain /tmp as a separate partition to mitigate this exact scenario on all nodes though it doesn’t necessarily need to be part of the primary system RAID. No need for tmp resiliency. Regards, Peter Peter Goode Research Computing Systems Administrator Lafayette College > On Dec 10, 2023, at 11:33, Ryan Novosielski wrote: > > This is basically always somebody filling up /tmp and /tmp residing on the > same filesystem as the actual SlurmdSpoolDirectory. > > /tmp, without modifications, it’s almost certainly the wrong place for > temporary HPC files. Too large. > > Sent from my iPhone > >> On Dec 8, 2023, at 10:02, Xaver Stiensmeier wrote: >> >> Dear slurm-user list, >> >> during a larger cluster run (the same I mentioned earlier 242 nodes), I >> got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a >> directory on the workers that is used for job state information >> (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, >> I was unable to find more precise information on that dictionary. We >> compute all data on another volume so SlurmdSpoolDir has roughly 38 GB >> of free space where nothing is intentionally put during the run. This >> error only occurred on very few nodes. >> >> I would like to understand what Slurmd is placing in this dir that fills >> up the space. Do you have any ideas? Due to the workflow used, we have a >> hard time reconstructing the exact scenario that caused this error. I >> guess, the "fix" is to just pick a bit larger disk, but I am unsure >> whether Slurm behaves normal here. >> >> Best regards >> Xaver Stiensmeier >> >>
Re: [slurm-users] SlurmdSpoolDir full
This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory. /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large. Sent from my iPhone > On Dec 8, 2023, at 10:02, Xaver Stiensmeier wrote: > > Dear slurm-user list, > > during a larger cluster run (the same I mentioned earlier 242 nodes), I > got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a > directory on the workers that is used for job state information > (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, > I was unable to find more precise information on that dictionary. We > compute all data on another volume so SlurmdSpoolDir has roughly 38 GB > of free space where nothing is intentionally put during the run. This > error only occurred on very few nodes. > > I would like to understand what Slurmd is placing in this dir that fills > up the space. Do you have any ideas? Due to the workflow used, we have a > hard time reconstructing the exact scenario that caused this error. I > guess, the "fix" is to just pick a bit larger disk, but I am unsure > whether Slurm behaves normal here. > > Best regards > Xaver Stiensmeier > >
Re: [slurm-users] SlurmdSpoolDir full
Hello Brian Andrus, we ran 'df -h' to determine the amount of free space I mentioned below. I also should add that at the time we inspected the node, there was still around 38 GB of space left - however, we were unable to watch the remaining space while the error occurred so maybe the large file(s) got removed immediately. I will take a look at /var/log. That's a good idea. I don't think that there will be anything unusual, but it's something I haven't thought about yet (the reason of the error being somewhere else). Best regards Xaver On 10.12.23 00:41, Brian Andrus wrote: Xaver, It is likely your /var or /var/spool mount. That may be a separate partition or part of your root partition. It is the partition that is full, not the directory itself. So the cause could very well be log files in /var/log. I would check to see what (if any) partitions are getting filled on the node. You can run 'df -h' and see some info that would get you started. Brian Andrus On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, I was unable to find more precise information on that dictionary. We compute all data on another volume so SlurmdSpoolDir has roughly 38 GB of free space where nothing is intentionally put during the run. This error only occurred on very few nodes. I would like to understand what Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here. Best regards Xaver Stiensmeier
Re: [slurm-users] SlurmdSpoolDir full
Xaver, It is likely your /var or /var/spool mount. That may be a separate partition or part of your root partition. It is the partition that is full, not the directory itself. So the cause could very well be log files in /var/log. I would check to see what (if any) partitions are getting filled on the node. You can run 'df -h' and see some info that would get you started. Brian Andrus On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote: Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, I was unable to find more precise information on that dictionary. We compute all data on another volume so SlurmdSpoolDir has roughly 38 GB of free space where nothing is intentionally put during the run. This error only occurred on very few nodes. I would like to understand what Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here. Best regards Xaver Stiensmeier
Re: [slurm-users] SlurmdSpoolDir full
Hi Xaver, On 12/8/23 16:00, Xaver Stiensmeier wrote: during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, I was unable to find more precise information on that dictionary. We compute all data on another volume so SlurmdSpoolDir has roughly 38 GB of free space where nothing is intentionally put during the run. This error only occurred on very few nodes. I would like to understand what Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here. With Slurm RPM installation this directory is configured: $ scontrol show config | grep SlurmdSpoolDir SlurmdSpoolDir = /var/spool/slurmd In SlurmdSpoolDir we find job scripts and various cached data. In our cluster it's usually a few Megabytes on each node. We never had any issues with the size of SlurmdSpoolDir. Do you store SlurmdSpoolDir on a shared network storage, or what? Can you job scripts contain large amounts of data? /Ole
[slurm-users] SlurmdSpoolDir full
Dear slurm-user list, during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However, I was unable to find more precise information on that dictionary. We compute all data on another volume so SlurmdSpoolDir has roughly 38 GB of free space where nothing is intentionally put during the run. This error only occurred on very few nodes. I would like to understand what Slurmd is placing in this dir that fills up the space. Do you have any ideas? Due to the workflow used, we have a hard time reconstructing the exact scenario that caused this error. I guess, the "fix" is to just pick a bit larger disk, but I am unsure whether Slurm behaves normal here. Best regards Xaver Stiensmeier