[slurm-users] bug when using SlurmctldParameters=cloud_reg_addrs ? error: get_name_info: getnameinfo() failed: Name or service not known

2021-10-25 Thread Pablo Escobar Lopez
Hi, I have configured slurm cloud scheduling for OpenStack. I am using CentOS7 with slurm version 20.11.8 installed using EPEL RPMs and it's working fine but I am getting some strange errors in the slurm master logs which I think are a bug. I am using these options in slurm.conf:

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Juergen Salk
Hi Alan and Paul, I can't clain to be a Lustre guru but my understanding is that Lustre failover does not imply umount/mount of the file system on the client side. On the client side the OSTs just stall until they are back. So open file handles should actually be kept during that process.

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Paul Edmon
I think it depends on the filesystem type.  Lustre generally fails over nicely and handles reconnections with out much of a problem.  We've done this before with out any hitches, even with the jobs being live.  Generally the jobs just hang and then resolve once the filesystem comes back.  On a

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Alan Orth
Dear Jurgen and Paul, This is an interesting strategy, thanks for sharing. So if I read the scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job processes. The processes remain in memory, but are paused. What happens to open file handles, since the underlying filesystem goes