All, I have a relatively successful cloud implementation of SLUM in Azure. I am experiencing an issue with the ResumeProgram not running. Thing work great but after a bit, it just plain stops calling the script. I have enabled debug on slurmctld and I see the jobs being assigned nodes that are idle~ but no calls to the script.
If I restart slurmctld, the backlog starts running and things work. Any ideas what could cause this? Brian Andrus