Dear Group, I have a question on slurm robustness. In this scenario, I am looking at a cluster that is build out of VMs running in OpenStack.
If the slurm controller machine were to crash, (and there is no failover) I would assume this makes no difference to running jobs. The slurmd on compute nodes would try to report to the controller, and when the slurm controller comes back, (and assuming persistent disk storage for the state files), jobs would eventually be reported as being finished/running However in Openstack, if you restart a machine from an image, there can be a differnet IP address. slurm.conf specifies both ControlMachine and ControlAddr. If the ControlAddr entry changes, but not ControlMachine, will the slurmds on compute nodes still be able to communicate with the slurm controller? We can push out new slurm.conf files to the compute nodes, and restart each slurmd, but in that case, will the controller still be able to determine the correct state of jobs that were running at the time of the slurm controller going down? If the job database or slurmdbm were to fail for a period of time, would this also have any long term effect on running jobs? (other than possibly incorrect reporting of job times/usage)? thanks -- Simon Michnowicz Monash e-Research Centre PH: (03) 9902 0794 Mob: 0418 302 046 www.monash.edu.au/eresearch
