Dear Group,
I have a question on slurm robustness. In this scenario, I am looking at a
cluster that is build out of VMs running in OpenStack.

If the slurm controller machine were to crash, (and there is no failover) I
would assume this makes no difference to running jobs. The slurmd on
compute nodes would try to report to the controller, and when the slurm
controller comes back, (and assuming persistent disk storage for the state
files), jobs would eventually be  reported as being finished/running

However in Openstack, if you restart a machine from an image, there can be
a differnet IP address. slurm.conf specifies both ControlMachine and
ControlAddr.
If the ControlAddr  entry changes, but not ControlMachine, will the slurmds
on compute nodes still be able to communicate with the slurm controller?

We can push out new  slurm.conf files to the compute nodes, and restart
each slurmd, but in that case,  will the controller still be able to
determine the correct state of jobs that were running at the time of the
slurm controller going down?

If the job database or slurmdbm were to fail for a period of time, would
this also have any long term effect on running jobs? (other than possibly
incorrect reporting of job times/usage)?

thanks

-- 
Simon Michnowicz
Monash e-Research Centre
PH:   (03) 9902 0794
Mob: 0418 302 046
www.monash.edu.au/eresearch

Reply via email to