All, I have configured a backup slurmctld system and it appears to work at first, but not in practice. In particular, when I start it, it says it is running in background mode: [2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on cluster hamming [2017-01-25T14:23:37.650] slurmctld running in background mode
But if I stop the primary daemon, it does not take over. I keep getting Invalid RPC errors (random snippets): [2017-01-25T15:50:37.664] error: Invalid RPC received 2007 while in standby mode [2017-01-25T15:53:50.495] error: Invalid RPC received 5018 while in standby mode [2017-01-25T15:59:36.847] error: Invalid RPC received 2007 while in standby mode [2017-01-25T15:59:37.499] error: Invalid RPC received 2007 while in standby mode [2017-01-25T15:59:38.923] error: Invalid RPC received 2007 while in standby mode [2017-01-25T15:59:38.985] error: Invalid RPC received 2007 while in standby mode [2017-01-25T15:59:39.246] error: Invalid RPC received 2007 while in standby mode [2017-01-25T15:59:39.293] error: Invalid RPC received 2009 while in standby mode [2017-01-25T15:59:39.522] error: Invalid RPC received 5018 while in standby mode [2017-01-25T15:59:43.839] error: Invalid RPC received 2009 while in standby mode [2017-01-25T15:59:43.930] error: Invalid RPC received 2009 while in standby mode [2017-01-25T16:19:47.215] error: Invalid RPC received 6012 while in standby mode [2017-01-25T16:19:48.238] error: Invalid RPC received 6012 while in standby mode And on any client running 'sinfo' for instance, it merely hangs. The interfaces for both slurmctld controllers are in the 'trusted' firewall group and there is no filtering between them. Is there something I am missing to make the backup controller 'kick in' and start responding to requests? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238