On 12/07/2012 04:24 PM, Daniel Petersen wrote: > > Hi, > > I've been running several different scenarios to test how RebootProgram > behaves, and I'm seeing a problem consistently; pending job(s) which > should run on the node after the node has rebooted and come back online, > are instead immediately failing when the node begins to shutdown. > > Here's a more detailed breakdown of scenario, what I expect, and what I see: > > scenario: > *Node X is idle. > *submit jobs A and B to the queue. The jobs are confined to only run on > node X > *job A starts running on node X > *while job A is running, run the command 'scontrol reboot_nodes nodeX' > > what I would expect: > > 1 slurmd (if that's where the rebootprogram logic is) on nodeX to > determine a suitable 'idle point' after which it will block pending job > B from being scheduled, so it can reboot at that point. > > 2 allow job A to run and finish up to this 'idle point'. All pending > jobs which would run past this point are blocked from starting (i.e. job B). > > 3 the node is now idle, so reboot > > 4 after the node reboots and comes online again, the node can continue > taking the pending job B, which had previously been blocked. > > what I'm seeing: > at step 3 above, right when it runs the RebootProgram, it also seems to > be continuing to run the next job(s) in the queue, and since the node is > rebooting, the jobs fail. For some reason, the logic to block/hold > pending jobs seems to be failing. > > That's pretty much it. Please let me know if I'm missing something with > my assumptions of expected behavior. I'm assuming this isn't a prevalent > problem since I couldn't find anything in the mailing list, so perhaps > something in our configuration is unusual and is triggering this. > > Since the problem seems to relate to the internal logic of how > RebootProgram handles the queue flow when it's about to reboot, I don't > know what other information to include to help troubleshoot, be it > specific configuration details, logging or what have you. Just let me > know and I can provide it. > > Kind Regards, >
Hi, Ran some more tests and what I'm seeing is consistent: when the RebootProgram is called, the next job(s) in the queue which should be blocked until after the reboot, are instead being started during the reboot, thus failing. Don't know how to go further with this, but it does seem to be a slurm issue. Any suggestions on how to further pin this problem down would be appreciated. Also it would be interesting to hear from any of you that use RebootProgram; about your setup, reliability, etc. Regards, -- Daniel Petersen
