Not sure if anyone will know the answer to this but thought I would ask: Is 
backfill suppose to work on BGQ when a midplane is being drained to recover 
from nodes being in a SoftwareError state? 

Right now our BGQ has a node in the SoftwareError state so it is draining all 
jobs on that midplane so it can reboot the block. The last job will finish in 
about 9 hours but there are jobs waiting with 6 hour time limits. The job is 
waiting for 32 nodes, 1 node is in error and 319 are idle. So I don't believe 
it is a case of not being able to find a 32 node block of nodes in the correct 
configuration. I have been suspecting backfill of not working properly but it 
is hard to guess what algorithm it uses to pack jobs into 5D toruses and 
whether space is really available. 

So does any one know if backfill will even try to schedule things in different 
situations? If it should work but doesn't I will dig in the code to see if I 
can find out why. If it doesn't even try to work then I will leave it.

In a somewhat related issue, is it possible to simulate the SoftwareError state 
in the slurm BGQ simulation mode?

Thanks,
Carl

-- 
Carl Schmidtmann 
Center for Integrated Research Computing 
University of Rochester 

Reply via email to