Not sure if anyone will know the answer to this but thought I would ask: Is backfill suppose to work on BGQ when a midplane is being drained to recover from nodes being in a SoftwareError state?
Right now our BGQ has a node in the SoftwareError state so it is draining all jobs on that midplane so it can reboot the block. The last job will finish in about 9 hours but there are jobs waiting with 6 hour time limits. The job is waiting for 32 nodes, 1 node is in error and 319 are idle. So I don't believe it is a case of not being able to find a 32 node block of nodes in the correct configuration. I have been suspecting backfill of not working properly but it is hard to guess what algorithm it uses to pack jobs into 5D toruses and whether space is really available. So does any one know if backfill will even try to schedule things in different situations? If it should work but doesn't I will dig in the code to see if I can find out why. If it doesn't even try to work then I will leave it. In a somewhat related issue, is it possible to simulate the SoftwareError state in the slurm BGQ simulation mode? Thanks, Carl -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester