Hello, I’m new to Slurm and am trying to implement it with a new cluster that we stood up. I’m having some success so far, but not completely with our Knights Landing nodes. I was wondering if anyone had experiences with these and if I’m missing some obvious configuration. Here are the issues I’m currently seeing if anyone has some insight:
- Submitting a job to a KNL partition seemingly chooses the nodes from that partition at random and will reboot them and change the constraints even if nodes with the proper constraints are idle and waiting. The documentation seems to indicate this should work by default. - I’ve set a default partition (one that does not include KNL nodes) and am able to submit jobs just fine by either specifying or not specifying a partition. They generally seem to go to the right place. If I specify no partition during an sbatch submission and give some constraints that are specific to KNL, they end up just landing on the default (wrong) partition and sitting in a held state. I would expect it to see the constraints and go to the right nodes in the right partition. It isn’t clear from the docs if constraints will override a default partition and if not, is this possible so we can enforce a non-KNL default? Since Slurm seems to be mostly doing the right things, I’m not having any luck figuring out where to poke next for the above issues. The Slurm logs don’t give me much to go off either even with a higher debug set. I’m using a knl_generic.conf with the same options in the Slurm documentation. Lastly, I’m running slurm 16.05.10-2. Thanks! -- John Roberts [email protected]
