> On Mar 27, 2019, at 9:32 PM, Chris Samuel <ch...@csamuel.org> wrote: > > On 27/3/19 2:43 pm, Noam Bernstein wrote: > >> Hi fellow slurm users - I’ve been using slurm happily for a few months, but >> now I feel like it’s gone crazy, and I’m wondering if anyone can explain >> what’s going on. I have a trivial batch script which I submit multiple >> times, and ends up with different numbers of nodes allocated. Does anyone >> have any idea why? > > You would need to share the output of "scontrol show nodes" to get an idea of > what resources Slurm thinks each node has.
Thanks for the pointer. I believe this revealed the problem. Systematically going over the "scontrol show nodes” output showed that while the number of cores was the same on each node, the memory was not, because one node had a badly socketed DIMM. Even though I wasn’t explicitly requesting memory, the partition defaults to x * total_mem/n_cores per task, where x =~ 0.9, so it must have realized that the node was short of memory. I fixed the underlying memory issue, and now I can no longer reproduce the weird behavior - now it always gets 2 nodes. Noam