> On Mar 27, 2019, at 9:32 PM, Chris Samuel <ch...@csamuel.org> wrote:
> 
> On 27/3/19 2:43 pm, Noam Bernstein wrote:
> 
>> Hi fellow slurm users - I’ve been using slurm happily for a few months, but 
>> now I feel like it’s gone crazy, and I’m wondering if anyone can explain 
>> what’s going on.  I have a trivial batch script which I submit multiple 
>> times, and ends up with different numbers of nodes allocated. Does anyone 
>> have any idea why?
> 
> You would need to share the output of "scontrol show nodes" to get an idea of 
> what resources Slurm thinks each node has.

Thanks for the pointer.  I believe this revealed the problem.  Systematically 
going over the "scontrol show nodes” output showed that while the number of 
cores was the same on each node, the memory was not, because one node had a 
badly socketed DIMM.  Even though I wasn’t explicitly requesting memory, the 
partition defaults to x * total_mem/n_cores per task, where x =~ 0.9, so it 
must have realized that the node was short of memory.  I fixed the underlying 
memory issue, and now I can no longer reproduce the weird behavior - now it 
always gets 2 nodes.

                                                                        Noam


Reply via email to