Hi Noam,

if you use the RealMemory parameter for the hosts, slurm will close a host, which has less than the configured memory. Thus
1. you would have seen much earlier, that something was wrong with the node
2. no job would have been submitted to that node, since it would have been closed


best
Marcus

On 3/28/19 2:32 PM, Noam Bernstein wrote:
On Mar 27, 2019, at 9:32 PM, Chris Samuel <ch...@csamuel.org> wrote:

On 27/3/19 2:43 pm, Noam Bernstein wrote:

Hi fellow slurm users - I’ve been using slurm happily for a few months, but now 
I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s 
going on.  I have a trivial batch script which I submit multiple times, and 
ends up with different numbers of nodes allocated. Does anyone have any idea 
why?
You would need to share the output of "scontrol show nodes" to get an idea of 
what resources Slurm thinks each node has.
Thanks for the pointer.  I believe this revealed the problem.  Systematically going 
over the "scontrol show nodes” output showed that while the number of cores was 
the same on each node, the memory was not, because one node had a badly socketed 
DIMM.  Even though I wasn’t explicitly requesting memory, the partition defaults to 
x * total_mem/n_cores per task, where x =~ 0.9, so it must have realized that the 
node was short of memory.  I fixed the underlying memory issue, and now I can no 
longer reproduce the weird behavior - now it always gets 2 nodes.

                                                                        Noam



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Reply via email to