On 2016-06-16 22:05, James Board wrote:
> 
> I'm not sure if my messages to this list are propagated.  It might be a Yahoo 
> mail thing.
> 
> I have a small cluster with 10 nodes running CentOS 7.0 and slurm 15.0.8.  I 
> start slurmd on all the nodes and so far so good.  But, when I try to run a 
> job on all 10 nodes, two of the nodes (node09 and node10) have problems.  
> This is the error message reported by slurmd (in verbose mode)
> 
>   slurmd: error: _step_connect: connect() failed dir /tmp/slurmd node node09 
> job 123 step 0 No such file or directory
> 
> /tmp/slurmd is what I set for the SlurmdSpoolDir in slurm.conf.  That 
> directory does in fact exist on all nodes.
> 
> Any help would be appreciated.
> 

Perhaps not exactly related to this, but on an older cluster we
"inherited" the original slurm.conf from our "upstream", and it also had
SlurmdSpoolDir set to /tmp/slurmd.  Turns out that slurmd becomes really
unhappy when tmpwatch goes and deletes stuff in /tmp/slurmd..

Moral of the story: Unless you really really know what you're doing and
have good reasons for doing so, just leave SlurmdSpoolDir at its default
value.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || [email protected]

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to