On 2016-06-16 22:05, James Board wrote: > > I'm not sure if my messages to this list are propagated. It might be a Yahoo > mail thing. > > I have a small cluster with 10 nodes running CentOS 7.0 and slurm 15.0.8. I > start slurmd on all the nodes and so far so good. But, when I try to run a > job on all 10 nodes, two of the nodes (node09 and node10) have problems. > This is the error message reported by slurmd (in verbose mode) > > slurmd: error: _step_connect: connect() failed dir /tmp/slurmd node node09 > job 123 step 0 No such file or directory > > /tmp/slurmd is what I set for the SlurmdSpoolDir in slurm.conf. That > directory does in fact exist on all nodes. > > Any help would be appreciated. >
Perhaps not exactly related to this, but on an older cluster we "inherited" the original slurm.conf from our "upstream", and it also had SlurmdSpoolDir set to /tmp/slurmd. Turns out that slurmd becomes really unhappy when tmpwatch goes and deletes stuff in /tmp/slurmd.. Moral of the story: Unless you really really know what you're doing and have good reasons for doing so, just leave SlurmdSpoolDir at its default value. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || [email protected]
signature.asc
Description: OpenPGP digital signature
