Greetings.

I have been using munge in a SLURM environment on my MPI development  
cluster at Cisco for quite a while.  I have noticed over the past year  
or so that a munge 0.5.8 daemon on a compute node sometimes just  
randomly dies, leaving the slurmd on that node unable to communicate  
with the slurmctld on the cluster head node.  SLURM therefore thinks  
that the node is down.

Over the past month, this has been happening to about a dozen nodes a  
week.  To this point, I haven't been paying closer attention than that.

Is this a known issue?  I can send more details if it is not.

FWIW, here's a summary of my setup:

- RHEL4U4 on all machines
- SLURM 1.3.1 (SLURM has been steadily upgraded over time, staying  
more-or-less current)
- Using Perceus to image/provision the back-end nodes
- Dell 1950 Intel Xeon servers (a few different specific flavors)

-- 
Jeff Squyres
Cisco Systems


_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users

Reply via email to