Greetings. I have been using munge in a SLURM environment on my MPI development cluster at Cisco for quite a while. I have noticed over the past year or so that a munge 0.5.8 daemon on a compute node sometimes just randomly dies, leaving the slurmd on that node unable to communicate with the slurmctld on the cluster head node. SLURM therefore thinks that the node is down.
Over the past month, this has been happening to about a dozen nodes a week. To this point, I haven't been paying closer attention than that. Is this a known issue? I can send more details if it is not. FWIW, here's a summary of my setup: - RHEL4U4 on all machines - SLURM 1.3.1 (SLURM has been steadily upgraded over time, staying more-or-less current) - Using Perceus to image/provision the back-end nodes - Dell 1950 Intel Xeon servers (a few different specific flavors) -- Jeff Squyres Cisco Systems _______________________________________________ munge-users mailing list [email protected] https://mail.gna.org/listinfo/munge-users
