No, this is not a known issue. Please send more details. Is there anything relevant in the munge log files after a crash?
-Chris On Sun, 2008-05-18 at 08:10am EDT, Jeff Squyres wrote: > > Greetings. > > I have been using munge in a SLURM environment on my MPI development > cluster at Cisco for quite a while. I have noticed over the past year > or so that a munge 0.5.8 daemon on a compute node sometimes just > randomly dies, leaving the slurmd on that node unable to communicate > with the slurmctld on the cluster head node. SLURM therefore thinks > that the node is down. > > Over the past month, this has been happening to about a dozen nodes a > week. To this point, I haven't been paying closer attention than that. > > Is this a known issue? I can send more details if it is not. > > FWIW, here's a summary of my setup: > > - RHEL4U4 on all machines > - SLURM 1.3.1 (SLURM has been steadily upgraded over time, staying > more-or-less current) > - Using Perceus to image/provision the back-end nodes > - Dell 1950 Intel Xeon servers (a few different specific flavors) > > -- > Jeff Squyres > Cisco Systems _______________________________________________ munge-users mailing list [email protected] https://mail.gna.org/listinfo/munge-users
