Ok. It hasn't happened since my post yesterday (I usually check my nodes first thing in the morning); I'll post again when it does.
Is there anything I should do in preparation to get more data? E.g., is there a way to crank up the verbosity of munge's logs? On May 18, 2008, at 2:00 PM, Chris Dunlap wrote: > No, this is not a known issue. Please send more details. Is there > anything relevant in the munge log files after a crash? > > -Chris > > > On Sun, 2008-05-18 at 08:10am EDT, Jeff Squyres wrote: >> >> Greetings. >> >> I have been using munge in a SLURM environment on my MPI development >> cluster at Cisco for quite a while. I have noticed over the past >> year >> or so that a munge 0.5.8 daemon on a compute node sometimes just >> randomly dies, leaving the slurmd on that node unable to communicate >> with the slurmctld on the cluster head node. SLURM therefore thinks >> that the node is down. >> >> Over the past month, this has been happening to about a dozen nodes a >> week. To this point, I haven't been paying closer attention than >> that. >> >> Is this a known issue? I can send more details if it is not. >> >> FWIW, here's a summary of my setup: >> >> - RHEL4U4 on all machines >> - SLURM 1.3.1 (SLURM has been steadily upgraded over time, staying >> more-or-less current) >> - Using Perceus to image/provision the back-end nodes >> - Dell 1950 Intel Xeon servers (a few different specific flavors) >> >> -- >> Jeff Squyres >> Cisco Systems -- Jeff Squyres Cisco Systems _______________________________________________ munge-users mailing list [email protected] https://mail.gna.org/listinfo/munge-users
