Re: [munge-users] Munge 0.5.8 sometimes dies

Chris Dunlap Mon, 19 May 2008 12:33:11 -0700

There's currently no means to change the log's verbosity at runtime.
And it's probably not worth recompiling with a different log priority
since I don't see much in the src logged at LOG_DEBUG.

If nothing relevant shows up in the logs, you could try running with
"--num-threads=1" to see if this might be thread-related.  You could
also try disabling the group-update-timer ("--group-update-time=0").

A backtrace would be nice, but I disable core file creation via
setrlimit() in munged.c when the daemon is backgrounded -- in
hindsight, it would be good to make that configurable.

I suppose you could run munged in the foreground ("--foreground")
and capture its stdout/stderr to file, stowing the core and restarting
the daemon upon crash.  Bernstein's daemontools might be useful here.
I'll try to look at these later today and see what's involved in
setting it up.

  http://cr.yp.to/daemontools.html
  http://cr.yp.to/daemontools/supervise.html

Was 0.5.8 the first release you saw this problem with?  With what sort
of frequency does this occur (you mention ~12 nodes/wk, but out of how
many total nodes running munged)?

-Chris

On Mon, 2008-05-19 at 07:05am EDT, Jeff Squyres wrote:
> 
> Ok.  It hasn't happened since my post yesterday (I usually check my  
> nodes first thing in the morning); I'll post again when it does.
> 
> Is there anything I should do in preparation to get more data?  E.g.,  
> is there a way to crank up the verbosity of munge's logs?
> 
> 
> On May 18, 2008, at 2:00 PM, Chris Dunlap wrote:
> 
> > No, this is not a known issue.  Please send more details.  Is there
> > anything relevant in the munge log files after a crash?
> >
> > -Chris
> >
> >
> > On Sun, 2008-05-18 at 08:10am EDT, Jeff Squyres wrote:
> >>
> >> Greetings.
> >>
> >> I have been using munge in a SLURM environment on my MPI
> >> development cluster at Cisco for quite a while.  I have noticed
> >> over the past  year or so that a munge 0.5.8 daemon on a compute
> >> node sometimes just randomly dies, leaving the slurmd on that node
> >> unable to communicate with the slurmctld on the cluster head node.
> >> SLURM therefore thinks that the node is down.
> >>
> >> Over the past month, this has been happening to about a dozen nodes
> >> a week.  To this point, I haven't been paying closer attention than
> >> that.
> >>
> >> Is this a known issue?  I can send more details if it is not.
> >>
> >> FWIW, here's a summary of my setup:
> >>
> >> - RHEL4U4 on all machines
> >> - SLURM 1.3.1 (SLURM has been steadily upgraded over time, staying
> >> more-or-less current)
> >> - Using Perceus to image/provision the back-end nodes
> >> - Dell 1950 Intel Xeon servers (a few different specific flavors)
> >>
> >> -- 
> >> Jeff Squyres
> >> Cisco Systems

_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users

Re: [munge-users] Munge 0.5.8 sometimes dies

Reply via email to