On May 19, 2008, at 3:32 PM, Chris Dunlap wrote:
> There's currently no means to change the log's verbosity at runtime.
> And it's probably not worth recompiling with a different log priority
> since I don't see much in the src logged at LOG_DEBUG.
Ok.
> If nothing relevant shows up in the logs, you could try running with
> "--num-threads=1" to see if this might be thread-related. You could
> also try disabling the group-update-timer ("--group-update-time=0").
I'll run without these while waiting for my next failure (hasn't
happened yet since I reported to the list -- I check about once a day).
> A backtrace would be nice, but I disable core file creation via
> setrlimit() in munged.c when the daemon is backgrounded -- in
> hindsight, it would be good to make that configurable.
I can hack the source for this; if it's something that generates a
corefile, I'll send a backtrace.
> I suppose you could run munged in the foreground ("--foreground")
> and capture its stdout/stderr to file, stowing the core and restarting
> the daemon upon crash. Bernstein's daemontools might be useful here.
> I'll try to look at these later today and see what's involved in
> setting it up.
>
> http://cr.yp.to/daemontools.html
> http://cr.yp.to/daemontools/supervise.html
Let's do that as a second step.
> Was 0.5.8 the first release you saw this problem with? With what sort
> of frequency does this occur (you mention ~12 nodes/wk, but out of how
> many total nodes running munged)?
I honestly don't know if 0.5.8 was the first I saw this with -- I've
been running 0.5.8 for so long (well over a year?) and have paid so
little attention to it that I don't have proper details before now. :-\
My cluster is currently 50 nodes (about to go up to 54) running
SLURM. I use my cluster for Open MPI regression testing, so it's a
continual barrage of relatively short srun's across 2, 4, and 8 node
SLURM allocations (i.e., OMPI's "mpirun" invokes "srun" under the
covers to launch its individual processes).
The "dozen" number I cited is a swag -- I have not been keeping track
of this scientifically.
--
Jeff Squyres
Cisco Systems
_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users