On May 19, 2008, at 3:32 PM, Chris Dunlap wrote:

> There's currently no means to change the log's verbosity at runtime.
> And it's probably not worth recompiling with a different log priority
> since I don't see much in the src logged at LOG_DEBUG.

Ok.

> If nothing relevant shows up in the logs, you could try running with
> "--num-threads=1" to see if this might be thread-related.  You could
> also try disabling the group-update-timer ("--group-update-time=0").

I'll run without these while waiting for my next failure (hasn't  
happened yet since I reported to the list -- I check about once a day).

> A backtrace would be nice, but I disable core file creation via
> setrlimit() in munged.c when the daemon is backgrounded -- in
> hindsight, it would be good to make that configurable.

I can hack the source for this; if it's something that generates a  
corefile, I'll send a backtrace.

> I suppose you could run munged in the foreground ("--foreground")
> and capture its stdout/stderr to file, stowing the core and restarting
> the daemon upon crash.  Bernstein's daemontools might be useful here.
> I'll try to look at these later today and see what's involved in
> setting it up.
>
>  http://cr.yp.to/daemontools.html
>  http://cr.yp.to/daemontools/supervise.html

Let's do that as a second step.

> Was 0.5.8 the first release you saw this problem with?  With what sort
> of frequency does this occur (you mention ~12 nodes/wk, but out of how
> many total nodes running munged)?

I honestly don't know if 0.5.8 was the first I saw this with -- I've  
been running 0.5.8 for so long (well over a year?) and have paid so  
little attention to it that I don't have proper details before now.  :-\

My cluster is currently 50 nodes (about to go up to 54) running  
SLURM.  I use my cluster for Open MPI regression testing, so it's a  
continual barrage of relatively short srun's across 2, 4, and 8 node  
SLURM allocations (i.e., OMPI's "mpirun" invokes "srun" under the  
covers to launch its individual processes).

The "dozen" number I cited is a swag -- I have not been keeping track  
of this scientifically.

-- 
Jeff Squyres
Cisco Systems


_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users

Reply via email to