Chris pinged me off-list with some additional instructions on the  
daemontools.

I had a munged failure earlier today with no extra output from the  
syslog.  So I installed the daemontools and hacked the munge source to  
allow leaving coredumps.  Once my cluster settles down, I'll reboot my  
nodes to get the new image.  I might be able to do this tonight,  
before tonight's regression runs startup.


On May 20, 2008, at 5:07 PM, Jeff Squyres wrote:

> On May 19, 2008, at 3:32 PM, Chris Dunlap wrote:
>
>> There's currently no means to change the log's verbosity at runtime.
>> And it's probably not worth recompiling with a different log priority
>> since I don't see much in the src logged at LOG_DEBUG.
>
> Ok.
>
>> If nothing relevant shows up in the logs, you could try running with
>> "--num-threads=1" to see if this might be thread-related.  You could
>> also try disabling the group-update-timer ("--group-update-time=0").
>
> I'll run without these while waiting for my next failure (hasn't
> happened yet since I reported to the list -- I check about once a  
> day).
>
>> A backtrace would be nice, but I disable core file creation via
>> setrlimit() in munged.c when the daemon is backgrounded -- in
>> hindsight, it would be good to make that configurable.
>
> I can hack the source for this; if it's something that generates a
> corefile, I'll send a backtrace.
>
>> I suppose you could run munged in the foreground ("--foreground")
>> and capture its stdout/stderr to file, stowing the core and  
>> restarting
>> the daemon upon crash.  Bernstein's daemontools might be useful here.
>> I'll try to look at these later today and see what's involved in
>> setting it up.
>>
>> http://cr.yp.to/daemontools.html
>> http://cr.yp.to/daemontools/supervise.html
>
> Let's do that as a second step.
>
>> Was 0.5.8 the first release you saw this problem with?  With what  
>> sort
>> of frequency does this occur (you mention ~12 nodes/wk, but out of  
>> how
>> many total nodes running munged)?
>
> I honestly don't know if 0.5.8 was the first I saw this with -- I've
> been running 0.5.8 for so long (well over a year?) and have paid so
> little attention to it that I don't have proper details before  
> now.  :-\
>
> My cluster is currently 50 nodes (about to go up to 54) running
> SLURM.  I use my cluster for Open MPI regression testing, so it's a
> continual barrage of relatively short srun's across 2, 4, and 8 node
> SLURM allocations (i.e., OMPI's "mpirun" invokes "srun" under the
> covers to launch its individual processes).
>
> The "dozen" number I cited is a swag -- I have not been keeping track
> of this scientifically.
>
> -- 
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> munge-users mailing list
> [email protected]
> https://mail.gna.org/listinfo/munge-users


-- 
Jeff Squyres
Cisco Systems


_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users

Reply via email to