Chris pinged me off-list with some additional instructions on the
daemontools.
I had a munged failure earlier today with no extra output from the
syslog. So I installed the daemontools and hacked the munge source to
allow leaving coredumps. Once my cluster settles down, I'll reboot my
nodes to get the new image. I might be able to do this tonight,
before tonight's regression runs startup.
On May 20, 2008, at 5:07 PM, Jeff Squyres wrote:
> On May 19, 2008, at 3:32 PM, Chris Dunlap wrote:
>
>> There's currently no means to change the log's verbosity at runtime.
>> And it's probably not worth recompiling with a different log priority
>> since I don't see much in the src logged at LOG_DEBUG.
>
> Ok.
>
>> If nothing relevant shows up in the logs, you could try running with
>> "--num-threads=1" to see if this might be thread-related. You could
>> also try disabling the group-update-timer ("--group-update-time=0").
>
> I'll run without these while waiting for my next failure (hasn't
> happened yet since I reported to the list -- I check about once a
> day).
>
>> A backtrace would be nice, but I disable core file creation via
>> setrlimit() in munged.c when the daemon is backgrounded -- in
>> hindsight, it would be good to make that configurable.
>
> I can hack the source for this; if it's something that generates a
> corefile, I'll send a backtrace.
>
>> I suppose you could run munged in the foreground ("--foreground")
>> and capture its stdout/stderr to file, stowing the core and
>> restarting
>> the daemon upon crash. Bernstein's daemontools might be useful here.
>> I'll try to look at these later today and see what's involved in
>> setting it up.
>>
>> http://cr.yp.to/daemontools.html
>> http://cr.yp.to/daemontools/supervise.html
>
> Let's do that as a second step.
>
>> Was 0.5.8 the first release you saw this problem with? With what
>> sort
>> of frequency does this occur (you mention ~12 nodes/wk, but out of
>> how
>> many total nodes running munged)?
>
> I honestly don't know if 0.5.8 was the first I saw this with -- I've
> been running 0.5.8 for so long (well over a year?) and have paid so
> little attention to it that I don't have proper details before
> now. :-\
>
> My cluster is currently 50 nodes (about to go up to 54) running
> SLURM. I use my cluster for Open MPI regression testing, so it's a
> continual barrage of relatively short srun's across 2, 4, and 8 node
> SLURM allocations (i.e., OMPI's "mpirun" invokes "srun" under the
> covers to launch its individual processes).
>
> The "dozen" number I cited is a swag -- I have not been keeping track
> of this scientifically.
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> munge-users mailing list
> [email protected]
> https://mail.gna.org/listinfo/munge-users
--
Jeff Squyres
Cisco Systems
_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users