Ok.  It hasn't happened since my post yesterday (I usually check my  
nodes first thing in the morning); I'll post again when it does.

Is there anything I should do in preparation to get more data?  E.g.,  
is there a way to crank up the verbosity of munge's logs?


On May 18, 2008, at 2:00 PM, Chris Dunlap wrote:

> No, this is not a known issue.  Please send more details.  Is there
> anything relevant in the munge log files after a crash?
>
> -Chris
>
>
> On Sun, 2008-05-18 at 08:10am EDT, Jeff Squyres wrote:
>>
>> Greetings.
>>
>> I have been using munge in a SLURM environment on my MPI development
>> cluster at Cisco for quite a while.  I have noticed over the past  
>> year
>> or so that a munge 0.5.8 daemon on a compute node sometimes just
>> randomly dies, leaving the slurmd on that node unable to communicate
>> with the slurmctld on the cluster head node.  SLURM therefore thinks
>> that the node is down.
>>
>> Over the past month, this has been happening to about a dozen nodes a
>> week.  To this point, I haven't been paying closer attention than  
>> that.
>>
>> Is this a known issue?  I can send more details if it is not.
>>
>> FWIW, here's a summary of my setup:
>>
>> - RHEL4U4 on all machines
>> - SLURM 1.3.1 (SLURM has been steadily upgraded over time, staying
>> more-or-less current)
>> - Using Perceus to image/provision the back-end nodes
>> - Dell 1950 Intel Xeon servers (a few different specific flavors)
>>
>> -- 
>> Jeff Squyres
>> Cisco Systems


-- 
Jeff Squyres
Cisco Systems


_______________________________________________
munge-users mailing list
[email protected]
https://mail.gna.org/listinfo/munge-users

Reply via email to