Re: [Beowulf] Supercomputers face growing resilience problems

Hearns, John Thu, 22 Nov 2012 02:39:50 -0800

"Other researchers have shown that multiple failures are sometimes correlated 
with each other, because a failure with one technology may affect performance 
in others, according to Gainaru."


Interesting idea - and of course cluster admins have been doing this all along, 
only manually.
How often have you run a parallel shell to grep through logs on a buch of nodes 
to look for a certain string or a certain event?
Automating this is a good concept.

But in the department of No Shit Sherlock!:

 "For instance, when a network card fails, it will soon hobble other system 
processes that rely on network communication."
Well I never...


The contents of this email are confidential and for the exclusive use of the 
intended recipient.  If you receive this email in error you should not copy it, 
retransmit it, use it or disclose its contents but should return it to the 
sender immediately and delete your copy.
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Supercomputers face growing resilience problems

Reply via email to