[Lustre-discuss] Lustre crashes periodically

2013-10-09 Thread Arya Mazaheri
Hi everyone, 
I have a problem lately with our Lustre 1.8 deployment. It crashes periodically 
in a way that the nodes can mount the storage and I can't access the Lustre 
server machine neither. So I have to manually restart the machine every time to 
make everything normal again. I tried to see the logs, memory usage and locks 
count to see whether these issues may have the cause of the problem. But, I 
don't think they account for this issue.
An interesting symptom I see every time this problem happens is the Infiniband 
switch network usage lights which blink very fast. I think a huge traffic on 
the Infiniband network to the lustre server may cause the server crash. Does 
this relevance seems logical?

Anyway, I hope some of you may have experience this problem before and could 
help me understand what is happening and how to avoid crashing the server again!

Thanks,___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre crashes periodically

2013-10-09 Thread Arya Mazaheri
Sorry, I have to correct this:  the nodes CANNOT mount the storage and I can't 
access the Lustre server machine neither.


On Wednesday ۱۷ July ۱۳۹۲ at ۱۱:۲۱, Arya Mazaheri wrote:

 Hi everyone,  
 I have a problem lately with our Lustre 1.8 deployment. It crashes 
 periodically in a way that the nodes can mount the storage and I can't access 
 the Lustre server machine neither. So I have to manually restart the machine 
 every time to make everything normal again. I tried to see the logs, memory 
 usage and locks count to see whether these issues may have the cause of the 
 problem. But, I don't think they account for this issue.
 An interesting symptom I see every time this problem happens is the 
 Infiniband switch network usage lights which blink very fast. I think a huge 
 traffic on the Infiniband network to the lustre server may cause the server 
 crash. Does this relevance seems logical?
  
 Anyway, I hope some of you may have experience this problem before and could 
 help me understand what is happening and how to avoid crashing the server 
 again!
  
 Thanks,  

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre crashes periodically

2013-10-09 Thread Abraham.Alawi
Did you run lfsck against it?
No kernel crash dumps?

Maybe it’s not Lustre related problem? If you have no Active/Passive MDS setup, 
Lustre file system will be unusable if the MDS server crashes for whatever 
reason.

Abraham Alawi
Linux/UNIX Systems and Storage Specialist | STACC Project | Information 
Management  Technology (IMT) | CSIRO

From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Arya Mazaheri
Sent: Wednesday, 9 October 2013 6:52 PM
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] Lustre crashes periodically

Hi everyone,
I have a problem lately with our Lustre 1.8 deployment. It crashes periodically 
in a way that the nodes can mount the storage and I can't access the Lustre 
server machine neither. So I have to manually restart the machine every time to 
make everything normal again. I tried to see the logs, memory usage and locks 
count to see whether these issues may have the cause of the problem. But, I 
don't think they account for this issue.
An interesting symptom I see every time this problem happens is the Infiniband 
switch network usage lights which blink very fast. I think a huge traffic on 
the Infiniband network to the lustre server may cause the server crash. Does 
this relevance seems logical?

Anyway, I hope some of you may have experience this problem before and could 
help me understand what is happening and how to avoid crashing the server again!

Thanks,
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss