Hi everyone, 
I have a problem lately with our Lustre 1.8 deployment. It crashes periodically 
in a way that the nodes can mount the storage and I can't access the Lustre 
server machine neither. So I have to manually restart the machine every time to 
make everything normal again. I tried to see the logs, memory usage and locks 
count to see whether these issues may have the cause of the problem. But, I 
don't think they account for this issue.
An interesting symptom I see every time this problem happens is the Infiniband 
switch network usage lights which blink very fast. I think a huge traffic on 
the Infiniband network to the lustre server may cause the server crash. Does 
this relevance seems logical?

Anyway, I hope some of you may have experience this problem before and could 
help me understand what is happening and how to avoid crashing the server again!

Thanks,
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to