Re: [Lustre-discuss] MDS hangs with OFED

2011-03-17 Thread Cliff White
Unfortunately, we've had lot's of reports of IB instability.  It does appear
to happen
quite a bit, and generally is not a Lustre problem at all.
- Check all mechanical connections, cables, etc. - replace if need be - many
issues have been cable-related.
- Check firmware versions of all IB cards, find the best version for yours.
- Make sure your IB cards are in the proper (best performing) slots in your
backplane.
- If you have an IB switch with monitoring/error reporting you may be able
to get more data.
cliffw


On Thu, Mar 17, 2011 at 10:54 AM, Kevin Hildebrand  wrote:

>
> We've been seeing occasional hangs on our MDS and I'd like to see if
> anyone else is seeing this or can provide suggestions on where to look.
> This might not even be a Lustre problem at all.
>
> We're running Lustre 1.8.4 with OFED 1.5.2, and kernel version
> 2.6.18-194.3.1.el5_lustre.1.8.4.
>
> The problem is that at some point it appears that something in the IB
> stack is going out to lunch- pings to the IPoIB interface time out, and
> anything that touches IB (perfquery, etc) goes into a hard hang and cannot
> be killed.
>
> The only solution to the problem once it occurs is to power-cycle the
> machine, as shutdown/reboot hang as well.
>
> >From what I can see, the first abnormal entries in the system logs on
> the MDS are messages showing that connections to the OSSes are timing out.
>
> Any insight would be appreciated.
>
> Thanks,
>
> Kevin
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>



-- 
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] MDS hangs with OFED

2011-03-17 Thread Kevin Hildebrand

We've been seeing occasional hangs on our MDS and I'd like to see if 
anyone else is seeing this or can provide suggestions on where to look.
This might not even be a Lustre problem at all.

We're running Lustre 1.8.4 with OFED 1.5.2, and kernel version 
2.6.18-194.3.1.el5_lustre.1.8.4.

The problem is that at some point it appears that something in the IB 
stack is going out to lunch- pings to the IPoIB interface time out, and 
anything that touches IB (perfquery, etc) goes into a hard hang and cannot 
be killed.

The only solution to the problem once it occurs is to power-cycle the 
machine, as shutdown/reboot hang as well.

>From what I can see, the first abnormal entries in the system logs on 
the MDS are messages showing that connections to the OSSes are timing out.

Any insight would be appreciated.

Thanks,

Kevin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss