On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote: > > I assume that the 25315s is from a bug
BTW, do you see this problem with both extent & inodebits locks? > (fixed in 1.8.5 I think, not sure if it was ported to 2.x) that calculated > the wrong time when printing this error message for LDLM lock timeouts. > > > I did not find the bug for that. I think Andreas was referring to bug 17887. However you should have the patch applied already since it was landed for 2.0.0. > > If there are routers they can cause dropped RPCs from the server to the > > client, and the client will be evicted for unresponsiveness even though it > > is not at fault. At one time Johann was working on a patch (or at least > > investigating) the ability to have servers resend RPCs before evicting > > clients. The tricky part is that you don't want to send 2 RPCs each with > > 1/2 the timeout interval, since that may reduce stability instead of > > increasing it. > > > How can I track those dropped RPCs on routers? I don't think routers can drop RPCs w/o a good reason. It is just that a router failure can lead to packet loss and given that servers don't resend local callbacks, this can result in client evictions. > Is this an expected behaviour? Well, let's call this a known problem we would like to address at some point. > How could I protect my filesystem from that? If I increase the timeout > this won't change anything Right, tweaking timeouts cannot help here. > if client/server do not re-send their RPC. To be clear, clients go through a disconnect/reconnect cycle and eventually resend RPCs. > > I think the bugzilla bug was called "limited server-side resend" or > > similar, filed by me several years ago. > > > Did not find either :) That's bug 3622. Fanyong also used to work on a patch, see http://review.whamcloud.com/#change,125. HTH Cheers, Johann _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss