We seem to be having a problem with lustre 1.6.4.3 and clients getting 
disconnected.

We currently have a situation where a box that just does maintenance work on 
the cluster (du/stats other work) has some directories it cannot enter.  (The 
shell just hangs and doesn't timeout.)

An lfs check servers shows all of the servers are ok:

% lfs check servers
content-MDT0000-mdc-ffff810210b0fc00 active.
content-OST0000-osc-ffff810210b0fc00 active.
content-OST0001-osc-ffff810210b0fc00 active.
content-OST0002-osc-ffff810210b0fc00 active.
content-OST0003-osc-ffff810210b0fc00 active.
content-OST0004-osc-ffff810210b0fc00 active.
content-OST0005-osc-ffff810210b0fc00 active.
content-OST0006-osc-ffff810210b0fc00 active.
content-OST0007-osc-ffff810210b0fc00 active.

I enabled the rpctrace in the debug logs, and am now seeing this:

00000100:00080000:2:1210181389.481562:0:4282:0:(pinger.c:139:ptlrpc_pinger_main())
 not pinging MGS (in recovery: FULL or recovery disabled: 0/1)
00000100:00080000:2:1210181414.476881:0:4282:0:(pinger.c:139:ptlrpc_pinger_main())
 not pinging MGS (in recovery: FULL or recovery disabled: 0/1)
00000100:00080000:2:1210181439.471197:0:4282:0:(pinger.c:139:ptlrpc_pinger_main())
 not pinging MGS (in recovery: FULL or recovery disabled: 0/1)
I can reboot the machine and it will come back.  The other clients connected to 
this cluster are not experiencing this problem.

Is anyone else seeing these issues?  Thoughts?

Thanks!

--
Andrew
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to