what does netstat -i give you? any RX or TX drops?
On Sun, Nov 16, 2008 at 11:09 AM, Brock Palen <[EMAIL PROTECTED]> wrote: > Running 1.6.5.1 both server and client, on RHEL4 patchless clients. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > [EMAIL PROTECTED] > (734)936-1985 > > > > On Nov 16, 2008, at 8:26 AM, Mag Gam wrote: > >> Brock: >> >> What is the client version? I am getting the same type of failures. >> >> Also, check your network if you have any TX/RX packet drops (netstat -i). >> >> I am wondering if you are having the same problem as us. >> >> >> >> On Fri, Nov 14, 2008 at 6:37 PM, Brock Palen <[EMAIL PROTECTED]> wrote: >>> >>> We consistantly see random ocurances of a client being kicked out, >>> and while lustre says it tries to reconnect, it almost never can >>> without a reboot: >>> >>> >>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110 >>> waiting for callback (3 != 0) >>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>> 230:ptlrpc_invalidate_import()) @@@ still on sending list >>> [EMAIL PROTECTED] x979024/t0 o101->nobackup- >>> [EMAIL PROTECTED]@tcp:12/10 lens 448/1184 e 0 to 100 dl >>> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0 >>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c: >>> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov >>> 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000- >>> mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000 >>> using nid [EMAIL PROTECTED] >>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error >>> occurred while communicating with [EMAIL PROTECTED] The mds_statfs >>> operation failed with -107 >>> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000- >>> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid >>> [EMAIL PROTECTED] was lost; in progress operations using this service >>> will wait for recovery to complete. >>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client >>> was evicted by nobackup-MDT0000; in progress operations using this >>> service will fail. >>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c: >>> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 >>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c: >>> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] >>> x983192/t0 o41->[EMAIL PROTECTED]@tcp:12/10 lens >>> 128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 >>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c: >>> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108 >>> >>> Is there any way to make lustre more robust against these types of >>> failures? According to the manual (and many times in practice, like >>> rebooting a MDS) the filesystem will just block and comeback. This >>> almost never comes back, after a while it will say reconnected, but >>> will fail again right away. >>> >>> On the MDS I see: >>> >>> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven't heard >>> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >>> [EMAIL PROTECTED]) in 227 seconds. I think it's dead, and I am >>> evicting it. >>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c: >>> 1515:mds_handle()) operation 41 on unconnected MDS from >>> [EMAIL PROTECTED] >>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c: >>> 1536:target_send_reply_msg()) @@@ processing error (-107) >>> [EMAIL PROTECTED] x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0 >>> dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0 >>> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven't heard >>> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at >>> [EMAIL PROTECTED]) in 227 seconds. I think it's dead, and I am >>> evicting it. >>> >>> Just keeps kicking it out, /proc/fs/lustre/health_check on client, >>> and servers are healthy. >>> >>> Brock Palen >>> www.umich.edu/~brockp >>> Center for Advanced Computing >>> [EMAIL PROTECTED] >>> (734)936-1985 >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss@lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> > > _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss