what does netstat -i give you?

any RX or TX drops?


On Sun, Nov 16, 2008 at 11:09 AM, Brock Palen <[EMAIL PROTECTED]> wrote:
> Running 1.6.5.1  both server and client, on RHEL4 patchless clients.
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> [EMAIL PROTECTED]
> (734)936-1985
>
>
>
> On Nov 16, 2008, at 8:26 AM, Mag Gam wrote:
>
>> Brock:
>>
>> What is the client version? I am getting the same type of failures.
>>
>> Also, check your network if you have any TX/RX packet drops (netstat -i).
>>
>> I am wondering if you are having the same problem as us.
>>
>>
>>
>> On Fri, Nov 14, 2008 at 6:37 PM, Brock Palen <[EMAIL PROTECTED]> wrote:
>>>
>>> We consistantly see random ocurances of a client being kicked out,
>>> and while lustre says it tries to reconnect, it almost never can
>>> without a reboot:
>>>
>>>
>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
>>> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110
>>> waiting for callback (3 != 0)
>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
>>> 230:ptlrpc_invalidate_import()) @@@ still on sending list
>>> [EMAIL PROTECTED] x979024/t0 o101->nobackup-
>>> [EMAIL PROTECTED]@tcp:12/10 lens 448/1184 e 0 to 100 dl
>>> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0
>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
>>> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar messageNov
>>> 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-MDT0000-
>>> mdc-00000100f7ef0400: Connection restored to service nobackup-MDT0000
>>> using nid [EMAIL PROTECTED]
>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error
>>> occurred while communicating with [EMAIL PROTECTED] The mds_statfs
>>> operation failed with -107
>>> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000-
>>> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via nid
>>> [EMAIL PROTECTED] was lost; in progress operations using this service
>>> will wait for recovery to complete.
>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This client
>>> was evicted by nobackup-MDT0000; in progress operations using this
>>> service will fail.
>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:(llite_lib.c:
>>> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5
>>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c:
>>> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID  [EMAIL PROTECTED]
>>> x983192/t0 o41->[EMAIL PROTECTED]@tcp:12/10 lens
>>> 128/400 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
>>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(llite_lib.c:
>>> 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108
>>>
>>> Is there any way to make lustre more robust against these types of
>>> failures?  According to the manual (and many times in practice, like
>>> rebooting a MDS)  the filesystem will just block and comeback.  This
>>> almost never comes back, after a while it will say reconnected, but
>>> will fail again right away.
>>>
>>> On the MDS I see:
>>>
>>> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven't heard
>>> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at
>>> [EMAIL PROTECTED]) in 227 seconds. I think it's dead, and I am
>>> evicting it.
>>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c:
>>> 1515:mds_handle()) operation 41 on unconnected MDS from
>>> [EMAIL PROTECTED]
>>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c:
>>> 1536:target_send_reply_msg()) @@@ processing error (-107)
>>> [EMAIL PROTECTED] x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0
>>> dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0
>>> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven't heard
>>> from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at
>>> [EMAIL PROTECTED]) in 227 seconds. I think it's dead, and I am
>>> evicting it.
>>>
>>> Just keeps kicking it out,  /proc/fs/lustre/health_check on client,
>>> and servers are healthy.
>>>
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> [EMAIL PROTECTED]
>>> (734)936-1985
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>>
>
>
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to