Some more information that might be helpful. There is a particular code that one of our users runs. Personally after the trouble this code has caused us we'd like to hand him a calculator and disable his accounts but sadly that's not an option. Since the time of the hang, there is what seems to be one process associated with lustre that is running as the userid of the problem user- "ll_sa_15530". A trace of this process in its current state shows this -

Apr 30 11:29:30 cola10 kernel: ll_sa_15530 S 0000000000000000 0 15531 1 17700 18228 (L-TLB) Apr 30 11:29:30 cola10 kernel: ffff810116c31c10 0000000000000046 ffff81013e7747a0 ffffffff80087d0e Apr 30 11:29:30 cola10 kernel: 0000000000000007 ffff81003a76b040 ffff81012f11f0c0 000fcb5175eba398 Apr 30 11:29:30 cola10 kernel: 0000000000001407 ffff81003a76b228 0000000000000001 0000000000000068
Apr 30 11:29:30 cola10 kernel: Call Trace:
Apr 30 11:29:30 cola10 kernel: [<ffffffff80087d0e>] enqueue_task +0x41/0x56 Apr 30 11:29:30 cola10 kernel: [<ffffffff8862b7e4>] :ptlrpc:ldlm_prep_enqueue_req+0x1b4/0x2e0 Apr 30 11:29:30 cola10 kernel: [<ffffffff886e528c>] :mdc:mdc_req_avail +0x6c/0xf0 Apr 30 11:29:30 cola10 kernel: [<ffffffff886e6275>] :mdc:mdc_enter_request+0x145/0x1e0 Apr 30 11:29:30 cola10 kernel: [<ffffffff800884ed>] default_wake_function+0x0/0xe Apr 30 11:29:30 cola10 kernel: [<ffffffff886e6410>] :mdc:mdc_intent_lookup_pack+0xd0/0xf0 Apr 30 11:29:30 cola10 kernel: [<ffffffff886e6644>] :mdc:mdc_intent_getattr_async+0x214/0x420 Apr 30 11:29:30 cola10 kernel: [<ffffffff887ae63d>] :lustre:ll_i2gids +0x5d/0x150 Apr 30 11:29:30 cola10 kernel: [<ffffffff887b94c5>] :lustre:ll_statahead_thread+0xf75/0x1810 Apr 30 11:29:30 cola10 kernel: [<ffffffff800884ed>] default_wake_function+0x0/0xe
Apr 30 11:29:30 cola10 kernel:  [<ffffffff8005bfb1>] child_rip+0xa/0x11
Apr 30 11:29:30 cola10 kernel: [<ffffffff887b8550>] :lustre:ll_statahead_thread+0x0/0x1810
Apr 30 11:29:30 cola10 kernel:  [<ffffffff8005bfa7>] child_rip+0x0/0x11

Is this a problem with the lustre readahead code? If so would this fix it? "echo 0 > /proc/fs/lustre/llite/*/statahead_count "

Thank you so much for all your help.

-Aaron

On Apr 30, 2008, at 11:16 AM, Aaron S. Knister wrote:

I have a lustre client that was randomly evicted early this morning. The errors from the dmesg are below. It's running infiniband. There were no infiniband errors that I could tell and all the mds/mgs and oss's said was "haven't heard from client xyz in 2277 seconds. Evicting". The client has halfway come back and now shows this -


[EMAIL PROTECTED]:~ $ lfs df -h
UUID                     bytes      Used Available  Use% Mounted on
data-MDT0000_UUID        87.5G      6.4G     81.1G    7% /data[MDT:0]
data-OST0000_UUID         5.4T      4.9T    439.6G   92% /data[OST:0]
data-OST0001_UUID   : inactive device
data-OST0002_UUID   : inactive device
data-OST0003_UUID   : inactive device
data-OST0004_UUID   : inactive device
data-OST0005_UUID   : inactive device
data-OST0006_UUID   : inactive device
data-OST0007_UUID   : inactive device
data-OST0008_UUID   : inactive device
data-OST0009_UUID   : inactive device

filesystem summary:       5.4T      4.9T    439.6G   92% /data

so it's reconnected to one of 10 osts. I tried to to an lctl -- device {device} reconnect and it said "Error: Operation in progress". I have no idea what went wrong and I'm confident a reboot would fix it but I'd like to avoid it if possible.


Thanks in advance.

LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_statfs operation failed with -107 Lustre: data-MDT0000-mdc-ffff81013037b800: Connection to service data-MDT0000 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by data-MDT0000; in progress operations using this service will fail. LustreError: 22345:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -5 LustreError: 22396:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717113/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22396:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22454:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717114/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22454:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22463:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717115/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22463:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22734:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717138/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22734:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22736:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717139/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22736:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22912:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717140/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22912:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717143/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 2 previous similar messages LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped 2 previous similar messages LustreError: 23781:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717144/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23781:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 23796:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717156/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23827:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717157/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped 1 previous similar message LustreError: 22346:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717169/t0 o35->[EMAIL PROTECTED] @o2ib:12 lens 296/896 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22346:0:(file.c:97:ll_close_inode_openhandle()) inode 21601226 mdc close failed: rc = -108 Lustre: data-MDT0000-mdc-ffff81013037b800: Connection restored to service data-MDT0000 using nid [EMAIL PROTECTED] LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ost_statfs operation failed with -107 Lustre: data-OST0001-osc-ffff81013037b800: Connection to service data-OST0001 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ost_statfs operation failed with -107 LustreError: 167-0: This client was evicted by data-OST0001; in progress operations using this service will fail. LustreError: 167-0: This client was evicted by data-OST0002; in progress operations using this service will fail. LustreError: 24093:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc = -5 Lustre: data-OST0000-osc-ffff81013037b800: Connection restored to service data-OST0000 using nid [EMAIL PROTECTED]

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
[EMAIL PROTECTED]




_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to