Hello, I am testing 2.8.0 in server and clients are getting evicted when an OST failover occurs resulting in IO error. Failover happens within 2 mins and lustre timeouts are very comfortable (lustre timeout is set to 300, at_min is 40, at_max is 400, ldlm_enqueue_min=260). Do you see any thing abnormal in the below debug? Any help is appreciated.
Client side debug when OST000e is unmounted and mounted on an OSS node(it was mounted again at 7:21): Feb 2 07:20:25 hcrddm005 kernel: LustreError: 11-0: hsi003-OST000e-osc-ffff8815c65f4400: Communicating with 10.10.150.21@o2ib, operation ost_read failed with -107. Feb 2 07:20:25 hcrddm005 kernel: Lustre: hsi003-OST000e-osc-ffff8815c65f4400: Connection to hsi003-OST000e (at 10.10.150.21@o2ib) was lost; in progress operations using this service will wait for recovery to complete Feb 2 07:20:25 hcrddm005 kernel: LustreError: dumping log to */tmp/lustre-log.1486041625.111200* Feb 2 07:20:25 hcrddm005 kernel: LustreError: Skipped 1 previous similar message Feb 2 07:21:27 hcrddm005 kernel: LustreError: 167-0: hsi003-OST000e-osc-ffff8815c65f4400: This client was evicted by hsi003-OST000e; in progress operations using this service will fail. Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) lock@ffff8819fd5eca98[3 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1000e0000:0x26442:0x0] { Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) lovsub@ffff881be723c720: [0 ffff881fb55c26e0 R(1):[0, 18446744073709551615]@[0x200000bd1:0x1e:0x0]] Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) osc@ffff881ab01f5978: ffff881d5cb4ebc0 0x20000001001 0x3e482ccc3f43b14c 3 ffff880b6711a5d0 size: 61045997568 mtime: 1485886851 atime: 0 ctime: 1485886851 blocks: 113896795 Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) } lock@ffff8819fd5eca98 Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) dlmlock returned -5 Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(ldlm_resource.c:809:ldlm_resource_complain()) hsi003-OST000e-osc-ffff8815c65f4400: namespace resource [0x26442:0x0:0x0].0 (ffff8812426c35c0) refcount nonzero (1) after lock cleanup; forcing cleanup. Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource: [0x26442:0x0:0x0].0 (ffff8812426c35c0) refcount = 2 Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in reverse order): Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns: hsi003-OST000e-osc-ffff8815c65f4400 lock: ffff881d5cb4ebc0/0x3e482ccc3f43b14c lrc: 3/0,0 mode: PR/PR res: [0x26442:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x126400000000 nid: local remote: 0xa7242e5b8ddab5d0 expref: -99 pid: 232385 timeout: 0 lvb_type: 1 Feb 2 07:21:28 hcrddm005 kernel: LustreError: 78611:0:(import.c:1296:ptlrpc_invalidate_import_thread()) dump the log upon eviction Feb 2 07:21:28 hcrddm005 kernel: LustreError: dumping log to */tmp/lustre-log.1486041688.78611* Feb 2 07:21:28 hcrddm005 kernel: Lustre: hsi003-OST000e-osc-ffff8815c65f4400: Connection restored to hsi003-OST000e (at 10.10.150.21@o2ib) Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1422:cl_unuse_try()) result = -5, this is unlikely! Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1437:cl_unuse_locked()) lock@ffff8817dddda078[2 0 0 1 0 00000000] R(1):[0, 18446744073709551615]@[0x200000bd1:0x1e:0x0] { Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1437:cl_unuse_locked()) vvp@ffff881c7a3823f8: Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1437:cl_unuse_locked()) lov@ffff881fb55c26e0: 1 Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1437:cl_unuse_locked()) 0 0: --- Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1437:cl_unuse_locked()) Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1437:cl_unuse_locked()) } lock@ffff8817dddda078 Feb 2 07:21:34 hcrddm005 kernel: LustreError: 78524:0:(cl_lock.c:1437:cl_unuse_locked()) unuse return –5 [client]</lus/hsi003/raj>$ dd if=raj_14 of=/dev/null bs=1M *dd: reading `raj_14': Input/output error* 8307+0 records in 8307+0 records out 8710520832 bytes (8.7 GB) copied, 72.793 s, 120 MB/s [client]</lus/hsi003/raj>$ /tmp/lustre-log.1486041625.111200 says: 00000100:00080000:12.0:1486041625.090841:0:111200:0:(recover.c:249:ptlrpc_request_handle_notconn()) import hsi003-OST000e-osc-ffff8815c65f4400 of hsi003-OST000e_UUID@10.10.150.21@o2ib abruptly disconnected: reconnecting 00000100:02000400:12.0:1486041625.090846:0:111200:0:(import.c:167:ptlrpc_set_import_discon()) hsi003-OST000e-osc-ffff8815c65f4400: Connection to hsi003-OST000e (at 10.10.150.21@o2ib) was lost; in progress operations using this service will wait for recovery to complete 00000100:00080000:12.0:1486041625.090855:0:111200:0:(import.c:176:ptlrpc_set_import_discon()) ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from FULL to DISCONN 00000100:00080000:13.0:1486041625.090916:0:111200:0:(import.c:640:ptlrpc_connect_import()) ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:13.0:1486041625.090923:0:111200:0:(import.c:485:import_select_connection()) hsi003-OST000e-osc-ffff8815c65f4400: connect to NID 10.10.150.21@o2ib last attempt 34103933580 00000100:00080000:13.0:1486041625.090930:0:111200:0:(import.c:563:import_select_connection()) hsi003-OST000e-osc-ffff8815c65f4400: import ffff880a9ea66000 using connection 10.10.150.21@o2ib/10.10.150.21@o2ib 00000100:00080000:14.0:1486041625.091061:0:111188:0:(import.c:1158:ptlrpc_connect_interpret()) ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:14.0:1486041625.091066:0:111188:0:(import.c:1204:ptlrpc_connect_interpret()) recovery of hsi003-OST000e_UUID on 10.10.150.21@o2ib failed (-19) 00000100:00080000:11.0:1486041625.092288:0:111201:0:(client.c:1438:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req@ffff880a2208d400 x1541968458196356/t0(0) o3-> hsi003-OST000e-osc-ffff8815c65f4400@10.10.150.21@o2ib:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00080000:13.0:1486041625.093806:0:111202:0:(client.c:1438:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req@ffff880c2caaac00 x1541968458196360/t0(0) o3-> hsi003-OST000e-osc-ffff8815c65f4400@10.10.150.21@o2ib:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00080000:13.0:1486041625.098325:0:111205:0:(client.c:1438:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req@ffff88070d29a800 x1541968458196372/t0(0) o3-> hsi003-OST000e-osc-ffff8815c65f4400@10.10.150.21@o2ib:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00080000:13.0:1486041625.099788:0:111205:0:(client.c:1438:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req@ffff8804d4fc0400 x1541968458196376/t0(0) o3-> hsi003-OST000e-osc-ffff8815c65f4400@10.10.150.21@o2ib:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00080000:11.0:1486041625.101315:0:111207:0:(client.c:1438:ptlrpc_send_new_req()) @@@ req from PID 0 waiting for recovery: (FULL != DISCONN) req@ffff880ae155bc00 x1541968458196380/t0(0) o3-> hsi003-OST000e-osc-ffff8815c65f4400@10.10.150.21@o2ib:6/4 lens 488/432 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 Debug log: 269934 lines, 269934 kept, 0 dropped, 0 bad. */tmp/lustre-log.1486041688.78611* 00000100:00080000:0.0:1486041681.536712:0:33379:0:(service.c:1079:ptlrpc_update_export_timer()) updating export 2189d151-cd58-3ada-30bd-fcff46ee16dc at 1486041681 exp ffff881e810c7c00 10000000:01000000:1.0F:1486041681.536805:0:32111:0:(mgc_request.c:962:mgc_blocking_ast()) Lock res [0x333030697368:0x2:0x0].0 (hsi003) 10000000:01000000:10.0:1486041687.976410:0:111932:0:(mgc_request.c:512:do_requeue()) updating log hsi003-cliir 10000000:01000000:10.0:1486041687.976413:0:111932:0:(mgc_request.c:1848:mgc_process_log()) Process log hsi003-cliir:ffff8815c65f4400 from 163 10000000:01000000:10.0:1486041687.976418:0:111932:0:(mgc_request.c:1061:mgc_enqueue()) Enqueue for hsi003-cliir (res 0x333030697368) 00000020:00000080:10.0:1486041687.976986:0:111932:0:(obd_config.c:1107:class_process_config()) processing cmd: cf00f 00000100:00080000:10.0:1486041687.977019:0:111932:0:(import.c:191:ptlrpc_set_import_discon()) osc: import ffff880a9ea66000 already not connected (conn 9, was 0): DISCONN 00010000:00080000:10.0:1486041687.977031:0:111932:0:(ldlm_lib.c:98:import_set_conn()) imp ffff880a9ea66000@hsi003-OST000e-osc-ffff8815c65f4400: found existing conn 10.10.150.21@o2ib, moved to head 00000100:00080000:10.0:1486041687.977034:0:111932:0:(import.c:640:ptlrpc_connect_import()) ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:10.0:1486041687.977040:0:111932:0:(import.c:485:import_select_connection()) hsi003-OST000e-osc-ffff8815c65f4400: connect to NID 10.10.150.21@o2ib last attempt 0 00000100:00080000:10.0:1486041687.977045:0:111932:0:(import.c:563:import_select_connection()) hsi003-OST000e-osc-ffff8815c65f4400: import ffff880a9ea66000 using connection 10.10.150.21@o2ib/10.10.150.21@o2ib 00000020:01000000:10.0:1486041687.977063:0:111932:0:(obd_config.c:1352:class_process_proc_param()) hsi003-OST000e-osc-ffff8815c65f4400.osc: Set parameter import=connection=10.10.150.21@o2ib::21 10000000:01000000:10.0:1486041687.977080:0:111932:0:(mgc_request.c:1879:mgc_process_log()) MGC10.10.150.15@o2ib: configuration from log 'hsi003-cliir' succeeded (0). 00010000:00080000:14.0F:1486041687.977538:0:111188:0:(ldlm_request.c:1317:ldlm_cli_update_pool()) @@@ Zero SLV or Limit found (SLV: 0, Limit: 1611812) req@ffff8818f261e800 x1541968458196396/t0(0) o8->hsi003-OST000e-osc-ffff8815c65f4400@10.10.150.21@o2ib:28/4 lens 400/384 e 0 to 0 dl 1486041707 ref 1 fl Rpc:RN/0/0 rc 0/0 00000100:00080000:14.0:1486041687.977554:0:111188:0:(import.c:831:ptlrpc_connect_interpret()) hsi003-OST000e-osc-ffff8815c65f4400: connect to target with instance 21 00000001:00080000:14.0:1486041687.977569:0:111188:0:(lcommon_misc.c:77:cl_init_ea_size()) updating max_mdsize/max_cookiesize: 432/512 00000100:00080000:14.0:1486041687.977574:0:111188:0:(import.c:980:ptlrpc_connect_interpret()) @@@ hsi003-OST000e-osc-ffff8815c65f4400: evicting (reconnect/recover flags not set: 4) req@ffff8818f261e800 x1541968458196396/t0(0) o8-> hsi003-OST000e-osc-ffff8815c65f4400@10.10.150.21@o2ib:28/4 lens 400/384 e 0 to 0 dl 1486041707 ref 1 fl Interpret:RN/0/0 rc 0/0 00000100:00080000:14.0:1486041687.977582:0:111188:0:(import.c:983:ptlrpc_connect_interpret()) ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:14.0:1486041687.977585:0:111188:0:(import.c:1346:ptlrpc_import_recovery_state_machine()) 167-0: hsi003-OST000e-osc-ffff8815c65f4400: This client was evicted by hsi003-OST000e; in progress operations using this service will fail. 00000100:00080000:14.0:1486041687.993395:0:111188:0:(import.c:1350:ptlrpc_import_recovery_state_machine()) evicted from hsi003-OST000e_UUID@10.10.150.21@o2ib; invalidating 00000100:00080000:12.0:1486041687.993449:0:78611:0:(import.c:1291:ptlrpc_invalidate_import_thread()) thread invalidate import hsi003-OST000e-osc-ffff8815c65f4400 to hsi003-OST000e_UUID@10.10.150.21@o2ib 00000100:00080000:12.0:1486041687.993454:0:78611:0:(import.c:203:ptlrpc_deactivate_and_unlock_import()) setting import hsi003-OST000e_UUID INVALID Thank you
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org