Hello all, I have a problem with lustre 1.8.0 on Centos 5.0. There are logs on client: Oct 4 23:56:13 web5 kernel: Lustre: lustre-MDT0000-mdc-ffff81022e8e9400: Connection to service lustre-MDT0000 via nid 192.168.1...@tcp was lost; in progress Oct 4 23:56:13 web5 kernel: LustreError: 3260:0:(import.c:909:ptlrpc_connect_interpret()) lustre-MDT0000_UUID went back in time (transno 12892928971 was pre Oct 4 23:56:13 web5 kernel: LustreError: 167-0: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. Oct 4 23:56:13 web5 kernel: LustreError: 1204:0:(client.c:792:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81017b295c00 x1315300608347977/t0 o101->lu Oct 4 23:56:13 web5 kernel: LustreError: 1204:0:(mdc_locks.c:601:mdc_enqueue()) ldlm_cli_enqueue: -108 Oct 4 23:56:13 web5 php-cgi: PHP Warning: session_start() [<a href='function.session-start'>function.session-start</a>]: open(/mnt/lustre/megashare/session Oct 4 23:56:13 web5 php-cgi: PHP Warning: session_start() [<a href='function.session-start'>function.session-start</a>]: open(/mnt/lustre/megashare/session Oct 4 23:56:13 web5 php-cgi: PHP Warning: session_start() [<a href='function.session-start'>function.session-start</a>]: open(/mnt/lustre/megashare/session Oct 4 23:56:13 web5 kernel: LustreError: 1193:0:(client.c:792:ptlrpc_import_delay_req()) @@@ IMP_INVALID r...@ffff81016f855000 x1315300608347982/t0 o101->lu Oct 4 23:56:13 web5 kernel: LustreError: 1193:0:(client.c:792:ptlrpc_import_delay_req()) Skipped 4 previous similar messages Oct 4 23:56:13 web5 php-cgi: PHP Warning: Unknown: open(/mnt/lustre/megashare/session/sess_7s287b1tjipfoh1ud1basgu773, O_RDWR) failed: Cannot send after tr Oct 4 23:56:13 web5 php-cgi: PHP Warning: Unknown: Failed to write session data (files). Please verify that the current setting of session.save_path is cor Oct 4 23:56:13 web5 kernel: LustreError: 1193:0:(mdc_locks.c:601:mdc_enqueue()) ldlm_cli_enqueue: -108 Oct 4 23:56:13 web5 kernel: LustreError: 1193:0:(mdc_locks.c:601:mdc_enqueue()) Skipped 4 previous similar messages Oct 4 23:56:13 web5 php-cgi: PHP Warning: fopen(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a href='function.fopen'>function.fopen</a>]: Oct 4 23:56:13 web5 php-cgi: PHP Warning: fopen(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a href='function.fopen'>function.fopen</a>]: Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:239:ll_get_dir_page()) lock enqueue: rc: -108 Oct 4 23:56:13 web5 php-cgi: PHP Warning: fwrite(): supplied argument is not a valid stream resource in /srv/www/lighttpd/search.php on line 66 Oct 4 23:56:13 web5 php-cgi: PHP Warning: fwrite(): supplied argument is not a valid stream resource in /srv/www/lighttpd/search.php on line 66 Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:384:ll_readdir_18()) error reading dir 28443012/2724301973 page 171: rc -108 Oct 4 23:56:13 web5 php-cgi: PHP Warning: fclose(): supplied argument is not a valid stream resource in /srv/www/lighttpd/search.php on line 67 Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:239:ll_get_dir_page()) lock enqueue: rc: -108 Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:239:ll_get_dir_page()) Skipped 10 previous similar messages Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:384:ll_readdir_18()) error reading dir 28443012/2724301973 page 182: rc -108 Oct 4 23:56:13 web5 php-cgi: PHP Warning: fclose(): supplied argument is not a valid stream resource in /srv/www/lighttpd/search.php on line 67 Oct 4 23:56:13 web5 kernel: LustreError: 1605:0:(dir.c:384:ll_readdir_18()) Skipped 10 previous similar messages Oct 4 23:56:13 web5 kernel: LustreError: 1714:0:(file.c:114:ll_close_inode_openhandle()) inode 28511437 mdc close failed: rc = -108 Oct 4 23:56:13 web5 kernel: Lustre: lustre-MDT0000-mdc-ffff81022e8e9400: Connection restored to service lustre-MDT0000 using nid 192.168.1...@tcp. Oct 4 23:56:14 web5 php-cgi: PHP Warning: unlink(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a href='function.unlink'>function.unlink</a> Oct 4 23:56:14 web5 php-cgi: PHP Warning: unlink(/mnt/lustre/megashare/tmp/query-qqjrdr3m16ni2s21kurbilbeu2) [<a href='function.unlink'>function.unlink</a> Oct 4 23:56:14 web5 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.1...@tcp. The mds_close operation failed with -116 Oct 4 23:56:14 web5 kernel: LustreError: 1605:0:(file.c:114:ll_close_inode_openhandle()) inode 28443012 mdc close failed: rc = -116 Oct 4 23:56:14 web5 kernel: LustreError: 1605:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) ASSERTION(lock->l_writers > 0) failed Oct 4 23:56:14 web5 kernel: LustreError: 1605:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) LBUG Oct 4 23:56:14 web5 kernel: Lustre: 1605:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 1605 Oct 4 23:56:14 web5 kernel: php-cgi R running task 0 1605 10752 1704 1592 (NOTLB) Oct 4 23:56:14 web5 kernel: ffff810169e3fba4 0000000188496a21 ffff810169e3fba0 ffff810176aaac80 Oct 4 23:56:14 web5 kernel: ffff81022e0f7e00 0000000000000000 0000002000000000 0000000100000001 Oct 4 23:56:14 web5 kernel: ffff8101677d4060 000000028848dd29 0000000000000000 ffff81012da80cc0 Oct 4 23:56:14 web5 kernel: Call Trace: Oct 4 23:56:14 web5 kernel: [<ffffffff884b802f>] :ptlrpc:ldlm_flock_completion_ast+0x83f/0x920 Oct 4 23:56:14 web5 kernel: [<ffffffff884913e9>] :ptlrpc:ldlm_lock_enqueue+0x879/0x9c0 Oct 4 23:56:14 web5 kernel: [<ffffffff884a9960>] :ptlrpc:ldlm_cli_enqueue_fini+0xa40/0xbd0 Oct 4 23:56:14 web5 kernel: [<ffffffff884a8c93>] :ptlrpc:ldlm_prep_elc_req+0x2f3/0x420 Oct 4 23:56:14 web5 kernel: [<ffffffff80087d97>] default_wake_function+0x0/0xe Oct 4 23:56:14 web5 kernel: [<ffffffff884ab791>] :ptlrpc:ldlm_cli_enqueue+0x681/0x740 Oct 4 23:56:14 web5 kernel: [<ffffffff886727cf>] :lustre:ll_file_flock+0x57f/0x670 Oct 4 23:56:14 web5 kernel: [<ffffffff884b77f0>] :ptlrpc:ldlm_flock_completion_ast+0x0/0x920 Oct 4 23:56:14 web5 kernel: [<ffffffff80030506>] locks_remove_posix+0x84/0xa8 Oct 4 23:56:14 web5 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Oct 4 23:56:14 web5 kernel: [<ffffffff80023a32>] filp_close+0x54/0x64 Oct 4 23:56:14 web5 kernel: [<ffffffff8001da4b>] sys_close+0x88/0xa2 Oct 4 23:56:14 web5 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Oct 4 23:56:14 web5 kernel: Oct 4 23:56:14 web5 kernel: LustreError: dumping log to /tmp/lustre-log.1254675374.1605 Oct 5 00:02:35 web5 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [php-cgi:1782] Oct 5 00:02:35 web5 kernel: CPU 1: Oct 5 00:02:35 web5 kernel: Modules linked in: nfs(U) lockd(U) fscache(U) nfs_acl(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) o Oct 5 00:02:35 web5 kernel: Pid: 1782, comm: php-cgi Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0custom #2 .................... and, logs on MDS ( mgs,mdt on the same node).
Oct 4 23:55:29 MDS1 kernel: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.1...@tcp ns: mds-lustre-MDT0000_UUID lock: ffff81010de8dc00/0x2d5a67974b5b0f96 lrc: 3/0,0 mode: CR/CR res: 28444597/2724645784 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x9b8763ff47462714 expref: 869 pid: 19255 timeout: 4921443228 Oct 4 23:55:29 MDS1 kernel: Lustre: 19259:0:(mds_reint.c:237:mds_finish_transno()) commit transaction for disconnected client 039bd606-7bf9-7b26-2866-c58d394d5441: rc 0 Oct 4 23:55:29 MDS1 kernel: LustreError: 19304:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 28443012: cookie 0x2d5a67974b5b0b60 r...@ffff81011c5c9c00x1315300608349333/t0 o35->039bd606-7bf9-7b26-2866-c58d394d5...@net_0x20000c0a8014b_uuid:0/0 lens 408/696 e 0 to 0 dl 1254675335 ref 1 fl Interpret:/0/0 rc 0/0 Oct 4 23:55:29 MDS1 kernel: LustreError: 19304:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-116) r...@ffff81011c5c9c00 x1315300608349333/t0 o35->039bd606-7bf9-7b26-2866-c58d394d5...@net_0x20000c0a8014b_uuid:0/0 lens 408/696 e 0 to 0 dl 1254675335 ref 1 fl Interpret:/0/0 rc -116/0 Oct 5 00:51:08 MDS1 kernel: LustreError: 21766:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 039bd606-7bf9-7b26-2866-c58d394d5...@net_0x20000c0a8014b_uuid Oct 5 00:51:08 MDS1 kernel: r...@ffff81018360ac00 x1315300608351958/t0 o37->039bd606-7bf9-7b26-2866-c58d394d5...@net_0x20000c0a8014b_uuid:0/0 lens 408/360 e 1 to 0 dl 1254678693 ref 1 fl Interpret:/0/0 rc 0/0 Oct 5 00:51:08 MDS1 kernel: Lustre: 21766:0:(ldlm_lockd.c:808:ldlm_server_completion_ast()) ### enqueue wait took 3339s from 1254675329 ns: mds-lustre-MDT0000_UUID lock: ffff81017a6b6600/0x2d5a67974b5b63c7 lrc: 3/0,0 mode: PW/PW res: 28456302/2724672227 rrc: 3 type: FLK pid: 1206 [0->9223372036854775807] flags: 0x0 remote: 0x9b8763ff474660b1 expref: 185 pid: 19262 timeout: 0 Oct 5 00:51:31 MDS1 kernel: LustreError: 19268:0:(handler.c:1554:mds_handle()) operation 400 on unconnected MDS from 12345-192.168.1...@tcp Oct 5 00:51:31 MDS1 kernel: LustreError: 19268:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-107) r...@ffff810136d09800 x1315300608351964/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1254678697 ref 1 fl Interpret:H/0/0 rc -107/0 Oct 5 00:51:53 MDS1 kernel: Lustre: 3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for peer 192.168.1...@tcp detected: 1; the oldest (ffff810116084ac0) timed out 0 secs ago Oct 5 00:51:53 MDS1 kernel: Lustre: Request x1314965658587167 sent from lustre-MDT0000 to NID 192.168.1...@tcp 45s ago has timed out (limit 45s). Oct 5 00:51:53 MDS1 kernel: LustreError: 138-a: lustre-MDT0000: A client on nid 192.168.1...@tcp was evicted due to a lock completion callback to 192.168.1...@tcp timed out: rc -107 Oct 5 00:52:05 MDS1 kernel: Lustre: 3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for peer 192.168.1...@tcp detected: 1; the oldest (ffff810116084ac0) timed out 12 secs ago Oct 5 00:52:18 MDS1 kernel: Lustre: 3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for peer 192.168.1...@tcp detected: 1; the oldest (ffff810116084ac0) timed out 25 secs ago Oct 5 00:52:30 MDS1 kernel: Lustre: 3408:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_REQs for peer 192.168.1...@tcp detected: 1; the oldest (ffff810116084ac0) timed out 37 secs ago Oct 5 00:52:32 MDS1 kernel: Lustre: 21766:0:(service.c:1319:ptlrpc_server_handle_request()) @@@ Request x1315300608351958 took longer than estimated (31+59s); client may timeout. r...@ffff81018360ac00 x1315300608351958/t0 o37->039bd606-7bf9-7b26-2866-c58d394d5...@net_0x20000c0a8014b_uuid:0/0 lens 408/360 e 1 to 0 dl 1254678693 ref 1 fl Complete:/0/0 rc -110/-110 Oct 5 00:55:20 MDS1 kernel: Lustre: MGS: haven't heard from client 91e0da15-85fa-166e-4633-d907fdc1fdc6 (at 192.168.1...@tcp) in 229 seconds. I think it's dead, and I am evicting it. Oct 5 00:55:23 MDS1 kernel: Lustre: lustre-MDT0000: haven't heard from client 039bd606-7bf9-7b26-2866-c58d394d5441 (at 192.168.1...@tcp) in 232 seconds. I think it's dead, and I am evicting it. Oct 5 00:55:23 MDS1 kernel: LustreError: 19238:0:(lprocfs_status.c:1644:lprocfs_exp_cleanup()) obd_export's client uuid 039bd606-7bf9-7b26-2866-c58d394d5441 are not found in its nid_stats list It's look like the disconnect tcp between MDS and client, but i check lctl ping, it's OK. The except, MDS (or other nodes) that can't ssh to this client. My provisional measure is reboot this client. It works OK. But to a certain time, it happens again ( can happen with other clients). I don't think, the cause is network. Because, my network is still stable. what's happen with my system and how to fix it? I know the problem client evicted has been discussed a lot, but I do not find a way to solve my problem. thanks,
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
