Gah! Ok no problem. No risk of data loss right? And is there anyway to 'limp along' till an outage without rebooting OST's?
Thanks for the insight! Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Jan 14, 2009, at 7:27 PM, Cliff White wrote: > Brock Palen wrote: >> I am having servers LBUG on a regular basis, Clients are running >> 1.6.6 patchless on RHEL4, servers are running RHEL4 with 1.6.5.1 >> RPM's from the download page. All connection is over Ethernet, >> Servers are x4600's. > > This looks like bug 16496, which is fixed in 1.6.6. You should upgrade > your servers to 1.6.6 > cliffw > >> The OSS that BUG'd has in its log: >> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c: >> 430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed >> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c: >> 432:libcfs_assertion_failed()) LBUG >> Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c: >> 167:libcfs_debug_dumpstack()) showing stack for process 10243 >> Jan 13 16:35:39 oss2 kernel: ldlm_cn_08 R running task >> 0 10243 1 10244 7776 (L-TLB) >> Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629 >> 00000103d83c7e00 0000000000000000 >> Jan 13 16:35:39 oss2 kernel: 00000101f8c88d40 >> ffffffffa021445e 00000103e315dd98 0000000000000001 >> Jan 13 16:35:39 oss2 kernel: 00000101f3993ea0 0000000000000000 >> Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629> >> {:ptlrpc:ptlrpc_server_handle_request+2457} >> Jan 13 16:35:39 oss2 kernel: <ffffffffa021445e> >> {:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67} >> Jan 13 16:35:39 oss2 kernel: <ffffffffa0416d05> >> {:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270> >> {:ptlrpc:ptlrpc_retry_rqbds+0} >> Jan 13 16:35:39 oss2 kernel: <ffffffffa0415270> >> {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270> >> {:ptlrpc:ptlrpc_retry_rqbds+0} >> Jan 13 16:35:39 oss2 kernel: <ffffffff80110de3>{child_rip >> +8} <ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0} >> Jan 13 16:35:39 oss2 kernel: <ffffffff80110ddb>{child_rip+0} >> Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/ >> lustre- log.1231882539.10243 >> At the same time a client (nyx346) lost contact with that oss, and >> is never allowed to reconnect. >> Client /var/log/message: >> Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d- >> osc-000001022c2a7800: Connection to service nobackup-OST000d via >> nid 10.164.3....@tcp was lost; in progress operations using this >> service will wait for recovery to complete.Jan 13 16:37:20 nyx346 >> kernel: Lustre: Skipped 6 previous similar messagesJan 13 >> 16:37:20 nyx346 kernel: LustreError: 3889:0:(ldlm_request.c: >> 996:ldlm_cli_cancel_req ()) Got rc -11 from cancel RPC: canceling >> anywayJan 13 16:37:20 nyx346 kernel: LustreError: 3889:0: >> (ldlm_request.c: 1605:ldlm_cli_cancel_list()) >> ldlm_cli_cancel_list: -11Jan 13 16:37:20 nyx346 kernel: >> LustreError: 11-0: an error occurred while communicating with >> 10.164.3....@tcp. The ost_connect operation failed with -16Jan 13 >> 16:37:20 nyx346 kernel: LustreError: Skipped 10 previous similar >> messages >> Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c: >> 410:import_select_connection()) nobackup-OST000d- >> osc-000001022c2a7800: tried all connections, increasing latency to 7s >> Even now the server(OSS) is refusing connection to OST00d, with >> the message: >> Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- >> OST000d: refuse reconnection from 145a1ec5-07ef- >> f7eb-0ca9-2a2b6503e...@10.164.1.90@tcp to 0x00000103d5ce7000; >> still busy with 2 active RPCs >> If I reboot the OSS, the OST's on it go though recovery like >> normal, and then the client is fine. >> Network looks clean, found one machine with lots of dropped >> packets between the servers, but that is not the client in question. >> Thank you! If it happens again, and I find any other data I will >> let you know. >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> bro...@umich.edu >> (734)936-1985 >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss