Gah!  Ok no problem.

No risk of data loss right?  And is there anyway to 'limp along' till  
an outage without rebooting OST's?

Thanks for the insight!

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Jan 14, 2009, at 7:27 PM, Cliff White wrote:

> Brock Palen wrote:
>> I am having servers LBUG on a regular basis, Clients are running   
>> 1.6.6 patchless on RHEL4,  servers are running RHEL4 with 1.6.5.1   
>> RPM's from the download page.  All connection is over Ethernet,    
>> Servers are x4600's.
>
> This looks like bug 16496, which is fixed in 1.6.6. You should upgrade
> your servers to 1.6.6
> cliffw
>
>> The OSS that BUG'd has in its log:
>> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c:  
>> 430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed
>> Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c:  
>> 432:libcfs_assertion_failed()) LBUG
>> Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c:  
>> 167:libcfs_debug_dumpstack()) showing stack for process 10243
>> Jan 13 16:35:39 oss2 kernel: ldlm_cn_08    R  running task        
>> 0  10243      1         10244  7776 (L-TLB)
>> Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629   
>> 00000103d83c7e00 0000000000000000
>> Jan 13 16:35:39 oss2 kernel:        00000101f8c88d40  
>> ffffffffa021445e  00000103e315dd98 0000000000000001
>> Jan 13 16:35:39 oss2 kernel:        00000101f3993ea0 0000000000000000
>> Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629>  
>> {:ptlrpc:ptlrpc_server_handle_request+2457}
>> Jan 13 16:35:39 oss2 kernel:        <ffffffffa021445e>  
>> {:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67}
>> Jan 13 16:35:39 oss2 kernel:        <ffffffffa0416d05>  
>> {:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270>  
>> {:ptlrpc:ptlrpc_retry_rqbds+0}
>> Jan 13 16:35:39 oss2 kernel:        <ffffffffa0415270>  
>> {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270>  
>> {:ptlrpc:ptlrpc_retry_rqbds+0}
>> Jan 13 16:35:39 oss2 kernel:        <ffffffff80110de3>{child_rip 
>> +8}  <ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0}
>> Jan 13 16:35:39 oss2 kernel:        <ffffffff80110ddb>{child_rip+0}
>> Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/ 
>> lustre- log.1231882539.10243
>> At the same time a client (nyx346) lost contact with that oss, and  
>> is  never allowed to reconnect.
>> Client /var/log/message:
>> Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d-  
>> osc-000001022c2a7800: Connection to service nobackup-OST000d via  
>> nid  10.164.3....@tcp was lost; in progress operations using this  
>> service  will wait for recovery to complete.Jan 13 16:37:20 nyx346  
>> kernel:  Lustre: Skipped 6 previous similar messagesJan 13  
>> 16:37:20 nyx346  kernel: LustreError: 3889:0:(ldlm_request.c: 
>> 996:ldlm_cli_cancel_req ()) Got rc -11 from cancel RPC: canceling  
>> anywayJan 13 16:37:20  nyx346 kernel: LustreError: 3889:0: 
>> (ldlm_request.c: 1605:ldlm_cli_cancel_list())  
>> ldlm_cli_cancel_list: -11Jan 13 16:37:20  nyx346 kernel:  
>> LustreError: 11-0: an error occurred while  communicating with  
>> 10.164.3....@tcp. The ost_connect operation failed  with -16Jan 13  
>> 16:37:20 nyx346 kernel: LustreError: Skipped 10  previous similar  
>> messages
>> Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c:  
>> 410:import_select_connection()) nobackup-OST000d-  
>> osc-000001022c2a7800: tried all connections, increasing latency to 7s
>> Even now the server(OSS) is refusing connection to OST00d,  with  
>> the  message:
>> Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup-  
>> OST000d: refuse reconnection from 145a1ec5-07ef-  
>> f7eb-0ca9-2a2b6503e...@10.164.1.90@tcp to 0x00000103d5ce7000;  
>> still  busy with 2 active RPCs
>> If I reboot the OSS, the OST's on it go though recovery like  
>> normal,  and then the client is fine.
>> Network looks clean, found one machine with lots of dropped  
>> packets  between the servers, but that is not the client in question.
>> Thank you!  If it happens again, and I find any other data I will  
>> let  you know.
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to