Oleg,
thanks for your response.
See my responses inline.

lisa

On 7/14/11 2:47 PM, Oleg Drokin wrote:
Hello!

On Jul 14, 2011, at 1:59 PM, Lisa Giacchetti wrote:

Jul  7 07:10:08 cmsls6 kernel: Lustre: 
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
Jul  7 07:59:42 cmsls6 kernel: Lustre: 
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.191.35@tcp was evicted due to a lock completion callback to 
131.225.191.35@tcp timed out: rc -107
Jul  7 09:26:58 cmsls6 kernel: Lustre: 
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
Jul  7 09:53:50 cmsls6 kernel: Lustre: 
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.204.88@tcp was evicted due to a lock blocking callback to 
131.225.204.88@tcp timed out: rc -107
Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.207.176@tcp was evicted due to a lock blocking callback to 
131.225.207.176@tcp timed out: rc -107
Jul  7 10:23:01 cmsls6 kernel: Lustre: 
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 11:06:31 cmsls6 kernel: Lustre: 
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151@tcp 7s ago 
has timed out (7s prior to deadline).
Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client 
on nid 131.225.190.151@tcp was evicted due to a lock blocking callback to 
131.225.190.151@tcp timed out: rc -107
Jul  7 12:26:17 cmsls6 kernel: LustreError: 
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export 
ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: 
ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 rrc: 2 type: 
EXT [0->1048575] (req 0->1048575) flags: 0x0 remote: 0x6c03f21f59f6b4e6 expref: 
19 pid: 15352 timeout 0
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO 
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x2000083e1be97_UUID 
id 12345-131.225.190.151@tcp - client will retry
Jul  7 12:26:19 cmsls6 kernel: Lustre: 
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO 
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x2000083e1be97_UUID 
id 12345-131.225.190.151@tcp - client will retry


Some of these errors seem really bad - like the bulk IO comm error or the 
eviction due to a locking call back.
What should I be looking for here?  I have determined some of the messages that 
say a client has been evicted cause the
OSS thinks its dead are not due the system being down. So what makes the OSS 
think the client is dead?
Well, the clients become unresponsive for some reason, you really need to look 
at the client side logs for some clues on that.
I have been doing this as I was waiting for a reply and going through the manual and lustre-discuss archives. Here is an example of one of the client's logs during the appropriate time frame: Jul 7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred while communicating with 131.225.191.164@tcp. The obd_ping operation failed with -107 Jul 7 11:55:33 cmswn1526 kernel: Lustre: cmsprod1-OST0033-osc-ffff810617966400: Connection to service cmsprod1-OST0033 via nid 131.225.191.164@tcp was lost; in progress operations using this service will wait for recovery to complete. Jul 7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred while communicating with 131.225.191.164@tcp. The ost_write operation failed with -107 Jul 7 11:55:35 cmswn1526 kernel: LustreError: 167-0: This client was evicted by cmsprod1-OST0033; in progress operations using this service will fail. Jul 7 11:55:35 cmswn1526 kernel: LustreError: 3750:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff81031d414400 x1373265269802511/t0 o4->cmsprod1-OST0033_UUID@131.225.191.164@tcp:6/4 lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 Jul 7 11:55:35 cmswn1526 kernel: Lustre: cmsprod1-OST0033-osc-ffff810617966400: Connection restored to service cmsprod1-OST0033 using nid 131.225.191.164@tcp.

Also is there any way to determine what files are involved in these errors?
Well, the lock blocking callbacks message will provide you with ost number and 
object index that you might be able to backreference to a file.
I know there is a way to do this from the /proc file system (at least I think its /proc) but I can't find any reference to this
in the book I got from class on this or in the manual.
Can someone refresh my memory?

All that said, 1.8.3 is quite old and I think it would be a much better idea to 
try 1.8.6 and see if it improves things.

downtimes are few and far between for us so this may take a while to get scheduled.
If there is anything that can be done in the meantime I'd like to try it.

lisa

Bye,
     Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.


<<attachment: lisa.vcf>>

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to