[Pvfs2-developers] BMI-IB hang

Kyle Schochenmaier Fri, 25 Aug 2006 07:11:01 -0700

Pete, All:

It seems I've stumbled across another deadlock somewhere in the IBcode. I'm back to square one with my testing of the eHCA's over pvfs2.The hangs occur at about the same places as they used to (back inapril/may?).. and occur at random times still. The difference now is ifI let the client-side communication timeout (not pvfs2-client process,this is using the libpvfs2) the metadata server crashes, all the otherservers do not. If I put in enormous amounts of debug output, it wonthang, but when I approach 10-15gbit/client of bandwidth, things start togo bad.


Any ideas here??

Here's a log from the client side of the crash. ..it appears thatnothing is wrong? Which state are we in here?

[D 13:49:56.711110] generic_post_recv: rq 0x1011aa80 matchesRQ_RTS_WAITING_USER_POST.[D 13:49:56.711134] generic_post_recv: rq 0x1011aa80RQ_RTS_WAITING_USER_POST send cts.[D 13:49:56.711160] memcache_register: hit [0] 0x400010a4000 len 65536(via 0x400010a4000 len 65536) refcnt now 1.[D 13:49:56.711183] send_cts: rq 0x1011aa80 from da8:3336 opid 0x54e970len 65536.

[D 13:49:56.711210] openib_post_rr_ack: da8:3336 bh 10.
[D 13:49:56.711235] openib_post_sr: da8:3336 bh 10 len 48 wr 45187/0.
[D 13:49:56.711270] test_rq: rq 0x1011a730 completed 65536 from da7:3336.
[D 13:49:56.711296] BMI_testcontext completing: 105720
[D 13:49:56.711327] ib_check_cq: found something.
[D 13:49:56.711350] ib_check_cq: sr (ack?) to da8:3336 send completed.
[D 13:49:56.711374] ib_check_cq: found something.
[D 13:49:56.711397] ib_check_cq: ack message da8:3336 my bufnum 10.

[D 13:49:56.711420] encourage_recv_incoming_cts_ack: rq 0x1011aa80RQ_RTS_WAITING_DATA.[D 13:49:56.711443] memcache_deregister: dec refcount [0] 0x400010a4000len 65536 count now 0.[D 13:49:56.711467] encourage_recv_incoming_cts_ack: rq 0x1011aa80 nowRQ_RTS_WAITING_USER_TEST.

[D 13:49:56.711492] test_rq: rq 0x1011aa80 completed 65536 from da8:3336.
[D 13:49:56.711516] BMI_testcontext completing: 105722

And from my metadata server:

[D 13:31:07.647479] PVFS2 Server version 1.5.1 starting.
[E 13:56:46.701579] Job time out: cancelling flow operation, job_id: 139985.
[E 13:56:46.701633] Flow proto cancel called on 0x2aaaaad216b0

[E 13:56:46.701645] Flow proto error cleanup started on 0x2aaaaad216b0,error_code: -1610612737[E 13:56:46.704647] Flow proto 0x2aaaaad216b0 canceling a total of 1 BMIor Trove operations



--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy

Scalable Computing Laboratory

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] BMI-IB hang

Reply via email to