Pete, All:

It seems I've stumbled across another deadlock somewhere in the IB code. I'm back to square one with my testing of the eHCA's over pvfs2. The hangs occur at about the same places as they used to (back in april/may?).. and occur at random times still. The difference now is if I let the client-side communication timeout (not pvfs2-client process, this is using the libpvfs2) the metadata server crashes, all the other servers do not. If I put in enormous amounts of debug output, it wont hang, but when I approach 10-15gbit/client of bandwidth, things start to go bad.

Any ideas here??
Here's a log from the client side of the crash. ..it appears that nothing is wrong? Which state are we in here?

[D 13:49:56.711110] generic_post_recv: rq 0x1011aa80 matches RQ_RTS_WAITING_USER_POST. [D 13:49:56.711134] generic_post_recv: rq 0x1011aa80 RQ_RTS_WAITING_USER_POST send cts. [D 13:49:56.711160] memcache_register: hit [0] 0x400010a4000 len 65536 (via 0x400010a4000 len 65536) refcnt now 1. [D 13:49:56.711183] send_cts: rq 0x1011aa80 from da8:3336 opid 0x54e970 len 65536.
[D 13:49:56.711210] openib_post_rr_ack: da8:3336 bh 10.
[D 13:49:56.711235] openib_post_sr: da8:3336 bh 10 len 48 wr 45187/0.
[D 13:49:56.711270] test_rq: rq 0x1011a730 completed 65536 from da7:3336.
[D 13:49:56.711296] BMI_testcontext completing: 105720
[D 13:49:56.711327] ib_check_cq: found something.
[D 13:49:56.711350] ib_check_cq: sr (ack?) to da8:3336 send completed.
[D 13:49:56.711374] ib_check_cq: found something.
[D 13:49:56.711397] ib_check_cq: ack message da8:3336 my bufnum 10.
[D 13:49:56.711420] encourage_recv_incoming_cts_ack: rq 0x1011aa80 RQ_RTS_WAITING_DATA. [D 13:49:56.711443] memcache_deregister: dec refcount [0] 0x400010a4000 len 65536 count now 0. [D 13:49:56.711467] encourage_recv_incoming_cts_ack: rq 0x1011aa80 now RQ_RTS_WAITING_USER_TEST.
[D 13:49:56.711492] test_rq: rq 0x1011aa80 completed 65536 from da8:3336.
[D 13:49:56.711516] BMI_testcontext completing: 105722

And from my metadata server:

[D 13:31:07.647479] PVFS2 Server version 1.5.1 starting.
[E 13:56:46.701579] Job time out: cancelling flow operation, job_id: 139985.
[E 13:56:46.701633] Flow proto cancel called on 0x2aaaaad216b0
[E 13:56:46.701645] Flow proto error cleanup started on 0x2aaaaad216b0, error_code: -1610612737 [E 13:56:46.704647] Flow proto 0x2aaaaad216b0 canceling a total of 1 BMI or Trove operations


--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to