Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Fri, 25 Aug 2006 09:11 -0500:
It seems I've stumbled across another deadlock somewhere in the IB code. I'm back to square one with my testing of the eHCA's over pvfs2. The hangs occur at about the same places as they used to (back in april/may?).. and occur at random times still.

Maybe the best way to proceed is to get me to repeat the failure.
As much as you can, try to specify the setup and give me any codes I
don't have and I'll try to break things.  If that doesn't work we
can look through your logs some.  Too bad it won't break if
debugging is turned on.

The difference now is if I let the client-side communication timeout (not pvfs2-client process, this is using the libpvfs2) the metadata server crashes, all the other servers do not.

That's bad, but BMI_Cancel is hopefully not the root of your
problems.  We need to understand why you're getting random hangs.
(Of course, if you have a somewhat repeatable way of getting this to
happen, I'll certainly go fix it.  I did just fix some minor cancel
problem yesterday, so please do your testing on CVS head from now
on.)

Any ideas here??
Here's a log from the client side of the crash. ..it appears that nothing is wrong? Which state are we in here?

[D 13:49:56.711110] generic_post_recv: rq 0x1011aa80 matches RQ_RTS_WAITING_USER_POST. [D 13:49:56.711134] generic_post_recv: rq 0x1011aa80 RQ_RTS_WAITING_USER_POST send cts. [D 13:49:56.711160] memcache_register: hit [0] 0x400010a4000 len 65536 (via 0x400010a4000 len 65536) refcnt now 1. [D 13:49:56.711183] send_cts: rq 0x1011aa80 from da8:3336 opid 0x54e970 len 65536.
[D 13:49:56.711210] openib_post_rr_ack: da8:3336 bh 10.
[D 13:49:56.711235] openib_post_sr: da8:3336 bh 10 len 48 wr 45187/0.
[D 13:49:56.711270] test_rq: rq 0x1011a730 completed 65536 from da7:3336.
[D 13:49:56.711296] BMI_testcontext completing: 105720
[D 13:49:56.711327] ib_check_cq: found something.
[D 13:49:56.711350] ib_check_cq: sr (ack?) to da8:3336 send completed.
[D 13:49:56.711374] ib_check_cq: found something.
[D 13:49:56.711397] ib_check_cq: ack message da8:3336 my bufnum 10.
[D 13:49:56.711420] encourage_recv_incoming_cts_ack: rq 0x1011aa80 RQ_RTS_WAITING_DATA. [D 13:49:56.711443] memcache_deregister: dec refcount [0] 0x400010a4000 len 65536 count now 0. [D 13:49:56.711467] encourage_recv_incoming_cts_ack: rq 0x1011aa80 now RQ_RTS_WAITING_USER_TEST.
[D 13:49:56.711492] test_rq: rq 0x1011aa80 completed 65536 from da8:3336.
[D 13:49:56.711516] BMI_testcontext completing: 105722

The client did BMI_recv or similar.  The server had already sent an
RTS asking to send some data.  It would be better to try to prepost
the receive if possible, but things should still work fine.  The
client did BMI_test or equivalent and found that the receive of a 64
kB message from da8 completed fine.  Are there other requests
outstanding from the client's point of view?

And from my metadata server:

[D 13:31:07.647479] PVFS2 Server version 1.5.1 starting.
[E 13:56:46.701579] Job time out: cancelling flow operation, job_id: 139985.
[E 13:56:46.701633] Flow proto cancel called on 0x2aaaaad216b0
[E 13:56:46.701645] Flow proto error cleanup started on 0x2aaaaad216b0, error_code: -1610612737 [E 13:56:46.704647] Flow proto 0x2aaaaad216b0 canceling a total of 1 BMI or Trove operations

During either a read or a write operation, the server gave up
waiting on the client to send more data or accept more data.  Can't
tell which without more debugging here.

If your clocks are to be trusted, this was well after the successful
little read activity above.

                -- Pete

I should have specified, my clocks are off. Sorry. I'll build against CVS head now and see what the issue is. Also, Pete, I have a way of reproducing the problem via my netpipe module, I'll get things pulled together for that and send it your way hopefully this afternoon.

The setup is as follows:
6x AMD64 servers.  1x PPC64 client (cross-compiling to 64bit)
1 metadata server, which is also one of the 6 AMD64 nodes.

What would you recommend for debug output, which flags?? Right now I'm doing client and network stuff, which makes for a lot of debug to begin with.

Thanks,

~Kyle


--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to