Re: how to debug (mlx4) CQ overrun

2011-10-08 Thread Wendy Cheng
On Fri, Sep 23, 2011 at 2:30 PM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:

 There are not really any tools, but this is usually straightforward to
 look at from your app.


Great thanks for the response. It helped (to ensure our cq handling
logic was ok). The issue turns out to be build related. After doing a
clean rebuild of OFED IB modules with the modified header files, the
problem went away. The (header file) change was a result of exporting
kernel FMR (fast memory registration) to user space for an
experimental project.

Again, thank you for the write-up. It is very appreciated.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to debug (mlx4) CQ overrun

2011-09-23 Thread Wendy Cheng
I have a test program that does RDMA read-write as the following:

node A: server listens and handles connection requests
   setup a piece of memory initialized to 0
node B: two processes parent  child

child:
  1. setup a new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
  2. RDMA sequential write (8192 bytes a time) to server memory
  4. sync with parent

parent:
   1. setup the new channel with server, including a CQ with 1024 entries
  (ibv_create_cq(ctx, 1024, NULL, channel, 0);)
3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
 - check the buffer contents.
 - if memory content is still zero, re-read
4. sync with child

The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
 mlx4_core :06:00.0: CQ overrun on CQN 87

I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug (mlx4) CQ overrun

2011-09-23 Thread Jason Gunthorpe
On Fri, Sep 23, 2011 at 02:15:30PM -0700, Wendy Cheng wrote:

 I have my own counters that restrict the read (and write) to 512 max.
 Both write and read are blocking (i.e. cq is polled after each
 read/write). I suspect I do not have the cq poll logic correct. The
 question here is .. is there any diag tool available to check on the
 internal counters (and /or states) of ibverbs library and/or kernel
 drivers (to help RDMA applications debug) ? In my case, it hangs
 around 14546 block (i.e. after 14546*8192 byes).

There are not really any tools, but this is usually straightforward to
look at from your app.

Every time you post to the send Q increment a counter. Everytime you
get something back from ibv_poll_cq increment another counter.

The (A - B) must never exceed the number of entries in the CQ, and it
must not exceed the number of entries in the send Q (very important).

This assumes you are posting everything with IBV_SEND_SIGNALED. Doing
otherwise is basically the same but there is a bit more complexity to
manage the CQ counter as each completion represents multiple sendQ
entries.

Make sure you check for error codes from ibv_post_send.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html