Re: how to debug (mlx4) CQ overrun
On Fri, Sep 23, 2011 at 2:30 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: There are not really any tools, but this is usually straightforward to look at from your app. Great thanks for the response. It helped (to ensure our cq handling logic was ok). The issue turns out to be build related. After doing a clean rebuild of OFED IB modules with the modified header files, the problem went away. The (header file) change was a result of exporting kernel FMR (fast memory registration) to user space for an experimental project. Again, thank you for the write-up. It is very appreciated. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to debug (mlx4) CQ overrun
I have a test program that does RDMA read-write as the following: node A: server listens and handles connection requests setup a piece of memory initialized to 0 node B: two processes parent child child: 1. setup a new channel with server, including a CQ with 1024 entries (ibv_create_cq(ctx, 1024, NULL, channel, 0);) 2. RDMA sequential write (8192 bytes a time) to server memory 4. sync with parent parent: 1. setup the new channel with server, including a CQ with 1024 entries (ibv_create_cq(ctx, 1024, NULL, channel, 0);) 3. RDMA sequential read (8192 byes a time) to the same piece of memory from server - check the buffer contents. - if memory content is still zero, re-read 4. sync with child The parent hangs (but child finishes its write) after the following pops up in /var/log/messages: mlx4_core :06:00.0: CQ overrun on CQN 87 I have my own counters that restrict the read (and write) to 512 max. Both write and read are blocking (i.e. cq is polled after each read/write). I suspect I do not have the cq poll logic correct. The question here is .. is there any diag tool available to check on the internal counters (and /or states) of ibverbs library and/or kernel drivers (to help RDMA applications debug) ? In my case, it hangs around 14546 block (i.e. after 14546*8192 byes). Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug (mlx4) CQ overrun
On Fri, Sep 23, 2011 at 02:15:30PM -0700, Wendy Cheng wrote: I have my own counters that restrict the read (and write) to 512 max. Both write and read are blocking (i.e. cq is polled after each read/write). I suspect I do not have the cq poll logic correct. The question here is .. is there any diag tool available to check on the internal counters (and /or states) of ibverbs library and/or kernel drivers (to help RDMA applications debug) ? In my case, it hangs around 14546 block (i.e. after 14546*8192 byes). There are not really any tools, but this is usually straightforward to look at from your app. Every time you post to the send Q increment a counter. Everytime you get something back from ibv_poll_cq increment another counter. The (A - B) must never exceed the number of entries in the CQ, and it must not exceed the number of entries in the send Q (very important). This assumes you are posting everything with IBV_SEND_SIGNALED. Doing otherwise is basically the same but there is a bit more complexity to manage the CQ counter as each completion represents multiple sendQ entries. Make sure you check for error codes from ibv_post_send. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html