I have a test program that does RDMA read-write as the following:

node A: server listens and handles connection requests
               setup a piece of memory initialized to "0"
node B: two processes parent & child

child:
  1. setup a new channel with server, including a CQ with 1024 entries
                (ibv_create_cq(ctx, 1024, NULL, channel, 0);)
  2. RDMA sequential write (8192 bytes a time) to server memory
  4. sync with parent

parent:
   1. setup the new channel with server, including a CQ with 1024 entries
                  (ibv_create_cq(ctx, 1024, NULL, channel, 0);)
    3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
                 - check the buffer contents.
                 - if memory content is still zero, re-read
    4. sync with child

The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
 mlx4_core 0000:06:00.0: CQ overrun on CQN 000087

I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).

Thanks,
Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to