Update: I found ways to improve active side performance from 10 million RDMA writes per second to 20 million (which I believe is the PCIe bottleneck):
1. Use inline payload - I think this reduces PCIe traffic. 2. Use non-signalled RDMA writes + don't poll for completion for every write - I don't know if ibv_poll_cq() uses the PCIe much. I'd appreciate any other ideas to reduce PCIe traffic or any affirmation that my explanation for the bottleneck and improvements are correct. --Anuj On Fri, Nov 22, 2013 at 2:59 AM, Anuj Kalia <anujkaliai...@gmail.com> wrote: > I had a related question regarding PCIe usage. > > How exactly does the userspace driver interact with the HCA? I'm > reading the code for libmlx4 but I can't find any code for interaction > with PCIe. There are some references to 'ringing a doorbell via PCI > MMIO' - can someone please tell me how that works? > > In general, it would be great if someone could explain the CPU-HCA > communication steps involved in doing an RDMA operation. If there is > an online resource from where I can read about this, I'd appreciate a > pointer. > > Thanks for your time! > > --Anuj > > > > On Thu, Nov 21, 2013 at 5:26 PM, Anuj Kalia <anujkaliai...@gmail.com> wrote: >> I have machines with Mellanox ConnectX-3 cards connected to a >> motherboard with PCIe 2.0. I had some questions regarding performance >> of this system: >> >> 1. When multiple clients issue small (32 byte) RDMA writes to the >> server, the combined throughput is about 22 million operations per >> second. With ConnectX-3 I should be able to get 35 million (quote from >> Mellanox). >> >> Is 22 million DMAs per second a PCIe 2.0 bottleneck? >> >> 2. When one client machine issues RDMA writes to multiple server >> machines, it can issue at most 10-11 million writes per second. Is >> this a PCIe bottleneck again? >> >> I believe it's a PCIe issue because an RDMA operation should involve 2 >> (or more) PCIe operations at the active side: >> a. Write the work request to the HCA (or maybe the HCA reads the request). >> b. The HCA reads the payload from the processor. >> >> Does this reasoning sound correct? >> >> 3. Is there a way to reduce the number of PCIe operations at the >> active side? I don't think that posting a linked list of WQEs will >> help because the HCA should read them one by one. >> >> --Anuj -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html