On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote: > On 9/27/24 16:45, Sean Hefty wrote: > > !-------------------------------------------------------------------| > > This Message Is From an External Sender > > This message came from outside your organization. > > |-------------------------------------------------------------------! > > > > > > > I have met with the team from IONOS about their testing on actual IB > > > > > hardware here at KVM Forum today and the requirements are starting > > > > > to make more sense to me. I didn't say much in our previous thread > > > > > because I misunderstood the requirements, so let me try to explain > > > > > and see if we're all on the same page. There appears to be a > > > > > fundamental limitation here with rsocket, for which I don't see how > > > > > it is > > > possible to overcome. > > > > > The basic problem is that rsocket is trying to present a stream > > > > > abstraction, a concept that is fundamentally incompatible with RDMA. > > > > > The whole point of using RDMA in the first place is to avoid using > > > > > the CPU, and to do that, all of the memory (potentially hundreds of > > > > > gigabytes) need to be registered with the hardware *in advance* (this > > > > > is > > > how the original implementation works). > > > > > The need to fake a socket/bytestream abstraction eventually breaks > > > > > down => There is a limit (a few GB) in rsocket (which the IONOS team > > > > > previous reported in testing.... see that email), it appears that > > > > > means that rsocket is only going to be able to map a certain limited > > > > > amount of memory with the hardware until its internal "buffer" runs > > > > > out before it can then unmap and remap the next batch of memory with > > > > > the hardware to continue along with the fake bytestream. This is > > > > > very much sticking a square peg in a round hole. If you were to > > > > > "relax" the rsocket implementation to register the entire VM memory > > > > > space (as my original implementation does), then there wouldn't be any > > > need for rsocket in the first place. > > > > > > Yes, some test like this can be helpful. > > > > > > And thanks for the summary. That's definitely helpful. > > > > > > One question from my side (as someone knows nothing on RDMA/rsocket): is > > > that "a few GBs" limitation a software guard? Would it be possible that > > > rsocket > > > provide some option to allow user opt-in on setting that value, so that > > > it might > > > work for VM use case? Would that consume similar resources v.s. the > > > current > > > QEMU impl but allows it to use rsockets with no perf regressions? > > Rsockets is emulated the streaming socket API. The amount of memory > > dedicated to a single rsocket is controlled through a wmem_default > > configuration setting. It is also configurable via rsetsockopt() > > SO_SNDBUF. Both of those are similar to TCP settings. The SW field used > > to store this value is 32-bits. > > > > This internal buffer acts as a bounce buffer to convert the synchronous > > socket API calls into the asynchronous RDMA transfers. Rsockets uses the > > CPU for data copies, but the transport is offloaded to the NIC, including > > kernel bypass. > Understood. > > Does your kernel allocate > 4 GBs of buffer space to an individual socket? > Yes, it absolutely does. We're dealing with virtual machines here, right? It > is possible (and likely) to have a virtual machine that is hundreds of GBs > of RAM in size. > > A bounce buffer defeats the entire purpose of using RDMA in these cases. > When using RDMA for very large transfers like this, the goal here is to map > the entire memory region at once and avoid all CPU interactions (except for > message management within libibverbs) so that the NIC is doing all of the > work. > > I'm sure rsocket has its place with much smaller transfer sizes, but this is > very different.
Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM use case? I also wonder whether there're other applications that may benefit from this outside of QEMU. Thanks, -- Peter Xu