On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> On 9/27/24 16:45, Sean Hefty wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > > > > I have met with the team from IONOS about their testing on actual IB
> > > > > hardware here at KVM Forum today and the requirements are starting
> > > > > to make more sense to me. I didn't say much in our previous thread
> > > > > because I misunderstood the requirements, so let me try to explain
> > > > > and see if we're all on the same page. There appears to be a
> > > > > fundamental limitation here with rsocket, for which I don't see how 
> > > > > it is
> > > possible to overcome.
> > > > > The basic problem is that rsocket is trying to present a stream
> > > > > abstraction, a concept that is fundamentally incompatible with RDMA.
> > > > > The whole point of using RDMA in the first place is to avoid using
> > > > > the CPU, and to do that, all of the memory (potentially hundreds of
> > > > > gigabytes) need to be registered with the hardware *in advance* (this 
> > > > > is
> > > how the original implementation works).
> > > > > The need to fake a socket/bytestream abstraction eventually breaks
> > > > > down => There is a limit (a few GB) in rsocket (which the IONOS team
> > > > > previous reported in testing.... see that email), it appears that
> > > > > means that rsocket is only going to be able to map a certain limited
> > > > > amount of memory with the hardware until its internal "buffer" runs
> > > > > out before it can then unmap and remap the next batch of memory with
> > > > > the hardware to continue along with the fake bytestream. This is
> > > > > very much sticking a square peg in a round hole. If you were to
> > > > > "relax" the rsocket implementation to register the entire VM memory
> > > > > space (as my original implementation does), then there wouldn't be any
> > > need for rsocket in the first place.
> > > 
> > > Yes, some test like this can be helpful.
> > > 
> > > And thanks for the summary.  That's definitely helpful.
> > > 
> > > One question from my side (as someone knows nothing on RDMA/rsocket): is
> > > that "a few GBs" limitation a software guard?  Would it be possible that 
> > > rsocket
> > > provide some option to allow user opt-in on setting that value, so that 
> > > it might
> > > work for VM use case?  Would that consume similar resources v.s. the 
> > > current
> > > QEMU impl but allows it to use rsockets with no perf regressions?
> > Rsockets is emulated the streaming socket API.  The amount of memory 
> > dedicated to a single rsocket is controlled through a wmem_default 
> > configuration setting.  It is also configurable via rsetsockopt() 
> > SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used 
> > to store this value is 32-bits.
> > 
> > This internal buffer acts as a bounce buffer to convert the synchronous 
> > socket API calls into the asynchronous RDMA transfers.  Rsockets uses the 
> > CPU for data copies, but the transport is offloaded to the NIC, including 
> > kernel bypass.
> Understood.
> > Does your kernel allocate > 4 GBs of buffer space to an individual socket?
> Yes, it absolutely does. We're dealing with virtual machines here, right? It
> is possible (and likely) to have a virtual machine that is hundreds of GBs
> of RAM in size.
> 
> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> When using RDMA for very large transfers like this, the goal here is to map
> the entire memory region at once and avoid all CPU interactions (except for
> message management within libibverbs) so that the NIC is doing all of the
> work.
> 
> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> very different.

Is it possible to make rsocket be friendly with large buffers (>4GB) like
the VM use case?

I also wonder whether there're other applications that may benefit from
this outside of QEMU.

Thanks,

-- 
Peter Xu


Reply via email to