Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Thu, Apr 06, 2017 at 10:45:54PM +0300, Yuval Shaia wrote: > > Just add my 2 cents. You didn't answer on my question about other possible > > implementations. It can be SoftRoCE loopback optimizations, special ULP, > > RDMA transport, virtual driver with multiple VFs and single PF. > > Please see my response to Jason's comments - eventually, when a support for > VM to external host communication will be added - kdbr will become ULP as > well. So, is KDBR only to be used on the HV side? Ie it never shows up in the VM? That is even weirder, we certainly do not want to see a kernel RDMA ULP for any of this - the entire point of RDMA is to let user space implement their protocols without needing a unique kernel component!! Jason
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Thu, Apr 06, 2017 at 10:42:20PM +0300, Yuval Shaia wrote: > > I'd rather see someone optimize the loopback path of soft roce than > > see KDBR :) > > Can we assume that the optimized loopback path will be as fast as direct > copy between one VM address space to another VM address space? Well, you'd optimize it until it was a direct memory copy, so I think that is a reasonable starting assumption. > > > 3. Our intention is for KDBR to be used in other contexts as well when we > > > need > > >inter VM data exchange, e.g. backend for virtio devices. We didn't see > > > how this > > >kind of requirement can be implemented inside SoftRoce as we don't see > > > any > > >connection between them. > > > > KDBR looks like weak RDMA to me, so it is reasonable question why not > > use full RDMA with loopback optimization instead of creating something > > unique. > > True, KDBR exposes RDMA-like API because it's sole user is currently > pvrdma device. But, by design it can be expand to support other > clients for example virtio device which might have other attributes, > can we expect the same from SoftRoCE? RDMA handles all sorts of complex virtio-like protocols just fine. Unclear what 'other attributes' would be. Sounds like over designing?? > > IMHO, it also makes more sense for something like KDBR to live as a > > RDMA transport, not as a unique char device, it is obviously very > > RDMA-like. > > Can you elaborate more on this? > What exactly it will solve? > How it will be better than kdbr? If you are going to do RDMA, then the uAPI for it from the kernel should be the RDMA subsystem, don't invent unique cdevs that overlap established kernel functionality without a very, very good reason. > > .. and the char dev really can't be used when implementing user space > > RDMA, that would just make a big mess.. > > The position of kdbr is not to be a layer *between* user space and device - > it is *the device* from point of view of the process. Any RDMA device built on top of kdbr certainly needs to support /dev/uverbs0 and all the usual RDMA stuff, so again, I fail to see the point of the special cdev.. Trying to mix /dev/uverbs0 and /dev/kdbr in your provider would be too goofy and weird. > > But obviously if you connect pvrdma to real hardware then the page pin > > comes back. > > The fact that page pin is not needed with Soft RoCE device but is needed > with real RoCE device is exactly where kdbr can help as it isolates this > fact from user space process. I don't see how KDBR helps at all. To do virtual RDMA you must transfer RDMA objects and commands unmodified from VM to HV and implement a fairly complicated SW stack inside the HV. Once you do that, micro-optimizing for same-machine VM-to-VM copy is not really such a big deal, IMHO. The big challenge is keeping the real HW (or softrocee) RDMA objects in sync with the VM ones and implementing some kind of RDMA-in-RDMA tunnel to enable migration when using today's HW offload. I see nothing in kdbr that helps with any of this. All it seems to do is obfuscate the transfer of RDMA objects and commands to the hypervisor, and make the transition of a RDMA channel from loopback to network far, far, more complicated. > Sorry, we didn't mean "easy" but "simple", and simplest solutions > are always preferred. IMHO, currently there is no good solution to > do data copy between two VMs. Don't confuse 'simple' with under featured. :) > Can you comment on the second point - migration? Please note that we need > it to work both with Soft RoCE and with real device. I don't see how kdbr helps with migration, you still have to setup the HW NIC and that needs sharing all the RDMA centric objects from VM to HV. Jason
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Tue, Apr 04, 2017 at 08:33:49PM +0300, Leon Romanovsky wrote: > > I'm not going to repeat Jason's answer, I'm completely agree with him. > > Just add my 2 cents. You didn't answer on my question about other possible > implementations. It can be SoftRoCE loopback optimizations, special ULP, > RDMA transport, virtual driver with multiple VFs and single PF. Please see my response to Jason's comments - eventually, when a support for VM to external host communication will be added - kdbr will become ULP as well. Marcel & Yuval > > >
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Tue, Apr 04, 2017 at 10:01:55AM -0600, Jason Gunthorpe wrote: > On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote: > > > Here are some thoughts regarding the Soft RoCE usage in our project. > > We thought about using it as backend for QEMU pvrdma device > > we didn't how it will support our requirements. > > > > 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR > >removes the need for hw resources, emulated or not, concentrating > >on one copy from a VM to another. > > I'd rather see someone optimize the loopback path of soft roce than > see KDBR :) Can we assume that the optimized loopback path will be as fast as direct copy between one VM address space to another VM address space? > > > 3. Our intention is for KDBR to be used in other contexts as well when we > > need > >inter VM data exchange, e.g. backend for virtio devices. We didn't see > > how this > >kind of requirement can be implemented inside SoftRoce as we don't see > > any > >connection between them. > > KDBR looks like weak RDMA to me, so it is reasonable question why not > use full RDMA with loopback optimization instead of creating something > unique. True, KDBR exposes RDMA-like API because it's sole user is currently pvrdma device. But, by design it can be expand to support other clients for example virtio device which might have other attributes, can we expect the same from SoftRoCE? > > IMHO, it also makes more sense for something like KDBR to live as a > RDMA transport, not as a unique char device, it is obviously very > RDMA-like. Can you elaborate more on this? What exactly it will solve? How it will be better than kdbr? As we see it - kdbr, when will be expand to support peers on external hosts, will be like a ULP. > > .. and the char dev really can't be used when implementing user space > RDMA, that would just make a big mess.. The position of kdbr is not to be a layer *between* user space and device - it is *the device* from point of view of the process. > > > 4. We don't want all the VM memory to be pinned since it disable > > memory-over-commit > >which in turn will make the pvrdma device useless. > >We weren't sure how nice would play Soft RoCE with memory pinning and we > > wanted > >more control on memory management. It may be a solvable issue, but > > combined > >with the others lead us to our decision to come up with our kernel > > bridge (char > > soft roce certainly can be optimized to remove the page pin and always > run in an ODP-like mode. > > But obviously if you connect pvrdma to real hardware then the page pin > comes back. The fact that page pin is not needed with Soft RoCE device but is needed with real RoCE device is exactly where kdbr can help as it isolates this fact from user space process. > > >device or not, we went for it since it was the easiest to > >implement for a POC) > > I can see why it would be easy to implement, but not sure how this > really improves the kernel.. Sorry, we didn't mean "easy" but "simple", and simplest solutions are always preferred. IMHO, currently there is no good solution to do data copy between two VMs. > > Jason Can you comment on the second point - migration? Please note that we need it to work both with Soft RoCE and with real device. Marcel & Yuval
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote: > On 04/03/2017 09:23 AM, Leon Romanovsky wrote: > > On Fri, Mar 31, 2017 at 06:45:43PM +0300, Marcel Apfelbaum wrote: > > > On 03/30/2017 11:28 PM, Doug Ledford wrote: > > > > On 3/30/17 9:13 AM, Leon Romanovsky wrote: > > > > > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: > > > > > > From: Yuval Shaia> > > > > > > > > > > > Hi, > > > > > > > > > > > > General description > > > > > > === > > > > > > This is a very early RFC of a new RoCE emulated device > > > > > > that enables guests to use the RDMA stack without having > > > > > > a real hardware in the host. > > > > > > > > > > > > The current implementation supports only VM to VM communication > > > > > > on the same host. > > > > > > Down the road we plan to make possible to be able to support > > > > > > inter-machine communication by utilizing physical RoCE devices > > > > > > or Soft RoCE. > > > > > > > > > > > > The goals are: > > > > > > - Reach fast and secure loos-less Inter-VM data exchange. > > > > > > - Support remote VMs or bare metal machines. > > > > > > - Allow VMs migration. > > > > > > - Do not require to pin all VM memory. > > > > > > > > > > > > > > > > > > Objective > > > > > > = > > > > > > Have a QEMU implementation of the PVRDMA device. We aim to do so > > > > > > without > > > > > > any change in the PVRDMA guest driver which is already merged into > > > > > > the > > > > > > upstream kernel. > > > > > > > > > > > > > > > > > > RFC status > > > > > > === > > > > > > The project is in early development stages and supports > > > > > > only basic send/receive operations. > > > > > > > > > > > > We present it so we can get feedbacks on design, > > > > > > feature demands and to receive comments from the > > > > > > community pointing us to the "right" direction. > > > > > > > > > > If to judge by the feedback which you got from RDMA community > > > > > for kernel proposal [1], this community failed to understand: > > > > > 1. Why do you need new module? > > > > > > > > In this case, this is a qemu module to allow qemu to provide a virt > > > > rdma device to guests that is compatible with the device provided by > > > > VMWare's ESX product. Right now, the vmware_pvrdma driver > > > > works only when the guest is running on a VMWare ESX server product, > > > > this would change that. Marcel mentioned that they are currently > > > > making it compatible because that's the easiest/quickest thing to > > > > do, but in the future they might extend beyond what VMWare's virt rdma > > > > driver provides/uses and might then need to either modify it to work > > > > with their extensions or fork and create their own virt > > > > client driver. > > > > > > > > > 2. Why existing solutions are not enough and can't be extended? > > > > > > > > This patch is against the qemu source code, not the kernel. There is > > > > no other solution in the qemu source code, so there is no existing > > > > solution to extend. > > > > > > > > > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM > > > > >communication via virtual NIC? > > > > > > > > Eventually they want this to work on real hardware, and to be more or > > > > less transparent to the guest. They will need to make it independent > > > > of the kernel hardware/driver in use. That means their own > > > > virt driver, then the virt driver will eventually hook into whatever > > > > hardware is present on the system, or failing that, fall back to soft > > > > RoCE or soft iWARP if that ever makes it in the kernel. > > > > > > > > > > > > > > Hi Leon and Doug, > > > Your feedback is much appreciated! > > > > > > As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device, > > > so SoftRoCE can't help here (we are emulating a PCI device). > > > > I just responded to the latest email, but as you understood from my > > question, > > it was related to your KDBR module. > > > > > > > > Regarding the new KDBR module (Kernel Data Bridge), as the name suggests > > > is > > > a bridge between different VMs or between a VM and a hardware/software > > > device > > > and does not replace it. > > > > > > Leon, utilizing the Soft RoCE is definitely part of our roadmap from the > > > start, > > > we find the project a must since most of our systems don't even have real > > > RDMA hardware, and the question is how do best integrate with it. > > > > This is exactly the question, you chose as an implementation path to do > > it with new module over char device. I'm not against your approach, > > but would like to see the list with pros and cons for over possible > > solutions if any. Does it make sense to do special ULP to share the data > > between different drivers over shared memory? > > Hi Leon, > > Here are some thoughts regarding the Soft RoCE usage in our project. > We thought about using
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote: > Here are some thoughts regarding the Soft RoCE usage in our project. > We thought about using it as backend for QEMU pvrdma device > we didn't how it will support our requirements. > > 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR >removes the need for hw resources, emulated or not, concentrating >on one copy from a VM to another. I'd rather see someone optimize the loopback path of soft roce than see KDBR :) > 3. Our intention is for KDBR to be used in other contexts as well when we need >inter VM data exchange, e.g. backend for virtio devices. We didn't see how > this >kind of requirement can be implemented inside SoftRoce as we don't see any >connection between them. KDBR looks like weak RDMA to me, so it is reasonable question why not use full RDMA with loopback optimization instead of creating something unique. IMHO, it also makes more sense for something like KDBR to live as a RDMA transport, not as a unique char device, it is obviously very RDMA-like. .. and the char dev really can't be used when implementing user space RDMA, that would just make a big mess.. > 4. We don't want all the VM memory to be pinned since it disable > memory-over-commit >which in turn will make the pvrdma device useless. >We weren't sure how nice would play Soft RoCE with memory pinning and we > wanted >more control on memory management. It may be a solvable issue, but combined >with the others lead us to our decision to come up with our kernel bridge > (char soft roce certainly can be optimized to remove the page pin and always run in an ODP-like mode. But obviously if you connect pvrdma to real hardware then the page pin comes back. >device or not, we went for it since it was the easiest to >implement for a POC) I can see why it would be easy to implement, but not sure how this really improves the kernel.. Jason
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On 04/03/2017 09:23 AM, Leon Romanovsky wrote: On Fri, Mar 31, 2017 at 06:45:43PM +0300, Marcel Apfelbaum wrote: On 03/30/2017 11:28 PM, Doug Ledford wrote: On 3/30/17 9:13 AM, Leon Romanovsky wrote: On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: From: Yuval ShaiaHi, General description === This is a very early RFC of a new RoCE emulated device that enables guests to use the RDMA stack without having a real hardware in the host. The current implementation supports only VM to VM communication on the same host. Down the road we plan to make possible to be able to support inter-machine communication by utilizing physical RoCE devices or Soft RoCE. The goals are: - Reach fast and secure loos-less Inter-VM data exchange. - Support remote VMs or bare metal machines. - Allow VMs migration. - Do not require to pin all VM memory. Objective = Have a QEMU implementation of the PVRDMA device. We aim to do so without any change in the PVRDMA guest driver which is already merged into the upstream kernel. RFC status === The project is in early development stages and supports only basic send/receive operations. We present it so we can get feedbacks on design, feature demands and to receive comments from the community pointing us to the "right" direction. If to judge by the feedback which you got from RDMA community for kernel proposal [1], this community failed to understand: 1. Why do you need new module? In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product. Right now, the vmware_pvrdma driver works only when the guest is running on a VMWare ESX server product, this would change that. Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt client driver. 2. Why existing solutions are not enough and can't be extended? This patch is against the qemu source code, not the kernel. There is no other solution in the qemu source code, so there is no existing solution to extend. 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM communication via virtual NIC? Eventually they want this to work on real hardware, and to be more or less transparent to the guest. They will need to make it independent of the kernel hardware/driver in use. That means their own virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel. Hi Leon and Doug, Your feedback is much appreciated! As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device, so SoftRoCE can't help here (we are emulating a PCI device). I just responded to the latest email, but as you understood from my question, it was related to your KDBR module. Regarding the new KDBR module (Kernel Data Bridge), as the name suggests is a bridge between different VMs or between a VM and a hardware/software device and does not replace it. Leon, utilizing the Soft RoCE is definitely part of our roadmap from the start, we find the project a must since most of our systems don't even have real RDMA hardware, and the question is how do best integrate with it. This is exactly the question, you chose as an implementation path to do it with new module over char device. I'm not against your approach, but would like to see the list with pros and cons for over possible solutions if any. Does it make sense to do special ULP to share the data between different drivers over shared memory? Hi Leon, Here are some thoughts regarding the Soft RoCE usage in our project. We thought about using it as backend for QEMU pvrdma device we didn't how it will support our requirements. 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR removes the need for hw resources, emulated or not, concentrating on one copy from a VM to another. 2. We needed to support migration, meaning the PVRDMA device must preserve the RDMA resources between different hosts. Our solution includes a clear separation between the guest resources namespace and the actual hw/sw device. This is why the KDBR is intended to run outside the scope of the SoftRoCE so it can open/close hw connections independent from the VM. 3. Our intention is for KDBR to be used in other contexts as well when we need inter VM data exchange, e.g. backend for virtio devices. We didn't see how this kind of requirement can be implemented inside SoftRoce as we don't see any connection between them. 4. We don't want all the VM memory to be pinned since it disable
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Thu, Mar 30, 2017 at 03:28:21PM -0500, Doug Ledford wrote: > On 3/30/17 9:13 AM, Leon Romanovsky wrote: > > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: > > > From: Yuval Shaia> > > > > > Hi, > > > > > > General description > > > === > > > This is a very early RFC of a new RoCE emulated device > > > that enables guests to use the RDMA stack without having > > > a real hardware in the host. > > > > > > The current implementation supports only VM to VM communication > > > on the same host. > > > Down the road we plan to make possible to be able to support > > > inter-machine communication by utilizing physical RoCE devices > > > or Soft RoCE. > > > > > > The goals are: > > > - Reach fast and secure loos-less Inter-VM data exchange. > > > - Support remote VMs or bare metal machines. > > > - Allow VMs migration. > > > - Do not require to pin all VM memory. > > > > > > > > > Objective > > > = > > > Have a QEMU implementation of the PVRDMA device. We aim to do so without > > > any change in the PVRDMA guest driver which is already merged into the > > > upstream kernel. > > > > > > > > > RFC status > > > === > > > The project is in early development stages and supports > > > only basic send/receive operations. > > > > > > We present it so we can get feedbacks on design, > > > feature demands and to receive comments from the > > > community pointing us to the "right" direction. > > > > If to judge by the feedback which you got from RDMA community > > for kernel proposal [1], this community failed to understand: > > 1. Why do you need new module? > > In this case, this is a qemu module to allow qemu to provide a virt rdma > device to guests that is compatible with the device provided by VMWare's ESX > product. Right now, the vmware_pvrdma driver works only when the guest is > running on a VMWare ESX server product, this would change that. Marcel > mentioned that they are currently making it compatible because that's the > easiest/quickest thing to do, but in the future they might extend beyond > what VMWare's virt rdma driver provides/uses and might then need to either > modify it to work with their extensions or fork and create their own virt > client driver. Doug, As I mentioned during OFA, I just responded to the latest email, but targeted my questions for their module. Sorry for not being clear about it. > > > 2. Why existing solutions are not enough and can't be extended? > > This patch is against the qemu source code, not the kernel. There is no > other solution in the qemu source code, so there is no existing solution to > extend. > > > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM > >communication via virtual NIC? > > Eventually they want this to work on real hardware, and to be more or less > transparent to the guest. They will need to make it independent of the > kernel hardware/driver in use. That means their own virt driver, then the > virt driver will eventually hook into whatever hardware is present on the > system, or failing that, fall back to soft RoCE or soft iWARP if that ever > makes it in the kernel. > > > > > > Can you please help us to fill this knowledge gap? > > > > [1] http://marc.info/?l=linux-rdma=149063626907175=2 > signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Fri, Mar 31, 2017 at 06:45:43PM +0300, Marcel Apfelbaum wrote: > On 03/30/2017 11:28 PM, Doug Ledford wrote: > > On 3/30/17 9:13 AM, Leon Romanovsky wrote: > > > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: > > > > From: Yuval Shaia> > > > > > > > Hi, > > > > > > > > General description > > > > === > > > > This is a very early RFC of a new RoCE emulated device > > > > that enables guests to use the RDMA stack without having > > > > a real hardware in the host. > > > > > > > > The current implementation supports only VM to VM communication > > > > on the same host. > > > > Down the road we plan to make possible to be able to support > > > > inter-machine communication by utilizing physical RoCE devices > > > > or Soft RoCE. > > > > > > > > The goals are: > > > > - Reach fast and secure loos-less Inter-VM data exchange. > > > > - Support remote VMs or bare metal machines. > > > > - Allow VMs migration. > > > > - Do not require to pin all VM memory. > > > > > > > > > > > > Objective > > > > = > > > > Have a QEMU implementation of the PVRDMA device. We aim to do so > > > > without > > > > any change in the PVRDMA guest driver which is already merged into the > > > > upstream kernel. > > > > > > > > > > > > RFC status > > > > === > > > > The project is in early development stages and supports > > > > only basic send/receive operations. > > > > > > > > We present it so we can get feedbacks on design, > > > > feature demands and to receive comments from the > > > > community pointing us to the "right" direction. > > > > > > If to judge by the feedback which you got from RDMA community > > > for kernel proposal [1], this community failed to understand: > > > 1. Why do you need new module? > > > > In this case, this is a qemu module to allow qemu to provide a virt rdma > > device to guests that is compatible with the device provided by VMWare's > > ESX product. Right now, the vmware_pvrdma driver > > works only when the guest is running on a VMWare ESX server product, this > > would change that. Marcel mentioned that they are currently making it > > compatible because that's the easiest/quickest thing to > > do, but in the future they might extend beyond what VMWare's virt rdma > > driver provides/uses and might then need to either modify it to work with > > their extensions or fork and create their own virt > > client driver. > > > > > 2. Why existing solutions are not enough and can't be extended? > > > > This patch is against the qemu source code, not the kernel. There is no > > other solution in the qemu source code, so there is no existing solution to > > extend. > > > > > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM > > >communication via virtual NIC? > > > > Eventually they want this to work on real hardware, and to be more or less > > transparent to the guest. They will need to make it independent of the > > kernel hardware/driver in use. That means their own > > virt driver, then the virt driver will eventually hook into whatever > > hardware is present on the system, or failing that, fall back to soft RoCE > > or soft iWARP if that ever makes it in the kernel. > > > > > > Hi Leon and Doug, > Your feedback is much appreciated! > > As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device, > so SoftRoCE can't help here (we are emulating a PCI device). I just responded to the latest email, but as you understood from my question, it was related to your KDBR module. > > Regarding the new KDBR module (Kernel Data Bridge), as the name suggests is > a bridge between different VMs or between a VM and a hardware/software device > and does not replace it. > > Leon, utilizing the Soft RoCE is definitely part of our roadmap from the > start, > we find the project a must since most of our systems don't even have real > RDMA hardware, and the question is how do best integrate with it. This is exactly the question, you chose as an implementation path to do it with new module over char device. I'm not against your approach, but would like to see the list with pros and cons for over possible solutions if any. Does it make sense to do special ULP to share the data between different drivers over shared memory? Thanks > > Thanks, > Marcel & Yuval > > > > > > > > Can you please help us to fill this knowledge gap? > > > > > > [1] http://marc.info/?l=linux-rdma=149063626907175=2 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On 03/31/2017 02:38 AM, Adit Ranadive wrote: On Thu Mar 30 2017 13:28:21 GMT-0700 (PDT), Doug Ledford wrote: On 3/30/17 9:13 AM, Leon Romanovsky wrote: On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: From: Yuval ShaiaHi, General description === This is a very early RFC of a new RoCE emulated device that enables guests to use the RDMA stack without having a real hardware in the host. The current implementation supports only VM to VM communication on the same host. Down the road we plan to make possible to be able to support inter-machine communication by utilizing physical RoCE devices or Soft RoCE. The goals are: - Reach fast and secure loos-less Inter-VM data exchange. - Support remote VMs or bare metal machines. - Allow VMs migration. - Do not require to pin all VM memory. Objective = Have a QEMU implementation of the PVRDMA device. We aim to do so without any change in the PVRDMA guest driver which is already merged into the upstream kernel. RFC status === The project is in early development stages and supports only basic send/receive operations. We present it so we can get feedbacks on design, feature demands and to receive comments from the community pointing us to the "right" direction. If to judge by the feedback which you got from RDMA community for kernel proposal [1], this community failed to understand: 1. Why do you need new module? In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product. Right now, the vmware_pvrdma driver works only when the guest is running on a VMWare ESX server product, this would change that. Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt client driver. 2. Why existing solutions are not enough and can't be extended? This patch is against the qemu source code, not the kernel. There is no other solution in the qemu source code, so there is no existing solution to extend. 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM communication via virtual NIC? Eventually they want this to work on real hardware, and to be more or less transparent to the guest. They will need to make it independent of the kernel hardware/driver in use. That means their own virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel. Hi Adit, Hmm, this looks quite interesting. Thanks!! Though I'm not surprised, the PVRDMA device spec is relatively straightforward. Indeed, the pvrdma driver is clear and well documented, which made our development much easier. I would have definitely mentioned this (if I knew about it) during my OFA workshop talk a couple of days ago :). There is always a next OFA workshop :) Thanks, Marcel & Yval Doug's right. I mean basically, this looks like a QEMU version of our PVRDMA backend. Thanks, Adit
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On 03/30/2017 11:28 PM, Doug Ledford wrote: On 3/30/17 9:13 AM, Leon Romanovsky wrote: On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: From: Yuval ShaiaHi, General description === This is a very early RFC of a new RoCE emulated device that enables guests to use the RDMA stack without having a real hardware in the host. The current implementation supports only VM to VM communication on the same host. Down the road we plan to make possible to be able to support inter-machine communication by utilizing physical RoCE devices or Soft RoCE. The goals are: - Reach fast and secure loos-less Inter-VM data exchange. - Support remote VMs or bare metal machines. - Allow VMs migration. - Do not require to pin all VM memory. Objective = Have a QEMU implementation of the PVRDMA device. We aim to do so without any change in the PVRDMA guest driver which is already merged into the upstream kernel. RFC status === The project is in early development stages and supports only basic send/receive operations. We present it so we can get feedbacks on design, feature demands and to receive comments from the community pointing us to the "right" direction. If to judge by the feedback which you got from RDMA community for kernel proposal [1], this community failed to understand: 1. Why do you need new module? In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product. Right now, the vmware_pvrdma driver works only when the guest is running on a VMWare ESX server product, this would change that. Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt client driver. 2. Why existing solutions are not enough and can't be extended? This patch is against the qemu source code, not the kernel. There is no other solution in the qemu source code, so there is no existing solution to extend. 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM communication via virtual NIC? Eventually they want this to work on real hardware, and to be more or less transparent to the guest. They will need to make it independent of the kernel hardware/driver in use. That means their own virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel. Hi Leon and Doug, Your feedback is much appreciated! As Doug mentioned, the RFC is a QEMU implementation of a pvrdma device, so SoftRoCE can't help here (we are emulating a PCI device). Regarding the new KDBR module (Kernel Data Bridge), as the name suggests is a bridge between different VMs or between a VM and a hardware/software device and does not replace it. Leon, utilizing the Soft RoCE is definitely part of our roadmap from the start, we find the project a must since most of our systems don't even have real RDMA hardware, and the question is how do best integrate with it. Thanks, Marcel & Yuval Can you please help us to fill this knowledge gap? [1] http://marc.info/?l=linux-rdma=149063626907175=2
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Thu Mar 30 2017 13:28:21 GMT-0700 (PDT), Doug Ledford wrote: > On 3/30/17 9:13 AM, Leon Romanovsky wrote: > > On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: > > > From: Yuval Shaia> > > > > > Hi, > > > > > > General description > > > === > > > This is a very early RFC of a new RoCE emulated device > > > that enables guests to use the RDMA stack without having > > > a real hardware in the host. > > > > > > The current implementation supports only VM to VM communication > > > on the same host. > > > Down the road we plan to make possible to be able to support > > > inter-machine communication by utilizing physical RoCE devices > > > or Soft RoCE. > > > > > > The goals are: > > > - Reach fast and secure loos-less Inter-VM data exchange. > > > - Support remote VMs or bare metal machines. > > > - Allow VMs migration. > > > - Do not require to pin all VM memory. > > > > > > > > > Objective > > > = > > > Have a QEMU implementation of the PVRDMA device. We aim to do so without > > > any change in the PVRDMA guest driver which is already merged into the > > > upstream kernel. > > > > > > > > > RFC status > > > === > > > The project is in early development stages and supports > > > only basic send/receive operations. > > > > > > We present it so we can get feedbacks on design, > > > feature demands and to receive comments from the > > > community pointing us to the "right" direction. > > > > If to judge by the feedback which you got from RDMA community > > for kernel proposal [1], this community failed to understand: > > 1. Why do you need new module? > > In this case, this is a qemu module to allow qemu to provide a virt rdma > device to guests that is compatible with the device provided by VMWare's ESX > product. Right now, the vmware_pvrdma driver works only when the guest is > running on a VMWare ESX server product, this would change that. Marcel > mentioned that they are currently making it compatible because that's the > easiest/quickest thing to do, but in the future they might extend beyond what > VMWare's virt rdma driver provides/uses and might then need to either modify > it to work with their extensions or fork and create their own virt client > driver. > > > 2. Why existing solutions are not enough and can't be extended? > > This patch is against the qemu source code, not the kernel. There is no > other solution in the qemu source code, so there is no existing solution to > extend. > > > 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM > >communication via virtual NIC? > > Eventually they want this to work on real hardware, and to be more or less > transparent to the guest. They will need to make it independent of the > kernel hardware/driver in use. That means their own virt driver, then the > virt driver will eventually hook into whatever hardware is present on the > system, or failing that, fall back to soft RoCE or soft iWARP if that ever > makes it in the kernel. > Hmm, this looks quite interesting. Though I'm not surprised, the PVRDMA device spec is relatively straightforward. I would have definitely mentioned this (if I knew about it) during my OFA workshop talk a couple of days ago :). Doug's right. I mean basically, this looks like a QEMU version of our PVRDMA backend. Thanks, Adit
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On 3/30/17 9:13 AM, Leon Romanovsky wrote: On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: From: Yuval ShaiaHi, General description === This is a very early RFC of a new RoCE emulated device that enables guests to use the RDMA stack without having a real hardware in the host. The current implementation supports only VM to VM communication on the same host. Down the road we plan to make possible to be able to support inter-machine communication by utilizing physical RoCE devices or Soft RoCE. The goals are: - Reach fast and secure loos-less Inter-VM data exchange. - Support remote VMs or bare metal machines. - Allow VMs migration. - Do not require to pin all VM memory. Objective = Have a QEMU implementation of the PVRDMA device. We aim to do so without any change in the PVRDMA guest driver which is already merged into the upstream kernel. RFC status === The project is in early development stages and supports only basic send/receive operations. We present it so we can get feedbacks on design, feature demands and to receive comments from the community pointing us to the "right" direction. If to judge by the feedback which you got from RDMA community for kernel proposal [1], this community failed to understand: 1. Why do you need new module? In this case, this is a qemu module to allow qemu to provide a virt rdma device to guests that is compatible with the device provided by VMWare's ESX product. Right now, the vmware_pvrdma driver works only when the guest is running on a VMWare ESX server product, this would change that. Marcel mentioned that they are currently making it compatible because that's the easiest/quickest thing to do, but in the future they might extend beyond what VMWare's virt rdma driver provides/uses and might then need to either modify it to work with their extensions or fork and create their own virt client driver. 2. Why existing solutions are not enough and can't be extended? This patch is against the qemu source code, not the kernel. There is no other solution in the qemu source code, so there is no existing solution to extend. 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM communication via virtual NIC? Eventually they want this to work on real hardware, and to be more or less transparent to the guest. They will need to make it independent of the kernel hardware/driver in use. That means their own virt driver, then the virt driver will eventually hook into whatever hardware is present on the system, or failing that, fall back to soft RoCE or soft iWARP if that ever makes it in the kernel. Can you please help us to fill this knowledge gap? [1] http://marc.info/?l=linux-rdma=149063626907175=2
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Thu, Mar 30, 2017 at 02:12:21PM +0300, Marcel Apfelbaum wrote: > From: Yuval Shaia> > Hi, > > General description > === > This is a very early RFC of a new RoCE emulated device > that enables guests to use the RDMA stack without having > a real hardware in the host. > > The current implementation supports only VM to VM communication > on the same host. > Down the road we plan to make possible to be able to support > inter-machine communication by utilizing physical RoCE devices > or Soft RoCE. > > The goals are: > - Reach fast and secure loos-less Inter-VM data exchange. > - Support remote VMs or bare metal machines. > - Allow VMs migration. > - Do not require to pin all VM memory. > > > Objective > = > Have a QEMU implementation of the PVRDMA device. We aim to do so without > any change in the PVRDMA guest driver which is already merged into the > upstream kernel. > > > RFC status > === > The project is in early development stages and supports > only basic send/receive operations. > > We present it so we can get feedbacks on design, > feature demands and to receive comments from the > community pointing us to the "right" direction. If to judge by the feedback which you got from RDMA community for kernel proposal [1], this community failed to understand: 1. Why do you need new module? 2. Why existing solutions are not enough and can't be extended? 3. Why RXE (SoftRoCE) can't be extended to perform this inter-VM communication via virtual NIC? Can you please help us to fill this knowledge gap? [1] http://marc.info/?l=linux-rdma=149063626907175=2 Thanks > > What does work: > - Tested with a basic unit-test: > - https://github.com/yuvalshaia/kibpingpong . > It works fine with two devices on a single VM, has > some issue between two VMs in the same host. > > > Design > == > - Follows the behavior of VMware's pvrdma device, however is not tightly >coupled with it and most of the code can be reused if we decide to >continue to a Virtio based RDMA device. > > - It exposes 3 BARs: > BAR 0 - MSIX, utilize 3 vectors for command ring, async events and > completions > BAR 1 - Configuration of registers > BAR 2 - UAR, used to pass HW commands from driver. > > - The device performs internal management of the RDMA >resources (PDs, CQs, QPs, ...), meaning the objects >are not directly coupled to a physical RDMA device resources. > > - As backend, the pvrdma device uses KDBR, a new kernel module which >is also in RFC phase, read more on the linux-rdma list: > - https://www.spinics.net/lists/linux-rdma/msg47951.html > > - All RDMA operations are converted to KDBR module calls which performs >the actual transfer between VMs, or, in the future, >will utilize a RoCE device (either physical or soft) to be able >to communicate with another host. > > > Roadmap (out of order) > == > - Utilize the RoCE host driver in order to support peers on external hosts. > - Re-use the code for a virtio based device. > > Any ideas, comments or suggestions would be highly appreciated. > > Thanks, > Yuval Shaia & Marcel Apfelbaum > > Signed-off-by: Yuval Shaia > (Mainly design, coding was done by Yuval) > Signed-off-by: Marcel Apfelbaum > > --- > hw/net/Makefile.objs| 5 + > hw/net/pvrdma/kdbr.h| 104 +++ > hw/net/pvrdma/pvrdma-uapi.h | 261 > hw/net/pvrdma/pvrdma.h | 155 ++ > hw/net/pvrdma/pvrdma_cmd.c | 322 +++ > hw/net/pvrdma/pvrdma_defs.h | 301 ++ > hw/net/pvrdma/pvrdma_dev_api.h | 342 > hw/net/pvrdma/pvrdma_ib_verbs.h | 469 > hw/net/pvrdma/pvrdma_kdbr.c | 395 > hw/net/pvrdma/pvrdma_kdbr.h | 53 > hw/net/pvrdma/pvrdma_main.c | 667 > > hw/net/pvrdma/pvrdma_qp_ops.c | 174 +++ > hw/net/pvrdma/pvrdma_qp_ops.h | 25 ++ > hw/net/pvrdma/pvrdma_ring.c | 127 > hw/net/pvrdma/pvrdma_ring.h | 43 +++ > hw/net/pvrdma/pvrdma_rm.c | 529 +++ > hw/net/pvrdma/pvrdma_rm.h | 214 + > hw/net/pvrdma/pvrdma_types.h| 37 +++ > hw/net/pvrdma/pvrdma_utils.c| 36 +++ > hw/net/pvrdma/pvrdma_utils.h| 49 +++ > include/hw/pci/pci_ids.h| 3 + > 21 files changed, 4311 insertions(+) > create mode 100644 hw/net/pvrdma/kdbr.h > create mode 100644 hw/net/pvrdma/pvrdma-uapi.h > create mode 100644 hw/net/pvrdma/pvrdma.h > create mode 100644 hw/net/pvrdma/pvrdma_cmd.c > create mode 100644 hw/net/pvrdma/pvrdma_defs.h > create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h > create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h > create mode 100644 hw/net/pvrdma/pvrdma_kdbr.c >
[Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
From: Yuval ShaiaHi, General description === This is a very early RFC of a new RoCE emulated device that enables guests to use the RDMA stack without having a real hardware in the host. The current implementation supports only VM to VM communication on the same host. Down the road we plan to make possible to be able to support inter-machine communication by utilizing physical RoCE devices or Soft RoCE. The goals are: - Reach fast and secure loos-less Inter-VM data exchange. - Support remote VMs or bare metal machines. - Allow VMs migration. - Do not require to pin all VM memory. Objective = Have a QEMU implementation of the PVRDMA device. We aim to do so without any change in the PVRDMA guest driver which is already merged into the upstream kernel. RFC status === The project is in early development stages and supports only basic send/receive operations. We present it so we can get feedbacks on design, feature demands and to receive comments from the community pointing us to the "right" direction. What does work: - Tested with a basic unit-test: - https://github.com/yuvalshaia/kibpingpong . It works fine with two devices on a single VM, has some issue between two VMs in the same host. Design == - Follows the behavior of VMware's pvrdma device, however is not tightly coupled with it and most of the code can be reused if we decide to continue to a Virtio based RDMA device. - It exposes 3 BARs: BAR 0 - MSIX, utilize 3 vectors for command ring, async events and completions BAR 1 - Configuration of registers BAR 2 - UAR, used to pass HW commands from driver. - The device performs internal management of the RDMA resources (PDs, CQs, QPs, ...), meaning the objects are not directly coupled to a physical RDMA device resources. - As backend, the pvrdma device uses KDBR, a new kernel module which is also in RFC phase, read more on the linux-rdma list: - https://www.spinics.net/lists/linux-rdma/msg47951.html - All RDMA operations are converted to KDBR module calls which performs the actual transfer between VMs, or, in the future, will utilize a RoCE device (either physical or soft) to be able to communicate with another host. Roadmap (out of order) == - Utilize the RoCE host driver in order to support peers on external hosts. - Re-use the code for a virtio based device. Any ideas, comments or suggestions would be highly appreciated. Thanks, Yuval Shaia & Marcel Apfelbaum Signed-off-by: Yuval Shaia (Mainly design, coding was done by Yuval) Signed-off-by: Marcel Apfelbaum --- hw/net/Makefile.objs| 5 + hw/net/pvrdma/kdbr.h| 104 +++ hw/net/pvrdma/pvrdma-uapi.h | 261 hw/net/pvrdma/pvrdma.h | 155 ++ hw/net/pvrdma/pvrdma_cmd.c | 322 +++ hw/net/pvrdma/pvrdma_defs.h | 301 ++ hw/net/pvrdma/pvrdma_dev_api.h | 342 hw/net/pvrdma/pvrdma_ib_verbs.h | 469 hw/net/pvrdma/pvrdma_kdbr.c | 395 hw/net/pvrdma/pvrdma_kdbr.h | 53 hw/net/pvrdma/pvrdma_main.c | 667 hw/net/pvrdma/pvrdma_qp_ops.c | 174 +++ hw/net/pvrdma/pvrdma_qp_ops.h | 25 ++ hw/net/pvrdma/pvrdma_ring.c | 127 hw/net/pvrdma/pvrdma_ring.h | 43 +++ hw/net/pvrdma/pvrdma_rm.c | 529 +++ hw/net/pvrdma/pvrdma_rm.h | 214 + hw/net/pvrdma/pvrdma_types.h| 37 +++ hw/net/pvrdma/pvrdma_utils.c| 36 +++ hw/net/pvrdma/pvrdma_utils.h| 49 +++ include/hw/pci/pci_ids.h| 3 + 21 files changed, 4311 insertions(+) create mode 100644 hw/net/pvrdma/kdbr.h create mode 100644 hw/net/pvrdma/pvrdma-uapi.h create mode 100644 hw/net/pvrdma/pvrdma.h create mode 100644 hw/net/pvrdma/pvrdma_cmd.c create mode 100644 hw/net/pvrdma/pvrdma_defs.h create mode 100644 hw/net/pvrdma/pvrdma_dev_api.h create mode 100644 hw/net/pvrdma/pvrdma_ib_verbs.h create mode 100644 hw/net/pvrdma/pvrdma_kdbr.c create mode 100644 hw/net/pvrdma/pvrdma_kdbr.h create mode 100644 hw/net/pvrdma/pvrdma_main.c create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.c create mode 100644 hw/net/pvrdma/pvrdma_qp_ops.h create mode 100644 hw/net/pvrdma/pvrdma_ring.c create mode 100644 hw/net/pvrdma/pvrdma_ring.h create mode 100644 hw/net/pvrdma/pvrdma_rm.c create mode 100644 hw/net/pvrdma/pvrdma_rm.h create mode 100644 hw/net/pvrdma/pvrdma_types.h create mode 100644 hw/net/pvrdma/pvrdma_utils.c create mode 100644 hw/net/pvrdma/pvrdma_utils.h diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs index 610ed3e..a962347 100644 --- a/hw/net/Makefile.objs +++ b/hw/net/Makefile.objs @@ -43,3 +43,8 @@