Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
On Thu, 5 May 2016 13:23:11 -0700 Neo Jia wrote: > > > I also noticed in another thread: > > > - > > > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with > > > iommu and without iommu > > > > > > Kirti did: > > > 1. don't pin the pages in the map ioctl for the vGPU case. > > > 2. export vfio_pin_pages and vfio_unpin_pages. > > > > > > Although their patches didn't show how these interfaces were used, I > > > guess them can either use these interfaces to pin/unpin all of the > > > guest memory, or pin/unpin memory on demand. So can I reuse their work > > > to finish my #1? If the answer is yes, then I could change my plan and > > > > Yes, we would absolutely only want one vfio iommu backend doing this, > > there's nothing device specific about it. We're looking at supporting > > both modes of operation, fully pinned and pin-on-demand. NVIDIA vGPU > > wants the on-demand approach while Intel vGPU wants to pin the entire > > guest, at least for an initial solution. This iommu backend would need > > to support both as determined by the mediated device backend. > > Right, we will add a new callback to mediated device backend interface for > this > purpose in v4 version patch. Dear Neo: Thanks for this information. What I interest most is the new vfio iommu backend. Looking forward to your new patches. :> > > Thanks, > Neo > > > > > > do: > > > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw. > > > When starting the guest, form the database. > > > > > > #2. In the driver of the ccw devices, when an I/O instruction was > > > intercepted, call vfio_pin_pages (Kirti's version) to get the host > > > physical address, then translate the ccw program for I/O operation. > > > > > > So which one is the right way to go? > > > > As above, I think we have a need to support both approaches in this new > > iommu backend, it will be up to you to determine which is appropriate > > for your devices and guest drivers. A fully pinned guest has a latency > > advantage, but obviously there are numerous disadvantages for the > > pinning itself. Pinning on-demand has overhead to setup each DMA > > operations by the device but has a much smaller pinning footprint. Dong Jia
Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
On Thu, 5 May 2016 13:19:45 -0600 Alex Williamson wrote: > [cc +Intel,NVIDIA] > > On Thu, 5 May 2016 18:29:08 +0800 > Dong Jia wrote: > > > On Wed, 4 May 2016 13:26:53 -0600 > > Alex Williamson wrote: > > > > > On Wed, 4 May 2016 17:26:29 +0800 > > > Dong Jia wrote: > > > > > > > On Fri, 29 Apr 2016 11:17:35 -0600 > > > > Alex Williamson wrote: > > > > > > > > Dear Alex: > > > > > > > > Thanks for the comments. > > > > > > > > [...] > > > > > > > > > > > > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is > > > > > > definitely a > > > > > > good example to get understand how these patches work. Here is a > > > > > > little > > > > > > bit more detail how an I/O request triggered by the Qemu guest will > > > > > > be > > > > > > handled (without error handling). > > > > > > > > > > > > Explanation: > > > > > > Q1-Q4: Qemu side process. > > > > > > K1-K6: Kernel side process. > > > > > > > > > > > > Q1. Intercept a ssch instruction. > > > > > > Q2. Translate the guest ccw program to a user space ccw program > > > > > > (u_ccwchain). > > > > > > > > > > Is this replacing guest physical address in the program with QEMU > > > > > virtual addresses? > > > > Yes. > > > > > > > > > > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb). > > > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain). > > > > > > K2. Translate the user space ccw program to a kernel space ccw > > > > > > program, which becomes runnable for a real device. > > > > > > > > > > And here we translate and likely pin QEMU virtual address to physical > > > > > addresses to further modify the program sent into the channel? > > > > Yes. Exactly. > > > > > > > > > > > > > > > K3. With the necessary information contained in the orb passed > > > > > > in > > > > > > by Qemu, issue the k_ccwchain to the device, and wait event > > > > > > q > > > > > > for the I/O result. > > > > > > K4. Interrupt handler gets the I/O result, and wakes up the > > > > > > wait q. > > > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result > > > > > > to > > > > > > update the user space irb. > > > > > > K6. Copy irb and scsw back to user space. > > > > > > Q4. Update the irb for the guest. > > > > > > > > > > If the answers to my questions above are both yes, > > > > Yes, they are. > > > > > > > > > then this is really a mediated interface, not a direct assignment. > > > > Right. This is true. > > > > > > > > > We don't need an iommu > > > > > because we're policing and translating the program for the device > > > > > before it gets sent to hardware. I think there are better ways than > > > > > noiommu to handle such devices perhaps even with better performance > > > > > than this two-stage translation. In fact, I think the solution we > > > > > plan > > > > > to implement for vGPU support would work here. > > > > > > > > > > Like your device, a vGPU is mediated, we don't have IOMMU level > > > > > translation or isolation since a vGPU is largely a software construct, > > > > > but we do have software policing and translating how the GPU is > > > > > programmed. To do this we're creating a type1 compatible vfio iommu > > > > > backend that uses the existing map and unmap ioctls, but rather than > > > > > programming them into an IOMMU for a device, it simply stores the > > > > > translations for use by later requests. This means that a device > > > > > programmed in a VM with guest physical addresses can have the > > > > > vfio kernel convert that address to process virtual address, pin the > > > > > page and program the hardware with the host physical address in one > > > > > step. > > > > I've read through the mail threads those discuss how to add vGPU > > > > support in VFIO. I'm afraid that proposal could not be simply addressed > > > > to this case, especially if we want to make the vfio api completely > > > > compatible with the existing usage. > > > > > > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and > > > > fixed range of address in the memory space for DMA operations. Any > > > > address inside this range will not be used for other purpose. Thus we > > > > can add memory listener on this range, and pin the pages for further > > > > use (DMA operation). And we can keep the pages pinned during the life > > > > cycle of the VM (not quite accurate, or I should say 'the target > > > > device'). > > > > > > That's not entirely accurate. Ignoring a guest IOMMU, current device > > > assignment pins all of guest memory, not just a dedicated, exclusive > > > range of it, in order to map it through the hardware IOMMU. That gives > > > the guest the ability to transparently perform DMA with the device > > > since the IOMMU maps the guest physical to host physical translations. > > Thanks for this explanation. > > > > I noticed in the Qem
Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
On Thu, May 05, 2016 at 01:19:45PM -0600, Alex Williamson wrote: > [cc +Intel,NVIDIA] > > On Thu, 5 May 2016 18:29:08 +0800 > Dong Jia wrote: > > > On Wed, 4 May 2016 13:26:53 -0600 > > Alex Williamson wrote: > > > > > On Wed, 4 May 2016 17:26:29 +0800 > > > Dong Jia wrote: > > > > > > > On Fri, 29 Apr 2016 11:17:35 -0600 > > > > Alex Williamson wrote: > > > > > > > > Dear Alex: > > > > > > > > Thanks for the comments. > > > > > > > > [...] > > > > > > > > > > > > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is > > > > > > definitely a > > > > > > good example to get understand how these patches work. Here is a > > > > > > little > > > > > > bit more detail how an I/O request triggered by the Qemu guest will > > > > > > be > > > > > > handled (without error handling). > > > > > > > > > > > > Explanation: > > > > > > Q1-Q4: Qemu side process. > > > > > > K1-K6: Kernel side process. > > > > > > > > > > > > Q1. Intercept a ssch instruction. > > > > > > Q2. Translate the guest ccw program to a user space ccw program > > > > > > (u_ccwchain). > > > > > > > > > > Is this replacing guest physical address in the program with QEMU > > > > > virtual addresses? > > > > Yes. > > > > > > > > > > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb). > > > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain). > > > > > > K2. Translate the user space ccw program to a kernel space ccw > > > > > > program, which becomes runnable for a real device. > > > > > > > > > > And here we translate and likely pin QEMU virtual address to physical > > > > > addresses to further modify the program sent into the channel? > > > > Yes. Exactly. > > > > > > > > > > > > > > > K3. With the necessary information contained in the orb passed > > > > > > in > > > > > > by Qemu, issue the k_ccwchain to the device, and wait event > > > > > > q > > > > > > for the I/O result. > > > > > > K4. Interrupt handler gets the I/O result, and wakes up the > > > > > > wait q. > > > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result > > > > > > to > > > > > > update the user space irb. > > > > > > K6. Copy irb and scsw back to user space. > > > > > > Q4. Update the irb for the guest. > > > > > > > > > > If the answers to my questions above are both yes, > > > > Yes, they are. > > > > > > > > > then this is really a mediated interface, not a direct assignment. > > > > Right. This is true. > > > > > > > > > We don't need an iommu > > > > > because we're policing and translating the program for the device > > > > > before it gets sent to hardware. I think there are better ways than > > > > > noiommu to handle such devices perhaps even with better performance > > > > > than this two-stage translation. In fact, I think the solution we > > > > > plan > > > > > to implement for vGPU support would work here. > > > > > > > > > > Like your device, a vGPU is mediated, we don't have IOMMU level > > > > > translation or isolation since a vGPU is largely a software construct, > > > > > but we do have software policing and translating how the GPU is > > > > > programmed. To do this we're creating a type1 compatible vfio iommu > > > > > backend that uses the existing map and unmap ioctls, but rather than > > > > > programming them into an IOMMU for a device, it simply stores the > > > > > translations for use by later requests. This means that a device > > > > > programmed in a VM with guest physical addresses can have the > > > > > vfio kernel convert that address to process virtual address, pin the > > > > > page and program the hardware with the host physical address in one > > > > > step. > > > > I've read through the mail threads those discuss how to add vGPU > > > > support in VFIO. I'm afraid that proposal could not be simply addressed > > > > to this case, especially if we want to make the vfio api completely > > > > compatible with the existing usage. > > > > > > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and > > > > fixed range of address in the memory space for DMA operations. Any > > > > address inside this range will not be used for other purpose. Thus we > > > > can add memory listener on this range, and pin the pages for further > > > > use (DMA operation). And we can keep the pages pinned during the life > > > > cycle of the VM (not quite accurate, or I should say 'the target > > > > device'). > > > > > > That's not entirely accurate. Ignoring a guest IOMMU, current device > > > assignment pins all of guest memory, not just a dedicated, exclusive > > > range of it, in order to map it through the hardware IOMMU. That gives > > > the guest the ability to transparently perform DMA with the device > > > since the IOMMU maps the guest physical to host physical translations. > > Thanks for this explanation. > > > > I noticed in t
Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
[cc +Intel,NVIDIA] On Thu, 5 May 2016 18:29:08 +0800 Dong Jia wrote: > On Wed, 4 May 2016 13:26:53 -0600 > Alex Williamson wrote: > > > On Wed, 4 May 2016 17:26:29 +0800 > > Dong Jia wrote: > > > > > On Fri, 29 Apr 2016 11:17:35 -0600 > > > Alex Williamson wrote: > > > > > > Dear Alex: > > > > > > Thanks for the comments. > > > > > > [...] > > > > > > > > > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely > > > > > a > > > > > good example to get understand how these patches work. Here is a > > > > > little > > > > > bit more detail how an I/O request triggered by the Qemu guest will be > > > > > handled (without error handling). > > > > > > > > > > Explanation: > > > > > Q1-Q4: Qemu side process. > > > > > K1-K6: Kernel side process. > > > > > > > > > > Q1. Intercept a ssch instruction. > > > > > Q2. Translate the guest ccw program to a user space ccw program > > > > > (u_ccwchain). > > > > > > > > Is this replacing guest physical address in the program with QEMU > > > > virtual addresses? > > > Yes. > > > > > > > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb). > > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain). > > > > > K2. Translate the user space ccw program to a kernel space ccw > > > > > program, which becomes runnable for a real device. > > > > > > > > And here we translate and likely pin QEMU virtual address to physical > > > > addresses to further modify the program sent into the channel? > > > Yes. Exactly. > > > > > > > > > > > > K3. With the necessary information contained in the orb passed in > > > > > by Qemu, issue the k_ccwchain to the device, and wait event q > > > > > for the I/O result. > > > > > K4. Interrupt handler gets the I/O result, and wakes up the wait > > > > > q. > > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to > > > > > update the user space irb. > > > > > K6. Copy irb and scsw back to user space. > > > > > Q4. Update the irb for the guest. > > > > > > > > If the answers to my questions above are both yes, > > > Yes, they are. > > > > > > > then this is really a mediated interface, not a direct assignment. > > > Right. This is true. > > > > > > > We don't need an iommu > > > > because we're policing and translating the program for the device > > > > before it gets sent to hardware. I think there are better ways than > > > > noiommu to handle such devices perhaps even with better performance > > > > than this two-stage translation. In fact, I think the solution we plan > > > > to implement for vGPU support would work here. > > > > > > > > Like your device, a vGPU is mediated, we don't have IOMMU level > > > > translation or isolation since a vGPU is largely a software construct, > > > > but we do have software policing and translating how the GPU is > > > > programmed. To do this we're creating a type1 compatible vfio iommu > > > > backend that uses the existing map and unmap ioctls, but rather than > > > > programming them into an IOMMU for a device, it simply stores the > > > > translations for use by later requests. This means that a device > > > > programmed in a VM with guest physical addresses can have the > > > > vfio kernel convert that address to process virtual address, pin the > > > > page and program the hardware with the host physical address in one > > > > step. > > > I've read through the mail threads those discuss how to add vGPU > > > support in VFIO. I'm afraid that proposal could not be simply addressed > > > to this case, especially if we want to make the vfio api completely > > > compatible with the existing usage. > > > > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and > > > fixed range of address in the memory space for DMA operations. Any > > > address inside this range will not be used for other purpose. Thus we > > > can add memory listener on this range, and pin the pages for further > > > use (DMA operation). And we can keep the pages pinned during the life > > > cycle of the VM (not quite accurate, or I should say 'the target > > > device'). > > > > That's not entirely accurate. Ignoring a guest IOMMU, current device > > assignment pins all of guest memory, not just a dedicated, exclusive > > range of it, in order to map it through the hardware IOMMU. That gives > > the guest the ability to transparently perform DMA with the device > > since the IOMMU maps the guest physical to host physical translations. > Thanks for this explanation. > > I noticed in the Qemu part, when we tried to introduce vfio-pci to the > s390 architecture, we set the IOMMU width by calling > memory_region_add_subregion before initializing the address_space of > the PCI device, which will be registered with the vfio_memory_listener > later. The 'width' of the subregion is what I called the 'range' in the > former r
Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
On Wed, 4 May 2016 13:26:53 -0600 Alex Williamson wrote: > On Wed, 4 May 2016 17:26:29 +0800 > Dong Jia wrote: > > > On Fri, 29 Apr 2016 11:17:35 -0600 > > Alex Williamson wrote: > > > > Dear Alex: > > > > Thanks for the comments. > > > > [...] > > > > > > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a > > > > good example to get understand how these patches work. Here is a little > > > > bit more detail how an I/O request triggered by the Qemu guest will be > > > > handled (without error handling). > > > > > > > > Explanation: > > > > Q1-Q4: Qemu side process. > > > > K1-K6: Kernel side process. > > > > > > > > Q1. Intercept a ssch instruction. > > > > Q2. Translate the guest ccw program to a user space ccw program > > > > (u_ccwchain). > > > > > > Is this replacing guest physical address in the program with QEMU > > > virtual addresses? > > Yes. > > > > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb). > > > > K1. Copy from u_ccwchain to kernel (k_ccwchain). > > > > K2. Translate the user space ccw program to a kernel space ccw > > > > program, which becomes runnable for a real device. > > > > > > And here we translate and likely pin QEMU virtual address to physical > > > addresses to further modify the program sent into the channel? > > Yes. Exactly. > > > > > > > > > K3. With the necessary information contained in the orb passed in > > > > by Qemu, issue the k_ccwchain to the device, and wait event q > > > > for the I/O result. > > > > K4. Interrupt handler gets the I/O result, and wakes up the wait q. > > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to > > > > update the user space irb. > > > > K6. Copy irb and scsw back to user space. > > > > Q4. Update the irb for the guest. > > > > > > If the answers to my questions above are both yes, > > Yes, they are. > > > > > then this is really a mediated interface, not a direct assignment. > > Right. This is true. > > > > > We don't need an iommu > > > because we're policing and translating the program for the device > > > before it gets sent to hardware. I think there are better ways than > > > noiommu to handle such devices perhaps even with better performance > > > than this two-stage translation. In fact, I think the solution we plan > > > to implement for vGPU support would work here. > > > > > > Like your device, a vGPU is mediated, we don't have IOMMU level > > > translation or isolation since a vGPU is largely a software construct, > > > but we do have software policing and translating how the GPU is > > > programmed. To do this we're creating a type1 compatible vfio iommu > > > backend that uses the existing map and unmap ioctls, but rather than > > > programming them into an IOMMU for a device, it simply stores the > > > translations for use by later requests. This means that a device > > > programmed in a VM with guest physical addresses can have the > > > vfio kernel convert that address to process virtual address, pin the > > > page and program the hardware with the host physical address in one > > > step. > > I've read through the mail threads those discuss how to add vGPU > > support in VFIO. I'm afraid that proposal could not be simply addressed > > to this case, especially if we want to make the vfio api completely > > compatible with the existing usage. > > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and > > fixed range of address in the memory space for DMA operations. Any > > address inside this range will not be used for other purpose. Thus we > > can add memory listener on this range, and pin the pages for further > > use (DMA operation). And we can keep the pages pinned during the life > > cycle of the VM (not quite accurate, or I should say 'the target > > device'). > > That's not entirely accurate. Ignoring a guest IOMMU, current device > assignment pins all of guest memory, not just a dedicated, exclusive > range of it, in order to map it through the hardware IOMMU. That gives > the guest the ability to transparently perform DMA with the device > since the IOMMU maps the guest physical to host physical translations. Thanks for this explanation. I noticed in the Qemu part, when we tried to introduce vfio-pci to the s390 architecture, we set the IOMMU width by calling memory_region_add_subregion before initializing the address_space of the PCI device, which will be registered with the vfio_memory_listener later. The 'width' of the subregion is what I called the 'range' in the former reply. The first reason we did that is, we know exactly the dma memory range, and we got the width by 'dma_addr_end - dma_addr_start'. The second reason we have to do that is, using the following statement will cause the initialization of the guest tremendously long: group = vfio_get_group(groupid, &address_space_memory); Because doing map o
Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
On Wed, 4 May 2016 17:26:29 +0800 Dong Jia wrote: > On Fri, 29 Apr 2016 11:17:35 -0600 > Alex Williamson wrote: > > Dear Alex: > > Thanks for the comments. > > [...] > > > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a > > > good example to get understand how these patches work. Here is a little > > > bit more detail how an I/O request triggered by the Qemu guest will be > > > handled (without error handling). > > > > > > Explanation: > > > Q1-Q4: Qemu side process. > > > K1-K6: Kernel side process. > > > > > > Q1. Intercept a ssch instruction. > > > Q2. Translate the guest ccw program to a user space ccw program > > > (u_ccwchain). > > > > Is this replacing guest physical address in the program with QEMU > > virtual addresses? > Yes. > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb). > > > K1. Copy from u_ccwchain to kernel (k_ccwchain). > > > K2. Translate the user space ccw program to a kernel space ccw > > > program, which becomes runnable for a real device. > > > > And here we translate and likely pin QEMU virtual address to physical > > addresses to further modify the program sent into the channel? > Yes. Exactly. > > > > > > K3. With the necessary information contained in the orb passed in > > > by Qemu, issue the k_ccwchain to the device, and wait event q > > > for the I/O result. > > > K4. Interrupt handler gets the I/O result, and wakes up the wait q. > > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to > > > update the user space irb. > > > K6. Copy irb and scsw back to user space. > > > Q4. Update the irb for the guest. > > > > If the answers to my questions above are both yes, > Yes, they are. > > > then this is really a mediated interface, not a direct assignment. > Right. This is true. > > > We don't need an iommu > > because we're policing and translating the program for the device > > before it gets sent to hardware. I think there are better ways than > > noiommu to handle such devices perhaps even with better performance > > than this two-stage translation. In fact, I think the solution we plan > > to implement for vGPU support would work here. > > > > Like your device, a vGPU is mediated, we don't have IOMMU level > > translation or isolation since a vGPU is largely a software construct, > > but we do have software policing and translating how the GPU is > > programmed. To do this we're creating a type1 compatible vfio iommu > > backend that uses the existing map and unmap ioctls, but rather than > > programming them into an IOMMU for a device, it simply stores the > > translations for use by later requests. This means that a device > > programmed in a VM with guest physical addresses can have the > > vfio kernel convert that address to process virtual address, pin the > > page and program the hardware with the host physical address in one > > step. > I've read through the mail threads those discuss how to add vGPU > support in VFIO. I'm afraid that proposal could not be simply addressed > to this case, especially if we want to make the vfio api completely > compatible with the existing usage. > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and > fixed range of address in the memory space for DMA operations. Any > address inside this range will not be used for other purpose. Thus we > can add memory listener on this range, and pin the pages for further > use (DMA operation). And we can keep the pages pinned during the life > cycle of the VM (not quite accurate, or I should say 'the target > device'). That's not entirely accurate. Ignoring a guest IOMMU, current device assignment pins all of guest memory, not just a dedicated, exclusive range of it, in order to map it through the hardware IOMMU. That gives the guest the ability to transparently perform DMA with the device since the IOMMU maps the guest physical to host physical translations. That's not what vGPU is about. In the case of vGPU the proposal is to use the same QEMU vfio MemoryListener API, but only for the purpose of having an accurate database of guest physical to process virtual translations for the VM. In your above example, this means step Q2 is eliminated because step K2 has the information to perform both a guest physical to process virtual translation and to pin the page to get a host physical address. So you'd only need to modify the program once. > Well, a Subchannel Device does not have such a range of address. The > device driver simply calls kalloc() to get a piece of memory, and > assembles a ccw program with it, before issuing the ccw program to > perform an I/O operation. So the Qemu memory listener can't tell if an > address is for an I/O operation, or for whatever else. And this makes > the memory listener unnecessary for our case. It's only unnecessary because QEMU is manipulating the program to repla
Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
On Fri, 29 Apr 2016 11:17:35 -0600 Alex Williamson wrote: Dear Alex: Thanks for the comments. [...] > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a > > good example to get understand how these patches work. Here is a little > > bit more detail how an I/O request triggered by the Qemu guest will be > > handled (without error handling). > > > > Explanation: > > Q1-Q4: Qemu side process. > > K1-K6: Kernel side process. > > > > Q1. Intercept a ssch instruction. > > Q2. Translate the guest ccw program to a user space ccw program > > (u_ccwchain). > > Is this replacing guest physical address in the program with QEMU > virtual addresses? Yes. > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb). > > K1. Copy from u_ccwchain to kernel (k_ccwchain). > > K2. Translate the user space ccw program to a kernel space ccw > > program, which becomes runnable for a real device. > > And here we translate and likely pin QEMU virtual address to physical > addresses to further modify the program sent into the channel? Yes. Exactly. > > > K3. With the necessary information contained in the orb passed in > > by Qemu, issue the k_ccwchain to the device, and wait event q > > for the I/O result. > > K4. Interrupt handler gets the I/O result, and wakes up the wait q. > > K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to > > update the user space irb. > > K6. Copy irb and scsw back to user space. > > Q4. Update the irb for the guest. > > If the answers to my questions above are both yes, Yes, they are. > then this is really a mediated interface, not a direct assignment. Right. This is true. > We don't need an iommu > because we're policing and translating the program for the device > before it gets sent to hardware. I think there are better ways than > noiommu to handle such devices perhaps even with better performance > than this two-stage translation. In fact, I think the solution we plan > to implement for vGPU support would work here. > > Like your device, a vGPU is mediated, we don't have IOMMU level > translation or isolation since a vGPU is largely a software construct, > but we do have software policing and translating how the GPU is > programmed. To do this we're creating a type1 compatible vfio iommu > backend that uses the existing map and unmap ioctls, but rather than > programming them into an IOMMU for a device, it simply stores the > translations for use by later requests. This means that a device > programmed in a VM with guest physical addresses can have the > vfio kernel convert that address to process virtual address, pin the > page and program the hardware with the host physical address in one > step. I've read through the mail threads those discuss how to add vGPU support in VFIO. I'm afraid that proposal could not be simply addressed to this case, especially if we want to make the vfio api completely compatible with the existing usage. AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and fixed range of address in the memory space for DMA operations. Any address inside this range will not be used for other purpose. Thus we can add memory listener on this range, and pin the pages for further use (DMA operation). And we can keep the pages pinned during the life cycle of the VM (not quite accurate, or I should say 'the target device'). Well, a Subchannel Device does not have such a range of address. The device driver simply calls kalloc() to get a piece of memory, and assembles a ccw program with it, before issuing the ccw program to perform an I/O operation. So the Qemu memory listener can't tell if an address is for an I/O operation, or for whatever else. And this makes the memory listener unnecessary for our case. The only time point that we know we should pin pages for I/O, is the time that an I/O instruction (e.g. ssch) was intercepted. At this point, we know the address contented in the parameter of the ssch instruction points to a piece of memory that contents a ccw program. Then we do: pin the pages --> convert the ccw program --> perform the I/O --> return the I/O result --> and unpin the pages. > > This architecture also makes the vfio api completely compatible with > existing usage without tainting QEMU with support for noiommu devices. > I would strongly suggest following a similar approach and dropping the > noiommu interface. We really do not need to confuse users with noiommu > devices that are safe and assignable and devices where noiommu should > warn them to stay away. Thanks, Understand. But like explained above, even if we introduce a new vfio iommu backend, what it does would probably look quite like what the no-iommu backend does. Any idea about this? > > Alex > Dong Jia
Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
On Fri, 29 Apr 2016 14:11:47 +0200 Dong Jia Shi wrote: > vfio: ccw: basic vfio-ccw infrastructure > > > Introduction > > > Here we describe the vfio support for Channel I/O devices (aka. CCW > devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW > devices to a virtual machine, while vfio is the means. > > Different than other hardware architectures, s390 has defined a unified > I/O access method, which is so called Channel I/O. It has its own > access patterns: > - Channel programs run asynchronously on a separate (co)processor. > - The channel subsystem will access any memory designated by the caller > in the channel program directly, i.e. there is no iommu involved. > Thus when we introduce vfio support for these devices, we realize it > with a no-iommu vfio implementation. > > This document does not intend to explain the s390 hardware architecture > in every detail. More information/reference could be found here: > - A good start to know Channel I/O in general: > https://en.wikipedia.org/wiki/Channel_I/O > - s390 architecture: > s390 Principles of Operation manual (IBM Form. No. SA22-7832) > - The existing Qemu code which implements a simple emulated channel > subsystem could also be a good reference. It makes it easier to > follow the flow. > qemu/hw/s390x/css.c > > Motivation of vfio-ccw > -- > > Currently, a guest virtualized via qemu/kvm on s390 only sees > paravirtualized virtio devices via the "Virtio Over Channel I/O > (virtio-ccw)" transport. This makes virtio devices discoverable via > standard operating system algorithms for handling channel devices. > > However this is not enough. On s390 for the majority of devices, which > use the standard Channel I/O based mechanism, we also need to provide > the functionality of passing through them to a Qemu virtual machine. > This includes devices that don't have a virtio counterpart (e.g. tape > drives) or that have specific characteristics which guests want to > exploit. > > For passing a device to a guest, we want to use the same interface as > everybody else, namely vfio. Thus, we would like to introduce vfio > support for channel devices. And we would like to name this new vfio > device "vfio-ccw". > > Access patterns of CCW devices > -- > > s390 architecture has implemented a so called channel subsystem, that > provides a unified view of the devices physically attached to the > systems. Though the s390 hardware platform knows about a huge variety of > different peripheral attachments like disk devices (aka. DASDs), tapes, > communication controllers, etc. They can all be accessed by a well > defined access method and they are presenting I/O completion a unified > way: I/O interruptions. > > All I/O requires the use of channel command words (CCWs). A CCW is an > instruction to a specialized I/O channel processor. A channel program > is a sequence of CCWs which are executed by the I/O channel subsystem. > To issue a CCW program to the channel subsystem, it is required to > build an operation request block (ORB), which can be used to point out > the format of the CCW and other control information to the system. The > operating system signals the I/O channel subsystem to begin executing > the channel program with a SSCH (start sub-channel) instruction. The > central processor is then free to proceed with non-I/O instructions > until interrupted. The I/O completion result is received by the > interrupt handler in the form of interrupt response block (IRB). > > Back to vfio-ccw, in short: > - ORBs and CCW programs are built in user space (with virtual > addresses). > - ORBs and CCW programs are passed to the kernel. > - kernel translates virtual addresses to real addresses and starts the > IO with issuing a privileged Channel I/O instruction (e.g SSCH). > - CCW programs run asynchronously on a separate processor. > - I/O completion will be signaled to the host with I/O interruptions. > And it will be copied as IRB to user space. > > > vfio-ccw patches overview > - > > It follows that we need vfio-ccw with a vfio no-iommu mode. For now, > our patches are based on the current no-iommu implementation. It's a > good start to launch the code review for vfio-ccw. Note that the > implementation is far from complete yet; but we'd like to get feedback > for the general architecture. > > The current no-iommu implementation would consider vfio-ccw as > unsupported and will taint the kernel. This should be not true for > vfio-ccw. But whether the end result will be using the existing > no-iommu code or a new module would be an implementation detail. > > * CCW translation APIs > - Description: > These introduce a group of APIs (start with 'ccwchain_') to do CCW > translation. The CCWs passed in by a user space program are organized > in a buffer, with their user virtual memory addre
[Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
vfio: ccw: basic vfio-ccw infrastructure Introduction Here we describe the vfio support for Channel I/O devices (aka. CCW devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW devices to a virtual machine, while vfio is the means. Different than other hardware architectures, s390 has defined a unified I/O access method, which is so called Channel I/O. It has its own access patterns: - Channel programs run asynchronously on a separate (co)processor. - The channel subsystem will access any memory designated by the caller in the channel program directly, i.e. there is no iommu involved. Thus when we introduce vfio support for these devices, we realize it with a no-iommu vfio implementation. This document does not intend to explain the s390 hardware architecture in every detail. More information/reference could be found here: - A good start to know Channel I/O in general: https://en.wikipedia.org/wiki/Channel_I/O - s390 architecture: s390 Principles of Operation manual (IBM Form. No. SA22-7832) - The existing Qemu code which implements a simple emulated channel subsystem could also be a good reference. It makes it easier to follow the flow. qemu/hw/s390x/css.c Motivation of vfio-ccw -- Currently, a guest virtualized via qemu/kvm on s390 only sees paravirtualized virtio devices via the "Virtio Over Channel I/O (virtio-ccw)" transport. This makes virtio devices discoverable via standard operating system algorithms for handling channel devices. However this is not enough. On s390 for the majority of devices, which use the standard Channel I/O based mechanism, we also need to provide the functionality of passing through them to a Qemu virtual machine. This includes devices that don't have a virtio counterpart (e.g. tape drives) or that have specific characteristics which guests want to exploit. For passing a device to a guest, we want to use the same interface as everybody else, namely vfio. Thus, we would like to introduce vfio support for channel devices. And we would like to name this new vfio device "vfio-ccw". Access patterns of CCW devices -- s390 architecture has implemented a so called channel subsystem, that provides a unified view of the devices physically attached to the systems. Though the s390 hardware platform knows about a huge variety of different peripheral attachments like disk devices (aka. DASDs), tapes, communication controllers, etc. They can all be accessed by a well defined access method and they are presenting I/O completion a unified way: I/O interruptions. All I/O requires the use of channel command words (CCWs). A CCW is an instruction to a specialized I/O channel processor. A channel program is a sequence of CCWs which are executed by the I/O channel subsystem. To issue a CCW program to the channel subsystem, it is required to build an operation request block (ORB), which can be used to point out the format of the CCW and other control information to the system. The operating system signals the I/O channel subsystem to begin executing the channel program with a SSCH (start sub-channel) instruction. The central processor is then free to proceed with non-I/O instructions until interrupted. The I/O completion result is received by the interrupt handler in the form of interrupt response block (IRB). Back to vfio-ccw, in short: - ORBs and CCW programs are built in user space (with virtual addresses). - ORBs and CCW programs are passed to the kernel. - kernel translates virtual addresses to real addresses and starts the IO with issuing a privileged Channel I/O instruction (e.g SSCH). - CCW programs run asynchronously on a separate processor. - I/O completion will be signaled to the host with I/O interruptions. And it will be copied as IRB to user space. vfio-ccw patches overview - It follows that we need vfio-ccw with a vfio no-iommu mode. For now, our patches are based on the current no-iommu implementation. It's a good start to launch the code review for vfio-ccw. Note that the implementation is far from complete yet; but we'd like to get feedback for the general architecture. The current no-iommu implementation would consider vfio-ccw as unsupported and will taint the kernel. This should be not true for vfio-ccw. But whether the end result will be using the existing no-iommu code or a new module would be an implementation detail. * CCW translation APIs - Description: These introduce a group of APIs (start with 'ccwchain_') to do CCW translation. The CCWs passed in by a user space program are organized in a buffer, with their user virtual memory addresses. These APIs will copy the CCWs into the kernel space, and assemble a runnable kernel CCW program by updating the user virtual addresses with their corresponding physical addresses. - Patches: vfio: ccw: introduce page array interfaces vfio: ccw: in