Re: RFC: vfio API changes needed for powerpc (v3)
On Tue, Apr 09, 2013 at 01:22:15AM +, Yoder Stuart-B08248 wrote: What happens if a normal unmap call is done on the MSI iova? Do we need a separate unmap? I was thinking a normal unmap on an MSI windows would be an error...but I'm not set on that. I put the msi unmap there to make things symmetric, a normal unmap would work as well...and then we could drop the msi unmap. Hmm, this API semantic isn't very clean. When you explicitly map the MSI banks a clean API would also allow to unmap them. But that is not possible in your design because the kernel is responsible for mapping MSIs and you can't unmap a MSI bank that is in use by the kernel. So since the kernel owns the MSI setup anyways it should also take care of mapping the MSI banks. What is the reason to not let the kernel allocate the MSI banks top-down from the end of the DMA window space? Just let userspace know (or even set if needed) in advance how many of the windows it configures the kernel will take for mapping MSI banks and you are fine, no? Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: vfio API changes needed for powerpc (v3)
-Original Message- From: Joerg Roedel [mailto:j...@8bytes.org] Sent: Thursday, April 11, 2013 7:57 AM To: Yoder Stuart-B08248 Cc: Wood Scott-B07421; kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org; ag...@suse.de; Bhushan Bharat-R65777 Subject: Re: RFC: vfio API changes needed for powerpc (v3) On Tue, Apr 09, 2013 at 01:22:15AM +, Yoder Stuart-B08248 wrote: What happens if a normal unmap call is done on the MSI iova? Do we need a separate unmap? I was thinking a normal unmap on an MSI windows would be an error...but I'm not set on that. I put the msi unmap there to make things symmetric, a normal unmap would work as well...and then we could drop the msi unmap. Hmm, this API semantic isn't very clean. When you explicitly map the MSI banks a clean API would also allow to unmap them. But that is not possible in your design because the kernel is responsible for mapping MSIs and you can't unmap a MSI bank that is in use by the kernel. The mapping that the vfio API creates is specific only to the assigned device. So it can be unmapped without affecting any other device... there is nothing else in the kernel making the mapping in use. Another device in use by the kernel using the same MSI bank would have its own independent mapping. So, these mappings are not global but are device specific...just like normal DMA memory mappings. So since the kernel owns the MSI setup anyways it should also take care of mapping the MSI banks. What is the reason to not let the kernel allocate the MSI banks top-down from the end of the DMA window space? Just let userspace know (or even set if needed) in advance how many of the windows it configures the kernel will take for mapping MSI banks and you are fine, no? As designed the API lets user space determine the number of windows needed for MSIs, so they can be set. The only difference between what we've proposed and what you described, I think, is that the proposal allows user space to _which_ windows are used for which MSI banks. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc (v3)
On 04/11/2013 07:56:59 AM, Joerg Roedel wrote: On Tue, Apr 09, 2013 at 01:22:15AM +, Yoder Stuart-B08248 wrote: What happens if a normal unmap call is done on the MSI iova? Do we need a separate unmap? I was thinking a normal unmap on an MSI windows would be an error...but I'm not set on that. I put the msi unmap there to make things symmetric, a normal unmap would work as well...and then we could drop the msi unmap. Hmm, this API semantic isn't very clean. When you explicitly map the MSI banks a clean API would also allow to unmap them. But that is not possible in your design because the kernel is responsible for mapping MSIs and you can't unmap a MSI bank that is in use by the kernel. Why is it not possible to unmap them? Once they've been mapped, they're just like any other IOMMU mapping. If the user breaks MSI for their own devices by unmapping the MSI page, that's their problem. So since the kernel owns the MSI setup anyways it should also take care of mapping the MSI banks. What is the reason to not let the kernel allocate the MSI banks top-down from the end of the DMA window space? It's less flexible, and possibly more complicated. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: vfio API changes needed for powerpc (v3)
So now the sequence would be something like: 1)VFIO_GROUP_SET_CONTAINER // add groups to the container 2)VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model 3)count = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks 4)VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture 5)VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, including MSI banks 6) For (int I = 0; I count; i++) VFIO_IOMMU_PAMU_MAP_MSI_BANK() // map the MSI banks, do not enable aperture here. 7) Memory Listener will call- VFIO_IOMMU_MAP_DMA// map the guest's memory --- kernel enables aperture here on first VFIO_IOMMU_MAP_DMA 8)VFIO_DEVICE_SET_IRQS --- VFIO in kernel makes pci_enable_msix()/pci_enable_msi_block() calls, this sets actual MSI addr/data in physical device. --- As the address set by above APIs is not what we want so - is using MSIX, VFIO will update address in the MSI-X table - if using MSI, update MSI address in PCI configuration space. Thanks -Bharat -Original Message- From: Yoder Stuart-B08248 Sent: Friday, April 05, 2013 3:40 AM To: Alex Williamson Cc: Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; Sethi Varun-B16395; kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org Subject: RFC: vfio API changes needed for powerpc (v3) -v3 updates -made vfio_pamu_attr a union, added flags -s/VFIO_PAMU_/VFIO_IOMMU_PAMU_/ for the ioctls to make it more clear which fd is being operated on -added flags to vfio_pamu_msi_bank_map/umap -VFIO_PAMU_GET_MSI_BANK_COUNT now just returns a __u32 not a struct -fixed some typos The Freescale PAMU is an aperture-based IOMMU with the following characteristics. Each device has an entry in a table in memory describing the iova-phys mapping. The mapping has: -an overall aperture that is power of 2 sized, and has a start iova that is naturally aligned -has 1 or more windows within the aperture -number of windows must be power of 2, max is 256 -size of each window is determined by aperture size / # of windows -iova of each window is determined by aperture start iova / # of windows -the mapped region in each window can be different than the window size...mapping must power of 2 -physical address of the mapping must be naturally aligned with the mapping size These ioctls operate on the VFIO file descriptor (/dev/vfio/vfio). /* * VFIO_IOMMU_PAMU_GET_ATTR * * Gets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets argsz and attribute. The ioctl fills in * the provided struct vfio_pamu_attr based on the attribute * value that was set. * Return: 0 on success, -errno on failure */ struct vfio_pamu_attr { __u32 argsz; __u32 flags;/* no flags currently */ __u32 attribute; union { /* VFIO_ATTR_GEOMETRY */ struct { __u64 aperture_start; /* first addr that can be mapped */ __u64 aperture_end; /* last addr that can be mapped */ } attr; /* VFIO_ATTR_WINDOWS */ __u32 windows; /* number of windows in the aperture */ /* initially this will be the max number * of windows that can be set */ /* VFIO_ATTR_PAMU_STASH */ struct { __u32 cpu; /* CPU number for stashing */ __u32 cache; /* cache ID for stashing */ } stash; } }; #define VFIO_IOMMU_PAMU_GET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_IOMMU_PAMU_SET_ATTR * * Sets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets struct vfio_pamu attr, including argsz and attribute and * setting any fields that are valid for the attribute. * Return: 0 on success, -errno on failure */ #define VFIO_IOMMU_PAMU_SET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT * * Returns the number of MSI banks for this platform. This tells user space * how many aperture windows should be reserved for MSI banks when setting * the PAMU geometry and window count. * Return: __u32 bank count on success, -errno on failure */ #define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, __u32) /* * VFIO_IOMMU_PAMU_MAP_MSI_BANK
RE: RFC: vfio API changes needed for powerpc (v3)
-Original Message- From: Wood Scott-B07421 Sent: Friday, April 05, 2013 5:17 PM To: Yoder Stuart-B08248 Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; Sethi Varun-B16395; kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org Subject: Re: RFC: vfio API changes needed for powerpc (v3) On 04/04/2013 05:10:27 PM, Yoder Stuart-B08248 wrote: /* * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK * * Unmaps the MSI bank at the specified iova. * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_unmap { __u32 argsz; __u32 flags; /* no flags currently */ __u64 iova; /* the iova to be unmapped to */ }; #define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_msi_bank_unmap ) What happens if a normal unmap call is done on the MSI iova? Do we need a separate unmap? I was thinking a normal unmap on an MSI windows would be an error...but I'm not set on that. I put the msi unmap there to make things symmetric, a normal unmap would work as well...and then we could drop the msi unmap. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc (v3)
On 04/04/2013 05:10:27 PM, Yoder Stuart-B08248 wrote: /* * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK * * Unmaps the MSI bank at the specified iova. * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_unmap { __u32 argsz; __u32 flags; /* no flags currently */ __u64 iova; /* the iova to be unmapped to */ }; #define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_msi_bank_unmap ) What happens if a normal unmap call is done on the MSI iova? Do we need a separate unmap? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: vfio API changes needed for powerpc (v2)
Based on the email thread over the last couple of days, I have below an more concrete proposal (v2) for new ioctls supporting vfio-pci on SoCs with the Freescale PAMU. Example usage is as described by Scott: count = VFIO_IOMMU_GET_MSI_BANK_COUNT VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // do other DMA maps now, or later, or not at all, doesn't matter for (i = 0; i count; i++) VFIO_IOMMU_MAP_MSI_BANK(iova, i); // The kernel now knows where each bank has been mapped, and can // update PCI config space appropriately. Thanks, Stuart The Freescale PAMU is an aperture-based IOMMU with the following characteristics. Each device has an entry in a table in memory describing the iova-phys mapping. The mapping has: -an overall aperture that is power of 2 sized, and has a start iova that is naturally aligned -has 1 or more windows within the aperture -number of windows must be power of 2, max is 256 -size of each window is determined by aperture size / # of windows -iova of each window is determined by aperture start iova / # of windows -the mapped region in each window can be different than the window size...mapping must power of 2 -physical address of the mapping must be naturally aligned with the mapping size /* * VFIO_PAMU_GET_ATTR * * Gets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets argsz and attribute. The ioctl fills in * the provided struct vfio_pamu_attr based on the attribute * value that was set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_attr { __u32 argsz; __u32 attribute; #define VFIO_ATTR_GEOMETRY 1 #define VFIO_ATTR_WINDOWS 2 #define VFIO_ATTR_PAMU_STASH 3 /* fowlling fields are for VFIO_ATTR_GEOMETRY */ __u64 aperture_start; /* first address that can be mapped*/ __u64 aperture_end; /* last address that can be mapped */ /* follwing fields are for VFIO_ATTR_WINDOWS */ __u32 windows;/* number of windows in the aperture */ /* initially this will be the max number * of windows that can be set */ /* following fields are for VFIO_ATTR_PAMU_STASH */ __u32 cpu;/* CPU number for stashing */ __u32 cache; /* cache ID for stashing */ }; #define VFIO_PAMU_GET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_PAMU_SET_ATTR * * Sets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets struct vfio_pamu attr, including argsz and attribute and * setting any fields that are valid for the attribute. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ #define VFIO_PAMU_SET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_PAMU_GET_MSI_BANK_COUNT * * Returns the number of MSI banks for this platform. This tells user space * how many aperture windows should be reserved for MSI banks when setting * the PAMU geometry and window count. * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_banks { __u32 argsz; __u32 bank_count; /* the number of MSI }; #define VFIO_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_msi_banks) /* * VFIO_PAMU_MAP_MSI_BANK * * Maps the MSI bank at the specified index and iova. User space must * call this ioctl once for each MSI bank (count of banks is returned by * VFIO_IOMMU_GET_MSI_BANK_COUNT). * Caller provides struct vfio_pamu_msi_bank_map with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_map { __u32 argsz; __u32 msi_bank_index; /* the index of the MSI bank */ __u64 iova; /* the iova the bank is to be mapped to */ }; /* * VFIO_PAMU_UNMAP_MSI_BANK * * Unmaps the MSI bank at the specified iova. * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_unmap { __u32 argsz; __u64 iova; /* the iova to be unmapped to */ }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc (v2)
On Thu, 2013-04-04 at 17:32 +, Yoder Stuart-B08248 wrote: Based on the email thread over the last couple of days, I have below an more concrete proposal (v2) for new ioctls supporting vfio-pci on SoCs with the Freescale PAMU. Example usage is as described by Scott: count = VFIO_IOMMU_GET_MSI_BANK_COUNT VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // do other DMA maps now, or later, or not at all, doesn't matter for (i = 0; i count; i++) VFIO_IOMMU_MAP_MSI_BANK(iova, i); // The kernel now knows where each bank has been mapped, and can // update PCI config space appropriately. Thanks, Stuart The Freescale PAMU is an aperture-based IOMMU with the following characteristics. Each device has an entry in a table in memory describing the iova-phys mapping. The mapping has: -an overall aperture that is power of 2 sized, and has a start iova that is naturally aligned -has 1 or more windows within the aperture -number of windows must be power of 2, max is 256 -size of each window is determined by aperture size / # of windows -iova of each window is determined by aperture start iova / # of windows -the mapped region in each window can be different than the window size...mapping must power of 2 -physical address of the mapping must be naturally aligned with the mapping size /* * VFIO_PAMU_GET_ATTR * * Gets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets argsz and attribute. The ioctl fills in * the provided struct vfio_pamu_attr based on the attribute * value that was set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_attr { __u32 argsz; __u32 attribute; #define VFIO_ATTR_GEOMETRY 1 #define VFIO_ATTR_WINDOWS 2 #define VFIO_ATTR_PAMU_STASH 3 /* fowlling fields are for VFIO_ATTR_GEOMETRY */ __u64 aperture_start; /* first address that can be mapped*/ __u64 aperture_end; /* last address that can be mapped */ /* follwing fields are for VFIO_ATTR_WINDOWS */ __u32 windows;/* number of windows in the aperture */ /* initially this will be the max number * of windows that can be set */ /* following fields are for VFIO_ATTR_PAMU_STASH */ __u32 cpu;/* CPU number for stashing */ __u32 cache; /* cache ID for stashing */ Can multiple attribute bits be set at once? If not then the above could be a union and the attribute could be an enum. A flags field is probably still a good idea. }; #define VFIO_PAMU_GET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_PAMU_SET_ATTR * * Sets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets struct vfio_pamu attr, including argsz and attribute and * setting any fields that are valid for the attribute. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ #define VFIO_PAMU_SET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_PAMU_GET_MSI_BANK_COUNT * * Returns the number of MSI banks for this platform. This tells user space * how many aperture windows should be reserved for MSI banks when setting * the PAMU geometry and window count. * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_banks { __u32 argsz; __u32 bank_count; /* the number of MSI }; #define VFIO_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_msi_banks) argsz w/o flags doesn't really buy us much. /* * VFIO_PAMU_MAP_MSI_BANK * * Maps the MSI bank at the specified index and iova. User space must * call this ioctl once for each MSI bank (count of banks is returned by * VFIO_IOMMU_GET_MSI_BANK_COUNT). * Caller provides struct vfio_pamu_msi_bank_map with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_map { __u32 argsz; __u32 msi_bank_index; /* the index of the MSI bank */ __u64 iova; /* the iova the bank is to be mapped to */ }; Again, flags. If we dynamically add or remove devices from a container the bank count can change, right? If bank count goes from 2 to 3, does userspace know assume the new bank is [2]? If bank count goes from 3
RE: RFC: vfio API changes needed for powerpc (v2)
/* * VFIO_PAMU_MAP_MSI_BANK * * Maps the MSI bank at the specified index and iova. User space must * call this ioctl once for each MSI bank (count of banks is returned by * VFIO_IOMMU_GET_MSI_BANK_COUNT). * Caller provides struct vfio_pamu_msi_bank_map with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_map { __u32 argsz; __u32 msi_bank_index; /* the index of the MSI bank */ __u64 iova; /* the iova the bank is to be mapped to */ }; Again, flags. If we dynamically add or remove devices from a container the bank count can change, right? If bank count goes from 2 to 3, does userspace know assume the new bank is [2]? If bank count goes from 3 to 2, which index was that? If removing a device removes an MSI bank then vfio-pamu will automatically do the unmap, right? My assumption is that the bank count returned by VFIO_IOMMU_GET_MSI_BANK_COUNT is the max bank count for a platform. (number will mostly likely always be 3 or 4). So it won't change as devices are added or removed. If devices are added or removed, the kernel side can enable or disable the corresponding MSI windows. But that is hidden from user space. Stuart
RFC: vfio API changes needed for powerpc (v3)
-v3 updates -made vfio_pamu_attr a union, added flags -s/VFIO_PAMU_/VFIO_IOMMU_PAMU_/ for the ioctls to make it more clear which fd is being operated on -added flags to vfio_pamu_msi_bank_map/umap -VFIO_PAMU_GET_MSI_BANK_COUNT now just returns a __u32 not a struct -fixed some typos The Freescale PAMU is an aperture-based IOMMU with the following characteristics. Each device has an entry in a table in memory describing the iova-phys mapping. The mapping has: -an overall aperture that is power of 2 sized, and has a start iova that is naturally aligned -has 1 or more windows within the aperture -number of windows must be power of 2, max is 256 -size of each window is determined by aperture size / # of windows -iova of each window is determined by aperture start iova / # of windows -the mapped region in each window can be different than the window size...mapping must power of 2 -physical address of the mapping must be naturally aligned with the mapping size These ioctls operate on the VFIO file descriptor (/dev/vfio/vfio). /* * VFIO_IOMMU_PAMU_GET_ATTR * * Gets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets argsz and attribute. The ioctl fills in * the provided struct vfio_pamu_attr based on the attribute * value that was set. * Return: 0 on success, -errno on failure */ struct vfio_pamu_attr { __u32 argsz; __u32 flags;/* no flags currently */ __u32 attribute; union { /* VFIO_ATTR_GEOMETRY */ struct { __u64 aperture_start; /* first addr that can be mapped */ __u64 aperture_end; /* last addr that can be mapped */ } attr; /* VFIO_ATTR_WINDOWS */ __u32 windows; /* number of windows in the aperture */ /* initially this will be the max number * of windows that can be set */ /* VFIO_ATTR_PAMU_STASH */ struct { __u32 cpu; /* CPU number for stashing */ __u32 cache; /* cache ID for stashing */ } stash; } }; #define VFIO_IOMMU_PAMU_GET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_IOMMU_PAMU_SET_ATTR * * Sets the iommu attributes for the current vfio container. This * ioctl is applicable to an iommu type of VFIO_PAMU only. * Caller sets struct vfio_pamu attr, including argsz and attribute and * setting any fields that are valid for the attribute. * Return: 0 on success, -errno on failure */ #define VFIO_IOMMU_PAMU_SET_ATTR _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_attr) /* * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT * * Returns the number of MSI banks for this platform. This tells user space * how many aperture windows should be reserved for MSI banks when setting * the PAMU geometry and window count. * Return: __u32 bank count on success, -errno on failure */ #define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, __u32) /* * VFIO_IOMMU_PAMU_MAP_MSI_BANK * * Maps the MSI bank at the specified index and iova. User space must * call this ioctl once for each MSI bank (count of banks is returned by * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT). * Caller provides struct vfio_pamu_msi_bank_map with all fields set. * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_map { __u32 argsz; __u32 flags; /* no flags currently */ __u32 msi_bank_index; /* the index of the MSI bank */ __u64 iova; /* the iova the bank is to be mapped to */ }; #define VFIO_IOMMU_PAMU_MAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_msi_bank_map ) /* * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK * * Unmaps the MSI bank at the specified iova. * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_unmap { __u32 argsz; __u32 flags; /* no flags currently */ __u64 iova; /* the iova to be unmapped to */ }; #define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + x, struct vfio_pamu_msi_bank_unmap ) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc (v2)
On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote: /* * VFIO_PAMU_MAP_MSI_BANK * * Maps the MSI bank at the specified index and iova. User space must * call this ioctl once for each MSI bank (count of banks is returned by * VFIO_IOMMU_GET_MSI_BANK_COUNT). * Caller provides struct vfio_pamu_msi_bank_map with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_map { __u32 argsz; __u32 msi_bank_index; /* the index of the MSI bank */ __u64 iova; /* the iova the bank is to be mapped to */ }; Again, flags. If we dynamically add or remove devices from a container the bank count can change, right? If bank count goes from 2 to 3, does userspace know assume the new bank is [2]? If bank count goes from 3 to 2, which index was that? If removing a device removes an MSI bank then vfio-pamu will automatically do the unmap, right? My assumption is that the bank count returned by VFIO_IOMMU_GET_MSI_BANK_COUNT is the max bank count for a platform. (number will mostly likely always be 3 or 4). So it won't change as devices are added or removed. It should be the actual number of banks used. This is required if we're going to have userspace do the iteration and specify the exact iovas to use -- and even if we aren't going to do that, it would be more restrictive on available iova-space than is necessary. Usually there will be only one bank in the group. Actually mapping all of the MSI banks, all the time, would completely negate the point of using the separate alias pages. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: vfio API changes needed for powerpc (v2)
-Original Message- From: Wood Scott-B07421 Sent: Thursday, April 04, 2013 5:52 PM To: Yoder Stuart-B08248 Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; Sethi Varun-B16395; kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org Subject: Re: RFC: vfio API changes needed for powerpc (v2) On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote: /* * VFIO_PAMU_MAP_MSI_BANK * * Maps the MSI bank at the specified index and iova. User space must * call this ioctl once for each MSI bank (count of banks is returned by * VFIO_IOMMU_GET_MSI_BANK_COUNT). * Caller provides struct vfio_pamu_msi_bank_map with all fields set. * Operates on VFIO file descriptor (/dev/vfio/vfio). * Return: 0 on success, -errno on failure */ struct vfio_pamu_msi_bank_map { __u32 argsz; __u32 msi_bank_index; /* the index of the MSI bank */ __u64 iova; /* the iova the bank is to be mapped to */ }; Again, flags. If we dynamically add or remove devices from a container the bank count can change, right? If bank count goes from 2 to 3, does userspace know assume the new bank is [2]? If bank count goes from 3 to 2, which index was that? If removing a device removes an MSI bank then vfio-pamu will automatically do the unmap, right? My assumption is that the bank count returned by VFIO_IOMMU_GET_MSI_BANK_COUNT is the max bank count for a platform. (number will mostly likely always be 3 or 4). So it won't change as devices are added or removed. It should be the actual number of banks used. This is required if we're going to have userspace do the iteration and specify the exact iovas to use -- and even if we aren't going to do that, it would be more restrictive on available iova-space than is necessary. Usually there will be only one bank in the group. Actually mapping all of the MSI banks, all the time, would completely negate the point of using the separate alias pages. The geometry, windows, DMA mappings, etc is set on a 'container' which may have multiple groups in it. So user space needs to determine the total number of MSI windows needed when setting the geometry and window count. In the flow you proposed: count = VFIO_IOMMU_GET_MSI_BANK_COUNT VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // do other DMA maps now, or later, or not at all, doesn't matter for (i = 0; i count; i++) VFIO_IOMMU_MAP_MSI_BANK(iova, i); ...that count has to be the total, not the count for 1 of N possible groups. So the get count ioctl is not done on a group. However, like you pointed out we don't want to negate isolation of the separate alias pages. All this API is doing is telling the kernel which windows to use for which MSI banks. It's up to the kernel to actually enable them as needed. Say 3 MSI banks exist. If there are no groups added to the vfio container and all 3 MAP_MSI_BANK calls occurred the picture may look like this (based on my earlier example): win gphys/ # enabled iovaphys size --- --- 5 N 0x1000 0xf_fe044000 4KB// msi bank 1 6 N 0x1400 0xf_fe045000 4KB// msi bank 2 7 N 0x1800 0xf_fe046000 4KB// msi bank 3 User space adds 2 groups that use bank 1: win gphys/ # enabled iovaphys size --- --- 5 Y 0x1000 0xf_fe044000 4KB// msi bank 1 6 N 0x1400 0xf_fe045000 4KB// msi bank 2 7 N 0x1800 0xf_fe046000 4KB// msi bank 3 User space adds another group that uses bank 3: win gphys/ # enabled iovaphys size --- --- 5 Y 0x1000 0xf_fe044000 4KB// msi bank 1 6 N 0x1400 0xf_fe045000 4KB// msi bank 2 7 Y 0x1800 0xf_fe046000 4KB// msi bank 3 User space doesn't need to care what is actually enabled, it just needs to the the kernel which windows to use and the kernel can take care of the rest. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
Type1 is arbitrary. It might as well be named brown and this one can be blue. The difference is that type1 seems to refer to hardware that can do arbitrary 4K page mappings, possibly constrained by an aperture but nothing else. More than one IOMMU can reasonably fit that. The odds that another IOMMU would have exactly the same restrictions as PAMU seem smaller in comparison. In any case, if you had to deal with some Intel-only quirk, would it make sense to call it a type1 attribute? I'm not advocating one way or the other on whether an abstraction is viable here (though Stuart seems to think it's highly unlikely anything but a PAMU will comply), just that if it is to be abstracted rather than a hardware-specific interface, we need to document what is and is not part of the abstraction. Otherwise a non-PAMU-specific user won't know what they can rely on, and someone adding support for a new windowed IOMMU won't know if theirs is close enough, or they need to introduce a type3. So Alexey named the SPAPR IOMMU something related to spapr... surprisingly enough. I'm fine with that. If you think it's unique enough, name it something appropriately. I haven't seen the code and don't know the architecture sufficiently to have an opinion. The only reason I suggested type 2 is that I thought that was the convention...we would enumerate different iommus. I think that calling it pamu is better and more clear. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 04:38:45 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT Do you mean a count of actual MSIs or a count of MSI banks used by the whole VFIO group? I meant # of MSI banks, so VFIO_IOMMU_GET_MSI_BANK_COUNT would be better. VFIO_IOMMU_MAP_MSI(iova, size) Not sure how you mean size to be used -- for MPIC it would be 4K per bank, and you can only map one bank at a time (which bank you're mapping should be a parameter, if only so that the kernel doesn't have to keep iteration state for you). The intent was for user space to tell the kernel which windows to use for MSI. So I envisioned a total size of window-size * msi-bank-count. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 01:32:26 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 04:38:45 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: VFIO_IOMMU_MAP_MSI(iova, size) Not sure how you mean size to be used -- for MPIC it would be 4K per bank, and you can only map one bank at a time (which bank you're mapping should be a parameter, if only so that the kernel doesn't have to keep iteration state for you). The intent was for user space to tell the kernel which windows to use for MSI. So I envisioned a total size of window-size * msi-bank-count. Size doesn't tell the kernel *which* banks to use, only how many. If it already knows which banks are used by the group, then it also knows how many are used. And size is misleading because the mapping is not generally going to be contiguous. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. One thing we need to do is enable the aperture...and current thinking is that is done on the first DMA_MAP. Similarly when the last mapping is unmapped we could also umap the MSI banks. Sequence would be something like: VFIO_GROUP_SET_CONTAINER // add groups to the container VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, including MSI banks VFIO_IOMMU_MAP_DMA// map the guest's memory --- kernel enables aperture and maps needed MSI banks here VFIO_DEVICE_SET_IRQS --- kernel sets actual MSI addr/data in physical device here (I think) Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. I think userspace should explicitly request it. Userspace still wouldn't need to know anything but the count: count = VFIO_IOMMU_GET_MSI_BANK_COUNT VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // do other DMA maps now, or later, or not at all, doesn't matter for (i = 0; i count; i++) VFIO_IOMMU_MAP_MSI_BANK(iova, i); // The kernel now knows where each bank has been mapped, and can update PCI config space appropriately. One thing we need to do is enable the aperture...and current thinking is that is done on the first DMA_MAP. What if there are no other mappings required? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Wed, 2013-04-03 at 14:09 -0500, Stuart Yoder wrote: Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. One thing we need to do is enable the aperture...and current thinking is that is done on the first DMA_MAP. Similarly when the last mapping is unmapped we could also umap the MSI banks. Sequence would be something like: VFIO_GROUP_SET_CONTAINER // add groups to the container VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, including MSI banks VFIO_IOMMU_MAP_DMA// map the guest's memory --- kernel enables aperture and maps needed MSI banks here VFIO_DEVICE_SET_IRQS --- kernel sets actual MSI addr/data in physical device here (I think) You could also make use of the IOMMU_ENABLE/DISABLE entry points that Alexey plans to use. Ideally I'd think that you'd want to enable the required MSI banks for a device on DEVICE_SET_IRQs. That's effectively what happens on x86. Perhaps some information stored in the domain structure would let architecture hooks in MSI setup enable those mappings for you? Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. One thing we need to do is enable the aperture...and current thinking is that is done on the first DMA_MAP. Similarly when the last mapping is unmapped we could also umap the MSI banks. Sequence would be something like: VFIO_GROUP_SET_CONTAINER // add groups to the container VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, including MSI banks VFIO_IOMMU_MAP_DMA// map the guest's memory --- kernel enables aperture and maps needed MSI banks here Maps them where? What if there is more than one explicit DMA mapping? What if DMA mappings are changed during operation? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood scottw...@freescale.com wrote: On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. I think userspace should explicitly request it. Userspace still wouldn't need to know anything but the count: count = VFIO_IOMMU_GET_MSI_BANK_COUNT VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // do other DMA maps now, or later, or not at all, doesn't matter for (i = 0; i count; i++) VFIO_IOMMU_MAP_MSI_BANK(iova, i); // The kernel now knows where each bank has been mapped, and can update PCI config space appropriately. And the overall aperture enable/disable would occur on the first dma/msi map() and last dma/msi unmap()? Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 02:43:06 PM, Stuart Yoder wrote: On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood scottw...@freescale.com wrote: On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. I think userspace should explicitly request it. Userspace still wouldn't need to know anything but the count: count = VFIO_IOMMU_GET_MSI_BANK_COUNT VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // do other DMA maps now, or later, or not at all, doesn't matter for (i = 0; i count; i++) VFIO_IOMMU_MAP_MSI_BANK(iova, i); // The kernel now knows where each bank has been mapped, and can update PCI config space appropriately. And the overall aperture enable/disable would occur on the first dma/msi map() and last dma/msi unmap()? Yes. We may want the optional ability to do an overall enable/disable for reasons we discussed a while ago, but in the absence of an explicit disable the domain would be enabled on first map. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 10:37:20 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote: On 04/02/2013 04:38:45 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT Do you mean a count of actual MSIs or a count of MSI banks used by the whole VFIO group? I hope the latter, which would clarify how this is distinct from DEVICE_GET_IRQ_INFO. Is hotplug even on the table? Presumably dynamically adding a device could bring along additional MSI banks? I'm not sure -- maybe we could say that hotplug can add banks, but not remove them or change the order, so userspace would just need to check if the number of banks changed, and map the extras. The current VFIO MSI support has the host handling everything about MSI. The user never programs an MSI vector to the physical device, they set up everything through ioctl. On interrupt, we simply trigger an eventfd and leave it to things like KVM irqfd or QEMU to do the right thing in a virtual machine. Here the MSI vector has to go through a PAMU window to hit the correct MSI bank. So that means it has some component of the iova involved, which we're proposing here is controlled by userspace (whether that vector uses an offset from 0x1000 or 0x depending on which window slot is used to make the MSI bank). I assume we're still working in a model where the physical interrupt fires into the host and a host-based interrupt handler triggers an eventfd, right? Yes (subject to possible future optimizations). So that means the vector also has host components so we trigger the correct ISR. How is that coordinated? Everything but the iova component needs to come from the host MSI allocator. Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This would restrict a (possibly unlikely) use case where the user wants to map something near the top of the aperture but has another place MSIs can go (or is willing to live without MSIs). Otherwise it could be workable, as long as we can require an explicit MSI enabling on a device to happen after the aperture and subwindow count are set up. I'm not sure it would really buy anything over having userspace iterate over the MSI bank count, though -- it would probably be a bit more complicated. On x86 MSI count is very device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with it being a device ioctl is that you need to get the device FD, but the IOMMU protection needs to be established before you can get that... so there's an ordering problem if you need it from the device before configuring the IOMMU. Thanks, What do you mean by IOMMU protection needs to be established? Wouldn't we just start with no mappings in place? If no mappings blocks all DMA, sure, that's fine. Once the VFIO device FD is accessible by
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 10:12:31 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote: On 04/02/2013 04:32:04 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. So what control will userspace have over programming the actually MSI vectors on PAMU? Not sure what you mean -- PAMU doesn't get explicitly involved in MSIs. It's just another 4K page mapping (per relevant MSI bank). If you want isolation, you need to make sure that an MSI group is only used by one VFIO group, and that you're on a chip that has alias pages with just one MSI bank register each (newer chips do, but the first chip to have a PAMU didn't). How does a user figure this out? The user's involvement could be limited to setting a policy knob of whether that degree of isolation is required (if required and unavailable, all devices using an MSI bank would be forced into the same group). We'd need to do something with MSI allocation so that we avoid using an MSI bank with more than one IOMMU group where possible. I'm not sure about the details yet, or how practical this is. There might need to be some MSI bank assignment done as part of the VFIO device binding process, if there are going to be more VFIO groups than there are MSI banks (reserving one bank for host use). -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: vfio API changes needed for powerpc
Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). Any feedback or direction? Thanks, Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). The new ioctl isn't really specific to PAMU (or whatever type2 is supposed to be, which nobody ever explains when I ask), so much as to the MSI implementation. It just exposes the MSI register as another device resource (well, technically a groupwide resource, unless we expose it on a per-device basis and provide enough information for userspace to recognize when it's the same for other devices in the group) to be mmapped, which userspace can choose to map in the IOMMU as well. Note that in the explicit case, userspace would have to program the MSI iova into the PCI device's config space (or communicate the chosen address to the kernel so it can set the config space registers). -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
Hi Stuart, On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Any ioctls to the vfiofd (/dev/vfio/vfio) not claimed by vfio-core are passed to the IOMMU driver. So you can effectively have your own type2 ioctl extensions. Alexey has already posted patches to do this for SPAPR that add VFIO_IOMMU_ENABLE/DISABLE to allow him access to VFIO_IOMMU_GET_INFO to examine locked page requirements. As Scott notes we need to come up with a clean userspace interface for these though. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. What's the value to userspace in determining which windows are used by which banks? It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. Ok, then make the number 4. The chance of more MSI banks in future chips is nil, and if it ever happened user space could adjust. Also, practically speaking since memory is typically allocate in powers of 2 way you need to approximately double the window geometry anyway. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. You're right, a single 24KB mapping wouldn't work-- in the case of 3 MSI banks perhaps we could just do one 64MB*3 mapping to identify which windows are used for MSIs. If only one MSI bank was involved the kernel could get clever and only enable the banks actually needed. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 03:38:42 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. Ok, then make the number 4. The chance of more MSI banks in future chips is nil, What makes you so sure? Especially since you seem to be presenting this as not specifically an MPIC API. and if it ever happened user space could adjust. What bit of API is going to tell it that it needs to adjust? Also, practically speaking since memory is typically allocate in powers of 2 way you need to approximately double the window geometry anyway. Only if your existing mapping needs fit exactly in a power of two. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. You're right, a single 24KB mapping wouldn't work-- in the case of 3 MSI banks perhaps we could just do one 64MB*3 mapping to identify which windows are used for MSIs. Where did the assumption of a 64MiB subwindow size come from? If only one MSI bank was involved the kernel could get clever and only enable the banks actually needed. I'd rather see cleverness kept in userspace. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson alex.william...@redhat.com wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? The windows can be enabled/disabled event while the aperture is enabled (pretty sure)... B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. What's the value to userspace in determining which windows are used by which banks? It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Yes, it would potentially give user space the flexibility some more windows. It also makes more explicit when the MSI mappings are created. In option #A the MSI mappings would probably get created at the time of the first normal DMA map. So, you're saying with this approach you'd rather see a new type 2 ioctl instead of adding new flags to DMA map, right? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. It would be less overhead to just add an ioctl to enable MSI, maybe letting userspace pick which windows get used, but I'm still not sure what the value is to userspace in exposing it. Thanks, Thanks, Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. What do you mean by reduce the number exposed? Userspace decides how many entries there are, but it must be a power of two beteen 1 and 256. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? Subwindows can be modified while the aperture is enabled, but the aperture size and number of subwindows cannot be changed. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Well, there could be that. But also, userspace will generally have a much better idea of the type of mappings it's creating, so it's easier to keep everything explicit at the kernel/user interface than require more complicated code in the kernel to figure things out automatically (not just for MSIs but in general). If the kernel automatically creates the MSI mappings, when does it assume that userspace is done creating its own? What if userspace doesn't need any DMA other than the MSIs? What if userspace wants to continue dynamically modifying its other mappings? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 3:47 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 03:38:42 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. Ok, then make the number 4. The chance of more MSI banks in future chips is nil, What makes you so sure? Especially since you seem to be presenting this as not specifically an MPIC API. and if it ever happened user space could adjust. What bit of API is going to tell it that it needs to adjust? Haven't thought through that completely, but I guess we could add an API to return the number of MSI banks for type 2 iommus. Also, practically speaking since memory is typically allocate in powers of 2 way you need to approximately double the window geometry anyway. Only if your existing mapping needs fit exactly in a power of two. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. You're right, a single 24KB mapping wouldn't work-- in the case of 3 MSI banks perhaps we could just do one 64MB*3 mapping to identify which windows are used for MSIs. Where did the assumption of a 64MiB subwindow size come from? The example I was using. User space would need to create a mapping for window_size * msi_bank_count. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? We will define that as part of the type2 implementation. Highly unlikely anything but a PAMU will comply. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Well, there could be that. But also, userspace will generally have a much better idea of the type of mappings it's creating, so it's easier to keep everything explicit at the kernel/user interface than require more complicated code in the kernel to figure things out automatically (not just for MSIs but in general). If the kernel automatically creates the MSI mappings, when does it assume that userspace is done creating its own? What if userspace doesn't need any DMA other than the MSIs? What if userspace wants to continue dynamically modifying its other mappings? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT VFIO_IOMMU_MAP_MSI(iova, size) Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson alex.william...@redhat.com wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). Ok, apparently I don't understand the API. Is it something like userspace calls GET_ATTR and finds out that there are 256 available windows, userspace determines that it needs 8 for RAM and then it has an MSI device, so it needs to call SET_ATTR and ask for 16? That seems prone to exploitation by the first userspace to allocate it's aperture, but I'm also not sure why userspace could specify the (non-power of 2) number of windows it needs for RAM, then VFIO would see that the devices attached have MSI and add those windows and align to a power of 2. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? The windows can be enabled/disabled event while the aperture is enabled (pretty sure)... B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. What's the value to userspace in determining which windows are used by which banks? It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Yes, it would potentially give user space the flexibility some more windows. It also makes more explicit when the MSI mappings are created. In option #A the MSI mappings would probably get created at the time of the first normal DMA map. So, you're saying with this approach you'd rather see a new type 2 ioctl instead of adding new flags to DMA map, right? I'm not sure I know enough yet to have a suggestion. What would be the purpose of userspace specifying the iova and size here? If userspace just needs to know that it needs X addition windows for MSI and can tell the kernel to use banks 0 through (X-1) for MSI, that sounds more like an ioctl interface than a DMA_MAP flag. Thanks, Alex C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu).
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. What do you mean by reduce the number exposed? Userspace decides how many entries there are, but it must be a power of two beteen 1 and 256. I didn't understand the API. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. So what control will userspace have over programming the actually MSI vectors on PAMU? What kind of restrictions do you have around adding and removing windows while the aperture is enabled? Subwindows can be modified while the aperture is enabled, but the aperture size and number of subwindows cannot be changed. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? Naming always has and always will be a problem. I assume this is named type2 rather than PAMU because it's trying to expose a generic windowed IOMMU fitting the IOMMU API. Like type1, it doesn't really make sense to name it IOMMU API because that's a kernel internal interface and we're designing a userspace interface that just happens to use that. Tagging it to a piece of hardware makes it less reusable. Type1 is arbitrary. It might as well be named brown and this one can be blue. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. So userspace is programming the MSI vectors, targeting a user programmed iova? But an iova selects a window and I thought there were some number of MSI banks and we don't really know which ones we'll need... still confused. It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Well, there could be that. But also, userspace will generally have a much better idea of the type of mappings it's creating, so it's easier to keep everything explicit at the kernel/user interface than require more complicated code in the kernel to figure things out automatically (not just for MSIs but in general). If the kernel automatically creates the MSI mappings, when does it assume that userspace is done creating its own? What if userspace doesn't need any DMA other than the MSIs? What if userspace wants to continue
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT VFIO_IOMMU_MAP_MSI(iova, size) How are MSIs related to devices on PAMU? On x86 MSI count is very device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with it being a device ioctl is that you need to get the device FD, but the IOMMU protection needs to be established before you can get that... so there's an ordering problem if you need it from the device before configuring the IOMMU. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:08:27 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? We will define that as part of the type2 implementation. Highly unlikely anything but a PAMU will comply. So then why not just call it pamu instead of being obfuscatory? There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, And what specifically does that ioctl do? It causes new mappings to be created, right? So you're changing (or at least adding to) the DMA map mechanism. it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. I don't see enough complexity in the mmap approach for anything to be vastly less complex in comparison. I think you're building the mmap approach up in your head to be a lot worse that it would actually be. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:16:11 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote: The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). Ok, apparently I don't understand the API. Is it something like userspace calls GET_ATTR and finds out that there are 256 available windows, userspace determines that it needs 8 for RAM and then it has an MSI device, so it needs to call SET_ATTR and ask for 16? That seems prone to exploitation by the first userspace to allocate it's aperture, What exploitation? It's not as if there is a pool of 256 global windows that users allocate from. The subwindow count is just how finely divided the aperture is. The only way one user will affect another is through cache contention (which is why we want the minimum number of subwindows that we can get away with). but I'm also not sure why userspace could specify the (non-power of 2) number of windows it needs for RAM, then VFIO would see that the devices attached have MSI and add those windows and align to a power of 2. If you double the subwindow count without userspace knowing, you have to double the aperture as well (and you may need to grow up or down depending on alignment). This means you also need to halve the maximum aperture that userspace can request. And you need to expose a different number of maximum subwindows in the IOMMU API based on whether we might have MSIs of this type. It's ugly and awkward, and removes the possibility for userspace to place the MSIs in some unused slot in the middle, or not use MSIs at all. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:32:04 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. So what control will userspace have over programming the actually MSI vectors on PAMU? Not sure what you mean -- PAMU doesn't get explicitly involved in MSIs. It's just another 4K page mapping (per relevant MSI bank). If you want isolation, you need to make sure that an MSI group is only used by one VFIO group, and that you're on a chip that has alias pages with just one MSI bank register each (newer chips do, but the first chip to have a PAMU didn't). This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? Naming always has and always will be a problem. I assume this is named type2 rather than PAMU because it's trying to expose a generic windowed IOMMU fitting the IOMMU API. But how closely is the MSI situation related to a generic windowed IOMMU, then? We could just as well have a highly flexible IOMMU in terms of arbitrary 4K page mappings, but still handle MSIs as pages to be mapped rather than a translation table. Or we could have a windowed IOMMU that has an MSI translation table. Like type1, it doesn't really make sense to name it IOMMU API because that's a kernel internal interface and we're designing a userspace interface that just happens to use that. Tagging it to a piece of hardware makes it less reusable. Well, that's my point. Is it reusable at all, anyway? If not, then giving it a more obscure name won't change that. If it is reusable, then where is the line drawn between things that are PAMU-specific or MPIC-specific and things that are part of the generic windowed IOMMU abstraction? Type1 is arbitrary. It might as well be named brown and this one can be blue. The difference is that type1 seems to refer to hardware that can do arbitrary 4K page mappings, possibly constrained by an aperture but nothing else. More than one IOMMU can reasonably fit that. The odds that another IOMMU would have exactly the same restrictions as PAMU seem smaller in comparison. In any case, if you had to deal with some Intel-only quirk, would it make sense to call it a type1 attribute? I'm not advocating one way or the other on whether an abstraction is viable here (though Stuart seems to think it's highly unlikely anything but a PAMU will comply), just that if it is to be abstracted rather than a hardware-specific interface, we need to document what is and is not part of the abstraction. Otherwise a non-PAMU-specific user won't know what they can rely on, and someone adding support for a new windowed IOMMU won't know if theirs is close enough, or they need to introduce a type3. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. So userspace is programming the MSI vectors, targeting a user programmed iova? But an iova selects a window and I thought there were some number of MSI banks and we don't really know which ones we'll need... still confused. Userspace would also need a way to find out the page offset and data value. That may be an argument in favor of having the two ioctls Stuart later suggested (get MSI count, and map MSI). Would there be any complication in the VFIO code from tracking a mapping that doesn't have a userspace virtual address associated with it? There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. Overhead in terms of code required and complexity. More things to reference count and shut down in the proper order on userspace exit. Thanks, That didn't stop others from having me convert the KVM device control API to use file descriptors instead of something more ad-hoc with a better-defined destruction order. :-) I don't know if it necessarily needs to be a separate fd -- it could be just another device resource like BARs, with some way for userspace to
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:38:45 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT Do you mean a count of actual MSIs or a count of MSI banks used by the whole VFIO group? VFIO_IOMMU_MAP_MSI(iova, size) Not sure how you mean size to be used -- for MPIC it would be 4K per bank, and you can only map one bank at a time (which bank you're mapping should be a parameter, if only so that the kernel doesn't have to keep iteration state for you). How are MSIs related to devices on PAMU? PAMU doesn't care about MSIs. The relation of individual MSIs to a device is standard PCI stuff. Each MSI bank (which is part of the MPIC, not PAMU) can hold numerous MSIs. The VFIO user would want to map all MSI banks that are in use by any of the devices in the group. Ideally we'd let the VFIO grouping influence the allocation of MSIs. On x86 MSI count is very device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with it being a device ioctl is that you need to get the device FD, but the IOMMU protection needs to be established before you can get that... so there's an ordering problem if you need it from the device before configuring the IOMMU. Thanks, What do you mean by IOMMU protection needs to be established? Wouldn't we just start with no mappings in place? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 17:13 -0500, Scott Wood wrote: On 04/02/2013 04:16:11 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote: The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). Ok, apparently I don't understand the API. Is it something like userspace calls GET_ATTR and finds out that there are 256 available windows, userspace determines that it needs 8 for RAM and then it has an MSI device, so it needs to call SET_ATTR and ask for 16? That seems prone to exploitation by the first userspace to allocate it's aperture, What exploitation? It's not as if there is a pool of 256 global windows that users allocate from. The subwindow count is just how finely divided the aperture is. The only way one user will affect another is through cache contention (which is why we want the minimum number of subwindows that we can get away with). but I'm also not sure why userspace could specify the (non-power of 2) number of windows it needs for RAM, then VFIO would see that the devices attached have MSI and add those windows and align to a power of 2. If you double the subwindow count without userspace knowing, you have to double the aperture as well (and you may need to grow up or down depending on alignment). This means you also need to halve the maximum aperture that userspace can request. And you need to expose a different number of maximum subwindows in the IOMMU API based on whether we might have MSIs of this type. It's ugly and awkward, and removes the possibility for userspace to place the MSIs in some unused slot in the middle, or not use MSIs at all. Ok, I missed this in Stuart's example: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 ^^ 5 0x1400 0xf_fe045000 4KB// msi bank 2 ^^ 6 0x1800 0xf_fe046000 4KB// msi bank 3 ^^ 7- - disabled So even though the MSI banks are 4k in this example, they're still on 64MB boundaries. If userspace were to leave this as 256 windows, each would be 2MB and we'd use 128 of them to map the same memory as these 4x64MB windows and thrash the iotlb harder. The picture is becoming clearer. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote: On 04/02/2013 04:32:04 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. So what control will userspace have over programming the actually MSI vectors on PAMU? Not sure what you mean -- PAMU doesn't get explicitly involved in MSIs. It's just another 4K page mapping (per relevant MSI bank). If you want isolation, you need to make sure that an MSI group is only used by one VFIO group, and that you're on a chip that has alias pages with just one MSI bank register each (newer chips do, but the first chip to have a PAMU didn't). How does a user figure this out? This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? Naming always has and always will be a problem. I assume this is named type2 rather than PAMU because it's trying to expose a generic windowed IOMMU fitting the IOMMU API. But how closely is the MSI situation related to a generic windowed IOMMU, then? We could just as well have a highly flexible IOMMU in terms of arbitrary 4K page mappings, but still handle MSIs as pages to be mapped rather than a translation table. Or we could have a windowed IOMMU that has an MSI translation table. Like type1, it doesn't really make sense to name it IOMMU API because that's a kernel internal interface and we're designing a userspace interface that just happens to use that. Tagging it to a piece of hardware makes it less reusable. Well, that's my point. Is it reusable at all, anyway? If not, then giving it a more obscure name won't change that. If it is reusable, then where is the line drawn between things that are PAMU-specific or MPIC-specific and things that are part of the generic windowed IOMMU abstraction? Type1 is arbitrary. It might as well be named brown and this one can be blue. The difference is that type1 seems to refer to hardware that can do arbitrary 4K page mappings, possibly constrained by an aperture but nothing else. More than one IOMMU can reasonably fit that. The odds that another IOMMU would have exactly the same restrictions as PAMU seem smaller in comparison. In any case, if you had to deal with some Intel-only quirk, would it make sense to call it a type1 attribute? I'm not advocating one way or the other on whether an abstraction is viable here (though Stuart seems to think it's highly unlikely anything but a PAMU will comply), just that if it is to be abstracted rather than a hardware-specific interface, we need to document what is and is not part of the abstraction. Otherwise a non-PAMU-specific user won't know what they can rely on, and someone adding support for a new windowed IOMMU won't know if theirs is close enough, or they need to introduce a type3. So Alexey named the SPAPR IOMMU something related to spapr... surprisingly enough. I'm fine with that. If you think it's unique enough, name it something appropriately. I haven't seen the code and don't know the architecture sufficiently to have an opinion. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. So userspace is programming the MSI vectors, targeting a user programmed iova? But an iova selects a window and I thought there were some number of MSI banks and we don't really know which ones we'll need... still confused. Userspace would also need a way to find out the page offset and data value. That may be an argument in favor of having the two ioctls Stuart later suggested (get MSI count, and map MSI). Connecting the user set iova and host kernel assigned irq number is where I'm still lost, but I'll follow-up with that question in the other thread. Would there be any complication in the VFIO code from tracking a mapping that doesn't have a userspace virtual address associated with it? Only the VFIO iommu driver tracks mappings, the QEMU userspace component doesn't (replies on the memory API for type1), nor does any of the kernel framework code. There's going to be special stuff no matter
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote: On 04/02/2013 04:38:45 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT Do you mean a count of actual MSIs or a count of MSI banks used by the whole VFIO group? I hope the latter, which would clarify how this is distinct from DEVICE_GET_IRQ_INFO. Is hotplug even on the table? Presumably dynamically adding a device could bring along additional MSI banks? VFIO_IOMMU_MAP_MSI(iova, size) Not sure how you mean size to be used -- for MPIC it would be 4K per bank, and you can only map one bank at a time (which bank you're mapping should be a parameter, if only so that the kernel doesn't have to keep iteration state for you). How are MSIs related to devices on PAMU? PAMU doesn't care about MSIs. The relation of individual MSIs to a device is standard PCI stuff. Each MSI bank (which is part of the MPIC, not PAMU) can hold numerous MSIs. The VFIO user would want to map all MSI banks that are in use by any of the devices in the group. Ideally we'd let the VFIO grouping influence the allocation of MSIs. The current VFIO MSI support has the host handling everything about MSI. The user never programs an MSI vector to the physical device, they set up everything through ioctl. On interrupt, we simply trigger an eventfd and leave it to things like KVM irqfd or QEMU to do the right thing in a virtual machine. Here the MSI vector has to go through a PAMU window to hit the correct MSI bank. So that means it has some component of the iova involved, which we're proposing here is controlled by userspace (whether that vector uses an offset from 0x1000 or 0x depending on which window slot is used to make the MSI bank). I assume we're still working in a model where the physical interrupt fires into the host and a host-based interrupt handler triggers an eventfd, right? So that means the vector also has host components so we trigger the correct ISR. How is that coordinated? Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? On x86 MSI count is very device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with it being a device ioctl is that you need to get the device FD, but the IOMMU protection needs to be established before you can get that... so there's an ordering problem if you need it from the device before configuring the IOMMU. Thanks, What do you mean by IOMMU protection needs to be established? Wouldn't we just start with no mappings in place? If no mappings blocks all DMA, sure, that's fine. Once the VFIO device FD is accessible by userspace we have to protect the host against DMA. If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that could be exploitable. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a