Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On Sun, Oct 11, 2015 at 12:19:54PM +0300, Michael S. Tsirkin wrote: > > But since you must pass the same value to open(), you already know that > > you're using noiommu. > > > > >VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE > > >will probably also fail ... > > > > > > > Don't you have to call MAP_DMA to pin the memory? > > Well check it out - the patch in question doesn't implement this ioctl. > In fact it doesn't implement anything except CHECK_EXTENSION. > > And this makes sense to me: MAP_DMA > maps a virtual address to io address, and that doesn't > work for the dummy iommu. > > You can pin memory using many other ways, including > mlock and hugetlbfs. > mlock() does not pin memory. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On Sun, Oct 11, 2015 at 12:19:54PM +0300, Michael S. Tsirkin wrote: > > But since you must pass the same value to open(), you already know that > > you're using noiommu. > > > > >VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE > > >will probably also fail ... > > > > > > > Don't you have to call MAP_DMA to pin the memory? > > Well check it out - the patch in question doesn't implement this ioctl. > In fact it doesn't implement anything except CHECK_EXTENSION. > > And this makes sense to me: MAP_DMA > maps a virtual address to io address, and that doesn't > work for the dummy iommu. > > You can pin memory using many other ways, including > mlock and hugetlbfs. > mlock() does not pin memory. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 08:39:10PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 08:01:21PM +0300, Gleb Natapov wrote: > > On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote: > > > On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote: > > > > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote: > > > > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote: > > > > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > > > > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > > > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > > > > > > > >> > > > > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > > > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > > > > > > > >>>>It is good practice to defend against root oopsing the > > > > > > > > >>>>kernel, but in some > > > > > > > > >>>>cases it cannot be achieved. > > > > > > > > >>>Absolutely. That's one of the issues with these patches. > > > > > > > > >>>They don't even > > > > > > > > >>>try where it's absolutely possible. > > > > > > > > >>> > > > > > > > > >>Are you referring to blocking the maps of the msix BAR areas? > > > > > > > > >For example. There are more. I listed some of the issues on > > > > > > > > >the mailing > > > > > > > > >list, and I might have missed some. VFIO has code to address > > > > > > > > >all this, > > > > > > > > >people should share code to avoid duplication, or at least > > > > > > > > >read it > > > > > > > > >to understand the issues. > > > > > > > > > > > > > > > > All but one of those are unrelated to the patch that adds msix > > > > > > > > support. > > > > > > > > > > > > > > They are related because msix support enables bus mastering. > > > > > > > Without it > > > > > > > device is passive and can't harm anyone. With it, suddently you > > > > > > > need to > > > > > > > be very careful with the device to avoid corrupting kernel memory. > > > > > > > > > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. > > > > > > The > > > > > > fact that they do that without even tainting the kernel like the > > > > > > patch > > > > > > does make current situation much worse that with the patch. > > > > > > > > > > It isn't worse. It's a sane interface. Whoever enables bus mastering > > > > > must be careful. If userspace enables bus mastering then userspace > > > > > needs to be very careful with the device to avoid corrupting kernel > > > > > memory. If kernel does it, it's kernel's responsibility. > > > > > > > > > Although this definition of sanity sounds strange to me, but lets > > > > flow with it for the sake of this email: would it be OK if proposed > > > > interface refused to work if bus mastering is not already enabled by > > > > userspace? > > > > > > An interface could be acceptable if there's a fallback where it > > > works without BM but slower (e.g. poll pending bits). > > > > > OK. > > > > > But not the proposed one. > > > > > Why? Greg is against ioctl interface so it will be reworked, by besides > > that what is wrong with the concept of binding msi-x interrupt to > > eventfd? > > It's not the binding. Managing msi-x just needs more than the puny > 2 ioctls to get # of vectors and set eventfd. > > It interacts in strange ways with reset, and with PM, and ... > Sorry, I need examples of what you mean. DMA also "interacts in strange ways with reset, and with PM, and ..." and it does not have any special handling anywhere in uio-generic. S
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote: > > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote: > > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote: > > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > > > > > > > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > > > > > >> > > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > > > > > >>>>It is good practice to defend against root oopsing the kernel, > > > > > > >>>>but in some > > > > > > >>>>cases it cannot be achieved. > > > > > > >>>Absolutely. That's one of the issues with these patches. They > > > > > > >>>don't even > > > > > > >>>try where it's absolutely possible. > > > > > > >>> > > > > > > >>Are you referring to blocking the maps of the msix BAR areas? > > > > > > >For example. There are more. I listed some of the issues on the > > > > > > >mailing > > > > > > >list, and I might have missed some. VFIO has code to address all > > > > > > >this, > > > > > > >people should share code to avoid duplication, or at least read it > > > > > > >to understand the issues. > > > > > > > > > > > > All but one of those are unrelated to the patch that adds msix > > > > > > support. > > > > > > > > > > They are related because msix support enables bus mastering. Without > > > > > it > > > > > device is passive and can't harm anyone. With it, suddently you need > > > > > to > > > > > be very careful with the device to avoid corrupting kernel memory. > > > > > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. The > > > > fact that they do that without even tainting the kernel like the patch > > > > does make current situation much worse that with the patch. > > > > > > It isn't worse. It's a sane interface. Whoever enables bus mastering > > > must be careful. If userspace enables bus mastering then userspace > > > needs to be very careful with the device to avoid corrupting kernel > > > memory. If kernel does it, it's kernel's responsibility. > > > > > Although this definition of sanity sounds strange to me, but lets > > flow with it for the sake of this email: would it be OK if proposed > > interface refused to work if bus mastering is not already enabled by > > userspace? > > An interface could be acceptable if there's a fallback where it > works without BM but slower (e.g. poll pending bits). > OK. > But not the proposed one. > Why? Greg is against ioctl interface so it will be reworked, by besides that what is wrong with the concept of binding msi-x interrupt to eventfd? > Really, there's more to making msi-x work with > userspace drivers than this patch. As I keep telling people, you would > basically reimplement vfio/pci. Go over it, and see for yourself. > Almost everything it does is relevant for msi-x. It's just wrong to > duplicate so much code. > The patch is tested and works with msi-x. Restricting access to msi-x registers that vfio does is not relevant here. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote: > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > > > >> > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > > > >>>>It is good practice to defend against root oopsing the kernel, but > > > > >>>>in some > > > > >>>>cases it cannot be achieved. > > > > >>>Absolutely. That's one of the issues with these patches. They don't > > > > >>>even > > > > >>>try where it's absolutely possible. > > > > >>> > > > > >>Are you referring to blocking the maps of the msix BAR areas? > > > > >For example. There are more. I listed some of the issues on the mailing > > > > >list, and I might have missed some. VFIO has code to address all this, > > > > >people should share code to avoid duplication, or at least read it > > > > >to understand the issues. > > > > > > > > All but one of those are unrelated to the patch that adds msix support. > > > > > > They are related because msix support enables bus mastering. Without it > > > device is passive and can't harm anyone. With it, suddently you need to > > > be very careful with the device to avoid corrupting kernel memory. > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. The > > fact that they do that without even tainting the kernel like the patch > > does make current situation much worse that with the patch. > > It isn't worse. It's a sane interface. Whoever enables bus mastering > must be careful. If userspace enables bus mastering then userspace > needs to be very careful with the device to avoid corrupting kernel > memory. If kernel does it, it's kernel's responsibility. > Although this definition of sanity sounds strange to me, but lets flow with it for the sake of this email: would it be OK if proposed interface refused to work if bus mastering is not already enabled by userspace? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > >> > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > It is good practice to defend against root oopsing the kernel, but in > > some > > cases it cannot be achieved. > > >>>Absolutely. That's one of the issues with these patches. They don't even > > >>>try where it's absolutely possible. > > >>> > > >>Are you referring to blocking the maps of the msix BAR areas? > > >For example. There are more. I listed some of the issues on the mailing > > >list, and I might have missed some. VFIO has code to address all this, > > >people should share code to avoid duplication, or at least read it > > >to understand the issues. > > > > All but one of those are unrelated to the patch that adds msix support. > > They are related because msix support enables bus mastering. Without it > device is passive and can't harm anyone. With it, suddently you need to > be very careful with the device to avoid corrupting kernel memory. > Most (if not all) uio_pci_generic users enable pci bus mastering. The fact that they do that without even tainting the kernel like the patch does make current situation much worse that with the patch. > > I can't comment on iommu overhead; for my use case it is likely negligible > > and we will use an iommu when available; but apparently it matters for > > others. > > You and Vlad are the only ones who brought this up. > So maybe you should not bring it up anymore. > Common, you were CCed to at least this one: We have a solution that makes use of IOMMU support with vfio. The problem is there are multiple cases where that support is either not available, or using the IOMMU provides excess overhead. http://dpdk.org/ml/archives/dev/2015-October/024560.html -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 12:38:28PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 10:59:10AM +0300, Gleb Natapov wrote: > > I do not remember this been an issue when uio_generic was accepted > > into the kernel. The reason was because it meant to be accessible by root > > only. > > No - because it does not need bus mastering. So it can be used safely > with some devices. > It still can be used safely with same devices. Admittedly I did not look close, but I am sure the patch does not enable bus mastering if MSI interrupt is not requested. If not, well that can be fixed. But more importantly it can be used unsafely in its current state. Not only can, it is widely used so. > [mst@robin linux]$ git grep pci_set_master|wc -l 533 > [mst@robin linux]$ git grep pci_enable|wc -l 1597 > > Looks like about 2/3 devices don't need to be bus masters. > > It's up to admin not to bind it to devices, and that is unfortunate, > but manually binding an incorrect driver to a device is generally > a hard problem to solve. > > > > There's also drivers/vfio/virqfd.c which deals > > > with sending interrupts over eventfds correctly. > > > > > As opposite to this patch that deals with them incorrectly? In what way? > > cleanup on fd close is not handled. > Have you commented about this on the patch and it was not fixed? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 11:32:50AM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > On 08/10/15 00:05, Michael S. Tsirkin wrote: > > >On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: > > >>That's what I thought as well, but apparently adding msix support to the > > >>already insecure uio drivers is even worse. > > >I'm glad you finally agree what these drivers are doing is insecure. > > > > > >And basically kernel cares about security, no one wants to maintain > > >insecure stuff. > > > > > >So you guys should think harder whether this code makes any sense upstream. > > > > You simply ignore everything I write, cherry-picking the word "insecure" as > > if it makes your point. That is very frustrating. > > And I'm sorry about the frustration. I didn't intend to twist your > words. It's just that I had to spend literally hours trying to explain > that security matters in kernel, and all I was getting back was a > summary "there's no security issue because there are other way to > corrupt memory". > That's not the (only) answer that you were given. The answers that you constantly ignore is that the patch in question does not add any new ways to corrupt memory which are not possible using _upstream_ uio_pci_generic device, so the fact that uio_pci_generic can corrupt memory cannot be used as a reason to not apply patches that do not corrupt any memory. You seams to be constantly arguing that uio_pci_generic is not suitable for upstream. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 10:41:53AM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 07:19:13AM +0300, Gleb Natapov wrote: > > Well > > the alternative is to add /dev/vfio/nommu like you've said, but what > > would be the difference between this and uio eludes me. > > Are you familiar with vfio that you ask such a question? > Yes, I do and I do not see anything of value that vfio can add to nommu setup besides complexity, but I do see why it will have to have special interface not applicable to regular vfio (hint: there is not HW to translate virtual address to physical) and why it will have to be accessible to root user only. > Here's the vfio pci code: > > $ wc -l drivers/vfio/pci/* >27 drivers/vfio/pci/Kconfig > 4 drivers/vfio/pci/Makefile > 1217 drivers/vfio/pci/vfio_pci.c > 1602 drivers/vfio/pci/vfio_pci_config.c > 675 drivers/vfio/pci/vfio_pci_intrs.c >92 drivers/vfio/pci/vfio_pci_private.h > 238 drivers/vfio/pci/vfio_pci_rdwr.c > 3855 total > > There's some code dealing with iommu groups in > drivers/vfio/pci/vfio_pci.c, > but most of it is validating input and > presenting a consistent interface to userspace. > What is has to do with the patch series in question? Non patched uio_generic code does not validate input. If you think it should by all means write the code (don't break existing use cases while doing so), but the patch under discussion does not even access pci device from userspace, so it will not be affected by said filtering. > This is exactly what's missing here. It is not missing in this patch series, it is missing from upstream code. I do not remember this been an issue when uio_generic was accepted into the kernel. The reason was because it meant to be accessible by root only. VFIO was designed to be used by regular user from ground up, so obviously unrestricted access to pci space was out of the question. Different use cases lead to different designs, how surprising. > > There's also drivers/vfio/virqfd.c which deals > with sending interrupts over eventfds correctly. > As opposite to this patch that deals with them incorrectly? In what way? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 11:32:50AM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > On 08/10/15 00:05, Michael S. Tsirkin wrote: > > >On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: > > >>That's what I thought as well, but apparently adding msix support to the > > >>already insecure uio drivers is even worse. > > >I'm glad you finally agree what these drivers are doing is insecure. > > > > > >And basically kernel cares about security, no one wants to maintain > > >insecure stuff. > > > > > >So you guys should think harder whether this code makes any sense upstream. > > > > You simply ignore everything I write, cherry-picking the word "insecure" as > > if it makes your point. That is very frustrating. > > And I'm sorry about the frustration. I didn't intend to twist your > words. It's just that I had to spend literally hours trying to explain > that security matters in kernel, and all I was getting back was a > summary "there's no security issue because there are other way to > corrupt memory". > That's not the (only) answer that you were given. The answers that you constantly ignore is that the patch in question does not add any new ways to corrupt memory which are not possible using _upstream_ uio_pci_generic device, so the fact that uio_pci_generic can corrupt memory cannot be used as a reason to not apply patches that do not corrupt any memory. You seams to be constantly arguing that uio_pci_generic is not suitable for upstream. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 10:41:53AM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 07:19:13AM +0300, Gleb Natapov wrote: > > Well > > the alternative is to add /dev/vfio/nommu like you've said, but what > > would be the difference between this and uio eludes me. > > Are you familiar with vfio that you ask such a question? > Yes, I do and I do not see anything of value that vfio can add to nommu setup besides complexity, but I do see why it will have to have special interface not applicable to regular vfio (hint: there is not HW to translate virtual address to physical) and why it will have to be accessible to root user only. > Here's the vfio pci code: > > $ wc -l drivers/vfio/pci/* >27 drivers/vfio/pci/Kconfig > 4 drivers/vfio/pci/Makefile > 1217 drivers/vfio/pci/vfio_pci.c > 1602 drivers/vfio/pci/vfio_pci_config.c > 675 drivers/vfio/pci/vfio_pci_intrs.c >92 drivers/vfio/pci/vfio_pci_private.h > 238 drivers/vfio/pci/vfio_pci_rdwr.c > 3855 total > > There's some code dealing with iommu groups in > drivers/vfio/pci/vfio_pci.c, > but most of it is validating input and > presenting a consistent interface to userspace. > What is has to do with the patch series in question? Non patched uio_generic code does not validate input. If you think it should by all means write the code (don't break existing use cases while doing so), but the patch under discussion does not even access pci device from userspace, so it will not be affected by said filtering. > This is exactly what's missing here. It is not missing in this patch series, it is missing from upstream code. I do not remember this been an issue when uio_generic was accepted into the kernel. The reason was because it meant to be accessible by root only. VFIO was designed to be used by regular user from ground up, so obviously unrestricted access to pci space was out of the question. Different use cases lead to different designs, how surprising. > > There's also drivers/vfio/virqfd.c which deals > with sending interrupts over eventfds correctly. > As opposite to this patch that deals with them incorrectly? In what way? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 12:38:28PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 10:59:10AM +0300, Gleb Natapov wrote: > > I do not remember this been an issue when uio_generic was accepted > > into the kernel. The reason was because it meant to be accessible by root > > only. > > No - because it does not need bus mastering. So it can be used safely > with some devices. > It still can be used safely with same devices. Admittedly I did not look close, but I am sure the patch does not enable bus mastering if MSI interrupt is not requested. If not, well that can be fixed. But more importantly it can be used unsafely in its current state. Not only can, it is widely used so. > [mst@robin linux]$ git grep pci_set_master|wc -l 533 > [mst@robin linux]$ git grep pci_enable|wc -l 1597 > > Looks like about 2/3 devices don't need to be bus masters. > > It's up to admin not to bind it to devices, and that is unfortunate, > but manually binding an incorrect driver to a device is generally > a hard problem to solve. > > > > There's also drivers/vfio/virqfd.c which deals > > > with sending interrupts over eventfds correctly. > > > > > As opposite to this patch that deals with them incorrectly? In what way? > > cleanup on fd close is not handled. > Have you commented about this on the patch and it was not fixed? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > >> > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > It is good practice to defend against root oopsing the kernel, but in > > some > > cases it cannot be achieved. > > >>>Absolutely. That's one of the issues with these patches. They don't even > > >>>try where it's absolutely possible. > > >>> > > >>Are you referring to blocking the maps of the msix BAR areas? > > >For example. There are more. I listed some of the issues on the mailing > > >list, and I might have missed some. VFIO has code to address all this, > > >people should share code to avoid duplication, or at least read it > > >to understand the issues. > > > > All but one of those are unrelated to the patch that adds msix support. > > They are related because msix support enables bus mastering. Without it > device is passive and can't harm anyone. With it, suddently you need to > be very careful with the device to avoid corrupting kernel memory. > Most (if not all) uio_pci_generic users enable pci bus mastering. The fact that they do that without even tainting the kernel like the patch does make current situation much worse that with the patch. > > I can't comment on iommu overhead; for my use case it is likely negligible > > and we will use an iommu when available; but apparently it matters for > > others. > > You and Vlad are the only ones who brought this up. > So maybe you should not bring it up anymore. > Common, you were CCed to at least this one: We have a solution that makes use of IOMMU support with vfio. The problem is there are multiple cases where that support is either not available, or using the IOMMU provides excess overhead. http://dpdk.org/ml/archives/dev/2015-October/024560.html -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote: > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > > > >> > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > > > >>>>It is good practice to defend against root oopsing the kernel, but > > > > >>>>in some > > > > >>>>cases it cannot be achieved. > > > > >>>Absolutely. That's one of the issues with these patches. They don't > > > > >>>even > > > > >>>try where it's absolutely possible. > > > > >>> > > > > >>Are you referring to blocking the maps of the msix BAR areas? > > > > >For example. There are more. I listed some of the issues on the mailing > > > > >list, and I might have missed some. VFIO has code to address all this, > > > > >people should share code to avoid duplication, or at least read it > > > > >to understand the issues. > > > > > > > > All but one of those are unrelated to the patch that adds msix support. > > > > > > They are related because msix support enables bus mastering. Without it > > > device is passive and can't harm anyone. With it, suddently you need to > > > be very careful with the device to avoid corrupting kernel memory. > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. The > > fact that they do that without even tainting the kernel like the patch > > does make current situation much worse that with the patch. > > It isn't worse. It's a sane interface. Whoever enables bus mastering > must be careful. If userspace enables bus mastering then userspace > needs to be very careful with the device to avoid corrupting kernel > memory. If kernel does it, it's kernel's responsibility. > Although this definition of sanity sounds strange to me, but lets flow with it for the sake of this email: would it be OK if proposed interface refused to work if bus mastering is not already enabled by userspace? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote: > > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote: > > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote: > > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > > > > > > > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > > > > > >> > > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > > > > > >>>>It is good practice to defend against root oopsing the kernel, > > > > > > >>>>but in some > > > > > > >>>>cases it cannot be achieved. > > > > > > >>>Absolutely. That's one of the issues with these patches. They > > > > > > >>>don't even > > > > > > >>>try where it's absolutely possible. > > > > > > >>> > > > > > > >>Are you referring to blocking the maps of the msix BAR areas? > > > > > > >For example. There are more. I listed some of the issues on the > > > > > > >mailing > > > > > > >list, and I might have missed some. VFIO has code to address all > > > > > > >this, > > > > > > >people should share code to avoid duplication, or at least read it > > > > > > >to understand the issues. > > > > > > > > > > > > All but one of those are unrelated to the patch that adds msix > > > > > > support. > > > > > > > > > > They are related because msix support enables bus mastering. Without > > > > > it > > > > > device is passive and can't harm anyone. With it, suddently you need > > > > > to > > > > > be very careful with the device to avoid corrupting kernel memory. > > > > > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. The > > > > fact that they do that without even tainting the kernel like the patch > > > > does make current situation much worse that with the patch. > > > > > > It isn't worse. It's a sane interface. Whoever enables bus mastering > > > must be careful. If userspace enables bus mastering then userspace > > > needs to be very careful with the device to avoid corrupting kernel > > > memory. If kernel does it, it's kernel's responsibility. > > > > > Although this definition of sanity sounds strange to me, but lets > > flow with it for the sake of this email: would it be OK if proposed > > interface refused to work if bus mastering is not already enabled by > > userspace? > > An interface could be acceptable if there's a fallback where it > works without BM but slower (e.g. poll pending bits). > OK. > But not the proposed one. > Why? Greg is against ioctl interface so it will be reworked, by besides that what is wrong with the concept of binding msi-x interrupt to eventfd? > Really, there's more to making msi-x work with > userspace drivers than this patch. As I keep telling people, you would > basically reimplement vfio/pci. Go over it, and see for yourself. > Almost everything it does is relevant for msi-x. It's just wrong to > duplicate so much code. > The patch is tested and works with msi-x. Restricting access to msi-x registers that vfio does is not relevant here. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 08:39:10PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 08, 2015 at 08:01:21PM +0300, Gleb Natapov wrote: > > On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote: > > > On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote: > > > > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote: > > > > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote: > > > > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote: > > > > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: > > > > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: > > > > > > > > >> > > > > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: > > > > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: > > > > > > > > >>>>It is good practice to defend against root oopsing the > > > > > > > > >>>>kernel, but in some > > > > > > > > >>>>cases it cannot be achieved. > > > > > > > > >>>Absolutely. That's one of the issues with these patches. > > > > > > > > >>>They don't even > > > > > > > > >>>try where it's absolutely possible. > > > > > > > > >>> > > > > > > > > >>Are you referring to blocking the maps of the msix BAR areas? > > > > > > > > >For example. There are more. I listed some of the issues on > > > > > > > > >the mailing > > > > > > > > >list, and I might have missed some. VFIO has code to address > > > > > > > > >all this, > > > > > > > > >people should share code to avoid duplication, or at least > > > > > > > > >read it > > > > > > > > >to understand the issues. > > > > > > > > > > > > > > > > All but one of those are unrelated to the patch that adds msix > > > > > > > > support. > > > > > > > > > > > > > > They are related because msix support enables bus mastering. > > > > > > > Without it > > > > > > > device is passive and can't harm anyone. With it, suddently you > > > > > > > need to > > > > > > > be very careful with the device to avoid corrupting kernel memory. > > > > > > > > > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. > > > > > > The > > > > > > fact that they do that without even tainting the kernel like the > > > > > > patch > > > > > > does make current situation much worse that with the patch. > > > > > > > > > > It isn't worse. It's a sane interface. Whoever enables bus mastering > > > > > must be careful. If userspace enables bus mastering then userspace > > > > > needs to be very careful with the device to avoid corrupting kernel > > > > > memory. If kernel does it, it's kernel's responsibility. > > > > > > > > > Although this definition of sanity sounds strange to me, but lets > > > > flow with it for the sake of this email: would it be OK if proposed > > > > interface refused to work if bus mastering is not already enabled by > > > > userspace? > > > > > > An interface could be acceptable if there's a fallback where it > > > works without BM but slower (e.g. poll pending bits). > > > > > OK. > > > > > But not the proposed one. > > > > > Why? Greg is against ioctl interface so it will be reworked, by besides > > that what is wrong with the concept of binding msi-x interrupt to > > eventfd? > > It's not the binding. Managing msi-x just needs more than the puny > 2 ioctls to get # of vectors and set eventfd. > > It interacts in strange ways with reset, and with PM, and ... > Sorry, I need examples of what you mean. DMA also "interacts in strange ways with reset, and with PM, and ..." and it does not have any special handling anywhere in uio-generic. S
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 12:05:11AM +0300, Michael S. Tsirkin wrote: > On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: > > That's what I thought as well, but apparently adding msix support to the > > already insecure uio drivers is even worse. > > I'm glad you finally agree what these drivers are doing is insecure. > Michael, please stop this meaningless world play. The above is said in the contexts of a device that is meant to be accessible by regular users and obviously for that purpose uio is insecure (in its current state btw). If you give user access to your root block device this device will be insecure too, so according to your logic block device is insecure? Pushing the code from uio to vfio means that vfio will have to implement access policy by itself - allow iommu mode to regular users, but no-iommu to root only. Implementing policy in the kernel is bad. Well the alternative is to add /dev/vfio/nommu like you've said, but what would be the difference between this and uio eludes me. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On Thu, Oct 08, 2015 at 12:05:11AM +0300, Michael S. Tsirkin wrote: > On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: > > That's what I thought as well, but apparently adding msix support to the > > already insecure uio drivers is even worse. > > I'm glad you finally agree what these drivers are doing is insecure. > Michael, please stop this meaningless world play. The above is said in the contexts of a device that is meant to be accessible by regular users and obviously for that purpose uio is insecure (in its current state btw). If you give user access to your root block device this device will be insecure too, so according to your logic block device is insecure? Pushing the code from uio to vfio means that vfio will have to implement access policy by itself - allow iommu mode to regular users, but no-iommu to root only. Implementing policy in the kernel is bad. Well the alternative is to add /dev/vfio/nommu like you've said, but what would be the difference between this and uio eludes me. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On Tue, Oct 06, 2015 at 06:15:54PM +0300, Michael S. Tsirkin wrote: > > While it is possible that userspace malfunctions and accidentally programs > > MSI incorrectly, the risk is dwarfed by the ability of userspace to program > > DMA incorrectly. > > That seems to imply that for the upstream kernel this is not a valid usecase > at all. > Are you implying that uio_pci_generic should be removed from upstream kernel because Avi did not describe anything that cannot be done with upstream kernel right now. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/3] uio: add ioctl support
On Tue, Oct 06, 2015 at 06:19:34PM +0300, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 05:30:31PM +0300, Gleb Natapov wrote: > > On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote: > > > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote: > > > > the solution u propose should be a matter of a separate patch and is > > > > obviously orthogonal to this series. > > > > > > Doesn't work this way, sorry. You want a patch enabling MSI merged, > > > you need to secure the MSI configuration. > > > > > MSI can be enabled right now without the patch by writing directly into > > PCI bar. > > By poking at config registers in sysfs? We can block this, or we > can log this, pretty easily. We don't ATM but it's not hard to do. > Blocking this will break userspace API. As a maintainer you should know that we do not break userspace APIs. Logging this is fine, but how exactly it helps you with "security"? The patch in question already taints the kernel which is much stronger than logging. > > The only thing this patch adds is forwarding the interrupt to > > an eventfd. > > This one just adds a bunch of ioctls. The next ones do > more than you describe. > Yes, it adds bunch of ioctls to do exactly what I wrote above. What point have you tried to make by this statement? It eluded me. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/3] uio: add ioctl support
On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote: > > the solution u propose should be a matter of a separate patch and is > > obviously orthogonal to this series. > > Doesn't work this way, sorry. You want a patch enabling MSI merged, > you need to secure the MSI configuration. > MSI can be enabled right now without the patch by writing directly into PCI bar. The only thing this patch adds is forwarding the interrupt to an eventfd. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/3] uio: add ioctl support
On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote: > > the solution u propose should be a matter of a separate patch and is > > obviously orthogonal to this series. > > Doesn't work this way, sorry. You want a patch enabling MSI merged, > you need to secure the MSI configuration. > MSI can be enabled right now without the patch by writing directly into PCI bar. The only thing this patch adds is forwarding the interrupt to an eventfd. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/3] uio: add ioctl support
On Tue, Oct 06, 2015 at 06:19:34PM +0300, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 05:30:31PM +0300, Gleb Natapov wrote: > > On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote: > > > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote: > > > > the solution u propose should be a matter of a separate patch and is > > > > obviously orthogonal to this series. > > > > > > Doesn't work this way, sorry. You want a patch enabling MSI merged, > > > you need to secure the MSI configuration. > > > > > MSI can be enabled right now without the patch by writing directly into > > PCI bar. > > By poking at config registers in sysfs? We can block this, or we > can log this, pretty easily. We don't ATM but it's not hard to do. > Blocking this will break userspace API. As a maintainer you should know that we do not break userspace APIs. Logging this is fine, but how exactly it helps you with "security"? The patch in question already taints the kernel which is much stronger than logging. > > The only thing this patch adds is forwarding the interrupt to > > an eventfd. > > This one just adds a bunch of ioctls. The next ones do > more than you describe. > Yes, it adds bunch of ioctls to do exactly what I wrote above. What point have you tried to make by this statement? It eluded me. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On Tue, Oct 06, 2015 at 06:15:54PM +0300, Michael S. Tsirkin wrote: > > While it is possible that userspace malfunctions and accidentally programs > > MSI incorrectly, the risk is dwarfed by the ability of userspace to program > > DMA incorrectly. > > That seems to imply that for the upstream kernel this is not a valid usecase > at all. > Are you implying that uio_pci_generic should be removed from upstream kernel because Avi did not describe anything that cannot be done with upstream kernel right now. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] KVM: ia64: remove
On Wed, Nov 19, 2014 at 10:05:43PM +0100, Paolo Bonzini wrote: > KVM for ia64 has been marked as broken not just once, but twice even, > and the last patch from the maintainer is now roughly 5 years old. > Time for it to rest in piece. > Acked-by: Gleb Natapov Next step is to move ioapic bits into arch :) > Signed-off-by: Paolo Bonzini > --- > The patch was edited to keep its size decent, by dropping > all the removed lines from the deleted files. > > MAINTAINERS |9 - > arch/ia64/Kconfig |3 - > arch/ia64/Makefile |1 - > arch/ia64/include/asm/kvm_host.h| 609 -- > arch/ia64/include/asm/pvclock-abi.h | 48 - > arch/ia64/include/uapi/asm/kvm.h| 268 - > arch/ia64/kvm/Kconfig | 66 -- > arch/ia64/kvm/Makefile | 67 -- > arch/ia64/kvm/asm-offsets.c | 241 > arch/ia64/kvm/irq.h | 33 - > arch/ia64/kvm/kvm-ia64.c| 1942 -- > arch/ia64/kvm/kvm_fw.c | 674 --- > arch/ia64/kvm/kvm_lib.c | 21 - > arch/ia64/kvm/kvm_minstate.h| 266 - > arch/ia64/kvm/lapic.h | 30 - > arch/ia64/kvm/memcpy.S |1 - > arch/ia64/kvm/memset.S |1 - > arch/ia64/kvm/misc.h| 94 -- > arch/ia64/kvm/mmio.c| 336 -- > arch/ia64/kvm/optvfault.S | 1090 - > arch/ia64/kvm/process.c | 1024 > arch/ia64/kvm/trampoline.S | 1038 > arch/ia64/kvm/vcpu.c| 2209 > --- > arch/ia64/kvm/vcpu.h| 752 > arch/ia64/kvm/vmm.c | 99 -- > arch/ia64/kvm/vmm_ivt.S | 1392 -- > arch/ia64/kvm/vti.h | 290 - > arch/ia64/kvm/vtlb.c| 640 -- > virt/kvm/ioapic.c | 5 - > virt/kvm/ioapic.h | 1 - > virt/kvm/irq_comm.c | 22 - > 31 files changed, 13272 deletions(-) > delete mode 100644 arch/ia64/include/asm/kvm_host.h > delete mode 100644 arch/ia64/include/asm/pvclock-abi.h > delete mode 100644 arch/ia64/include/uapi/asm/kvm.h > delete mode 100644 arch/ia64/kvm/Kconfig > delete mode 100644 arch/ia64/kvm/Makefile > delete mode 100644 arch/ia64/kvm/asm-offsets.c > delete mode 100644 arch/ia64/kvm/irq.h > delete mode 100644 arch/ia64/kvm/kvm-ia64.c > delete mode 100644 arch/ia64/kvm/kvm_fw.c > delete mode 100644 arch/ia64/kvm/kvm_lib.c > delete mode 100644 arch/ia64/kvm/kvm_minstate.h > delete mode 100644 arch/ia64/kvm/lapic.h > delete mode 100644 arch/ia64/kvm/memcpy.S > delete mode 100644 arch/ia64/kvm/memset.S > delete mode 100644 arch/ia64/kvm/misc.h > delete mode 100644 arch/ia64/kvm/mmio.c > delete mode 100644 arch/ia64/kvm/optvfault.S > delete mode 100644 arch/ia64/kvm/process.c > delete mode 100644 arch/ia64/kvm/trampoline.S > delete mode 100644 arch/ia64/kvm/vcpu.c > delete mode 100644 arch/ia64/kvm/vcpu.h > delete mode 100644 arch/ia64/kvm/vmm.c > delete mode 100644 arch/ia64/kvm/vmm_ivt.S > delete mode 100644 arch/ia64/kvm/vti.h > delete mode 100644 arch/ia64/kvm/vtlb.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index a12edf2624e5..56705138ca74 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -5244,15 +5244,6 @@ S: Supported > F: arch/powerpc/include/asm/kvm* > F: arch/powerpc/kvm/ > > -KERNEL VIRTUAL MACHINE For Itanium (KVM/IA64) > -M: Xiantao Zhang > -L: kvm-i...@vger.kernel.org > -W: http://kvm.qumranet.com > -S: Supported > -F: Documentation/ia64/kvm.txt > -F: arch/ia64/include/asm/kvm* > -F: arch/ia64/kvm/ > - > KERNEL VIRTUAL MACHINE for s390 (KVM/s390) > M: Christian Borntraeger > M: Cornelia Huck > diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig > index c84c88d7..11afe7ab1981 100644 > --- a/arch/ia64/Kconfig > +++ b/arch/ia64/Kconfig > @@ -21,7 +21,6 @@ config IA64 > select HAVE_DYNAMIC_FTRACE if (!ITANIUM) > select HAVE_FUNCTION_TRACER > select HAVE_DMA_ATTRS > - select HAVE_KVM > select TTY > select HAVE_ARCH_TRACEHOOK > select HAVE_DMA_API_DEBUG > @@ -640,8 +639,6 @@ source "security/Kconfig" > > source "crypto/Kconfig" > > -source "arch/ia64/kvm/Kconfig" > - > source "lib/Kconfig" > > config IOMMU_HELPER > diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile > index 5441b14994fc
Re: [PATCH] KVM: ia64: remove
On Wed, Nov 19, 2014 at 10:05:43PM +0100, Paolo Bonzini wrote: KVM for ia64 has been marked as broken not just once, but twice even, and the last patch from the maintainer is now roughly 5 years old. Time for it to rest in piece. Acked-by: Gleb Natapov g...@kernel.org Next step is to move ioapic bits into arch :) Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- The patch was edited to keep its size decent, by dropping all the removed lines from the deleted files. MAINTAINERS |9 - arch/ia64/Kconfig |3 - arch/ia64/Makefile |1 - arch/ia64/include/asm/kvm_host.h| 609 -- arch/ia64/include/asm/pvclock-abi.h | 48 - arch/ia64/include/uapi/asm/kvm.h| 268 - arch/ia64/kvm/Kconfig | 66 -- arch/ia64/kvm/Makefile | 67 -- arch/ia64/kvm/asm-offsets.c | 241 arch/ia64/kvm/irq.h | 33 - arch/ia64/kvm/kvm-ia64.c| 1942 -- arch/ia64/kvm/kvm_fw.c | 674 --- arch/ia64/kvm/kvm_lib.c | 21 - arch/ia64/kvm/kvm_minstate.h| 266 - arch/ia64/kvm/lapic.h | 30 - arch/ia64/kvm/memcpy.S |1 - arch/ia64/kvm/memset.S |1 - arch/ia64/kvm/misc.h| 94 -- arch/ia64/kvm/mmio.c| 336 -- arch/ia64/kvm/optvfault.S | 1090 - arch/ia64/kvm/process.c | 1024 arch/ia64/kvm/trampoline.S | 1038 arch/ia64/kvm/vcpu.c| 2209 --- arch/ia64/kvm/vcpu.h| 752 arch/ia64/kvm/vmm.c | 99 -- arch/ia64/kvm/vmm_ivt.S | 1392 -- arch/ia64/kvm/vti.h | 290 - arch/ia64/kvm/vtlb.c| 640 -- virt/kvm/ioapic.c | 5 - virt/kvm/ioapic.h | 1 - virt/kvm/irq_comm.c | 22 - 31 files changed, 13272 deletions(-) delete mode 100644 arch/ia64/include/asm/kvm_host.h delete mode 100644 arch/ia64/include/asm/pvclock-abi.h delete mode 100644 arch/ia64/include/uapi/asm/kvm.h delete mode 100644 arch/ia64/kvm/Kconfig delete mode 100644 arch/ia64/kvm/Makefile delete mode 100644 arch/ia64/kvm/asm-offsets.c delete mode 100644 arch/ia64/kvm/irq.h delete mode 100644 arch/ia64/kvm/kvm-ia64.c delete mode 100644 arch/ia64/kvm/kvm_fw.c delete mode 100644 arch/ia64/kvm/kvm_lib.c delete mode 100644 arch/ia64/kvm/kvm_minstate.h delete mode 100644 arch/ia64/kvm/lapic.h delete mode 100644 arch/ia64/kvm/memcpy.S delete mode 100644 arch/ia64/kvm/memset.S delete mode 100644 arch/ia64/kvm/misc.h delete mode 100644 arch/ia64/kvm/mmio.c delete mode 100644 arch/ia64/kvm/optvfault.S delete mode 100644 arch/ia64/kvm/process.c delete mode 100644 arch/ia64/kvm/trampoline.S delete mode 100644 arch/ia64/kvm/vcpu.c delete mode 100644 arch/ia64/kvm/vcpu.h delete mode 100644 arch/ia64/kvm/vmm.c delete mode 100644 arch/ia64/kvm/vmm_ivt.S delete mode 100644 arch/ia64/kvm/vti.h delete mode 100644 arch/ia64/kvm/vtlb.c diff --git a/MAINTAINERS b/MAINTAINERS index a12edf2624e5..56705138ca74 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5244,15 +5244,6 @@ S: Supported F: arch/powerpc/include/asm/kvm* F: arch/powerpc/kvm/ -KERNEL VIRTUAL MACHINE For Itanium (KVM/IA64) -M: Xiantao Zhang xiantao.zh...@intel.com -L: kvm-i...@vger.kernel.org -W: http://kvm.qumranet.com -S: Supported -F: Documentation/ia64/kvm.txt -F: arch/ia64/include/asm/kvm* -F: arch/ia64/kvm/ - KERNEL VIRTUAL MACHINE for s390 (KVM/s390) M: Christian Borntraeger borntrae...@de.ibm.com M: Cornelia Huck cornelia.h...@de.ibm.com diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index c84c88d7..11afe7ab1981 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -21,7 +21,6 @@ config IA64 select HAVE_DYNAMIC_FTRACE if (!ITANIUM) select HAVE_FUNCTION_TRACER select HAVE_DMA_ATTRS - select HAVE_KVM select TTY select HAVE_ARCH_TRACEHOOK select HAVE_DMA_API_DEBUG @@ -640,8 +639,6 @@ source security/Kconfig source crypto/Kconfig -source arch/ia64/kvm/Kconfig - source lib/Kconfig config IOMMU_HELPER diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile index 5441b14994fc..970d0bd99621 100644 --- a/arch/ia64/Makefile +++ b/arch/ia64/Makefile @@ -53,7 +53,6 @@ core-$(CONFIG_IA64_HP_ZX1) += arch/ia64/dig/ core-$(CONFIG_IA64_HP_ZX1_SWIOTLB) += arch/ia64/dig/ core-$(CONFIG_IA64_SGI_SN2) += arch/ia64/sn/ core-$(CONFIG_IA64_SGI_UV) += arch/ia64/uv/ -core-$(CONFIG_KVM) += arch/ia64/kvm/ drivers-$(CONFIG_PCI)+= arch/ia64/pci
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On Mon, Sep 22, 2014 at 09:29:19PM +0200, Paolo Bonzini wrote: > Il 22/09/2014 21:20, Christian Borntraeger ha scritto: > > "while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. > > Lets bail out early on invalid ioctls". or similar? > > Okay. David, can you explain how you found it so that I can make up my > mind? > > Gleb and Marcelo, a fourth and fifth opinion? :) > I agree with Christian that simpler fix is better here. The overhead is minimal. If we ever notice this overhead we can revert the patch all together since the problem it fixes can only be inflicted on userspace by itself and there are myriads other ways userspace can hurt itself. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
On Mon, Sep 22, 2014 at 09:29:19PM +0200, Paolo Bonzini wrote: Il 22/09/2014 21:20, Christian Borntraeger ha scritto: while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. Lets bail out early on invalid ioctls. or similar? Okay. David, can you explain how you found it so that I can make up my mind? Gleb and Marcelo, a fourth and fifth opinion? :) I agree with Christian that simpler fix is better here. The overhead is minimal. If we ever notice this overhead we can revert the patch all together since the problem it fixes can only be inflicted on userspace by itself and there are myriads other ways userspace can hurt itself. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: > When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory > has been swapped out or is behind a filemap, this will trigger async > readahead and return immediately. The rationale is that KVM will kick > back the guest with an "async page fault" and allow for some other > guest process to take over. > > If async PFs are enabled the fault is retried asap from an async > workqueue. If not, it's retried immediately in the same code path. In > either case the retry will not relinquish the mmap semaphore and will > block on the IO. This is a bad thing, as other mmap semaphore users > now stall as a function of swap or filemap latency. > > This patch ensures both the regular and async PF path re-enter the > fault allowing for the mmap semaphore to be relinquished in the case > of IO wait. > Reviewed-by: Gleb Natapov > Reviewed-by: Radim Krčmář > Signed-off-by: Andres Lagar-Cavilla > > --- > v1 -> v2 > > * WARN_ON_ONCE -> VM_WARN_ON_ONCE > * pagep == NULL skips the final retry > * kvm_gup_retry -> kvm_gup_io > * Comment updates throughout > --- > include/linux/kvm_host.h | 11 +++ > include/linux/mm.h | 1 + > mm/gup.c | 4 > virt/kvm/async_pf.c | 4 +--- > virt/kvm/kvm_main.c | 49 > +--- > 5 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 3addcbc..4c1991b 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, > unsigned long hva, > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > #endif > > +/* > + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap > + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL > + * controls whether we retry the gup one more time to completion in that > case. > + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp > + * handler. > + */ > +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long addr, bool write_fault, > + struct page **pagep); > + > enum { > OUTSIDE_GUEST_MODE, > IN_GUEST_MODE, > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ebc5f90..13e585f7 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct > vm_area_struct *vma, > #define FOLL_HWPOISON0x100 /* check page is hwpoisoned */ > #define FOLL_NUMA0x200 /* force NUMA hinting page fault */ > #define FOLL_MIGRATION 0x400 /* wait for page to replace migration > entry */ > +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ > > typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, > void *data); > diff --git a/mm/gup.c b/mm/gup.c > index 91d044b..af7ea3e 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct > vm_area_struct *vma, > fault_flags |= FAULT_FLAG_ALLOW_RETRY; > if (*flags & FOLL_NOWAIT) > fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; > + if (*flags & FOLL_TRIED) { > + VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY); > + fault_flags |= FAULT_FLAG_TRIED; > + } > > ret = handle_mm_fault(mm, vma, address, fault_flags); > if (ret & VM_FAULT_ERROR) { > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c > index d6a3d09..5ff7f7f 100644 > --- a/virt/kvm/async_pf.c > +++ b/virt/kvm/async_pf.c > @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work) > > might_sleep(); > > - down_read(>mmap_sem); > - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL); > - up_read(>mmap_sem); > + kvm_get_user_page_io(NULL, mm, addr, 1, NULL); > kvm_async_page_present_sync(vcpu, apf); > > spin_lock(>async_pf.lock); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 7ef6b48..fa8a565 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct > *tsk, struct mm_struct *mm, > return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL); > } > > +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, > +
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote: > Hi Andres, > On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: > [...] > > static inline int check_user_page_hwpoison(unsigned long addr) > > { > > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; > >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool > >*async, bool write_fault, > > npages = get_user_page_nowait(current, current->mm, > > addr, write_fault, page); > > up_read(>mm->mmap_sem); > >-} else > >-npages = get_user_pages_fast(addr, 1, write_fault, > >- page); > >+} else { > >+/* > >+ * By now we have tried gup_fast, and possibly async_pf, and we > >+ * are certainly not atomic. Time to retry the gup, allowing > >+ * mmap semaphore to be relinquished in the case of IO. > >+ */ > >+npages = kvm_get_user_page_io(current, current->mm, addr, > >+ write_fault, page); > >+} > > try_async_pf > gfn_to_pfn_async > __gfn_to_pfnasync = false *async = false >__gfn_to_pfn_memslot > hva_to_pfn >hva_to_pfn_fast >hva_to_pfn_slow hva_to_pfn_slow checks async not *async. > kvm_get_user_page_io > > page will always be ready after kvm_get_user_page_io which leads to APF > don't need to work any more. > > Regards, > Wanpeng Li > > > if (npages != 1) > > return npages; > > > >-- > >2.1.0.rc2.206.gedb03e5 > > > >-- > >To unsubscribe, send a message with 'unsubscribe linux-mm' in > >the body to majord...@kvack.org. For more info on Linux MM, > >see: http://www.linux-mm.org/ . > >Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote: Hi Andres, On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: [...] static inline int check_user_page_hwpoison(unsigned long addr) { int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE; @@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, npages = get_user_page_nowait(current, current-mm, addr, write_fault, page); up_read(current-mm-mmap_sem); -} else -npages = get_user_pages_fast(addr, 1, write_fault, - page); +} else { +/* + * By now we have tried gup_fast, and possibly async_pf, and we + * are certainly not atomic. Time to retry the gup, allowing + * mmap semaphore to be relinquished in the case of IO. + */ +npages = kvm_get_user_page_io(current, current-mm, addr, + write_fault, page); +} try_async_pf gfn_to_pfn_async __gfn_to_pfnasync = false *async = false __gfn_to_pfn_memslot hva_to_pfn hva_to_pfn_fast hva_to_pfn_slow hva_to_pfn_slow checks async not *async. kvm_get_user_page_io page will always be ready after kvm_get_user_page_io which leads to APF don't need to work any more. Regards, Wanpeng Li if (npages != 1) return npages; -- 2.1.0.rc2.206.gedb03e5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote: When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been swapped out or is behind a filemap, this will trigger async readahead and return immediately. The rationale is that KVM will kick back the guest with an async page fault and allow for some other guest process to take over. If async PFs are enabled the fault is retried asap from an async workqueue. If not, it's retried immediately in the same code path. In either case the retry will not relinquish the mmap semaphore and will block on the IO. This is a bad thing, as other mmap semaphore users now stall as a function of swap or filemap latency. This patch ensures both the regular and async PF path re-enter the fault allowing for the mmap semaphore to be relinquished in the case of IO wait. Reviewed-by: Gleb Natapov g...@kernel.org Reviewed-by: Radim Krčmář rkrc...@redhat.com Signed-off-by: Andres Lagar-Cavilla andre...@google.com --- v1 - v2 * WARN_ON_ONCE - VM_WARN_ON_ONCE * pagep == NULL skips the final retry * kvm_gup_retry - kvm_gup_io * Comment updates throughout --- include/linux/kvm_host.h | 11 +++ include/linux/mm.h | 1 + mm/gup.c | 4 virt/kvm/async_pf.c | 4 +--- virt/kvm/kvm_main.c | 49 +--- 5 files changed, 63 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3addcbc..4c1991b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif +/* + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL + * controls whether we retry the gup one more time to completion in that case. + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp + * handler. + */ +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, bool write_fault, + struct page **pagep); + enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE, diff --git a/include/linux/mm.h b/include/linux/mm.h index ebc5f90..13e585f7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON0x100 /* check page is hwpoisoned */ #define FOLL_NUMA0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/gup.c b/mm/gup.c index 91d044b..af7ea3e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, fault_flags |= FAULT_FLAG_ALLOW_RETRY; if (*flags FOLL_NOWAIT) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; + if (*flags FOLL_TRIED) { + VM_WARN_ON_ONCE(fault_flags FAULT_FLAG_ALLOW_RETRY); + fault_flags |= FAULT_FLAG_TRIED; + } ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret VM_FAULT_ERROR) { diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index d6a3d09..5ff7f7f 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work) might_sleep(); - down_read(mm-mmap_sem); - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL); - up_read(mm-mmap_sem); + kvm_get_user_page_io(NULL, mm, addr, 1, NULL); kvm_async_page_present_sync(vcpu, apf); spin_lock(vcpu-async_pf.lock); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7ef6b48..fa8a565 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm, return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL); } +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, bool write_fault, + struct page **pagep) +{ + int npages; + int locked = 1; + int flags = FOLL_TOUCH | FOLL_HWPOISON | + (pagep ? FOLL_GET : 0) | + (write_fault ? FOLL_WRITE : 0); + + /* + * If retrying the fault, we get here *not* having allowed the filemap + * to wait on the page lock. We should now allow waiting on the IO
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 10:13:45AM -0700, Andres Lagar-Cavilla wrote: > On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov wrote: > > On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote: > >> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov wrote: > >> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote: > >> >> 2014-09-17 13:26+0300, Gleb Natapov: > >> >> > For async_pf_execute() you do not need to even retry. Next guest's > >> >> > page fault > >> >> > will retry it for you. > >> >> > >> >> Wouldn't that be a waste of vmentries? > >> > This is how it will work with or without this second gup. Page is not > >> > mapped into a shadow page table on this path, it happens on a next fault. > >> > >> The point is that the gup in the async pf completion from the work > >> queue will not relinquish the mmap semaphore. And it most definitely > >> should, given that we are likely looking at swap/filemap. > >> > > I get this point and the patch looks good in general, but my point is > > that when _retry() is called from async_pf_execute() second gup is not > > needed. In the original code gup is called to do IO and nothing else. > > In your patch this is accomplished by the first gup already, so you > > can skip second gup if pagep == nullptr. > > I see. However, if this function were to be used elsewhere in the > future, then the "if pagep == NULL don't retry" semantics may not > match the new caller's intention. Would you prefer an explicit flag? > We can add explicit flag whenever such caller will be added, if ever. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote: > On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov wrote: > > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote: > >> 2014-09-17 13:26+0300, Gleb Natapov: > >> > For async_pf_execute() you do not need to even retry. Next guest's page > >> > fault > >> > will retry it for you. > >> > >> Wouldn't that be a waste of vmentries? > > This is how it will work with or without this second gup. Page is not > > mapped into a shadow page table on this path, it happens on a next fault. > > The point is that the gup in the async pf completion from the work > queue will not relinquish the mmap semaphore. And it most definitely > should, given that we are likely looking at swap/filemap. > I get this point and the patch looks good in general, but my point is that when _retry() is called from async_pf_execute() second gup is not needed. In the original code gup is called to do IO and nothing else. In your patch this is accomplished by the first gup already, so you can skip second gup if pagep == nullptr. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote: > 2014-09-17 13:26+0300, Gleb Natapov: > > For async_pf_execute() you do not need to even retry. Next guest's page > > fault > > will retry it for you. > > Wouldn't that be a waste of vmentries? This is how it will work with or without this second gup. Page is not mapped into a shadow page table on this path, it happens on a next fault. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Mon, Sep 15, 2014 at 01:11:25PM -0700, Andres Lagar-Cavilla wrote: > When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been > swapped out or is behind a filemap, this will trigger async readahead and > return immediately. The rationale is that KVM will kick back the guest with an > "async page fault" and allow for some other guest process to take over. > > If async PFs are enabled the fault is retried asap from a workqueue, or > immediately if no async PFs. The retry will not relinquish the mmap semaphore > and will block on the IO. This is a bad thing, as other mmap semaphore users > now stall. The fault could take a long time, depending on swap or filemap > latency. > > This patch ensures both the regular and async PF path re-enter the fault > allowing for the mmap semaphore to be relinquished in the case of IO wait. > > Signed-off-by: Andres Lagar-Cavilla > --- > include/linux/kvm_host.h | 9 + > include/linux/mm.h | 1 + > mm/gup.c | 4 > virt/kvm/async_pf.c | 4 +--- > virt/kvm/kvm_main.c | 45 ++--- > 5 files changed, 57 insertions(+), 6 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 3addcbc..704908d 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -198,6 +198,15 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, > unsigned long hva, > int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); > #endif > > +/* > + * Retry a fault after a gup with FOLL_NOWAIT. This properly relinquishes > mmap > + * semaphore if the filemap/swap has to wait on page lock (and retries the > gup > + * to completion after that). > + */ > +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long addr, bool write_fault, > + struct page **pagep); > + > enum { > OUTSIDE_GUEST_MODE, > IN_GUEST_MODE, > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ebc5f90..13e585f7 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct > vm_area_struct *vma, > #define FOLL_HWPOISON0x100 /* check page is hwpoisoned */ > #define FOLL_NUMA0x200 /* force NUMA hinting page fault */ > #define FOLL_MIGRATION 0x400 /* wait for page to replace migration > entry */ > +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ > > typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, > void *data); > diff --git a/mm/gup.c b/mm/gup.c > index 91d044b..332d1c3 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct > vm_area_struct *vma, > fault_flags |= FAULT_FLAG_ALLOW_RETRY; > if (*flags & FOLL_NOWAIT) > fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; > + if (*flags & FOLL_TRIED) { > + WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY); > + fault_flags |= FAULT_FLAG_TRIED; > + } > > ret = handle_mm_fault(mm, vma, address, fault_flags); > if (ret & VM_FAULT_ERROR) { > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c > index d6a3d09..17b78b1 100644 > --- a/virt/kvm/async_pf.c > +++ b/virt/kvm/async_pf.c > @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work) > > might_sleep(); > > - down_read(>mmap_sem); > - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL); > - up_read(>mmap_sem); > + kvm_get_user_page_retry(NULL, mm, addr, 1, NULL); > kvm_async_page_present_sync(vcpu, apf); > > spin_lock(>async_pf.lock); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 7ef6b48..43a9ab9 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1115,6 +1115,39 @@ static int get_user_page_nowait(struct task_struct > *tsk, struct mm_struct *mm, > return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL); > } > > +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long addr, bool write_fault, > + struct page **pagep) > +{ > + int npages; > + int locked = 1; > + int flags = FOLL_TOUCH | FOLL_HWPOISON | > + (pagep ? FOLL_GET : 0) | > + (write_fault ? FOLL_WRITE : 0); > + > + /* > + * Retrying fault, we get here *not* having allowed the filemap to wait > + * on the page lock. We should now allow waiting on the IO with the > + * mmap semaphore released. > + */ > + down_read(>mmap_sem); > + npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL, > + ); > + if (!locked) { > + BUG_ON(npages != -EBUSY); > + /* > +
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Mon, Sep 15, 2014 at 01:11:25PM -0700, Andres Lagar-Cavilla wrote: When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been swapped out or is behind a filemap, this will trigger async readahead and return immediately. The rationale is that KVM will kick back the guest with an async page fault and allow for some other guest process to take over. If async PFs are enabled the fault is retried asap from a workqueue, or immediately if no async PFs. The retry will not relinquish the mmap semaphore and will block on the IO. This is a bad thing, as other mmap semaphore users now stall. The fault could take a long time, depending on swap or filemap latency. This patch ensures both the regular and async PF path re-enter the fault allowing for the mmap semaphore to be relinquished in the case of IO wait. Signed-off-by: Andres Lagar-Cavilla andre...@google.com --- include/linux/kvm_host.h | 9 + include/linux/mm.h | 1 + mm/gup.c | 4 virt/kvm/async_pf.c | 4 +--- virt/kvm/kvm_main.c | 45 ++--- 5 files changed, 57 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3addcbc..704908d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -198,6 +198,15 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif +/* + * Retry a fault after a gup with FOLL_NOWAIT. This properly relinquishes mmap + * semaphore if the filemap/swap has to wait on page lock (and retries the gup + * to completion after that). + */ +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, bool write_fault, + struct page **pagep); + enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE, diff --git a/include/linux/mm.h b/include/linux/mm.h index ebc5f90..13e585f7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON0x100 /* check page is hwpoisoned */ #define FOLL_NUMA0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/gup.c b/mm/gup.c index 91d044b..332d1c3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, fault_flags |= FAULT_FLAG_ALLOW_RETRY; if (*flags FOLL_NOWAIT) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; + if (*flags FOLL_TRIED) { + WARN_ON_ONCE(fault_flags FAULT_FLAG_ALLOW_RETRY); + fault_flags |= FAULT_FLAG_TRIED; + } ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret VM_FAULT_ERROR) { diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index d6a3d09..17b78b1 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work) might_sleep(); - down_read(mm-mmap_sem); - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL); - up_read(mm-mmap_sem); + kvm_get_user_page_retry(NULL, mm, addr, 1, NULL); kvm_async_page_present_sync(vcpu, apf); spin_lock(vcpu-async_pf.lock); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7ef6b48..43a9ab9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1115,6 +1115,39 @@ static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm, return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL); } +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, bool write_fault, + struct page **pagep) +{ + int npages; + int locked = 1; + int flags = FOLL_TOUCH | FOLL_HWPOISON | + (pagep ? FOLL_GET : 0) | + (write_fault ? FOLL_WRITE : 0); + + /* + * Retrying fault, we get here *not* having allowed the filemap to wait + * on the page lock. We should now allow waiting on the IO with the + * mmap semaphore released. + */ + down_read(mm-mmap_sem); + npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL, + locked); + if (!locked) { + BUG_ON(npages != -EBUSY); + /* + * The previous call has now waited on the IO. Now we can + * retry
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote: 2014-09-17 13:26+0300, Gleb Natapov: For async_pf_execute() you do not need to even retry. Next guest's page fault will retry it for you. Wouldn't that be a waste of vmentries? This is how it will work with or without this second gup. Page is not mapped into a shadow page table on this path, it happens on a next fault. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote: On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov g...@kernel.org wrote: On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote: 2014-09-17 13:26+0300, Gleb Natapov: For async_pf_execute() you do not need to even retry. Next guest's page fault will retry it for you. Wouldn't that be a waste of vmentries? This is how it will work with or without this second gup. Page is not mapped into a shadow page table on this path, it happens on a next fault. The point is that the gup in the async pf completion from the work queue will not relinquish the mmap semaphore. And it most definitely should, given that we are likely looking at swap/filemap. I get this point and the patch looks good in general, but my point is that when _retry() is called from async_pf_execute() second gup is not needed. In the original code gup is called to do IO and nothing else. In your patch this is accomplished by the first gup already, so you can skip second gup if pagep == nullptr. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
On Wed, Sep 17, 2014 at 10:13:45AM -0700, Andres Lagar-Cavilla wrote: On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov g...@kernel.org wrote: On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote: On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov g...@kernel.org wrote: On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote: 2014-09-17 13:26+0300, Gleb Natapov: For async_pf_execute() you do not need to even retry. Next guest's page fault will retry it for you. Wouldn't that be a waste of vmentries? This is how it will work with or without this second gup. Page is not mapped into a shadow page table on this path, it happens on a next fault. The point is that the gup in the async pf completion from the work queue will not relinquish the mmap semaphore. And it most definitely should, given that we are likely looking at swap/filemap. I get this point and the patch looks good in general, but my point is that when _retry() is called from async_pf_execute() second gup is not needed. In the original code gup is called to do IO and nothing else. In your patch this is accomplished by the first gup already, so you can skip second gup if pagep == nullptr. I see. However, if this function were to be used elsewhere in the future, then the if pagep == NULL don't retry semantics may not match the new caller's intention. Would you prefer an explicit flag? We can add explicit flag whenever such caller will be added, if ever. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 04:37:39PM +0200, Paolo Bonzini wrote: > Il 11/09/2014 16:31, Gleb Natapov ha scritto: > >> > What if the page being swapped out is L1's APIC access page? We don't > >> > run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we > >> > need to "do something". > > We will do something on L2->L1 exit. We will call > > kvm_reload_apic_access_page(). > > That is what patch 5 of this series is doing. > > Sorry, I meant "the APIC access page prepared by L1" for L2's execution. > > You wrote: > > > if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & > > ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) > > write(PIC_ACCESS_ADDR) > > > > In other words if L2 shares L1 apic access page then reload, otherwise do > > nothing. > > but in that case you have to redo nested_get_page, so "do nothing" > doesn't work. > Ah, 7/7 is new in this submission. Before that this page was still pinned. Looking at 7/7 now I do not see how it can work since it has no code for mmu notifier to detect that it deals with such page and call kvm_reload_apic_access_page(). I said to Tang previously that nested kvm has a bunch of pinned page that are hard to deal with and suggested to iron out non nested case first :( -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 04:24:04PM +0200, Paolo Bonzini wrote: > Il 11/09/2014 16:21, Gleb Natapov ha scritto: > > As far as I can tell the if that is needed there is: > > > > if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & > > ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) > > write(PIC_ACCESS_ADDR) > > > > In other words if L2 shares L1 apic access page then reload, otherwise do > > nothing. > > What if the page being swapped out is L1's APIC access page? We don't > run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we > need to "do something". We will do something on L2->L1 exit. We will call kvm_reload_apic_access_page(). That is what patch 5 of this series is doing. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 04:06:58PM +0200, Paolo Bonzini wrote: > Il 11/09/2014 15:59, Gleb Natapov ha scritto: > > > > Suppose vmcs01->APIC_ACCESS_ADDR = 0xf000. During L2 entry > > vmcs02->APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now > > 0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit, > > but vcpu is in a guest mode so vmcs02->APIC_ACCESS_ADDR is never updated > > to 0x8000 because of "if (!is_guest_mode(vcpu))" check. So what am I > > missing here? > > Right, guest mode isn't left as soon as you execute nested_vmx_vmexit, > because this isn't an L2->L1 exit. So we need an "else" for that "if > (!is_guest_mode(vcpu))", in which case the hpa is ignored and > vmcs12->apic_access_addr is used instead? > As far as I can tell the if that is needed there is: if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) write(PIC_ACCESS_ADDR) In other words if L2 shares L1 apic access page then reload, otherwise do nothing. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 03:05:05PM +0200, Paolo Bonzini wrote: > Il 11/09/2014 13:30, Gleb Natapov ha scritto: > >> > +vmcs_write64(APIC_ACCESS_ADDR, > >> > page_to_phys(page)); > >> > +/* > >> > + * Do not pin apic access page in memory so > >> > that memory > >> > + * hotplug process is able to migrate it. > >> > + */ > >> > +put_page(page); > >> > } > > This code is in prepare_vmcs02() and is executed during L1->L2 vmentry. > > What happens > > when apic access page is migrated while L2 is running? It needs to be > > update somewhere. > > Before it is migrated, the MMU notifier is called and will force a > vmexit on all CPUs. The reload code will call GUP again on the page > again and swap it in. > This is how it will work without "if (!is_guest_mode(vcpu))". But, unless I am missing something, with this check it will not work while vcpu is in L2. Suppose vmcs01->APIC_ACCESS_ADDR = 0xf000. During L2 entry vmcs02->APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now 0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit, but vcpu is in a guest mode so vmcs02->APIC_ACCESS_ADDR is never updated to 0x8000 because of "if (!is_guest_mode(vcpu))" check. So what am I missing here? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 12:47:16PM +0200, Paolo Bonzini wrote: > Il 11/09/2014 12:12, Gleb Natapov ha scritto: > > On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote: > >> Il 11/09/2014 07:38, Tang Chen ha scritto: > >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >>> index 63c4c3e..da6d55d 100644 > >>> --- a/arch/x86/kvm/vmx.c > >>> +++ b/arch/x86/kvm/vmx.c > >>> @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct > >>> kvm_vcpu *vcpu, bool set) > >>> vmx_set_msr_bitmap(vcpu); > >>> } > >>> > >>> +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) > >>> +{ > >>> + vmcs_write64(APIC_ACCESS_ADDR, hpa); > >> > >> This has to be guarded by "if (!is_guest_mode(vcpu))". > >> > > We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip > > it otherwise, no? > > Yes, but this would be handled by patch 6: > > } else if (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) { > + struct page *page = gfn_to_page(vmx->vcpu.kvm, > + APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); > exec_control |= > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > - vmcs_write64(APIC_ACCESS_ADDR, > - page_to_phys(vcpu->kvm->arch.apic_access_page)); > + vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page)); > + /* > + * Do not pin apic access page in memory so that memory > + * hotplug process is able to migrate it. > + */ > + put_page(page); > } This code is in prepare_vmcs02() and is executed during L1->L2 vmentry. What happens when apic access page is migrated while L2 is running? It needs to be update somewhere. > > However, this is also useless code duplication because the above snippet could > reuse vcpu_reload_apic_access_page too. > > So I think you cannot do the is_guest_mode check in > kvm_vcpu_reload_apic_access_page and also not in > vmx_reload_apic_access_page. But you could do something like > > kvm_vcpu_reload_apic_access_page(...) > { > ... > kvm_x86_ops->reload_apic_access_page(...); > } > EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page); > > /* used in vcpu_enter_guest only */ > vcpu_reload_apic_access_page(...) > { > if (!is_guest_mode(vcpu)) > kvm_vcpu_reload_apic_access_page(...) > } > > Paolo -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote: > Il 11/09/2014 07:38, Tang Chen ha scritto: > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 63c4c3e..da6d55d 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct > > kvm_vcpu *vcpu, bool set) > > vmx_set_msr_bitmap(vcpu); > > } > > > > +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) > > +{ > > + vmcs_write64(APIC_ACCESS_ADDR, hpa); > > This has to be guarded by "if (!is_guest_mode(vcpu))". > We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip it otherwise, no? > > +} > > + -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote: Il 11/09/2014 07:38, Tang Chen ha scritto: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 63c4c3e..da6d55d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) vmx_set_msr_bitmap(vcpu); } +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + vmcs_write64(APIC_ACCESS_ADDR, hpa); This has to be guarded by if (!is_guest_mode(vcpu)). We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip it otherwise, no? +} + -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 12:47:16PM +0200, Paolo Bonzini wrote: Il 11/09/2014 12:12, Gleb Natapov ha scritto: On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote: Il 11/09/2014 07:38, Tang Chen ha scritto: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 63c4c3e..da6d55d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) vmx_set_msr_bitmap(vcpu); } +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + vmcs_write64(APIC_ACCESS_ADDR, hpa); This has to be guarded by if (!is_guest_mode(vcpu)). We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip it otherwise, no? Yes, but this would be handled by patch 6: } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { + struct page *page = gfn_to_page(vmx-vcpu.kvm, + APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; - vmcs_write64(APIC_ACCESS_ADDR, - page_to_phys(vcpu-kvm-arch.apic_access_page)); + vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page)); + /* + * Do not pin apic access page in memory so that memory + * hotplug process is able to migrate it. + */ + put_page(page); } This code is in prepare_vmcs02() and is executed during L1-L2 vmentry. What happens when apic access page is migrated while L2 is running? It needs to be update somewhere. However, this is also useless code duplication because the above snippet could reuse vcpu_reload_apic_access_page too. So I think you cannot do the is_guest_mode check in kvm_vcpu_reload_apic_access_page and also not in vmx_reload_apic_access_page. But you could do something like kvm_vcpu_reload_apic_access_page(...) { ... kvm_x86_ops-reload_apic_access_page(...); } EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page); /* used in vcpu_enter_guest only */ vcpu_reload_apic_access_page(...) { if (!is_guest_mode(vcpu)) kvm_vcpu_reload_apic_access_page(...) } Paolo -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 03:05:05PM +0200, Paolo Bonzini wrote: Il 11/09/2014 13:30, Gleb Natapov ha scritto: +vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page)); +/* + * Do not pin apic access page in memory so that memory + * hotplug process is able to migrate it. + */ +put_page(page); } This code is in prepare_vmcs02() and is executed during L1-L2 vmentry. What happens when apic access page is migrated while L2 is running? It needs to be update somewhere. Before it is migrated, the MMU notifier is called and will force a vmexit on all CPUs. The reload code will call GUP again on the page again and swap it in. This is how it will work without if (!is_guest_mode(vcpu)). But, unless I am missing something, with this check it will not work while vcpu is in L2. Suppose vmcs01-APIC_ACCESS_ADDR = 0xf000. During L2 entry vmcs02-APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now 0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit, but vcpu is in a guest mode so vmcs02-APIC_ACCESS_ADDR is never updated to 0x8000 because of if (!is_guest_mode(vcpu)) check. So what am I missing here? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 04:06:58PM +0200, Paolo Bonzini wrote: Il 11/09/2014 15:59, Gleb Natapov ha scritto: Suppose vmcs01-APIC_ACCESS_ADDR = 0xf000. During L2 entry vmcs02-APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now 0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit, but vcpu is in a guest mode so vmcs02-APIC_ACCESS_ADDR is never updated to 0x8000 because of if (!is_guest_mode(vcpu)) check. So what am I missing here? Right, guest mode isn't left as soon as you execute nested_vmx_vmexit, because this isn't an L2-L1 exit. So we need an else for that if (!is_guest_mode(vcpu)), in which case the hpa is ignored and vmcs12-apic_access_addr is used instead? As far as I can tell the if that is needed there is: if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) write(PIC_ACCESS_ADDR) In other words if L2 shares L1 apic access page then reload, otherwise do nothing. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 04:24:04PM +0200, Paolo Bonzini wrote: Il 11/09/2014 16:21, Gleb Natapov ha scritto: As far as I can tell the if that is needed there is: if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) write(PIC_ACCESS_ADDR) In other words if L2 shares L1 apic access page then reload, otherwise do nothing. What if the page being swapped out is L1's APIC access page? We don't run prepare_vmcs12 in that case because it's an L2-L0-L2 entry, so we need to do something. We will do something on L2-L1 exit. We will call kvm_reload_apic_access_page(). That is what patch 5 of this series is doing. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Thu, Sep 11, 2014 at 04:37:39PM +0200, Paolo Bonzini wrote: Il 11/09/2014 16:31, Gleb Natapov ha scritto: What if the page being swapped out is L1's APIC access page? We don't run prepare_vmcs12 in that case because it's an L2-L0-L2 entry, so we need to do something. We will do something on L2-L1 exit. We will call kvm_reload_apic_access_page(). That is what patch 5 of this series is doing. Sorry, I meant the APIC access page prepared by L1 for L2's execution. You wrote: if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) write(PIC_ACCESS_ADDR) In other words if L2 shares L1 apic access page then reload, otherwise do nothing. but in that case you have to redo nested_get_page, so do nothing doesn't work. Ah, 7/7 is new in this submission. Before that this page was still pinned. Looking at 7/7 now I do not see how it can work since it has no code for mmu notifier to detect that it deals with such page and call kvm_reload_apic_access_page(). I said to Tang previously that nested kvm has a bunch of pinned page that are hard to deal with and suggested to iron out non nested case first :( -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Tue, Sep 09, 2014 at 03:13:07PM +0800, tangchen wrote: > Hi Gleb, > > On 09/03/2014 11:04 PM, Gleb Natapov wrote: > >On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote: > >>Hi Gleb, > >> > >>On 09/03/2014 12:00 AM, Gleb Natapov wrote: > >>>.. > >>>+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) > >>>+{ > >>>+ /* > >>>+ * apic access page could be migrated. When the page is being migrated, > >>>+ * GUP will wait till the migrate entry is replaced with the new pte > >>>+ * entry pointing to the new page. > >>>+ */ > >>>+ vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm, > >>>+ APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); > >>>+ kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm, > >>>+ page_to_phys(vcpu->kvm->arch.apic_access_page)); > >>>I am a little bit worried that here all vcpus write to > >>>vcpu->kvm->arch.apic_access_page > >>>without any locking. It is probably benign since pointer write is atomic > >>>on x86. Paolo? > >>> > >>>Do we even need apic_access_page? Why not call > >>> gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT) > >>> put_page() > >>>on rare occasions we need to know its address? > >>Isn't it a necessary item defined in hardware spec ? > >> > >vcpu->kvm->arch.apic_access_page? No. This is internal kvm data structure. > > > >>I didn't read intel spec deeply, but according to the code, the page's > >>address is > >>written into vmcs. And it made me think that we cannnot remove it. > >> > >We cannot remove writing of apic page address into vmcs, but this is not > >done by > >assigning to vcpu->kvm->arch.apic_access_page, but by vmwrite in > >set_apic_access_page_addr(). > > OK, I'll try to remove kvm->arch.apic_access_page and send a patch for it > soon. > > BTW, if you don't have objection to the first two patches, would you please > help to > commit them first ? > I acked them and CCed Paolo to this reply. I hope he will look at the series too. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/6] kvm: Remove ept_identity_pagetable from struct kvm_arch.
On Wed, Aug 27, 2014 at 06:17:37PM +0800, Tang Chen wrote: > kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But > it is never used to refer to the page at all. > > In vcpu initialization, it indicates two things: > 1. indicates if ept page is allocated > 2. indicates if a memory slot for identity page is initialized > > Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept > identity pagetable is initialized. So we can remove ept_identity_pagetable. > > NOTE: In the original code, ept identity pagetable page is pinned in memroy. > As a result, it cannot be migrated/hot-removed. After this patch, since > kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page > is no longer pinned in memory. And it can be migrated/hot-removed. Reviewed-by: Gleb Natapov > > Signed-off-by: Tang Chen > --- > arch/x86/include/asm/kvm_host.h | 1 - > arch/x86/kvm/vmx.c | 50 > - > arch/x86/kvm/x86.c | 2 -- > 3 files changed, 25 insertions(+), 28 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 7c492ed..35171c7 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -580,7 +580,6 @@ struct kvm_arch { > > gpa_t wall_clock; > > - struct page *ept_identity_pagetable; > bool ept_identity_pagetable_done; > gpa_t ept_identity_map_addr; > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 4b80ead..953d529 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -743,6 +743,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment > *var); > static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu); > static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx); > static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx); > +static int alloc_identity_pagetable(struct kvm *kvm); > > static DEFINE_PER_CPU(struct vmcs *, vmxarea); > static DEFINE_PER_CPU(struct vmcs *, current_vmcs); > @@ -3938,21 +3939,27 @@ out: > > static int init_rmode_identity_map(struct kvm *kvm) > { > - int i, idx, r, ret; > + int i, idx, r, ret = 0; > pfn_t identity_map_pfn; > u32 tmp; > > if (!enable_ept) > return 1; > - if (unlikely(!kvm->arch.ept_identity_pagetable)) { > - printk(KERN_ERR "EPT: identity-mapping pagetable " > - "haven't been allocated!\n"); > - return 0; > + > + /* Protect kvm->arch.ept_identity_pagetable_done. */ > + mutex_lock(>slots_lock); > + > + if (likely(kvm->arch.ept_identity_pagetable_done)) { > + ret = 1; > + goto out2; > } > - if (likely(kvm->arch.ept_identity_pagetable_done)) > - return 1; > - ret = 0; > + > identity_map_pfn = kvm->arch.ept_identity_map_addr >> PAGE_SHIFT; > + > + r = alloc_identity_pagetable(kvm); > + if (r) > + goto out2; > + > idx = srcu_read_lock(>srcu); > r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE); > if (r < 0) > @@ -3970,6 +3977,9 @@ static int init_rmode_identity_map(struct kvm *kvm) > ret = 1; > out: > srcu_read_unlock(>srcu, idx); > + > +out2: > + mutex_unlock(>slots_lock); > return ret; > } > > @@ -4019,31 +4029,23 @@ out: > > static int alloc_identity_pagetable(struct kvm *kvm) > { > - struct page *page; > + /* > + * In init_rmode_identity_map(), kvm->arch.ept_identity_pagetable_done > + * is checked before calling this function and set to true after the > + * calling. The access to kvm->arch.ept_identity_pagetable_done should > + * be protected by kvm->slots_lock. > + */ > + > struct kvm_userspace_memory_region kvm_userspace_mem; > int r = 0; > > - mutex_lock(>slots_lock); > - if (kvm->arch.ept_identity_pagetable) > - goto out; > kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT; > kvm_userspace_mem.flags = 0; > kvm_userspace_mem.guest_phys_addr = > kvm->arch.ept_identity_map_addr; > kvm_userspace_mem.memory_size = PAGE_SIZE; > r = __kvm_set_memory_region(kvm, _userspace_mem); > - if (r) > - goto out; > > - page = gfn_to_page(kvm, kvm->arch.ept_identity_map_addr >> PAGE_SHIFT); > - if (is_error_page(page)) { > - r = -EFAULT; >
Re: [PATCH v4 1/6] kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.
On Wed, Aug 27, 2014 at 06:17:36PM +0800, Tang Chen wrote: > We have APIC_DEFAULT_PHYS_BASE defined as 0xfee0, which is also the > address of > apic access page. So use this macro. Reviewed-by: Gleb Natapov > > Signed-off-by: Tang Chen > --- > arch/x86/kvm/svm.c | 3 ++- > arch/x86/kvm/vmx.c | 6 +++--- > 2 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index ddf7427..1d941ad 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -1257,7 +1257,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm > *kvm, unsigned int id) > svm->asid_generation = 0; > init_vmcb(svm); > > - svm->vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE; > + svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE | > +MSR_IA32_APICBASE_ENABLE; > if (kvm_vcpu_is_bsp(>vcpu)) > svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index bfe11cf..4b80ead 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -3999,13 +3999,13 @@ static int alloc_apic_access_page(struct kvm *kvm) > goto out; > kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT; > kvm_userspace_mem.flags = 0; > - kvm_userspace_mem.guest_phys_addr = 0xfee0ULL; > + kvm_userspace_mem.guest_phys_addr = APIC_DEFAULT_PHYS_BASE; > kvm_userspace_mem.memory_size = PAGE_SIZE; > r = __kvm_set_memory_region(kvm, _userspace_mem); > if (r) > goto out; > > - page = gfn_to_page(kvm, 0xfee00); > + page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); > if (is_error_page(page)) { > r = -EFAULT; > goto out; > @@ -4477,7 +4477,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) > > vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); > kvm_set_cr8(>vcpu, 0); > - apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE; > + apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; > if (kvm_vcpu_is_bsp(>vcpu)) > apic_base_msr.data |= MSR_IA32_APICBASE_BSP; > apic_base_msr.host_initiated = true; > -- > 1.8.3.1 > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 1/6] kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.
On Wed, Aug 27, 2014 at 06:17:36PM +0800, Tang Chen wrote: We have APIC_DEFAULT_PHYS_BASE defined as 0xfee0, which is also the address of apic access page. So use this macro. Reviewed-by: Gleb Natapov g...@kernel.org Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/kvm/svm.c | 3 ++- arch/x86/kvm/vmx.c | 6 +++--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ddf7427..1d941ad 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1257,7 +1257,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-asid_generation = 0; init_vmcb(svm); - svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE; + svm-vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE | +MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bfe11cf..4b80ead 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3999,13 +3999,13 @@ static int alloc_apic_access_page(struct kvm *kvm) goto out; kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT; kvm_userspace_mem.flags = 0; - kvm_userspace_mem.guest_phys_addr = 0xfee0ULL; + kvm_userspace_mem.guest_phys_addr = APIC_DEFAULT_PHYS_BASE; kvm_userspace_mem.memory_size = PAGE_SIZE; r = __kvm_set_memory_region(kvm, kvm_userspace_mem); if (r) goto out; - page = gfn_to_page(kvm, 0xfee00); + page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); if (is_error_page(page)) { r = -EFAULT; goto out; @@ -4477,7 +4477,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val(); kvm_set_cr8(vmx-vcpu, 0); - apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE; + apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_bsp(vmx-vcpu)) apic_base_msr.data |= MSR_IA32_APICBASE_BSP; apic_base_msr.host_initiated = true; -- 1.8.3.1 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/6] kvm: Remove ept_identity_pagetable from struct kvm_arch.
On Wed, Aug 27, 2014 at 06:17:37PM +0800, Tang Chen wrote: kvm_arch-ept_identity_pagetable holds the ept identity pagetable page. But it is never used to refer to the page at all. In vcpu initialization, it indicates two things: 1. indicates if ept page is allocated 2. indicates if a memory slot for identity page is initialized Actually, kvm_arch-ept_identity_pagetable_done is enough to tell if the ept identity pagetable is initialized. So we can remove ept_identity_pagetable. NOTE: In the original code, ept identity pagetable page is pinned in memroy. As a result, it cannot be migrated/hot-removed. After this patch, since kvm_arch-ept_identity_pagetable is removed, ept identity pagetable page is no longer pinned in memory. And it can be migrated/hot-removed. Reviewed-by: Gleb Natapov g...@kernel.org Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/vmx.c | 50 - arch/x86/kvm/x86.c | 2 -- 3 files changed, 25 insertions(+), 28 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7c492ed..35171c7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -580,7 +580,6 @@ struct kvm_arch { gpa_t wall_clock; - struct page *ept_identity_pagetable; bool ept_identity_pagetable_done; gpa_t ept_identity_map_addr; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 4b80ead..953d529 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -743,6 +743,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment *var); static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu); static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx); static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx); +static int alloc_identity_pagetable(struct kvm *kvm); static DEFINE_PER_CPU(struct vmcs *, vmxarea); static DEFINE_PER_CPU(struct vmcs *, current_vmcs); @@ -3938,21 +3939,27 @@ out: static int init_rmode_identity_map(struct kvm *kvm) { - int i, idx, r, ret; + int i, idx, r, ret = 0; pfn_t identity_map_pfn; u32 tmp; if (!enable_ept) return 1; - if (unlikely(!kvm-arch.ept_identity_pagetable)) { - printk(KERN_ERR EPT: identity-mapping pagetable - haven't been allocated!\n); - return 0; + + /* Protect kvm-arch.ept_identity_pagetable_done. */ + mutex_lock(kvm-slots_lock); + + if (likely(kvm-arch.ept_identity_pagetable_done)) { + ret = 1; + goto out2; } - if (likely(kvm-arch.ept_identity_pagetable_done)) - return 1; - ret = 0; + identity_map_pfn = kvm-arch.ept_identity_map_addr PAGE_SHIFT; + + r = alloc_identity_pagetable(kvm); + if (r) + goto out2; + idx = srcu_read_lock(kvm-srcu); r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE); if (r 0) @@ -3970,6 +3977,9 @@ static int init_rmode_identity_map(struct kvm *kvm) ret = 1; out: srcu_read_unlock(kvm-srcu, idx); + +out2: + mutex_unlock(kvm-slots_lock); return ret; } @@ -4019,31 +4029,23 @@ out: static int alloc_identity_pagetable(struct kvm *kvm) { - struct page *page; + /* + * In init_rmode_identity_map(), kvm-arch.ept_identity_pagetable_done + * is checked before calling this function and set to true after the + * calling. The access to kvm-arch.ept_identity_pagetable_done should + * be protected by kvm-slots_lock. + */ + struct kvm_userspace_memory_region kvm_userspace_mem; int r = 0; - mutex_lock(kvm-slots_lock); - if (kvm-arch.ept_identity_pagetable) - goto out; kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT; kvm_userspace_mem.flags = 0; kvm_userspace_mem.guest_phys_addr = kvm-arch.ept_identity_map_addr; kvm_userspace_mem.memory_size = PAGE_SIZE; r = __kvm_set_memory_region(kvm, kvm_userspace_mem); - if (r) - goto out; - page = gfn_to_page(kvm, kvm-arch.ept_identity_map_addr PAGE_SHIFT); - if (is_error_page(page)) { - r = -EFAULT; - goto out; - } - - kvm-arch.ept_identity_pagetable = page; -out: - mutex_unlock(kvm-slots_lock); return r; } @@ -7643,8 +7645,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) kvm-arch.ept_identity_map_addr = VMX_EPT_IDENTITY_PAGETABLE_ADDR; err = -ENOMEM; - if (alloc_identity_pagetable(kvm) != 0) - goto free_vmcs; if (!init_rmode_identity_map(kvm)) goto
Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Tue, Sep 09, 2014 at 03:13:07PM +0800, tangchen wrote: Hi Gleb, On 09/03/2014 11:04 PM, Gleb Natapov wrote: On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote: Hi Gleb, On 09/03/2014 12:00 AM, Gleb Natapov wrote: .. +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) +{ + /* + * apic access page could be migrated. When the page is being migrated, + * GUP will wait till the migrate entry is replaced with the new pte + * entry pointing to the new page. + */ + vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm, + APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); + kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm, + page_to_phys(vcpu-kvm-arch.apic_access_page)); I am a little bit worried that here all vcpus write to vcpu-kvm-arch.apic_access_page without any locking. It is probably benign since pointer write is atomic on x86. Paolo? Do we even need apic_access_page? Why not call gfn_to_page(APIC_DEFAULT_PHYS_BASE PAGE_SHIFT) put_page() on rare occasions we need to know its address? Isn't it a necessary item defined in hardware spec ? vcpu-kvm-arch.apic_access_page? No. This is internal kvm data structure. I didn't read intel spec deeply, but according to the code, the page's address is written into vmcs. And it made me think that we cannnot remove it. We cannot remove writing of apic page address into vmcs, but this is not done by assigning to vcpu-kvm-arch.apic_access_page, but by vmwrite in set_apic_access_page_addr(). OK, I'll try to remove kvm-arch.apic_access_page and send a patch for it soon. BTW, if you don't have objection to the first two patches, would you please help to commit them first ? I acked them and CCed Paolo to this reply. I hope he will look at the series too. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions
On Thu, Sep 04, 2014 at 07:44:51PM +0200, Paolo Bonzini wrote: > Il 04/09/2014 17:05, Gleb Natapov ha scritto: > >> > > >> > If you do that, KVM gets down to the "if (writeback)" and writes the > >> > ctxt->eip from L2 into the L1 EIP. > > Heh, that's a bummer. We should not write back if an instruction caused a > > vmexit. > > > > You're right, that works. Looks good! Reviewed-by: Gleb Natapov > > Paolo > > -- 8< - > Subject: [PATCH] KVM: x86: skip writeback on injection of nested exception > > If a nested page fault happens during emulation, we will inject a vmexit, > not a page fault. However because writeback happens after the injection, > we will write ctxt->eip from L2 into the L1 EIP. We do not write back > if an instruction caused an interception vmexit---do the same for page > faults. > > Signed-off-by: Paolo Bonzini > --- > arch/x86/include/asm/kvm_host.h | 1 - > arch/x86/kvm/x86.c | 15 ++- > 2 files changed, 10 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 08cc299..c989651 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -893,7 +893,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct > x86_exception *fault); > int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, > gfn_t gfn, void *data, int offset, int len, > u32 access); > -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); > bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); > > static inline int __kvm_irq_line_state(unsigned long *irq_state, > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index e4ed85e..3541946 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -408,12 +408,14 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, > struct x86_exception *fault) > } > EXPORT_SYMBOL_GPL(kvm_inject_page_fault); > > -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault) > +static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception > *fault) > { > if (mmu_is_nested(vcpu) && !fault->nested_page_fault) > vcpu->arch.nested_mmu.inject_page_fault(vcpu, fault); > else > vcpu->arch.mmu.inject_page_fault(vcpu, fault); > + > + return fault->nested_page_fault; > } > > void kvm_inject_nmi(struct kvm_vcpu *vcpu) > @@ -4929,16 +4931,18 @@ static void toggle_interruptibility(struct kvm_vcpu > *vcpu, u32 mask) > } > } > > -static void inject_emulated_exception(struct kvm_vcpu *vcpu) > +static bool inject_emulated_exception(struct kvm_vcpu *vcpu) > { > struct x86_emulate_ctxt *ctxt = >arch.emulate_ctxt; > if (ctxt->exception.vector == PF_VECTOR) > - kvm_propagate_fault(vcpu, >exception); > - else if (ctxt->exception.error_code_valid) > + return kvm_propagate_fault(vcpu, >exception); > + > + if (ctxt->exception.error_code_valid) > kvm_queue_exception_e(vcpu, ctxt->exception.vector, > ctxt->exception.error_code); > else > kvm_queue_exception(vcpu, ctxt->exception.vector); > + return false; > } > > static void init_emulate_ctxt(struct kvm_vcpu *vcpu) > @@ -5300,8 +5304,9 @@ restart: > } > > if (ctxt->have_exception) { > - inject_emulated_exception(vcpu); > r = EMULATE_DONE; > + if (inject_emulated_exception(vcpu)) > + return r; > } else if (vcpu->arch.pio.count) { > if (!vcpu->arch.pio.in) { > /* FIXME: return into emulator if single-stepping. */ > -- > 1.9.3 > > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions
On Thu, Sep 04, 2014 at 07:44:51PM +0200, Paolo Bonzini wrote: Il 04/09/2014 17:05, Gleb Natapov ha scritto: If you do that, KVM gets down to the if (writeback) and writes the ctxt-eip from L2 into the L1 EIP. Heh, that's a bummer. We should not write back if an instruction caused a vmexit. You're right, that works. Looks good! Reviewed-by: Gleb Natapov g...@kernel.org Paolo -- 8 - Subject: [PATCH] KVM: x86: skip writeback on injection of nested exception If a nested page fault happens during emulation, we will inject a vmexit, not a page fault. However because writeback happens after the injection, we will write ctxt-eip from L2 into the L1 EIP. We do not write back if an instruction caused an interception vmexit---do the same for page faults. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/x86.c | 15 ++- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 08cc299..c989651 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -893,7 +893,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gfn_t gfn, void *data, int offset, int len, u32 access); -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); static inline int __kvm_irq_line_state(unsigned long *irq_state, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e4ed85e..3541946 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -408,12 +408,14 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault) } EXPORT_SYMBOL_GPL(kvm_inject_page_fault); -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault) +static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault) { if (mmu_is_nested(vcpu) !fault-nested_page_fault) vcpu-arch.nested_mmu.inject_page_fault(vcpu, fault); else vcpu-arch.mmu.inject_page_fault(vcpu, fault); + + return fault-nested_page_fault; } void kvm_inject_nmi(struct kvm_vcpu *vcpu) @@ -4929,16 +4931,18 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) } } -static void inject_emulated_exception(struct kvm_vcpu *vcpu) +static bool inject_emulated_exception(struct kvm_vcpu *vcpu) { struct x86_emulate_ctxt *ctxt = vcpu-arch.emulate_ctxt; if (ctxt-exception.vector == PF_VECTOR) - kvm_propagate_fault(vcpu, ctxt-exception); - else if (ctxt-exception.error_code_valid) + return kvm_propagate_fault(vcpu, ctxt-exception); + + if (ctxt-exception.error_code_valid) kvm_queue_exception_e(vcpu, ctxt-exception.vector, ctxt-exception.error_code); else kvm_queue_exception(vcpu, ctxt-exception.vector); + return false; } static void init_emulate_ctxt(struct kvm_vcpu *vcpu) @@ -5300,8 +5304,9 @@ restart: } if (ctxt-have_exception) { - inject_emulated_exception(vcpu); r = EMULATE_DONE; + if (inject_emulated_exception(vcpu)) + return r; } else if (vcpu-arch.pio.count) { if (!vcpu-arch.pio.in) { /* FIXME: return into emulator if single-stepping. */ -- 1.9.3 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions
On Thu, Sep 04, 2014 at 04:12:19PM +0200, Paolo Bonzini wrote: > Il 04/09/2014 09:02, Gleb Natapov ha scritto: > > On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote: > >> > This is required for the following patch to work correctly. If a nested > >> > page > >> > fault happens during emulation, we must inject a vmexit, not a page > >> > fault. > >> > Luckily we already have the required machinery: it is enough to return > >> > X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT. > >> > > > I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes > > ctxt->have_exception to be set to true in x86_emulate_insn(). > > x86_emulate_instruction() checks ctxt->have_exception and calls > > inject_emulated_exception() if it is true. inject_emulated_exception() > > calls kvm_propagate_fault() where we check if the fault was nested and > > generate vmexit or a page fault accordingly. > > Good question. :) > > If you do that, KVM gets down to the "if (writeback)" and writes the > ctxt->eip from L2 into the L1 EIP. Heh, that's a bummer. We should not write back if an instruction caused a vmexit. > > Possibly this patch can be replaced by just this? > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 022513b..475e979 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -5312,7 +5312,7 @@ restart: > > if (ctxt->have_exception) { > inject_emulated_exception(vcpu); > - r = EMULATE_DONE; > + return EMULATE_DONE; If there was no vmexit we still want to writeback. Perhaps: writeback = inject_emulated_exception(vcpu); and return false if there was vmexit due to nested page fault (or any fault, can't L1 ask for #GP/#UD intercept that need to be handled here too?) > } else if (vcpu->arch.pio.count) { > if (!vcpu->arch.pio.in) { > /* FIXME: return into emulator if single-stepping. */ > > But I'm not sure how to test it, and I like the idea of treating nested page > faults like other nested vmexits during emulation (which is what this patch > does). IMO exits due to instruction intercept and exits due to other interceptable events that may happen during instruction emulation are sufficiently different to be handled slightly different. If my assumption about #GP above are correct with current approach it can be easily handled inside inject_emulated_exception(). -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions
On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote: > This is required for the following patch to work correctly. If a nested page > fault happens during emulation, we must inject a vmexit, not a page fault. > Luckily we already have the required machinery: it is enough to return > X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT. > I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes ctxt->have_exception to be set to true in x86_emulate_insn(). x86_emulate_instruction() checks ctxt->have_exception and calls inject_emulated_exception() if it is true. inject_emulated_exception() calls kvm_propagate_fault() where we check if the fault was nested and generate vmexit or a page fault accordingly. > Reported-by: Valentine Sinitsyn > Signed-off-by: Paolo Bonzini > --- > arch/x86/kvm/x86.c | 18 ++ > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index e4ed85e07a01..9e3b74c044ed 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -416,6 +416,16 @@ void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct > x86_exception *fault) > vcpu->arch.mmu.inject_page_fault(vcpu, fault); > } > > +static inline int kvm_propagate_or_intercept(struct kvm_vcpu *vcpu, > + struct x86_exception *exception) > +{ > + if (likely(!exception->nested_page_fault)) > + return X86EMUL_PROPAGATE_FAULT; > + > + vcpu->arch.mmu.inject_page_fault(vcpu, exception); > + return X86EMUL_INTERCEPTED; > +} > + > void kvm_inject_nmi(struct kvm_vcpu *vcpu) > { > atomic_inc(>arch.nmi_queued); > @@ -4122,7 +4132,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void > *val, unsigned int bytes, > int ret; > > if (gpa == UNMAPPED_GVA) > - return X86EMUL_PROPAGATE_FAULT; > + return kvm_propagate_or_intercept(vcpu, exception); > ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, data, > offset, toread); > if (ret < 0) { > @@ -4152,7 +4162,7 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt > *ctxt, > gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, > access|PFERR_FETCH_MASK, > exception); > if (unlikely(gpa == UNMAPPED_GVA)) > - return X86EMUL_PROPAGATE_FAULT; > + return kvm_propagate_or_intercept(vcpu, exception); > > offset = addr & (PAGE_SIZE-1); > if (WARN_ON(offset + bytes > PAGE_SIZE)) > @@ -4203,7 +4213,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt > *ctxt, > int ret; > > if (gpa == UNMAPPED_GVA) > - return X86EMUL_PROPAGATE_FAULT; > + return kvm_propagate_or_intercept(vcpu, exception); > ret = kvm_write_guest(vcpu->kvm, gpa, data, towrite); > if (ret < 0) { > r = X86EMUL_IO_NEEDED; > @@ -4350,7 +4360,7 @@ static int emulator_read_write_onepage(unsigned long > addr, void *val, > ret = vcpu_mmio_gva_to_gpa(vcpu, addr, , exception, write); > > if (ret < 0) > - return X86EMUL_PROPAGATE_FAULT; > + return kvm_propagate_or_intercept(vcpu, exception); > > /* For APIC access vmexit */ > if (ret) > -- > 1.8.3.1 > > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions
On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote: This is required for the following patch to work correctly. If a nested page fault happens during emulation, we must inject a vmexit, not a page fault. Luckily we already have the required machinery: it is enough to return X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT. I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes ctxt-have_exception to be set to true in x86_emulate_insn(). x86_emulate_instruction() checks ctxt-have_exception and calls inject_emulated_exception() if it is true. inject_emulated_exception() calls kvm_propagate_fault() where we check if the fault was nested and generate vmexit or a page fault accordingly. Reported-by: Valentine Sinitsyn valentine.sinit...@gmail.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/x86.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e4ed85e07a01..9e3b74c044ed 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -416,6 +416,16 @@ void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault) vcpu-arch.mmu.inject_page_fault(vcpu, fault); } +static inline int kvm_propagate_or_intercept(struct kvm_vcpu *vcpu, + struct x86_exception *exception) +{ + if (likely(!exception-nested_page_fault)) + return X86EMUL_PROPAGATE_FAULT; + + vcpu-arch.mmu.inject_page_fault(vcpu, exception); + return X86EMUL_INTERCEPTED; +} + void kvm_inject_nmi(struct kvm_vcpu *vcpu) { atomic_inc(vcpu-arch.nmi_queued); @@ -4122,7 +4132,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes, int ret; if (gpa == UNMAPPED_GVA) - return X86EMUL_PROPAGATE_FAULT; + return kvm_propagate_or_intercept(vcpu, exception); ret = kvm_read_guest_page(vcpu-kvm, gpa PAGE_SHIFT, data, offset, toread); if (ret 0) { @@ -4152,7 +4162,7 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, gpa_t gpa = vcpu-arch.walk_mmu-gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK, exception); if (unlikely(gpa == UNMAPPED_GVA)) - return X86EMUL_PROPAGATE_FAULT; + return kvm_propagate_or_intercept(vcpu, exception); offset = addr (PAGE_SIZE-1); if (WARN_ON(offset + bytes PAGE_SIZE)) @@ -4203,7 +4213,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt, int ret; if (gpa == UNMAPPED_GVA) - return X86EMUL_PROPAGATE_FAULT; + return kvm_propagate_or_intercept(vcpu, exception); ret = kvm_write_guest(vcpu-kvm, gpa, data, towrite); if (ret 0) { r = X86EMUL_IO_NEEDED; @@ -4350,7 +4360,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val, ret = vcpu_mmio_gva_to_gpa(vcpu, addr, gpa, exception, write); if (ret 0) - return X86EMUL_PROPAGATE_FAULT; + return kvm_propagate_or_intercept(vcpu, exception); /* For APIC access vmexit */ if (ret) -- 1.8.3.1 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions
On Thu, Sep 04, 2014 at 04:12:19PM +0200, Paolo Bonzini wrote: Il 04/09/2014 09:02, Gleb Natapov ha scritto: On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote: This is required for the following patch to work correctly. If a nested page fault happens during emulation, we must inject a vmexit, not a page fault. Luckily we already have the required machinery: it is enough to return X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT. I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes ctxt-have_exception to be set to true in x86_emulate_insn(). x86_emulate_instruction() checks ctxt-have_exception and calls inject_emulated_exception() if it is true. inject_emulated_exception() calls kvm_propagate_fault() where we check if the fault was nested and generate vmexit or a page fault accordingly. Good question. :) If you do that, KVM gets down to the if (writeback) and writes the ctxt-eip from L2 into the L1 EIP. Heh, that's a bummer. We should not write back if an instruction caused a vmexit. Possibly this patch can be replaced by just this? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 022513b..475e979 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5312,7 +5312,7 @@ restart: if (ctxt-have_exception) { inject_emulated_exception(vcpu); - r = EMULATE_DONE; + return EMULATE_DONE; If there was no vmexit we still want to writeback. Perhaps: writeback = inject_emulated_exception(vcpu); and return false if there was vmexit due to nested page fault (or any fault, can't L1 ask for #GP/#UD intercept that need to be handled here too?) } else if (vcpu-arch.pio.count) { if (!vcpu-arch.pio.in) { /* FIXME: return into emulator if single-stepping. */ But I'm not sure how to test it, and I like the idea of treating nested page faults like other nested vmexits during emulation (which is what this patch does). IMO exits due to instruction intercept and exits due to other interceptable events that may happen during instruction emulation are sufficiently different to be handled slightly different. If my assumption about #GP above are correct with current approach it can be easily handled inside inject_emulated_exception(). -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote: > Hi Gleb, > > On 09/03/2014 12:00 AM, Gleb Natapov wrote: > >.. > >+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) > >+{ > >+/* > >+ * apic access page could be migrated. When the page is being migrated, > >+ * GUP will wait till the migrate entry is replaced with the new pte > >+ * entry pointing to the new page. > >+ */ > >+vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm, > >+APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); > >+kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm, > >+page_to_phys(vcpu->kvm->arch.apic_access_page)); > >I am a little bit worried that here all vcpus write to > >vcpu->kvm->arch.apic_access_page > >without any locking. It is probably benign since pointer write is atomic on > >x86. Paolo? > > > >Do we even need apic_access_page? Why not call > > gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT) > > put_page() > >on rare occasions we need to know its address? > > Isn't it a necessary item defined in hardware spec ? > vcpu->kvm->arch.apic_access_page? No. This is internal kvm data structure. > I didn't read intel spec deeply, but according to the code, the page's > address is > written into vmcs. And it made me think that we cannnot remove it. > We cannot remove writing of apic page address into vmcs, but this is not done by assigning to vcpu->kvm->arch.apic_access_page, but by vmwrite in set_apic_access_page_addr(). -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 5/6] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.
On Wed, Aug 27, 2014 at 06:17:40PM +0800, Tang Chen wrote: > This patch only handle "L1 and L2 vm share one apic access page" situation. > > When L1 vm is running, if the shared apic access page is migrated, > mmu_notifier will > request all vcpus to exit to L0, and reload apic access page physical address > for > all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, > L2's vmcs > will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to > do > nothing. > > When L2 vm is running, if the shared apic access page is migrated, > mmu_notifier will > request all vcpus to exit to L0, and reload apic access page physical address > for > all L2 vmcs. And this patch requests apic access page reload in L2->L1 vmexit. > > Signed-off-by: Tang Chen > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/svm.c | 6 ++ > arch/x86/kvm/vmx.c | 32 > arch/x86/kvm/x86.c | 3 +++ > virt/kvm/kvm_main.c | 1 + > 5 files changed, 43 insertions(+) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 514183e..13fbb62 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -740,6 +740,7 @@ struct kvm_x86_ops { > void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); > void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); > void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); > + void (*set_nested_apic_page_migrated)(struct kvm_vcpu *vcpu, bool set); > void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); > void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); > int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index f2eacc4..da88646 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -3624,6 +3624,11 @@ static void svm_set_apic_access_page_addr(struct kvm > *kvm, hpa_t hpa) > return; > } > > +static void svm_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool > set) > +{ > + return; > +} > + > static int svm_vm_has_apicv(struct kvm *kvm) > { > return 0; > @@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = { > .update_cr8_intercept = update_cr8_intercept, > .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, > .set_apic_access_page_addr = svm_set_apic_access_page_addr, > + .set_nested_apic_page_migrated = svm_set_nested_apic_page_migrated, > .vm_has_apicv = svm_vm_has_apicv, > .load_eoi_exitmap = svm_load_eoi_exitmap, > .hwapic_isr_update = svm_hwapic_isr_update, > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index da6d55d..9035fd1 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -379,6 +379,16 @@ struct nested_vmx { >* we must keep them pinned while L2 runs. >*/ > struct page *apic_access_page; > + /* > + * L1's apic access page can be migrated. When L1 and L2 are sharing > + * the apic access page, after the page is migrated when L2 is running, > + * we have to reload it to L1 vmcs before we enter L1. > + * > + * When the shared apic access page is migrated in L1 mode, we don't > + * need to do anything else because we reload apic access page each > + * time when entering L2 in prepare_vmcs02(). > + */ > + bool apic_access_page_migrated; > u64 msr_ia32_feature_control; > > struct hrtimer preemption_timer; > @@ -7098,6 +7108,12 @@ static void vmx_set_apic_access_page_addr(struct kvm > *kvm, hpa_t hpa) > vmcs_write64(APIC_ACCESS_ADDR, hpa); > } > > +static void vmx_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool > set) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + vmx->nested.apic_access_page_migrated = set; > +} > + > static void vmx_hwapic_isr_update(struct kvm *kvm, int isr) > { > u16 status; > @@ -8796,6 +8812,21 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, > u32 exit_reason, > } > > /* > + * When shared (L1 & L2) apic access page is migrated during L2 is > + * running, mmu_notifier will force to reload the page's hpa for L2 > + * vmcs. Need to reload it for L1 before entering L1. > + */ > + if (vmx->nested.apic_access_page_migrated) { > + /* > + * Do not call kvm_reload_apic_access_page() because we are now > + * in L2. We should not call make_all_cpus_request() to exit to > + * L0, otherwise we will reload for L2 vmcs again. > + */ > + kvm_reload_apic_access_page(vcpu->kvm); > + vmx->nested.apic_access_page_migrated = false; > + } I would just call kvm_reload_apic_access_page() unconditionally and only if it will prove to be performance problem would optimize it further.
Re: [PATCH v4 5/6] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.
On Wed, Aug 27, 2014 at 06:17:40PM +0800, Tang Chen wrote: This patch only handle L1 and L2 vm share one apic access page situation. When L1 vm is running, if the shared apic access page is migrated, mmu_notifier will request all vcpus to exit to L0, and reload apic access page physical address for all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, L2's vmcs will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to do nothing. When L2 vm is running, if the shared apic access page is migrated, mmu_notifier will request all vcpus to exit to L0, and reload apic access page physical address for all L2 vmcs. And this patch requests apic access page reload in L2-L1 vmexit. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 32 arch/x86/kvm/x86.c | 3 +++ virt/kvm/kvm_main.c | 1 + 5 files changed, 43 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 514183e..13fbb62 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -740,6 +740,7 @@ struct kvm_x86_ops { void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); + void (*set_nested_apic_page_migrated)(struct kvm_vcpu *vcpu, bool set); void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index f2eacc4..da88646 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3624,6 +3624,11 @@ static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) return; } +static void svm_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool set) +{ + return; +} + static int svm_vm_has_apicv(struct kvm *kvm) { return 0; @@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = { .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, .set_apic_access_page_addr = svm_set_apic_access_page_addr, + .set_nested_apic_page_migrated = svm_set_nested_apic_page_migrated, .vm_has_apicv = svm_vm_has_apicv, .load_eoi_exitmap = svm_load_eoi_exitmap, .hwapic_isr_update = svm_hwapic_isr_update, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index da6d55d..9035fd1 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -379,6 +379,16 @@ struct nested_vmx { * we must keep them pinned while L2 runs. */ struct page *apic_access_page; + /* + * L1's apic access page can be migrated. When L1 and L2 are sharing + * the apic access page, after the page is migrated when L2 is running, + * we have to reload it to L1 vmcs before we enter L1. + * + * When the shared apic access page is migrated in L1 mode, we don't + * need to do anything else because we reload apic access page each + * time when entering L2 in prepare_vmcs02(). + */ + bool apic_access_page_migrated; u64 msr_ia32_feature_control; struct hrtimer preemption_timer; @@ -7098,6 +7108,12 @@ static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) vmcs_write64(APIC_ACCESS_ADDR, hpa); } +static void vmx_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool set) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + vmx-nested.apic_access_page_migrated = set; +} + static void vmx_hwapic_isr_update(struct kvm *kvm, int isr) { u16 status; @@ -8796,6 +8812,21 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, } /* + * When shared (L1 L2) apic access page is migrated during L2 is + * running, mmu_notifier will force to reload the page's hpa for L2 + * vmcs. Need to reload it for L1 before entering L1. + */ + if (vmx-nested.apic_access_page_migrated) { + /* + * Do not call kvm_reload_apic_access_page() because we are now + * in L2. We should not call make_all_cpus_request() to exit to + * L0, otherwise we will reload for L2 vmcs again. + */ + kvm_reload_apic_access_page(vcpu-kvm); + vmx-nested.apic_access_page_migrated = false; + } I would just call kvm_reload_apic_access_page() unconditionally and only if it will prove to be performance problem would optimize it further. Vmexit emulation it pretty heavy, so I doubt one more vmwrite will be noticeable. --
Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote: Hi Gleb, On 09/03/2014 12:00 AM, Gleb Natapov wrote: .. +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) +{ +/* + * apic access page could be migrated. When the page is being migrated, + * GUP will wait till the migrate entry is replaced with the new pte + * entry pointing to the new page. + */ +vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm, +APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); +kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm, +page_to_phys(vcpu-kvm-arch.apic_access_page)); I am a little bit worried that here all vcpus write to vcpu-kvm-arch.apic_access_page without any locking. It is probably benign since pointer write is atomic on x86. Paolo? Do we even need apic_access_page? Why not call gfn_to_page(APIC_DEFAULT_PHYS_BASE PAGE_SHIFT) put_page() on rare occasions we need to know its address? Isn't it a necessary item defined in hardware spec ? vcpu-kvm-arch.apic_access_page? No. This is internal kvm data structure. I didn't read intel spec deeply, but according to the code, the page's address is written into vmcs. And it made me think that we cannnot remove it. We cannot remove writing of apic page address into vmcs, but this is not done by assigning to vcpu-kvm-arch.apic_access_page, but by vmwrite in set_apic_access_page_addr(). -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Wed, Aug 27, 2014 at 06:17:39PM +0800, Tang Chen wrote: > apic access page is pinned in memory. As a result, it cannot be > migrated/hot-removed. > Actually, it is not necessary to be pinned. > > The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When > the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the > corresponding ept entry. This patch introduces a new vcpu request named > KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this > time, > and force all the vcpus exit guest, and re-enter guest till they updates the > VMCS > APIC_ACCESS_ADDR pointer to the new apic access page address, and updates > kvm->arch.apic_access_page to the new page. > > Signed-off-by: Tang Chen > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/svm.c | 6 ++ > arch/x86/kvm/vmx.c | 6 ++ > arch/x86/kvm/x86.c | 15 +++ > include/linux/kvm_host.h| 2 ++ > virt/kvm/kvm_main.c | 12 > 6 files changed, 42 insertions(+) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 35171c7..514183e 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -739,6 +739,7 @@ struct kvm_x86_ops { > void (*hwapic_isr_update)(struct kvm *kvm, int isr); > void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); > void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); > + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); > void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); > void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); > int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index 1d941ad..f2eacc4 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct > kvm_vcpu *vcpu, bool set) > return; > } > > +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) > +{ > + return; > +} > + > static int svm_vm_has_apicv(struct kvm *kvm) > { > return 0; > @@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = { > .enable_irq_window = enable_irq_window, > .update_cr8_intercept = update_cr8_intercept, > .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, > + .set_apic_access_page_addr = svm_set_apic_access_page_addr, > .vm_has_apicv = svm_vm_has_apicv, > .load_eoi_exitmap = svm_load_eoi_exitmap, > .hwapic_isr_update = svm_hwapic_isr_update, > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 63c4c3e..da6d55d 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct > kvm_vcpu *vcpu, bool set) > vmx_set_msr_bitmap(vcpu); > } > > +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) > +{ > + vmcs_write64(APIC_ACCESS_ADDR, hpa); > +} > + > static void vmx_hwapic_isr_update(struct kvm *kvm, int isr) > { > u16 status; > @@ -8910,6 +8915,7 @@ static struct kvm_x86_ops vmx_x86_ops = { > .enable_irq_window = enable_irq_window, > .update_cr8_intercept = update_cr8_intercept, > .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode, > + .set_apic_access_page_addr = vmx_set_apic_access_page_addr, > .vm_has_apicv = vmx_vm_has_apicv, > .load_eoi_exitmap = vmx_load_eoi_exitmap, > .hwapic_irr_update = vmx_hwapic_irr_update, > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index e05bd58..96f4188 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -5989,6 +5989,19 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) > kvm_apic_update_tmr(vcpu, tmr); > } > > +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) > +{ > + /* > + * apic access page could be migrated. When the page is being migrated, > + * GUP will wait till the migrate entry is replaced with the new pte > + * entry pointing to the new page. > + */ > + vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm, > + APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); > + kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm, > + page_to_phys(vcpu->kvm->arch.apic_access_page)); I am a little bit worried that here all vcpus write to vcpu->kvm->arch.apic_access_page without any locking. It is probably benign since pointer write is atomic on x86. Paolo? Do we even need apic_access_page? Why not call gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT) put_page() on rare occasions we need to know its address? > +} > + > /* > * Returns 1 to let __vcpu_run() continue the guest execution loop without > * exiting to the userspace. Otherwise, the value will be returned to the > @@
Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().
On Wed, Aug 27, 2014 at 06:17:39PM +0800, Tang Chen wrote: apic access page is pinned in memory. As a result, it cannot be migrated/hot-removed. Actually, it is not necessary to be pinned. The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the corresponding ept entry. This patch introduces a new vcpu request named KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this time, and force all the vcpus exit guest, and re-enter guest till they updates the VMCS APIC_ACCESS_ADDR pointer to the new apic access page address, and updates kvm-arch.apic_access_page to the new page. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 6 ++ arch/x86/kvm/x86.c | 15 +++ include/linux/kvm_host.h| 2 ++ virt/kvm/kvm_main.c | 12 6 files changed, 42 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 35171c7..514183e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -739,6 +739,7 @@ struct kvm_x86_ops { void (*hwapic_isr_update)(struct kvm *kvm, int isr); void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1d941ad..f2eacc4 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) return; } +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + return; +} + static int svm_vm_has_apicv(struct kvm *kvm) { return 0; @@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = { .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, + .set_apic_access_page_addr = svm_set_apic_access_page_addr, .vm_has_apicv = svm_vm_has_apicv, .load_eoi_exitmap = svm_load_eoi_exitmap, .hwapic_isr_update = svm_hwapic_isr_update, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 63c4c3e..da6d55d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) vmx_set_msr_bitmap(vcpu); } +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + vmcs_write64(APIC_ACCESS_ADDR, hpa); +} + static void vmx_hwapic_isr_update(struct kvm *kvm, int isr) { u16 status; @@ -8910,6 +8915,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode, + .set_apic_access_page_addr = vmx_set_apic_access_page_addr, .vm_has_apicv = vmx_vm_has_apicv, .load_eoi_exitmap = vmx_load_eoi_exitmap, .hwapic_irr_update = vmx_hwapic_irr_update, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e05bd58..96f4188 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5989,6 +5989,19 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) kvm_apic_update_tmr(vcpu, tmr); } +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) +{ + /* + * apic access page could be migrated. When the page is being migrated, + * GUP will wait till the migrate entry is replaced with the new pte + * entry pointing to the new page. + */ + vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm, + APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); + kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm, + page_to_phys(vcpu-kvm-arch.apic_access_page)); I am a little bit worried that here all vcpus write to vcpu-kvm-arch.apic_access_page without any locking. It is probably benign since pointer write is atomic on x86. Paolo? Do we even need apic_access_page? Why not call gfn_to_page(APIC_DEFAULT_PHYS_BASE PAGE_SHIFT) put_page() on rare occasions we need to know its address? +} + /* * Returns 1 to let __vcpu_run() continue the guest execution loop without * exiting to the userspace. Otherwise, the value will be returned to the @@ -6049,6 +6062,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
Re: GET_RNG_SEED hypercall ABI? (Re: [PATCH v5 0/5] random,x86,kvm: Rework arch RNG seeds and get some from kvm)
On Tue, Aug 26, 2014 at 04:58:34PM -0700, Andy Lutomirski wrote: > hpa pointed out that the ABI that I chose (an MSR from the KVM range > and a KVM cpuid bit) is unnecessarily KVM-specific. It would be nice > to allocate an MSR that everyone involved can agree on and, rather > than relying on a cpuid bit, just have the guest probe for the MSR. > CPUID part allows feature to be disabled for machine compatibility purpose during migration. Of course interface can explicitly state that one successful use of the MSR does not mean that next use will not result in a #GP, but that doesn't sound very elegant and is different from any other MSR out there. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GET_RNG_SEED hypercall ABI? (Re: [PATCH v5 0/5] random,x86,kvm: Rework arch RNG seeds and get some from kvm)
On Tue, Aug 26, 2014 at 04:58:34PM -0700, Andy Lutomirski wrote: hpa pointed out that the ABI that I chose (an MSR from the KVM range and a KVM cpuid bit) is unnecessarily KVM-specific. It would be nice to allocate an MSR that everyone involved can agree on and, rather than relying on a cpuid bit, just have the guest probe for the MSR. CPUID part allows feature to be disabled for machine compatibility purpose during migration. Of course interface can explicitly state that one successful use of the MSR does not mean that next use will not result in a #GP, but that doesn't sound very elegant and is different from any other MSR out there. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode
On Mon, Aug 25, 2014 at 11:16:34AM +0800, Dennis Chen wrote: > On Sun, Aug 24, 2014 at 5:38 PM, Gleb Natapov wrote: > > On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote: > >> This patch is used to construct the eptp in vmx mode with values > >> readed from MSR according to the intel x86 software developer's > >> manual. > >> > >> static u64 construct_eptp(unsigned long root_hpa) > >> { > >> -u64 eptp; > >> +u64 eptp, pwl; > >> + > >> +if (cpu_has_vmx_ept_4levels()) > >> +pwl = VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT; > >> +else { > >> +WARN(1, "Unsupported page-walk length of 4.\n"); > > Page-walk length of 4 is the only one that is supported. > > > Since there is a bit 6 in IA32_VMX_EPT_VPID_CAP MSR indicating the > support for the page-walk length, I think sanity check is necessary. > But I just checked the code, it's already done in the hardware_setup() > function which will disable ept feature if the page-wake length is not > 4. Gleb, any comments for the memory type check part? Looks fine, but are there CPUs out there that do not support WB for eptp? Since there was no bug reports about it I assume no. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode
On Mon, Aug 25, 2014 at 11:16:34AM +0800, Dennis Chen wrote: On Sun, Aug 24, 2014 at 5:38 PM, Gleb Natapov g...@kernel.org wrote: On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote: This patch is used to construct the eptp in vmx mode with values readed from MSR according to the intel x86 software developer's manual. static u64 construct_eptp(unsigned long root_hpa) { -u64 eptp; +u64 eptp, pwl; + +if (cpu_has_vmx_ept_4levels()) +pwl = VMX_EPT_DEFAULT_GAW VMX_EPT_GAW_EPTP_SHIFT; +else { +WARN(1, Unsupported page-walk length of 4.\n); Page-walk length of 4 is the only one that is supported. Since there is a bit 6 in IA32_VMX_EPT_VPID_CAP MSR indicating the support for the page-walk length, I think sanity check is necessary. But I just checked the code, it's already done in the hardware_setup() function which will disable ept feature if the page-wake length is not 4. Gleb, any comments for the memory type check part? Looks fine, but are there CPUs out there that do not support WB for eptp? Since there was no bug reports about it I assume no. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode
On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote: > This patch is used to construct the eptp in vmx mode with values > readed from MSR according to the intel x86 software developer's > manual. > > Signed-off-by: Dennis Chen > --- > arch/x86/include/asm/vmx.h |1 + > arch/x86/kvm/vmx.c | 21 + > 2 files changed, 18 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h > index bcbfade..bf82a77 100644 > --- a/arch/x86/include/asm/vmx.h > +++ b/arch/x86/include/asm/vmx.h > @@ -417,6 +417,7 @@ enum vmcs_field { > #define VMX_EPT_GAW_EPTP_SHIFT3 > #define VMX_EPT_AD_ENABLE_BIT(1ull << 6) > #define VMX_EPT_DEFAULT_MT0x6ull > +#define VMX_EPT_UC_MT0x0ull > #define VMX_EPT_READABLE_MASK0x1ull > #define VMX_EPT_WRITABLE_MASK0x2ull > #define VMX_EPT_EXECUTABLE_MASK0x4ull > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index bfe11cf..7add5ce 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -3477,11 +3477,24 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, > unsigned long cr0) > > static u64 construct_eptp(unsigned long root_hpa) > { > -u64 eptp; > +u64 eptp, pwl; > + > +if (cpu_has_vmx_ept_4levels()) > +pwl = VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT; > +else { > +WARN(1, "Unsupported page-walk length of 4.\n"); Page-walk length of 4 is the only one that is supported. > +BUG(); > +} > + > +if (cpu_has_vmx_eptp_writeback()) > +eptp = VMX_EPT_DEFAULT_MT | pwl; > +else if (cpu_has_vmx_eptp_uncacheable()) > +eptp = VMX_EPT_UC_MT | pwl; > +else { > +WARN(1, "Unsupported memory type config in vmx eptp.\n"); > +BUG(); > +} > > -/* TODO write the value reading from MSR */ > -eptp = VMX_EPT_DEFAULT_MT | > -VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT; > if (enable_ept_ad_bits) > eptp |= VMX_EPT_AD_ENABLE_BIT; > eptp |= (root_hpa & PAGE_MASK); > -- > 1.7.9.5 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode
On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote: This patch is used to construct the eptp in vmx mode with values readed from MSR according to the intel x86 software developer's manual. Signed-off-by: Dennis Chen kernel.org@gmail.com --- arch/x86/include/asm/vmx.h |1 + arch/x86/kvm/vmx.c | 21 + 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index bcbfade..bf82a77 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -417,6 +417,7 @@ enum vmcs_field { #define VMX_EPT_GAW_EPTP_SHIFT3 #define VMX_EPT_AD_ENABLE_BIT(1ull 6) #define VMX_EPT_DEFAULT_MT0x6ull +#define VMX_EPT_UC_MT0x0ull #define VMX_EPT_READABLE_MASK0x1ull #define VMX_EPT_WRITABLE_MASK0x2ull #define VMX_EPT_EXECUTABLE_MASK0x4ull diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bfe11cf..7add5ce 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3477,11 +3477,24 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) static u64 construct_eptp(unsigned long root_hpa) { -u64 eptp; +u64 eptp, pwl; + +if (cpu_has_vmx_ept_4levels()) +pwl = VMX_EPT_DEFAULT_GAW VMX_EPT_GAW_EPTP_SHIFT; +else { +WARN(1, Unsupported page-walk length of 4.\n); Page-walk length of 4 is the only one that is supported. +BUG(); +} + +if (cpu_has_vmx_eptp_writeback()) +eptp = VMX_EPT_DEFAULT_MT | pwl; +else if (cpu_has_vmx_eptp_uncacheable()) +eptp = VMX_EPT_UC_MT | pwl; +else { +WARN(1, Unsupported memory type config in vmx eptp.\n); +BUG(); +} -/* TODO write the value reading from MSR */ -eptp = VMX_EPT_DEFAULT_MT | -VMX_EPT_DEFAULT_GAW VMX_EPT_GAW_EPTP_SHIFT; if (enable_ept_ad_bits) eptp |= VMX_EPT_AD_ENABLE_BIT; eptp |= (root_hpa PAGE_MASK); -- 1.7.9.5 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Fri, Jul 18, 2014 at 05:05:20PM +0800, Tang Chen wrote: > Hi Gleb, > > On 07/17/2014 09:57 PM, Gleb Natapov wrote: > >On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote: > >>Hi Gleb, > >> > >>On 07/15/2014 08:40 PM, Gleb Natapov wrote: > >>.. > >>>> > >>>>And yes, we have the problem you said here. We can migrate the page while > >>>>L2 > >>>>vm is running. > >>>>So I think we should enforce L2 vm to exit to L1. Right ? > >>>> > >>>We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so > >>>if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 > >>>too. > >>> > >> > >>Sorry, I think I don't quite understand the procedure you are talking about > >>here. > >> > >>Referring to the code, I think we have three machines: L0(host), L1 and L2. > >>And we have two types of vmexit: L2->L1 and L2->L0. Right ? > >> > >>We are now talking about this case: L2 and L1 shares the apic page. > >> > >>Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify > >>L1, > >>and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we > >Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify > >L1 or L2 VMCS depending on which one happens to be running right now. > >If it is L1 then L2's VMCS will be updated during vmentry emulation, > > OK, this is easy to understand. > > >if it is > >L2 we need to request reload during vmexit emulation to make sure L1's VMCS > >is > >updated. > > > > I'm a little confused here. In patch 5/5, I called make_all_cpus_request() > to > force all vcpus exit to host. If we are in L2, where will the vcpus exit to > ? > L1 or L0 ? > L0. CPU always exits to L0. L1->L2 exit is emulated by nested_vmx_vmexit(). -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Fri, Jul 18, 2014 at 05:05:20PM +0800, Tang Chen wrote: Hi Gleb, On 07/17/2014 09:57 PM, Gleb Natapov wrote: On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote: Hi Gleb, On 07/15/2014 08:40 PM, Gleb Natapov wrote: .. And yes, we have the problem you said here. We can migrate the page while L2 vm is running. So I think we should enforce L2 vm to exit to L1. Right ? We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too. Sorry, I think I don't quite understand the procedure you are talking about here. Referring to the code, I think we have three machines: L0(host), L1 and L2. And we have two types of vmexit: L2-L1 and L2-L0. Right ? We are now talking about this case: L2 and L1 shares the apic page. Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify L1, and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify L1 or L2 VMCS depending on which one happens to be running right now. If it is L1 then L2's VMCS will be updated during vmentry emulation, OK, this is easy to understand. if it is L2 we need to request reload during vmexit emulation to make sure L1's VMCS is updated. I'm a little confused here. In patch 5/5, I called make_all_cpus_request() to force all vcpus exit to host. If we are in L2, where will the vcpus exit to ? L1 or L0 ? L0. CPU always exits to L0. L1-L2 exit is emulated by nested_vmx_vmexit(). -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote: > Hi Gleb, > > On 07/15/2014 08:40 PM, Gleb Natapov wrote: > .. > >> > >>And yes, we have the problem you said here. We can migrate the page while L2 > >>vm is running. > >>So I think we should enforce L2 vm to exit to L1. Right ? > >> > >We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so > >if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 > >too. > > > > Sorry, I think I don't quite understand the procedure you are talking about > here. > > Referring to the code, I think we have three machines: L0(host), L1 and L2. > And we have two types of vmexit: L2->L1 and L2->L0. Right ? > > We are now talking about this case: L2 and L1 shares the apic page. > > Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify > L1, > and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify L1 or L2 VMCS depending on which one happens to be running right now. If it is L1 then L2's VMCS will be updated during vmentry emulation, if it is L2 we need to request reload during vmexit emulation to make sure L1's VMCS is updated. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote: Hi Gleb, On 07/15/2014 08:40 PM, Gleb Natapov wrote: .. And yes, we have the problem you said here. We can migrate the page while L2 vm is running. So I think we should enforce L2 vm to exit to L1. Right ? We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too. Sorry, I think I don't quite understand the procedure you are talking about here. Referring to the code, I think we have three machines: L0(host), L1 and L2. And we have two types of vmexit: L2-L1 and L2-L0. Right ? We are now talking about this case: L2 and L1 shares the apic page. Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify L1, and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify L1 or L2 VMCS depending on which one happens to be running right now. If it is L1 then L2's VMCS will be updated during vmentry emulation, if it is L2 we need to request reload during vmexit emulation to make sure L1's VMCS is updated. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Wed, Jul 16, 2014 at 09:13:23AM -0700, H. Peter Anvin wrote: > On 07/16/2014 09:08 AM, Paolo Bonzini wrote: > > Il 16/07/2014 18:03, H. Peter Anvin ha scritto: > >> I suggested emulating RDRAND *but not set the CPUID bit*. We already > >> developed a protocol in KVM/Qemu to enumerate emulated features (created > >> for MOVBE as I recall), specifically to service the semantic "feature X > >> will work but will be substantially slower than normal." > > > > But those will set the CPUID bit. There is currently no way for KVM > > guests to know if a CPUID bit is real or emulated. > > > > OK, so there wasn't any protocol implemented in the end. I sit corrected. > That protocol that was implemented is between qemu and kvm, not kvm and a guest. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Wed, Jul 16, 2014 at 04:32:19PM +0200, Paolo Bonzini wrote: > Il 16/07/2014 16:07, Andy Lutomirski ha scritto: > >This patch has nothing whatsoever to do with how much I trust the CPU > >vs the hypervisor. It's for the enormous installed base of machines > >without RDRAND. > > Ok. I think an MSR is fine, though I don't think it's useful for the guest > to use it if it already has RDRAND and/or RDSEED. > Agree. It is unfortunate that we add PV interfaces for a HW that will be extinct in a couple of years though :( -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Wed, Jul 16, 2014 at 09:10:27AM +0200, Daniel Borkmann wrote: > On 07/16/2014 08:41 AM, Gleb Natapov wrote: > >On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote: > >>virtio-rng is both too complicated and insufficient for initial rng > >>seeding. It's far too complicated to use for KASLR or any other > >>early boot random number needs. It also provides /dev/random-style > >>bits, which means that making guest boot wait for virtio-rng is > >>unacceptably slow, and doing it asynchronously means that > >>/dev/urandom might be predictable when userspace starts. > >> > >>This introduces a very simple synchronous mechanism to get > >>/dev/urandom-style bits. > > > >Why can't you use RDRAND instruction for that? > > You mean using it directly? I think simply for the very same reasons > as in c2557a303a ... So you trust your hypervisor vendor more than you trust your CPU vendor? :) -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote: > virtio-rng is both too complicated and insufficient for initial rng > seeding. It's far too complicated to use for KASLR or any other > early boot random number needs. It also provides /dev/random-style > bits, which means that making guest boot wait for virtio-rng is > unacceptably slow, and doing it asynchronously means that > /dev/urandom might be predictable when userspace starts. > > This introduces a very simple synchronous mechanism to get > /dev/urandom-style bits. Why can't you use RDRAND instruction for that? > > This is a KVM change: am I supposed to write a unit test somewhere? > > Andy Lutomirski (4): > x86,kvm: Add MSR_KVM_GET_RNG_SEED and a matching feature bit > random,x86: Add arch_get_slow_rng_u64 > random: Seed pools from arch_get_slow_rng_u64 at startup > x86,kaslr: Use MSR_KVM_GET_RNG_SEED for KASLR if available > > Documentation/virtual/kvm/cpuid.txt | 3 +++ > arch/x86/Kconfig | 4 > arch/x86/boot/compressed/aslr.c | 27 +++ > arch/x86/include/asm/archslowrng.h | 30 ++ > arch/x86/include/uapi/asm/kvm_para.h | 2 ++ > arch/x86/kernel/kvm.c| 22 ++ > arch/x86/kvm/cpuid.c | 3 ++- > arch/x86/kvm/x86.c | 4 > drivers/char/random.c| 14 +- > include/linux/random.h | 9 + > 10 files changed, 116 insertions(+), 2 deletions(-) > create mode 100644 arch/x86/include/asm/archslowrng.h > > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote: virtio-rng is both too complicated and insufficient for initial rng seeding. It's far too complicated to use for KASLR or any other early boot random number needs. It also provides /dev/random-style bits, which means that making guest boot wait for virtio-rng is unacceptably slow, and doing it asynchronously means that /dev/urandom might be predictable when userspace starts. This introduces a very simple synchronous mechanism to get /dev/urandom-style bits. Why can't you use RDRAND instruction for that? This is a KVM change: am I supposed to write a unit test somewhere? Andy Lutomirski (4): x86,kvm: Add MSR_KVM_GET_RNG_SEED and a matching feature bit random,x86: Add arch_get_slow_rng_u64 random: Seed pools from arch_get_slow_rng_u64 at startup x86,kaslr: Use MSR_KVM_GET_RNG_SEED for KASLR if available Documentation/virtual/kvm/cpuid.txt | 3 +++ arch/x86/Kconfig | 4 arch/x86/boot/compressed/aslr.c | 27 +++ arch/x86/include/asm/archslowrng.h | 30 ++ arch/x86/include/uapi/asm/kvm_para.h | 2 ++ arch/x86/kernel/kvm.c| 22 ++ arch/x86/kvm/cpuid.c | 3 ++- arch/x86/kvm/x86.c | 4 drivers/char/random.c| 14 +- include/linux/random.h | 9 + 10 files changed, 116 insertions(+), 2 deletions(-) create mode 100644 arch/x86/include/asm/archslowrng.h -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Wed, Jul 16, 2014 at 09:10:27AM +0200, Daniel Borkmann wrote: On 07/16/2014 08:41 AM, Gleb Natapov wrote: On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote: virtio-rng is both too complicated and insufficient for initial rng seeding. It's far too complicated to use for KASLR or any other early boot random number needs. It also provides /dev/random-style bits, which means that making guest boot wait for virtio-rng is unacceptably slow, and doing it asynchronously means that /dev/urandom might be predictable when userspace starts. This introduces a very simple synchronous mechanism to get /dev/urandom-style bits. Why can't you use RDRAND instruction for that? You mean using it directly? I think simply for the very same reasons as in c2557a303a ... So you trust your hypervisor vendor more than you trust your CPU vendor? :) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Wed, Jul 16, 2014 at 04:32:19PM +0200, Paolo Bonzini wrote: Il 16/07/2014 16:07, Andy Lutomirski ha scritto: This patch has nothing whatsoever to do with how much I trust the CPU vs the hypervisor. It's for the enormous installed base of machines without RDRAND. Ok. I think an MSR is fine, though I don't think it's useful for the guest to use it if it already has RDRAND and/or RDSEED. Agree. It is unfortunate that we add PV interfaces for a HW that will be extinct in a couple of years though :( -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED
On Wed, Jul 16, 2014 at 09:13:23AM -0700, H. Peter Anvin wrote: On 07/16/2014 09:08 AM, Paolo Bonzini wrote: Il 16/07/2014 18:03, H. Peter Anvin ha scritto: I suggested emulating RDRAND *but not set the CPUID bit*. We already developed a protocol in KVM/Qemu to enumerate emulated features (created for MOVBE as I recall), specifically to service the semantic feature X will work but will be substantially slower than normal. But those will set the CPUID bit. There is currently no way for KVM guests to know if a CPUID bit is real or emulated. OK, so there wasn't any protocol implemented in the end. I sit corrected. That protocol that was implemented is between qemu and kvm, not kvm and a guest. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 08:54:01PM +0800, Tang Chen wrote: > On 07/15/2014 08:40 PM, Gleb Natapov wrote: > >On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote: > >>On 07/15/2014 08:09 PM, Gleb Natapov wrote: > >>>On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote: > >>.. > >>>> > >>>>I cannot follow your concerns yet. Specifically, how should > >>>>APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We > >>>>currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean? > >>>> > >>>I am talking about this case: > >>> if (cpu_has_secondary_exec_ctrls()) {a > >>> } else { > >>> exec_control |= > >>> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > >>> vmcs_write64(APIC_ACCESS_ADDR, > >>> page_to_phys(vcpu->kvm->arch.apic_access_page)); > >>> } > >>>We do not pin here. > >>> > >> > >>Hi Gleb, > >> > >> > >>7905 if (exec_control& > >>SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { > >>.. > >>7912 if (vmx->nested.apic_access_page) /* shouldn't > >>happen */ > >>7913 nested_release_page(vmx->nested.apic_access_page); > >>7914 vmx->nested.apic_access_page = > >>7915 nested_get_page(vcpu, > >>vmcs12->apic_access_addr); > >> > >>I thought you were talking about the problem here. We pin > >>vmcs12->apic_access_addr > >>in memory. And I think we should do the same thing to this page as to L1 vm. > >>Right ? > >Nested kvm pins a lot of pages, it will probably be not easy to handle all > >of them, > >so for now I am concerned with non nested case only (but nested should > >continue to > >work obviously, just pin pages like it does now). > > True. I will work on it. > > And also, when using PCI passthrough, kvm_pin_pages() also pins some pages. > This is > also in my todo list. Those pages are (almost) directly accessible by assigned PCI devices, I am not sure this is even doable. > > But sorry, a little strange. I didn't find where vmcs12->apic_access_addr is > allocated > or initialized... Would you please tell me ? handle_vmwrite() writes it when guest is executing vmwrite(APIC_ACCESS_ADDR); > > > > >> > >>.. > >>7922 if (!vmx->nested.apic_access_page) > >>7923 exec_control&= > >>7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > >>7925 else > >>7926 vmcs_write64(APIC_ACCESS_ADDR, > >>7927 page_to_phys(vmx->nested.apic_access_page)); > >>7928 } else if > >>(vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) { > >>7929 exec_control |= > >>7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > >>7931 vmcs_write64(APIC_ACCESS_ADDR, > >>7932 page_to_phys(vcpu->kvm->arch.apic_access_page)); > >>7933 } > >> > >>And yes, we have the problem you said here. We can migrate the page while L2 > >>vm is running. > >>So I think we should enforce L2 vm to exit to L1. Right ? > >> > >We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so > >if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 > >too. > > > > apic pages for L2 and L1 are not the same page, right ? > If L2 guest enable apic access page then they are different, otherwise they are the same. > I think, just like we are doing in patch 5/5, we cannot wait for the next > L2->L1 vmexit. > We should enforce a L2->L1 vmexit in mmu_notifier, just like > make_all_cpus_request() does. > > Am I right ? > I do not see why forcing APIC_ACCESS_ADDR reload during L2->L1 exit is not enough. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 03:10:15PM +0200, Jan Kiszka wrote: > On 2014-07-15 14:40, Gleb Natapov wrote: > >> > >> .. > >> 7922 if (!vmx->nested.apic_access_page) > >> 7923 exec_control &= > >> 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > >> 7925 else > >> 7926 vmcs_write64(APIC_ACCESS_ADDR, > >> 7927 page_to_phys(vmx->nested.apic_access_page)); > >> 7928 } else if > >> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) { > >> 7929 exec_control |= > >> 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > >> 7931 vmcs_write64(APIC_ACCESS_ADDR, > >> 7932 page_to_phys(vcpu->kvm->arch.apic_access_page)); > >> 7933 } > >> > >> And yes, we have the problem you said here. We can migrate the page while > >> L2 > >> vm is running. > >> So I think we should enforce L2 vm to exit to L1. Right ? > >> > > We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so > > if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 > > too. > > How should this host-managed VMCS field possibly change while L2 is running? > That what "Do not pin apic access page in memory" patch is doing. It changes APIC_ACCESS_ADDR of a current vmcs. It kicks vcpu out of a guest mode of course, but vcpu may still be in L2 at this point, so only L2 vmcs will get new APIC_ACCESS_ADDR pointer, L1 vmcs will still have an old one. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote: > On 07/15/2014 08:09 PM, Gleb Natapov wrote: > >On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote: > .. > >> > >>I cannot follow your concerns yet. Specifically, how should > >>APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We > >>currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean? > >> > >I am talking about this case: > > if (cpu_has_secondary_exec_ctrls()) {a > > } else { > > exec_control |= > > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > > vmcs_write64(APIC_ACCESS_ADDR, > > page_to_phys(vcpu->kvm->arch.apic_access_page)); > > } > >We do not pin here. > > > > Hi Gleb, > > > 7905 if (exec_control & > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { > .. > 7912 if (vmx->nested.apic_access_page) /* shouldn't > happen */ > 7913 nested_release_page(vmx->nested.apic_access_page); > 7914 vmx->nested.apic_access_page = > 7915 nested_get_page(vcpu, > vmcs12->apic_access_addr); > > I thought you were talking about the problem here. We pin > vmcs12->apic_access_addr > in memory. And I think we should do the same thing to this page as to L1 vm. > Right ? Nested kvm pins a lot of pages, it will probably be not easy to handle all of them, so for now I am concerned with non nested case only (but nested should continue to work obviously, just pin pages like it does now). > > .. > 7922 if (!vmx->nested.apic_access_page) > 7923 exec_control &= > 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > 7925 else > 7926 vmcs_write64(APIC_ACCESS_ADDR, > 7927 page_to_phys(vmx->nested.apic_access_page)); > 7928 } else if > (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) { > 7929 exec_control |= > 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > 7931 vmcs_write64(APIC_ACCESS_ADDR, > 7932 page_to_phys(vcpu->kvm->arch.apic_access_page)); > 7933 } > > And yes, we have the problem you said here. We can migrate the page while L2 > vm is running. > So I think we should enforce L2 vm to exit to L1. Right ? > We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote: > On 2014-07-14 16:58, Gleb Natapov wrote: > >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >>>> index ffbe557..7080eda 100644 > >>>> --- a/arch/x86/kvm/x86.c > >>>> +++ b/arch/x86/kvm/x86.c > >>>> @@ -5929,6 +5929,18 @@ static void vcpu_scan_ioapic(struct kvm_vcpu > >>>> *vcpu) > >>>> kvm_apic_update_tmr(vcpu, tmr); > >>>> } > >>>> > >>>> +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) > >>>> +{ > >>>> +/* > >>>> + * When the page is being migrated, GUP will wait till the > >>>> migrate > >>>> + * entry is replaced with the new pte entry pointing to the new > >>>> page. > >>>> + */ > >>>> +struct page *page = gfn_to_page_no_pin(vcpu->kvm, > >>>> +APIC_DEFAULT_PHYS_BASE>> > >>>> PAGE_SHIFT); > >>> If you do not use kvm->arch.apic_access_page to get current address why > >>> not drop it entirely? > >>> > >> > >> I should also update kvm->arch.apic_access_page here. It is used in other > >> places > >> in kvm, so I don't think we should drop it. Will update the patch. > > What other places? The only other place I see is in nested kvm code and you > > can call > > gfn_to_page_no_pin() there instead of using kvm->arch.apic_access_page > > directly. But > > as far as I see nested kvm code cannot handle change of APIC_ACCESS_ADDR > > phys address. > > If APIC_ACCESS_ADDR changes during nested guest run, non nested vmcs will > > still have old > > physical address. One way to fix that is to set KVM_REQ_APIC_PAGE_RELOAD > > during nested exit. > > I cannot follow your concerns yet. Specifically, how should > APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We > currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean? > I am talking about this case: if (cpu_has_secondary_exec_ctrls()) {a } else { exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vcpu->kvm->arch.apic_access_page)); } We do not pin here. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote: On 2014-07-14 16:58, Gleb Natapov wrote: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ffbe557..7080eda 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5929,6 +5929,18 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) kvm_apic_update_tmr(vcpu, tmr); } +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) +{ +/* + * When the page is being migrated, GUP will wait till the migrate + * entry is replaced with the new pte entry pointing to the new page. + */ +struct page *page = gfn_to_page_no_pin(vcpu-kvm, +APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); If you do not use kvm-arch.apic_access_page to get current address why not drop it entirely? I should also update kvm-arch.apic_access_page here. It is used in other places in kvm, so I don't think we should drop it. Will update the patch. What other places? The only other place I see is in nested kvm code and you can call gfn_to_page_no_pin() there instead of using kvm-arch.apic_access_page directly. But as far as I see nested kvm code cannot handle change of APIC_ACCESS_ADDR phys address. If APIC_ACCESS_ADDR changes during nested guest run, non nested vmcs will still have old physical address. One way to fix that is to set KVM_REQ_APIC_PAGE_RELOAD during nested exit. I cannot follow your concerns yet. Specifically, how should APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We currently pin/unpin on L1-L2/L2-L1, respectively. Or what do you mean? I am talking about this case: if (cpu_has_secondary_exec_ctrls()) {a } else { exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vcpu-kvm-arch.apic_access_page)); } We do not pin here. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote: On 07/15/2014 08:09 PM, Gleb Natapov wrote: On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote: .. I cannot follow your concerns yet. Specifically, how should APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We currently pin/unpin on L1-L2/L2-L1, respectively. Or what do you mean? I am talking about this case: if (cpu_has_secondary_exec_ctrls()) {a } else { exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vcpu-kvm-arch.apic_access_page)); } We do not pin here. Hi Gleb, 7905 if (exec_control SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { .. 7912 if (vmx-nested.apic_access_page) /* shouldn't happen */ 7913 nested_release_page(vmx-nested.apic_access_page); 7914 vmx-nested.apic_access_page = 7915 nested_get_page(vcpu, vmcs12-apic_access_addr); I thought you were talking about the problem here. We pin vmcs12-apic_access_addr in memory. And I think we should do the same thing to this page as to L1 vm. Right ? Nested kvm pins a lot of pages, it will probably be not easy to handle all of them, so for now I am concerned with non nested case only (but nested should continue to work obviously, just pin pages like it does now). .. 7922 if (!vmx-nested.apic_access_page) 7923 exec_control = 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 7925 else 7926 vmcs_write64(APIC_ACCESS_ADDR, 7927 page_to_phys(vmx-nested.apic_access_page)); 7928 } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { 7929 exec_control |= 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 7931 vmcs_write64(APIC_ACCESS_ADDR, 7932 page_to_phys(vcpu-kvm-arch.apic_access_page)); 7933 } And yes, we have the problem you said here. We can migrate the page while L2 vm is running. So I think we should enforce L2 vm to exit to L1. Right ? We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 03:10:15PM +0200, Jan Kiszka wrote: On 2014-07-15 14:40, Gleb Natapov wrote: .. 7922 if (!vmx-nested.apic_access_page) 7923 exec_control = 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 7925 else 7926 vmcs_write64(APIC_ACCESS_ADDR, 7927 page_to_phys(vmx-nested.apic_access_page)); 7928 } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { 7929 exec_control |= 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 7931 vmcs_write64(APIC_ACCESS_ADDR, 7932 page_to_phys(vcpu-kvm-arch.apic_access_page)); 7933 } And yes, we have the problem you said here. We can migrate the page while L2 vm is running. So I think we should enforce L2 vm to exit to L1. Right ? We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too. How should this host-managed VMCS field possibly change while L2 is running? That what Do not pin apic access page in memory patch is doing. It changes APIC_ACCESS_ADDR of a current vmcs. It kicks vcpu out of a guest mode of course, but vcpu may still be in L2 at this point, so only L2 vmcs will get new APIC_ACCESS_ADDR pointer, L1 vmcs will still have an old one. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
On Tue, Jul 15, 2014 at 08:54:01PM +0800, Tang Chen wrote: On 07/15/2014 08:40 PM, Gleb Natapov wrote: On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote: On 07/15/2014 08:09 PM, Gleb Natapov wrote: On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote: .. I cannot follow your concerns yet. Specifically, how should APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We currently pin/unpin on L1-L2/L2-L1, respectively. Or what do you mean? I am talking about this case: if (cpu_has_secondary_exec_ctrls()) {a } else { exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vcpu-kvm-arch.apic_access_page)); } We do not pin here. Hi Gleb, 7905 if (exec_control SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { .. 7912 if (vmx-nested.apic_access_page) /* shouldn't happen */ 7913 nested_release_page(vmx-nested.apic_access_page); 7914 vmx-nested.apic_access_page = 7915 nested_get_page(vcpu, vmcs12-apic_access_addr); I thought you were talking about the problem here. We pin vmcs12-apic_access_addr in memory. And I think we should do the same thing to this page as to L1 vm. Right ? Nested kvm pins a lot of pages, it will probably be not easy to handle all of them, so for now I am concerned with non nested case only (but nested should continue to work obviously, just pin pages like it does now). True. I will work on it. And also, when using PCI passthrough, kvm_pin_pages() also pins some pages. This is also in my todo list. Those pages are (almost) directly accessible by assigned PCI devices, I am not sure this is even doable. But sorry, a little strange. I didn't find where vmcs12-apic_access_addr is allocated or initialized... Would you please tell me ? handle_vmwrite() writes it when guest is executing vmwrite(APIC_ACCESS_ADDR); .. 7922 if (!vmx-nested.apic_access_page) 7923 exec_control= 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 7925 else 7926 vmcs_write64(APIC_ACCESS_ADDR, 7927 page_to_phys(vmx-nested.apic_access_page)); 7928 } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { 7929 exec_control |= 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; 7931 vmcs_write64(APIC_ACCESS_ADDR, 7932 page_to_phys(vcpu-kvm-arch.apic_access_page)); 7933 } And yes, we have the problem you said here. We can migrate the page while L2 vm is running. So I think we should enforce L2 vm to exit to L1. Right ? We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too. apic pages for L2 and L1 are not the same page, right ? If L2 guest enable apic access page then they are different, otherwise they are the same. I think, just like we are doing in patch 5/5, we cannot wait for the next L2-L1 vmexit. We should enforce a L2-L1 vmexit in mmu_notifier, just like make_all_cpus_request() does. Am I right ? I do not see why forcing APIC_ACCESS_ADDR reload during L2-L1 exit is not enough. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
CCing Jan to check my nested kvm findings below. On Mon, Jul 14, 2014 at 03:57:09PM +0800, Tang Chen wrote: > Hi Gleb, > > Thanks for the reply. Please see below. > > On 07/12/2014 04:04 PM, Gleb Natapov wrote: > >On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote: > >>apic access page is pinned in memory. As a result, it cannot be > >>migrated/hot-removed. > >>Actually, it is not necessary to be pinned. > >> > >>The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When > >>the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the > >>corresponding ept entry. This patch introduces a new vcpu request named > >>KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this > >>time, > >>and force all the vcpus exit guest, and re-enter guest till they updates > >>the VMCS > >>APIC_ACCESS_ADDR pointer to the new apic access page address, and updates > >>kvm->arch.apic_access_page to the new page. > >> > >By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism > >is not used since no MMIO access to APIC is ever done. Have you tested > >this with "-cpu modelname,-x2apic" qemu flag? > > I used the following commandline to test the patches. > > # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp > 2 > That most likely uses x2apic. > And I think the guest used APIC_ACCESS_ADDR mechanism because the previous > patch-set has some problem which will happen when the apic page is accessed. > And it did happen. > > I'll test this patch-set with "-cpu modelname,-x2apic" flag. > Replace "modelname" with one of supported model names such as nehalem of course :) > > > >>Signed-off-by: Tang Chen > >>--- > >> arch/x86/include/asm/kvm_host.h | 1 + > >> arch/x86/kvm/mmu.c | 11 +++ > >> arch/x86/kvm/svm.c | 6 ++ > >> arch/x86/kvm/vmx.c | 8 +++- > >> arch/x86/kvm/x86.c | 14 ++ > >> include/linux/kvm_host.h| 2 ++ > >> virt/kvm/kvm_main.c | 12 > >> 7 files changed, 53 insertions(+), 1 deletion(-) > >> > >>diff --git a/arch/x86/include/asm/kvm_host.h > >>b/arch/x86/include/asm/kvm_host.h > >>index 62f973e..9ce6bfd 100644 > >>--- a/arch/x86/include/asm/kvm_host.h > >>+++ b/arch/x86/include/asm/kvm_host.h > >>@@ -737,6 +737,7 @@ struct kvm_x86_ops { > >>void (*hwapic_isr_update)(struct kvm *kvm, int isr); > >>void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); > >>void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); > >>+ void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); > >>void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); > >>void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); > >>int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); > >>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > >>index 9314678..551693d 100644 > >>--- a/arch/x86/kvm/mmu.c > >>+++ b/arch/x86/kvm/mmu.c > >>@@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, > >>gva_t gpa, u32 error_code, > >> level, gfn, pfn, prefault); > >>spin_unlock(>kvm->mmu_lock); > >> > >>+ /* > >>+* apic access page could be migrated. When the guest tries to access > >>+* the apic access page, ept violation will occur, and we can use GUP > >>+* to find the new page. > >>+* > >>+* GUP will wait till the migrate entry be replaced with the new page. > >>+*/ > >>+ if (gpa == APIC_DEFAULT_PHYS_BASE) > >>+ vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm, > >>+ APIC_DEFAULT_PHYS_BASE>> PAGE_SHIFT); > >Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here? > > I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here. > > In kvm_mmu_notifier_invalidate_page() I made the request. And the handler > called > gfn_to_page_no_pin() to get the new page, which will wait till the migration > finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the > vcpus > were forced to exit the guest mode, they would wait till the VMCS > APIC_ACCESS_ADDR > pointer was updated. > > As a result, we don't need to make the request here. OK, I do
Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.
On Mon, Jul 14, 2014 at 05:17:04PM +0800, Tang Chen wrote: > On 07/12/2014 03:44 PM, Gleb Natapov wrote: > >On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote: > >>kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But > >>it is never used to refer to the page at all. > >> > >>In vcpu initialization, it indicates two things: > >>1. indicates if ept page is allocated > >>2. indicates if a memory slot for identity page is initialized > >> > >>Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept > >>identity pagetable is initialized. So we can remove ept_identity_pagetable. > >> > >>Signed-off-by: Tang Chen > >>--- > >> arch/x86/include/asm/kvm_host.h | 1 - > >> arch/x86/kvm/vmx.c | 25 +++-- > >> 2 files changed, 11 insertions(+), 15 deletions(-) > >> > >>diff --git a/arch/x86/include/asm/kvm_host.h > >>b/arch/x86/include/asm/kvm_host.h > >>index 4931415..62f973e 100644 > >>--- a/arch/x86/include/asm/kvm_host.h > >>+++ b/arch/x86/include/asm/kvm_host.h > >>@@ -578,7 +578,6 @@ struct kvm_arch { > >> > >>gpa_t wall_clock; > >> > >>- struct page *ept_identity_pagetable; > >>bool ept_identity_pagetable_done; > >>gpa_t ept_identity_map_addr; > >> > >>diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >>index 0918635e..fe2e5f4 100644 > >>--- a/arch/x86/kvm/vmx.c > >>+++ b/arch/x86/kvm/vmx.c > >>@@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu > >>*vcpu); > >> static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx); > >> static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx); > >> static bool vmx_mpx_supported(void); > >>+static int alloc_identity_pagetable(struct kvm *kvm); > >> > >> static DEFINE_PER_CPU(struct vmcs *, vmxarea); > >> static DEFINE_PER_CPU(struct vmcs *, current_vmcs); > >>@@ -3921,21 +3922,21 @@ out: > >> > >> static int init_rmode_identity_map(struct kvm *kvm) > >> { > >>- int i, idx, r, ret; > >>+ int i, idx, r, ret = 0; > >>pfn_t identity_map_pfn; > >>u32 tmp; > >> > >>if (!enable_ept) > >>return 1; > >>- if (unlikely(!kvm->arch.ept_identity_pagetable)) { > >>- printk(KERN_ERR "EPT: identity-mapping pagetable " > >>- "haven't been allocated!\n"); > >>- return 0; > >>- } > >>if (likely(kvm->arch.ept_identity_pagetable_done)) > >>return 1; > >>- ret = 0; > >>identity_map_pfn = kvm->arch.ept_identity_map_addr>> PAGE_SHIFT; > >>+ > >>+ mutex_lock(>slots_lock); > >Why move this out of alloc_identity_pagetable()? > > > > Referring to the original code, I think mutex_lock(>slots_lock) is used > to protect kvm->arch.ept_identity_pagetable. If two or more threads try to > modify it at the same time, the mutex ensures that the identity table is > only > allocated once. > > Now, we dropped kvm->arch.ept_identity_pagetable. And use > kvm->arch.ept_identity_pagetable_done > to check if the identity table is allocated and initialized. So we should > protect > memory slot operation in alloc_identity_pagetable() and > kvm->arch.ept_identity_pagetable_done > with this mutex. > > Of course, I can see that the name "slots_lock" indicates that it may be > used > to protect the memory slot operation only. Maybe move it out here is not > suitable. > > If I'm wrong, please tell me. > No, you are right that besides memory slot creation slots_lock protects checking of ept_identity_pagetable here, but after you patch ept_identity_pagetable_done is tested outside of slots_lock so the allocation can happen twice, no? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.
On Mon, Jul 14, 2014 at 05:17:04PM +0800, Tang Chen wrote: On 07/12/2014 03:44 PM, Gleb Natapov wrote: On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote: kvm_arch-ept_identity_pagetable holds the ept identity pagetable page. But it is never used to refer to the page at all. In vcpu initialization, it indicates two things: 1. indicates if ept page is allocated 2. indicates if a memory slot for identity page is initialized Actually, kvm_arch-ept_identity_pagetable_done is enough to tell if the ept identity pagetable is initialized. So we can remove ept_identity_pagetable. Signed-off-by: Tang Chentangc...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h | 1 - arch/x86/kvm/vmx.c | 25 +++-- 2 files changed, 11 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4931415..62f973e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -578,7 +578,6 @@ struct kvm_arch { gpa_t wall_clock; - struct page *ept_identity_pagetable; bool ept_identity_pagetable_done; gpa_t ept_identity_map_addr; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0918635e..fe2e5f4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu); static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx); static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx); static bool vmx_mpx_supported(void); +static int alloc_identity_pagetable(struct kvm *kvm); static DEFINE_PER_CPU(struct vmcs *, vmxarea); static DEFINE_PER_CPU(struct vmcs *, current_vmcs); @@ -3921,21 +3922,21 @@ out: static int init_rmode_identity_map(struct kvm *kvm) { - int i, idx, r, ret; + int i, idx, r, ret = 0; pfn_t identity_map_pfn; u32 tmp; if (!enable_ept) return 1; - if (unlikely(!kvm-arch.ept_identity_pagetable)) { - printk(KERN_ERR EPT: identity-mapping pagetable - haven't been allocated!\n); - return 0; - } if (likely(kvm-arch.ept_identity_pagetable_done)) return 1; - ret = 0; identity_map_pfn = kvm-arch.ept_identity_map_addr PAGE_SHIFT; + + mutex_lock(kvm-slots_lock); Why move this out of alloc_identity_pagetable()? Referring to the original code, I think mutex_lock(kvm-slots_lock) is used to protect kvm-arch.ept_identity_pagetable. If two or more threads try to modify it at the same time, the mutex ensures that the identity table is only allocated once. Now, we dropped kvm-arch.ept_identity_pagetable. And use kvm-arch.ept_identity_pagetable_done to check if the identity table is allocated and initialized. So we should protect memory slot operation in alloc_identity_pagetable() and kvm-arch.ept_identity_pagetable_done with this mutex. Of course, I can see that the name slots_lock indicates that it may be used to protect the memory slot operation only. Maybe move it out here is not suitable. If I'm wrong, please tell me. No, you are right that besides memory slot creation slots_lock protects checking of ept_identity_pagetable here, but after you patch ept_identity_pagetable_done is tested outside of slots_lock so the allocation can happen twice, no? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.
CCing Jan to check my nested kvm findings below. On Mon, Jul 14, 2014 at 03:57:09PM +0800, Tang Chen wrote: Hi Gleb, Thanks for the reply. Please see below. On 07/12/2014 04:04 PM, Gleb Natapov wrote: On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote: apic access page is pinned in memory. As a result, it cannot be migrated/hot-removed. Actually, it is not necessary to be pinned. The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the corresponding ept entry. This patch introduces a new vcpu request named KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this time, and force all the vcpus exit guest, and re-enter guest till they updates the VMCS APIC_ACCESS_ADDR pointer to the new apic access page address, and updates kvm-arch.apic_access_page to the new page. By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism is not used since no MMIO access to APIC is ever done. Have you tested this with -cpu modelname,-x2apic qemu flag? I used the following commandline to test the patches. # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp 2 That most likely uses x2apic. And I think the guest used APIC_ACCESS_ADDR mechanism because the previous patch-set has some problem which will happen when the apic page is accessed. And it did happen. I'll test this patch-set with -cpu modelname,-x2apic flag. Replace modelname with one of supported model names such as nehalem of course :) Signed-off-by: Tang Chentangc...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/mmu.c | 11 +++ arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 8 +++- arch/x86/kvm/x86.c | 14 ++ include/linux/kvm_host.h| 2 ++ virt/kvm/kvm_main.c | 12 7 files changed, 53 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 62f973e..9ce6bfd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -737,6 +737,7 @@ struct kvm_x86_ops { void (*hwapic_isr_update)(struct kvm *kvm, int isr); void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa); void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9314678..551693d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, level, gfn, pfn, prefault); spin_unlock(vcpu-kvm-mmu_lock); + /* +* apic access page could be migrated. When the guest tries to access +* the apic access page, ept violation will occur, and we can use GUP +* to find the new page. +* +* GUP will wait till the migrate entry be replaced with the new page. +*/ + if (gpa == APIC_DEFAULT_PHYS_BASE) + vcpu-kvm-arch.apic_access_page = gfn_to_page_no_pin(vcpu-kvm, + APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here? I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here. In kvm_mmu_notifier_invalidate_page() I made the request. And the handler called gfn_to_page_no_pin() to get the new page, which will wait till the migration finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the vcpus were forced to exit the guest mode, they would wait till the VMCS APIC_ACCESS_ADDR pointer was updated. As a result, we don't need to make the request here. OK, I do not see what's the purpose of the code here then. + return r; out_unlock: diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 576b525..dc76f29 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3612,6 +3612,11 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) return; } +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa) +{ + return; +} + static int svm_vm_has_apicv(struct kvm *kvm) { return 0; @@ -4365,6 +4370,7 @@ static struct kvm_x86_ops svm_x86_ops = { .enable_irq_window = enable_irq_window, .update_cr8_intercept = update_cr8_intercept, .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode, + .set_apic_access_page_addr = svm_set_apic_access_page_addr, .vm_has_apicv = svm_vm_has_apicv, .load_eoi_exitmap = svm_load_eoi_exitmap