from:"Gleb Natapov"

Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-11 Thread Gleb Natapov

On Sun, Oct 11, 2015 at 12:19:54PM +0300, Michael S. Tsirkin wrote:
> > But since you must pass the same value to open(), you already know that
> > you're using noiommu.
> > 
> > >VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
> > >will probably also fail ...
> > >
> > 
> > Don't you have to call MAP_DMA to pin the memory?
> 
> Well check it out - the patch in question doesn't implement this ioctl.
> In fact it doesn't implement anything except CHECK_EXTENSION.
> 
> And this makes sense to me: MAP_DMA
> maps a virtual address to io address, and that doesn't
> work for the dummy iommu.
> 
> You can pin memory using many other ways, including
> mlock and hugetlbfs.
> 
mlock() does not pin memory.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-11 Thread Gleb Natapov

On Sun, Oct 11, 2015 at 12:19:54PM +0300, Michael S. Tsirkin wrote:
> > But since you must pass the same value to open(), you already know that
> > you're using noiommu.
> > 
> > >VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
> > >will probably also fail ...
> > >
> > 
> > Don't you have to call MAP_DMA to pin the memory?
> 
> Well check it out - the patch in question doesn't implement this ioctl.
> In fact it doesn't implement anything except CHECK_EXTENSION.
> 
> And this makes sense to me: MAP_DMA
> maps a virtual address to io address, and that doesn't
> work for the dummy iommu.
> 
> You can pin memory using many other ways, including
> mlock and hugetlbfs.
> 
mlock() does not pin memory.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 08:39:10PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 08:01:21PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote:
> > > > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote:
> > > > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > > > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > > > > > > > >>
> > > > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > > > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > > > > > > > >>>>It is good practice to defend against root oopsing the 
> > > > > > > > >>>>kernel, but in some
> > > > > > > > >>>>cases it cannot be achieved.
> > > > > > > > >>>Absolutely. That's one of the issues with these patches. 
> > > > > > > > >>>They don't even
> > > > > > > > >>>try where it's absolutely possible.
> > > > > > > > >>>
> > > > > > > > >>Are you referring to blocking the maps of the msix BAR areas?
> > > > > > > > >For example. There are more. I listed some of the issues on 
> > > > > > > > >the mailing
> > > > > > > > >list, and I might have missed some.  VFIO has code to address 
> > > > > > > > >all this,
> > > > > > > > >people should share code to avoid duplication, or at least 
> > > > > > > > >read it
> > > > > > > > >to understand the issues.
> > > > > > > > 
> > > > > > > > All but one of those are unrelated to the patch that adds msix 
> > > > > > > > support.
> > > > > > > 
> > > > > > > They are related because msix support enables bus mastering.  
> > > > > > > Without it
> > > > > > > device is passive and can't harm anyone. With it, suddently you 
> > > > > > > need to
> > > > > > > be very careful with the device to avoid corrupting kernel memory.
> > > > > > > 
> > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. 
> > > > > > The
> > > > > > fact that they do that without even tainting the kernel like the 
> > > > > > patch
> > > > > > does make current situation much worse that with the patch.
> > > > > 
> > > > > It isn't worse. It's a sane interface. Whoever enables bus mastering
> > > > > must be careful.  If userspace enables bus mastering then userspace
> > > > > needs to be very careful with the device to avoid corrupting kernel
> > > > > memory.  If kernel does it, it's kernel's responsibility.
> > > > > 
> > > > Although this definition of sanity sounds strange to me, but lets
> > > > flow with it for the sake of this email: would it be OK if proposed
> > > > interface refused to work if bus mastering is not already enabled by
> > > > userspace?
> > > 
> > > An interface could be acceptable if there's a fallback where it
> > > works without BM but slower (e.g. poll pending bits).
> > > 
> > OK.
> > 
> > > But not the proposed one.
> > >
> > Why? Greg is against ioctl interface so it will be reworked, by besides
> > that what is wrong with the concept of binding msi-x interrupt to
> > eventfd?
> 
> It's not the binding. Managing msi-x just needs more than the puny
> 2 ioctls to get # of vectors and set eventfd.
> 
> It interacts in strange ways with reset, and with PM, and ...
> 
Sorry, I need examples of what you mean. DMA also "interacts in strange
ways with reset, and with PM, and ..." and it does not have any special
handling anywhere in uio-generic. S

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote:
> > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > > > > > 
> > > > > > 
> > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > > > > > >>
> > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > > > > > >>>>It is good practice to defend against root oopsing the kernel, 
> > > > > > >>>>but in some
> > > > > > >>>>cases it cannot be achieved.
> > > > > > >>>Absolutely. That's one of the issues with these patches. They 
> > > > > > >>>don't even
> > > > > > >>>try where it's absolutely possible.
> > > > > > >>>
> > > > > > >>Are you referring to blocking the maps of the msix BAR areas?
> > > > > > >For example. There are more. I listed some of the issues on the 
> > > > > > >mailing
> > > > > > >list, and I might have missed some.  VFIO has code to address all 
> > > > > > >this,
> > > > > > >people should share code to avoid duplication, or at least read it
> > > > > > >to understand the issues.
> > > > > > 
> > > > > > All but one of those are unrelated to the patch that adds msix 
> > > > > > support.
> > > > > 
> > > > > They are related because msix support enables bus mastering.  Without 
> > > > > it
> > > > > device is passive and can't harm anyone. With it, suddently you need 
> > > > > to
> > > > > be very careful with the device to avoid corrupting kernel memory.
> > > > > 
> > > > Most (if not all) uio_pci_generic users enable pci bus mastering. The
> > > > fact that they do that without even tainting the kernel like the patch
> > > > does make current situation much worse that with the patch.
> > > 
> > > It isn't worse. It's a sane interface. Whoever enables bus mastering
> > > must be careful.  If userspace enables bus mastering then userspace
> > > needs to be very careful with the device to avoid corrupting kernel
> > > memory.  If kernel does it, it's kernel's responsibility.
> > > 
> > Although this definition of sanity sounds strange to me, but lets
> > flow with it for the sake of this email: would it be OK if proposed
> > interface refused to work if bus mastering is not already enabled by
> > userspace?
> 
> An interface could be acceptable if there's a fallback where it
> works without BM but slower (e.g. poll pending bits).
> 
OK.

> But not the proposed one.
>
Why? Greg is against ioctl interface so it will be reworked, by besides
that what is wrong with the concept of binding msi-x interrupt to
eventfd?
 
> Really, there's more to making msi-x work with
> userspace drivers than this patch. As I keep telling people, you would
> basically reimplement vfio/pci. Go over it, and see for yourself.
> Almost everything it does is relevant for msi-x.  It's just wrong to
> duplicate so much code.
> 
The patch is tested and works with msi-x. Restricting access to msi-x
registers that vfio does is not relevant here.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > > > 
> > > > 
> > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > > > >>
> > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > > > >>>>It is good practice to defend against root oopsing the kernel, but 
> > > > >>>>in some
> > > > >>>>cases it cannot be achieved.
> > > > >>>Absolutely. That's one of the issues with these patches. They don't 
> > > > >>>even
> > > > >>>try where it's absolutely possible.
> > > > >>>
> > > > >>Are you referring to blocking the maps of the msix BAR areas?
> > > > >For example. There are more. I listed some of the issues on the mailing
> > > > >list, and I might have missed some.  VFIO has code to address all this,
> > > > >people should share code to avoid duplication, or at least read it
> > > > >to understand the issues.
> > > > 
> > > > All but one of those are unrelated to the patch that adds msix support.
> > > 
> > > They are related because msix support enables bus mastering.  Without it
> > > device is passive and can't harm anyone. With it, suddently you need to
> > > be very careful with the device to avoid corrupting kernel memory.
> > > 
> > Most (if not all) uio_pci_generic users enable pci bus mastering. The
> > fact that they do that without even tainting the kernel like the patch
> > does make current situation much worse that with the patch.
> 
> It isn't worse. It's a sane interface. Whoever enables bus mastering
> must be careful.  If userspace enables bus mastering then userspace
> needs to be very careful with the device to avoid corrupting kernel
> memory.  If kernel does it, it's kernel's responsibility.
> 
Although this definition of sanity sounds strange to me, but lets
flow with it for the sake of this email: would it be OK if proposed
interface refused to work if bus mastering is not already enabled by
userspace?
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > 
> > 
> > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > >>
> > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > It is good practice to defend against root oopsing the kernel, but in 
> > some
> > cases it cannot be achieved.
> > >>>Absolutely. That's one of the issues with these patches. They don't even
> > >>>try where it's absolutely possible.
> > >>>
> > >>Are you referring to blocking the maps of the msix BAR areas?
> > >For example. There are more. I listed some of the issues on the mailing
> > >list, and I might have missed some.  VFIO has code to address all this,
> > >people should share code to avoid duplication, or at least read it
> > >to understand the issues.
> > 
> > All but one of those are unrelated to the patch that adds msix support.
> 
> They are related because msix support enables bus mastering.  Without it
> device is passive and can't harm anyone. With it, suddently you need to
> be very careful with the device to avoid corrupting kernel memory.
> 
Most (if not all) uio_pci_generic users enable pci bus mastering. The
fact that they do that without even tainting the kernel like the patch
does make current situation much worse that with the patch.

> > I can't comment on iommu overhead; for my use case it is likely negligible
> > and we will use an iommu when available; but apparently it matters for
> > others.
> 
> You and Vlad are the only ones who brought this up.
> So maybe you should not bring it up anymore.
> 
Common, you were CCed to at least this one:

 We have a solution that makes use of IOMMU support with vfio.  The 
 problem is there are multiple cases where that support is either not 
 available, or using the IOMMU provides excess overhead.


http://dpdk.org/ml/archives/dev/2015-October/024560.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 12:38:28PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 10:59:10AM +0300, Gleb Natapov wrote:
> > I do not remember this been an issue when uio_generic was accepted
> > into the kernel. The reason was because it meant to be accessible by root
> > only.
> 
> No - because it does not need bus mastering. So it can be used safely
> with some devices.
> 
It still can be used safely with same devices. Admittedly I did not look
close, but I am sure the patch does not enable bus mastering if MSI
interrupt is not requested. If not, well that can be fixed. But more
importantly it can be used unsafely in its current state. Not only can,
it is widely used so.

> [mst@robin linux]$ git grep pci_set_master|wc -l 533
> [mst@robin linux]$ git grep pci_enable|wc -l 1597
> 
> Looks like about 2/3 devices don't need to be bus masters.
> 
> It's up to admin not to bind it to devices, and that is unfortunate,
> but manually binding an incorrect driver to a device is generally
> a hard problem to solve.
> 
> > > There's also drivers/vfio/virqfd.c which deals
> > > with sending interrupts over eventfds correctly.
> > > 
> > As opposite to this patch that deals with them incorrectly? In what way?
> 
> cleanup on fd close is not handled.
> 
Have you commented about this on the patch and it was not fixed?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 11:32:50AM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > On 08/10/15 00:05, Michael S. Tsirkin wrote:
> > >On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:
> > >>That's what I thought as well, but apparently adding msix support to the
> > >>already insecure uio drivers is even worse.
> > >I'm glad you finally agree what these drivers are doing is insecure.
> > >
> > >And basically kernel cares about security, no one wants to maintain 
> > >insecure stuff.
> > >
> > >So you guys should think harder whether this code makes any sense upstream.
> > 
> > You simply ignore everything I write, cherry-picking the word "insecure" as
> > if it makes your point.  That is very frustrating.
> 
> And I'm sorry about the frustration.  I didn't intend to twist your
> words. It's just that I had to spend literally hours trying to explain
> that security matters in kernel, and all I was getting back was a
> summary "there's no security issue because there are other way to
> corrupt memory".
> 
That's not the (only) answer that you were given. The answers that
you constantly ignore is that the patch in question does not add any
new ways to corrupt memory which are not possible using _upstream_
uio_pci_generic device, so the fact that uio_pci_generic can corrupt
memory cannot be used as a reason to not apply patches that do not corrupt
any memory. You seams to be constantly arguing that uio_pci_generic is
not suitable for upstream.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 10:41:53AM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 07:19:13AM +0300, Gleb Natapov wrote:
> > Well
> > the alternative is to add /dev/vfio/nommu like you've said, but what
> > would be the difference between this and uio eludes me.
> 
> Are you familiar with vfio that you ask such a question?
> 
Yes, I do and I do not see anything of value that vfio can add to nommu
setup besides complexity, but I do see why it will have to have special
interface not applicable to regular vfio (hint: there is not HW to translate
virtual address to physical) and why it will have to be accessible to
root user only.

> Here's the vfio pci code:
> 
> $ wc -l drivers/vfio/pci/*
>27 drivers/vfio/pci/Kconfig
> 4 drivers/vfio/pci/Makefile
>  1217 drivers/vfio/pci/vfio_pci.c
>  1602 drivers/vfio/pci/vfio_pci_config.c
>   675 drivers/vfio/pci/vfio_pci_intrs.c
>92 drivers/vfio/pci/vfio_pci_private.h
>   238 drivers/vfio/pci/vfio_pci_rdwr.c
>  3855 total
>
> There's some code dealing with iommu groups in
> drivers/vfio/pci/vfio_pci.c,
> but most of it is validating input and
> presenting a consistent interface to userspace.
> 
What is has to do with the patch series in question? Non patched
uio_generic code does not validate input. If you think it should by all
means write the code (don't break existing use cases while doing so),
but the patch under discussion does not even access pci device from
userspace, so it will not be affected by said filtering.

> This is exactly what's missing here.
It is not missing in this patch series, it is missing from upstream
code. I do not remember this been an issue when uio_generic was accepted
into the kernel. The reason was because it meant to be accessible by root
only. VFIO was designed to be used by regular user from ground up, so
obviously unrestricted access to pci space was out of the question.
Different use cases lead to different designs, how surprising.

> 
> There's also drivers/vfio/virqfd.c which deals
> with sending interrupts over eventfds correctly.
> 
As opposite to this patch that deals with them incorrectly? In what way?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 11:32:50AM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > On 08/10/15 00:05, Michael S. Tsirkin wrote:
> > >On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:
> > >>That's what I thought as well, but apparently adding msix support to the
> > >>already insecure uio drivers is even worse.
> > >I'm glad you finally agree what these drivers are doing is insecure.
> > >
> > >And basically kernel cares about security, no one wants to maintain 
> > >insecure stuff.
> > >
> > >So you guys should think harder whether this code makes any sense upstream.
> > 
> > You simply ignore everything I write, cherry-picking the word "insecure" as
> > if it makes your point.  That is very frustrating.
> 
> And I'm sorry about the frustration.  I didn't intend to twist your
> words. It's just that I had to spend literally hours trying to explain
> that security matters in kernel, and all I was getting back was a
> summary "there's no security issue because there are other way to
> corrupt memory".
> 
That's not the (only) answer that you were given. The answers that
you constantly ignore is that the patch in question does not add any
new ways to corrupt memory which are not possible using _upstream_
uio_pci_generic device, so the fact that uio_pci_generic can corrupt
memory cannot be used as a reason to not apply patches that do not corrupt
any memory. You seams to be constantly arguing that uio_pci_generic is
not suitable for upstream.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 10:41:53AM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 07:19:13AM +0300, Gleb Natapov wrote:
> > Well
> > the alternative is to add /dev/vfio/nommu like you've said, but what
> > would be the difference between this and uio eludes me.
> 
> Are you familiar with vfio that you ask such a question?
> 
Yes, I do and I do not see anything of value that vfio can add to nommu
setup besides complexity, but I do see why it will have to have special
interface not applicable to regular vfio (hint: there is not HW to translate
virtual address to physical) and why it will have to be accessible to
root user only.

> Here's the vfio pci code:
> 
> $ wc -l drivers/vfio/pci/*
>27 drivers/vfio/pci/Kconfig
> 4 drivers/vfio/pci/Makefile
>  1217 drivers/vfio/pci/vfio_pci.c
>  1602 drivers/vfio/pci/vfio_pci_config.c
>   675 drivers/vfio/pci/vfio_pci_intrs.c
>92 drivers/vfio/pci/vfio_pci_private.h
>   238 drivers/vfio/pci/vfio_pci_rdwr.c
>  3855 total
>
> There's some code dealing with iommu groups in
> drivers/vfio/pci/vfio_pci.c,
> but most of it is validating input and
> presenting a consistent interface to userspace.
> 
What is has to do with the patch series in question? Non patched
uio_generic code does not validate input. If you think it should by all
means write the code (don't break existing use cases while doing so),
but the patch under discussion does not even access pci device from
userspace, so it will not be affected by said filtering.

> This is exactly what's missing here.
It is not missing in this patch series, it is missing from upstream
code. I do not remember this been an issue when uio_generic was accepted
into the kernel. The reason was because it meant to be accessible by root
only. VFIO was designed to be used by regular user from ground up, so
obviously unrestricted access to pci space was out of the question.
Different use cases lead to different designs, how surprising.

> 
> There's also drivers/vfio/virqfd.c which deals
> with sending interrupts over eventfds correctly.
> 
As opposite to this patch that deals with them incorrectly? In what way?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 12:38:28PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 10:59:10AM +0300, Gleb Natapov wrote:
> > I do not remember this been an issue when uio_generic was accepted
> > into the kernel. The reason was because it meant to be accessible by root
> > only.
> 
> No - because it does not need bus mastering. So it can be used safely
> with some devices.
> 
It still can be used safely with same devices. Admittedly I did not look
close, but I am sure the patch does not enable bus mastering if MSI
interrupt is not requested. If not, well that can be fixed. But more
importantly it can be used unsafely in its current state. Not only can,
it is widely used so.

> [mst@robin linux]$ git grep pci_set_master|wc -l 533
> [mst@robin linux]$ git grep pci_enable|wc -l 1597
> 
> Looks like about 2/3 devices don't need to be bus masters.
> 
> It's up to admin not to bind it to devices, and that is unfortunate,
> but manually binding an incorrect driver to a device is generally
> a hard problem to solve.
> 
> > > There's also drivers/vfio/virqfd.c which deals
> > > with sending interrupts over eventfds correctly.
> > > 
> > As opposite to this patch that deals with them incorrectly? In what way?
> 
> cleanup on fd close is not handled.
> 
Have you commented about this on the patch and it was not fixed?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > 
> > 
> > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > >>
> > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > It is good practice to defend against root oopsing the kernel, but in 
> > some
> > cases it cannot be achieved.
> > >>>Absolutely. That's one of the issues with these patches. They don't even
> > >>>try where it's absolutely possible.
> > >>>
> > >>Are you referring to blocking the maps of the msix BAR areas?
> > >For example. There are more. I listed some of the issues on the mailing
> > >list, and I might have missed some.  VFIO has code to address all this,
> > >people should share code to avoid duplication, or at least read it
> > >to understand the issues.
> > 
> > All but one of those are unrelated to the patch that adds msix support.
> 
> They are related because msix support enables bus mastering.  Without it
> device is passive and can't harm anyone. With it, suddently you need to
> be very careful with the device to avoid corrupting kernel memory.
> 
Most (if not all) uio_pci_generic users enable pci bus mastering. The
fact that they do that without even tainting the kernel like the patch
does make current situation much worse that with the patch.

> > I can't comment on iommu overhead; for my use case it is likely negligible
> > and we will use an iommu when available; but apparently it matters for
> > others.
> 
> You and Vlad are the only ones who brought this up.
> So maybe you should not bring it up anymore.
> 
Common, you were CCed to at least this one:

 We have a solution that makes use of IOMMU support with vfio.  The 
 problem is there are multiple cases where that support is either not 
 available, or using the IOMMU provides excess overhead.


http://dpdk.org/ml/archives/dev/2015-October/024560.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > > > 
> > > > 
> > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > > > >>
> > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > > > >>>>It is good practice to defend against root oopsing the kernel, but 
> > > > >>>>in some
> > > > >>>>cases it cannot be achieved.
> > > > >>>Absolutely. That's one of the issues with these patches. They don't 
> > > > >>>even
> > > > >>>try where it's absolutely possible.
> > > > >>>
> > > > >>Are you referring to blocking the maps of the msix BAR areas?
> > > > >For example. There are more. I listed some of the issues on the mailing
> > > > >list, and I might have missed some.  VFIO has code to address all this,
> > > > >people should share code to avoid duplication, or at least read it
> > > > >to understand the issues.
> > > > 
> > > > All but one of those are unrelated to the patch that adds msix support.
> > > 
> > > They are related because msix support enables bus mastering.  Without it
> > > device is passive and can't harm anyone. With it, suddently you need to
> > > be very careful with the device to avoid corrupting kernel memory.
> > > 
> > Most (if not all) uio_pci_generic users enable pci bus mastering. The
> > fact that they do that without even tainting the kernel like the patch
> > does make current situation much worse that with the patch.
> 
> It isn't worse. It's a sane interface. Whoever enables bus mastering
> must be careful.  If userspace enables bus mastering then userspace
> needs to be very careful with the device to avoid corrupting kernel
> memory.  If kernel does it, it's kernel's responsibility.
> 
Although this definition of sanity sounds strange to me, but lets
flow with it for the sake of this email: would it be OK if proposed
interface refused to work if bus mastering is not already enabled by
userspace?
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote:
> > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > > > > > 
> > > > > > 
> > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > > > > > >>
> > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > > > > > >>>>It is good practice to defend against root oopsing the kernel, 
> > > > > > >>>>but in some
> > > > > > >>>>cases it cannot be achieved.
> > > > > > >>>Absolutely. That's one of the issues with these patches. They 
> > > > > > >>>don't even
> > > > > > >>>try where it's absolutely possible.
> > > > > > >>>
> > > > > > >>Are you referring to blocking the maps of the msix BAR areas?
> > > > > > >For example. There are more. I listed some of the issues on the 
> > > > > > >mailing
> > > > > > >list, and I might have missed some.  VFIO has code to address all 
> > > > > > >this,
> > > > > > >people should share code to avoid duplication, or at least read it
> > > > > > >to understand the issues.
> > > > > > 
> > > > > > All but one of those are unrelated to the patch that adds msix 
> > > > > > support.
> > > > > 
> > > > > They are related because msix support enables bus mastering.  Without 
> > > > > it
> > > > > device is passive and can't harm anyone. With it, suddently you need 
> > > > > to
> > > > > be very careful with the device to avoid corrupting kernel memory.
> > > > > 
> > > > Most (if not all) uio_pci_generic users enable pci bus mastering. The
> > > > fact that they do that without even tainting the kernel like the patch
> > > > does make current situation much worse that with the patch.
> > > 
> > > It isn't worse. It's a sane interface. Whoever enables bus mastering
> > > must be careful.  If userspace enables bus mastering then userspace
> > > needs to be very careful with the device to avoid corrupting kernel
> > > memory.  If kernel does it, it's kernel's responsibility.
> > > 
> > Although this definition of sanity sounds strange to me, but lets
> > flow with it for the sake of this email: would it be OK if proposed
> > interface refused to work if bus mastering is not already enabled by
> > userspace?
> 
> An interface could be acceptable if there's a fallback where it
> works without BM but slower (e.g. poll pending bits).
> 
OK.

> But not the proposed one.
>
Why? Greg is against ioctl interface so it will be reworked, by besides
that what is wrong with the concept of binding msi-x interrupt to
eventfd?
 
> Really, there's more to making msi-x work with
> userspace drivers than this patch. As I keep telling people, you would
> basically reimplement vfio/pci. Go over it, and see for yourself.
> Almost everything it does is relevant for msi-x.  It's just wrong to
> duplicate so much code.
> 
The patch is tested and works with msi-x. Restricting access to msi-x
registers that vfio does is not relevant here.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 08:39:10PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2015 at 08:01:21PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 08, 2015 at 07:43:04PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 08, 2015 at 04:28:34PM +0300, Gleb Natapov wrote:
> > > > On Thu, Oct 08, 2015 at 04:20:04PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 08, 2015 at 03:27:37PM +0300, Gleb Natapov wrote:
> > > > > > On Thu, Oct 08, 2015 at 03:06:07PM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Oct 08, 2015 at 12:44:09PM +0300, Avi Kivity wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:
> > > > > > > > >On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:
> > > > > > > > >>
> > > > > > > > >>On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:
> > > > > > > > >>>On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:
> > > > > > > > >>>>It is good practice to defend against root oopsing the 
> > > > > > > > >>>>kernel, but in some
> > > > > > > > >>>>cases it cannot be achieved.
> > > > > > > > >>>Absolutely. That's one of the issues with these patches. 
> > > > > > > > >>>They don't even
> > > > > > > > >>>try where it's absolutely possible.
> > > > > > > > >>>
> > > > > > > > >>Are you referring to blocking the maps of the msix BAR areas?
> > > > > > > > >For example. There are more. I listed some of the issues on 
> > > > > > > > >the mailing
> > > > > > > > >list, and I might have missed some.  VFIO has code to address 
> > > > > > > > >all this,
> > > > > > > > >people should share code to avoid duplication, or at least 
> > > > > > > > >read it
> > > > > > > > >to understand the issues.
> > > > > > > > 
> > > > > > > > All but one of those are unrelated to the patch that adds msix 
> > > > > > > > support.
> > > > > > > 
> > > > > > > They are related because msix support enables bus mastering.  
> > > > > > > Without it
> > > > > > > device is passive and can't harm anyone. With it, suddently you 
> > > > > > > need to
> > > > > > > be very careful with the device to avoid corrupting kernel memory.
> > > > > > > 
> > > > > > Most (if not all) uio_pci_generic users enable pci bus mastering. 
> > > > > > The
> > > > > > fact that they do that without even tainting the kernel like the 
> > > > > > patch
> > > > > > does make current situation much worse that with the patch.
> > > > > 
> > > > > It isn't worse. It's a sane interface. Whoever enables bus mastering
> > > > > must be careful.  If userspace enables bus mastering then userspace
> > > > > needs to be very careful with the device to avoid corrupting kernel
> > > > > memory.  If kernel does it, it's kernel's responsibility.
> > > > > 
> > > > Although this definition of sanity sounds strange to me, but lets
> > > > flow with it for the sake of this email: would it be OK if proposed
> > > > interface refused to work if bus mastering is not already enabled by
> > > > userspace?
> > > 
> > > An interface could be acceptable if there's a fallback where it
> > > works without BM but slower (e.g. poll pending bits).
> > > 
> > OK.
> > 
> > > But not the proposed one.
> > >
> > Why? Greg is against ioctl interface so it will be reworked, by besides
> > that what is wrong with the concept of binding msi-x interrupt to
> > eventfd?
> 
> It's not the binding. Managing msi-x just needs more than the puny
> 2 ioctls to get # of vectors and set eventfd.
> 
> It interacts in strange ways with reset, and with PM, and ...
> 
Sorry, I need examples of what you mean. DMA also "interacts in strange
ways with reset, and with PM, and ..." and it does not have any special
handling anywhere in uio-generic. S

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 12:05:11AM +0300, Michael S. Tsirkin wrote:
> On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:
> > That's what I thought as well, but apparently adding msix support to the
> > already insecure uio drivers is even worse.
> 
> I'm glad you finally agree what these drivers are doing is insecure.
> 
Michael, please stop this meaningless world play. The above is said in
the contexts of a device that is meant to be accessible by regular users
and obviously for that purpose uio is insecure (in its current state btw).
If you give user access to your root block device this device will be
insecure too, so according to your logic block device is insecure?
Pushing the code from uio to vfio means that vfio will have to implement
access policy by itself - allow iommu mode to regular users, but
no-iommu to root only. Implementing policy in the kernel is bad. Well
the alternative is to add /dev/vfio/nommu like you've said, but what
would be the difference between this and uio eludes me.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Gleb Natapov

On Thu, Oct 08, 2015 at 12:05:11AM +0300, Michael S. Tsirkin wrote:
> On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:
> > That's what I thought as well, but apparently adding msix support to the
> > already insecure uio drivers is even worse.
> 
> I'm glad you finally agree what these drivers are doing is insecure.
> 
Michael, please stop this meaningless world play. The above is said in
the contexts of a device that is meant to be accessible by regular users
and obviously for that purpose uio is insecure (in its current state btw).
If you give user access to your root block device this device will be
insecure too, so according to your logic block device is insecure?
Pushing the code from uio to vfio means that vfio will have to implement
access policy by itself - allow iommu mode to regular users, but
no-iommu to root only. Implementing policy in the kernel is bad. Well
the alternative is to add /dev/vfio/nommu like you've said, but what
would be the difference between this and uio eludes me.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-06 Thread Gleb Natapov

On Tue, Oct 06, 2015 at 06:15:54PM +0300, Michael S. Tsirkin wrote:
> > While it is possible that userspace malfunctions and accidentally programs
> > MSI incorrectly, the risk is dwarfed by the ability of userspace to program
> > DMA incorrectly.
> 
> That seems to imply that for the upstream kernel this is not a valid usecase 
> at all.
> 
Are you implying that uio_pci_generic should be removed from upstream
kernel because Avi did not describe anything that cannot be done with
upstream kernel right now.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/3] uio: add ioctl support

2015-10-06 Thread Gleb Natapov

On Tue, Oct 06, 2015 at 06:19:34PM +0300, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:30:31PM +0300, Gleb Natapov wrote:
> > On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> > > > the solution u propose should be a matter of a separate patch and is
> > > > obviously orthogonal to this series.
> > > 
> > > Doesn't work this way, sorry. You want a patch enabling MSI merged,
> > > you need to secure the MSI configuration.
> > > 
> > MSI can be enabled right now without the patch by writing directly into
> > PCI bar.
> 
> By poking at config registers in sysfs? We can block this, or we
> can log this, pretty easily. We don't ATM but it's not hard to do.
> 
Blocking this will break userspace API. As a maintainer you should know
that we do not break userspace APIs. Logging this is fine, but how
exactly it helps you with "security"? The patch in question already
taints the kernel which is much stronger than logging.

> > The only thing this patch adds is forwarding the interrupt to
> > an eventfd.
> 
> This one just adds a bunch of ioctls. The next ones do
> more than you describe.
> 
Yes, it adds bunch of ioctls to do exactly what I wrote above. What
point have you tried to make by this statement? It eluded me.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/3] uio: add ioctl support

2015-10-06 Thread Gleb Natapov

On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> > the solution u propose should be a matter of a separate patch and is
> > obviously orthogonal to this series.
> 
> Doesn't work this way, sorry. You want a patch enabling MSI merged,
> you need to secure the MSI configuration.
> 
MSI can be enabled right now without the patch by writing directly into
PCI bar. The only thing this patch adds is forwarding the interrupt to
an eventfd.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/3] uio: add ioctl support

2015-10-06 Thread Gleb Natapov

On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> > the solution u propose should be a matter of a separate patch and is
> > obviously orthogonal to this series.
> 
> Doesn't work this way, sorry. You want a patch enabling MSI merged,
> you need to secure the MSI configuration.
> 
MSI can be enabled right now without the patch by writing directly into
PCI bar. The only thing this patch adds is forwarding the interrupt to
an eventfd.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/3] uio: add ioctl support

2015-10-06 Thread Gleb Natapov

On Tue, Oct 06, 2015 at 06:19:34PM +0300, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:30:31PM +0300, Gleb Natapov wrote:
> > On Tue, Oct 06, 2015 at 05:19:22PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Oct 06, 2015 at 11:33:56AM +0300, Vlad Zolotarov wrote:
> > > > the solution u propose should be a matter of a separate patch and is
> > > > obviously orthogonal to this series.
> > > 
> > > Doesn't work this way, sorry. You want a patch enabling MSI merged,
> > > you need to secure the MSI configuration.
> > > 
> > MSI can be enabled right now without the patch by writing directly into
> > PCI bar.
> 
> By poking at config registers in sysfs? We can block this, or we
> can log this, pretty easily. We don't ATM but it's not hard to do.
> 
Blocking this will break userspace API. As a maintainer you should know
that we do not break userspace APIs. Logging this is fine, but how
exactly it helps you with "security"? The patch in question already
taints the kernel which is much stronger than logging.

> > The only thing this patch adds is forwarding the interrupt to
> > an eventfd.
> 
> This one just adds a bunch of ioctls. The next ones do
> more than you describe.
> 
Yes, it adds bunch of ioctls to do exactly what I wrote above. What
point have you tried to make by this statement? It eluded me.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-06 Thread Gleb Natapov

On Tue, Oct 06, 2015 at 06:15:54PM +0300, Michael S. Tsirkin wrote:
> > While it is possible that userspace malfunctions and accidentally programs
> > MSI incorrectly, the risk is dwarfed by the ability of userspace to program
> > DMA incorrectly.
> 
> That seems to imply that for the upstream kernel this is not a valid usecase 
> at all.
> 
Are you implying that uio_pci_generic should be removed from upstream
kernel because Avi did not describe anything that cannot be done with
upstream kernel right now.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM: ia64: remove

2014-11-19 Thread Gleb Natapov

On Wed, Nov 19, 2014 at 10:05:43PM +0100, Paolo Bonzini wrote:
> KVM for ia64 has been marked as broken not just once, but twice even,
> and the last patch from the maintainer is now roughly 5 years old.
> Time for it to rest in piece.
> 
Acked-by: Gleb Natapov 

Next step is to move ioapic bits into arch :)

> Signed-off-by: Paolo Bonzini 
> ---
>   The patch was edited to keep its size decent, by dropping
>   all the removed lines from the deleted files.
> 
>  MAINTAINERS |9 -
>  arch/ia64/Kconfig   |3 -
>  arch/ia64/Makefile  |1 -
>  arch/ia64/include/asm/kvm_host.h|  609 --
>  arch/ia64/include/asm/pvclock-abi.h |   48 -
>  arch/ia64/include/uapi/asm/kvm.h|  268 -
>  arch/ia64/kvm/Kconfig   |   66 --
>  arch/ia64/kvm/Makefile  |   67 --
>  arch/ia64/kvm/asm-offsets.c |  241 
>  arch/ia64/kvm/irq.h |   33 -
>  arch/ia64/kvm/kvm-ia64.c| 1942 --
>  arch/ia64/kvm/kvm_fw.c  |  674 ---
>  arch/ia64/kvm/kvm_lib.c |   21 -
>  arch/ia64/kvm/kvm_minstate.h|  266 -
>  arch/ia64/kvm/lapic.h   |   30 -
>  arch/ia64/kvm/memcpy.S  |1 -
>  arch/ia64/kvm/memset.S  |1 -
>  arch/ia64/kvm/misc.h|   94 --
>  arch/ia64/kvm/mmio.c|  336 --
>  arch/ia64/kvm/optvfault.S   | 1090 -
>  arch/ia64/kvm/process.c | 1024 
>  arch/ia64/kvm/trampoline.S  | 1038 
>  arch/ia64/kvm/vcpu.c| 2209 
> ---
>  arch/ia64/kvm/vcpu.h|  752 
>  arch/ia64/kvm/vmm.c |   99 --
>  arch/ia64/kvm/vmm_ivt.S | 1392 --
>  arch/ia64/kvm/vti.h |  290 -
>  arch/ia64/kvm/vtlb.c|  640 --
>  virt/kvm/ioapic.c   |  5 -
>  virt/kvm/ioapic.h   |  1 -
>  virt/kvm/irq_comm.c | 22 -
>  31 files changed, 13272 deletions(-)
>  delete mode 100644 arch/ia64/include/asm/kvm_host.h
>  delete mode 100644 arch/ia64/include/asm/pvclock-abi.h
>  delete mode 100644 arch/ia64/include/uapi/asm/kvm.h
>  delete mode 100644 arch/ia64/kvm/Kconfig
>  delete mode 100644 arch/ia64/kvm/Makefile
>  delete mode 100644 arch/ia64/kvm/asm-offsets.c
>  delete mode 100644 arch/ia64/kvm/irq.h
>  delete mode 100644 arch/ia64/kvm/kvm-ia64.c
>  delete mode 100644 arch/ia64/kvm/kvm_fw.c
>  delete mode 100644 arch/ia64/kvm/kvm_lib.c
>  delete mode 100644 arch/ia64/kvm/kvm_minstate.h
>  delete mode 100644 arch/ia64/kvm/lapic.h
>  delete mode 100644 arch/ia64/kvm/memcpy.S
>  delete mode 100644 arch/ia64/kvm/memset.S
>  delete mode 100644 arch/ia64/kvm/misc.h
>  delete mode 100644 arch/ia64/kvm/mmio.c
>  delete mode 100644 arch/ia64/kvm/optvfault.S
>  delete mode 100644 arch/ia64/kvm/process.c
>  delete mode 100644 arch/ia64/kvm/trampoline.S
>  delete mode 100644 arch/ia64/kvm/vcpu.c
>  delete mode 100644 arch/ia64/kvm/vcpu.h
>  delete mode 100644 arch/ia64/kvm/vmm.c
>  delete mode 100644 arch/ia64/kvm/vmm_ivt.S
>  delete mode 100644 arch/ia64/kvm/vti.h
>  delete mode 100644 arch/ia64/kvm/vtlb.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a12edf2624e5..56705138ca74 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5244,15 +5244,6 @@ S:   Supported
>  F: arch/powerpc/include/asm/kvm*
>  F: arch/powerpc/kvm/
> 
> -KERNEL VIRTUAL MACHINE For Itanium (KVM/IA64)
> -M: Xiantao Zhang 
> -L: kvm-i...@vger.kernel.org
> -W: http://kvm.qumranet.com
> -S: Supported
> -F: Documentation/ia64/kvm.txt
> -F: arch/ia64/include/asm/kvm*
> -F: arch/ia64/kvm/
> -
>  KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
>  M: Christian Borntraeger 
>  M: Cornelia Huck 
> diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
> index c84c88d7..11afe7ab1981 100644
> --- a/arch/ia64/Kconfig
> +++ b/arch/ia64/Kconfig
> @@ -21,7 +21,6 @@ config IA64
>   select HAVE_DYNAMIC_FTRACE if (!ITANIUM)
>   select HAVE_FUNCTION_TRACER
>   select HAVE_DMA_ATTRS
> - select HAVE_KVM
>   select TTY
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_DMA_API_DEBUG
> @@ -640,8 +639,6 @@ source "security/Kconfig"
>  
>  source "crypto/Kconfig"
>  
> -source "arch/ia64/kvm/Kconfig"
> -
>  source "lib/Kconfig"
>  
>  config IOMMU_HELPER
> diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile
> index 5441b14994fc

Re: [PATCH] KVM: ia64: remove

2014-11-19 Thread Gleb Natapov

On Wed, Nov 19, 2014 at 10:05:43PM +0100, Paolo Bonzini wrote:
 KVM for ia64 has been marked as broken not just once, but twice even,
 and the last patch from the maintainer is now roughly 5 years old.
 Time for it to rest in piece.
 
Acked-by: Gleb Natapov g...@kernel.org

Next step is to move ioapic bits into arch :)

 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   The patch was edited to keep its size decent, by dropping
   all the removed lines from the deleted files.
 
  MAINTAINERS |9 -
  arch/ia64/Kconfig   |3 -
  arch/ia64/Makefile  |1 -
  arch/ia64/include/asm/kvm_host.h|  609 --
  arch/ia64/include/asm/pvclock-abi.h |   48 -
  arch/ia64/include/uapi/asm/kvm.h|  268 -
  arch/ia64/kvm/Kconfig   |   66 --
  arch/ia64/kvm/Makefile  |   67 --
  arch/ia64/kvm/asm-offsets.c |  241 
  arch/ia64/kvm/irq.h |   33 -
  arch/ia64/kvm/kvm-ia64.c| 1942 --
  arch/ia64/kvm/kvm_fw.c  |  674 ---
  arch/ia64/kvm/kvm_lib.c |   21 -
  arch/ia64/kvm/kvm_minstate.h|  266 -
  arch/ia64/kvm/lapic.h   |   30 -
  arch/ia64/kvm/memcpy.S  |1 -
  arch/ia64/kvm/memset.S  |1 -
  arch/ia64/kvm/misc.h|   94 --
  arch/ia64/kvm/mmio.c|  336 --
  arch/ia64/kvm/optvfault.S   | 1090 -
  arch/ia64/kvm/process.c | 1024 
  arch/ia64/kvm/trampoline.S  | 1038 
  arch/ia64/kvm/vcpu.c| 2209 
 ---
  arch/ia64/kvm/vcpu.h|  752 
  arch/ia64/kvm/vmm.c |   99 --
  arch/ia64/kvm/vmm_ivt.S | 1392 --
  arch/ia64/kvm/vti.h |  290 -
  arch/ia64/kvm/vtlb.c|  640 --
  virt/kvm/ioapic.c   |  5 -
  virt/kvm/ioapic.h   |  1 -
  virt/kvm/irq_comm.c | 22 -
  31 files changed, 13272 deletions(-)
  delete mode 100644 arch/ia64/include/asm/kvm_host.h
  delete mode 100644 arch/ia64/include/asm/pvclock-abi.h
  delete mode 100644 arch/ia64/include/uapi/asm/kvm.h
  delete mode 100644 arch/ia64/kvm/Kconfig
  delete mode 100644 arch/ia64/kvm/Makefile
  delete mode 100644 arch/ia64/kvm/asm-offsets.c
  delete mode 100644 arch/ia64/kvm/irq.h
  delete mode 100644 arch/ia64/kvm/kvm-ia64.c
  delete mode 100644 arch/ia64/kvm/kvm_fw.c
  delete mode 100644 arch/ia64/kvm/kvm_lib.c
  delete mode 100644 arch/ia64/kvm/kvm_minstate.h
  delete mode 100644 arch/ia64/kvm/lapic.h
  delete mode 100644 arch/ia64/kvm/memcpy.S
  delete mode 100644 arch/ia64/kvm/memset.S
  delete mode 100644 arch/ia64/kvm/misc.h
  delete mode 100644 arch/ia64/kvm/mmio.c
  delete mode 100644 arch/ia64/kvm/optvfault.S
  delete mode 100644 arch/ia64/kvm/process.c
  delete mode 100644 arch/ia64/kvm/trampoline.S
  delete mode 100644 arch/ia64/kvm/vcpu.c
  delete mode 100644 arch/ia64/kvm/vcpu.h
  delete mode 100644 arch/ia64/kvm/vmm.c
  delete mode 100644 arch/ia64/kvm/vmm_ivt.S
  delete mode 100644 arch/ia64/kvm/vti.h
  delete mode 100644 arch/ia64/kvm/vtlb.c
 
 diff --git a/MAINTAINERS b/MAINTAINERS
 index a12edf2624e5..56705138ca74 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -5244,15 +5244,6 @@ S:   Supported
  F: arch/powerpc/include/asm/kvm*
  F: arch/powerpc/kvm/
 
 -KERNEL VIRTUAL MACHINE For Itanium (KVM/IA64)
 -M: Xiantao Zhang xiantao.zh...@intel.com
 -L: kvm-i...@vger.kernel.org
 -W: http://kvm.qumranet.com
 -S: Supported
 -F: Documentation/ia64/kvm.txt
 -F: arch/ia64/include/asm/kvm*
 -F: arch/ia64/kvm/
 -
  KERNEL VIRTUAL MACHINE for s390 (KVM/s390)
  M: Christian Borntraeger borntrae...@de.ibm.com
  M: Cornelia Huck cornelia.h...@de.ibm.com
 diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
 index c84c88d7..11afe7ab1981 100644
 --- a/arch/ia64/Kconfig
 +++ b/arch/ia64/Kconfig
 @@ -21,7 +21,6 @@ config IA64
   select HAVE_DYNAMIC_FTRACE if (!ITANIUM)
   select HAVE_FUNCTION_TRACER
   select HAVE_DMA_ATTRS
 - select HAVE_KVM
   select TTY
   select HAVE_ARCH_TRACEHOOK
   select HAVE_DMA_API_DEBUG
 @@ -640,8 +639,6 @@ source security/Kconfig
  
  source crypto/Kconfig
  
 -source arch/ia64/kvm/Kconfig
 -
  source lib/Kconfig
  
  config IOMMU_HELPER
 diff --git a/arch/ia64/Makefile b/arch/ia64/Makefile
 index 5441b14994fc..970d0bd99621 100644
 --- a/arch/ia64/Makefile
 +++ b/arch/ia64/Makefile
 @@ -53,7 +53,6 @@ core-$(CONFIG_IA64_HP_ZX1)  += arch/ia64/dig/
  core-$(CONFIG_IA64_HP_ZX1_SWIOTLB) += arch/ia64/dig/
  core-$(CONFIG_IA64_SGI_SN2)  += arch/ia64/sn/
  core-$(CONFIG_IA64_SGI_UV)   += arch/ia64/uv/
 -core-$(CONFIG_KVM)   += arch/ia64/kvm/
  
  drivers-$(CONFIG_PCI)+= arch/ia64/pci

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-23 Thread Gleb Natapov

On Mon, Sep 22, 2014 at 09:29:19PM +0200, Paolo Bonzini wrote:
> Il 22/09/2014 21:20, Christian Borntraeger ha scritto:
> > "while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. 
> > Lets bail out early on invalid ioctls". or similar?
> 
> Okay.  David, can you explain how you found it so that I can make up my
> mind?
> 
> Gleb and Marcelo, a fourth and fifth opinion? :)
> 
I agree with Christian that simpler fix is better here.
The overhead is minimal. If we ever notice this overhead
we can revert the patch all together since the problem it
fixes can only be inflicted on userspace by itself and there
are myriads other ways userspace can hurt itself.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kvm: don't take vcpu mutex for obviously invalid vcpu ioctls

2014-09-23 Thread Gleb Natapov

On Mon, Sep 22, 2014 at 09:29:19PM +0200, Paolo Bonzini wrote:
 Il 22/09/2014 21:20, Christian Borntraeger ha scritto:
  while using trinity to fuzz KVM, we noticed long stalls on invalid ioctls. 
  Lets bail out early on invalid ioctls. or similar?
 
 Okay.  David, can you explain how you found it so that I can make up my
 mind?
 
 Gleb and Marcelo, a fourth and fifth opinion? :)
 
I agree with Christian that simpler fix is better here.
The overhead is minimal. If we ever notice this overhead
we can revert the patch all together since the problem it
fixes can only be inflicted on userspace by itself and there
are myriads other ways userspace can hurt itself.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-18 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
> has been swapped out or is behind a filemap, this will trigger async
> readahead and return immediately. The rationale is that KVM will kick
> back the guest with an "async page fault" and allow for some other
> guest process to take over.
> 
> If async PFs are enabled the fault is retried asap from an async
> workqueue. If not, it's retried immediately in the same code path. In
> either case the retry will not relinquish the mmap semaphore and will
> block on the IO. This is a bad thing, as other mmap semaphore users
> now stall as a function of swap or filemap latency.
> 
> This patch ensures both the regular and async PF path re-enter the
> fault allowing for the mmap semaphore to be relinquished in the case
> of IO wait.
> 
Reviewed-by: Gleb Natapov 

> Reviewed-by: Radim Krčmář 
> Signed-off-by: Andres Lagar-Cavilla 
> 
> ---
> v1 -> v2
> 
> * WARN_ON_ONCE -> VM_WARN_ON_ONCE
> * pagep == NULL skips the final retry
> * kvm_gup_retry -> kvm_gup_io
> * Comment updates throughout
> ---
>  include/linux/kvm_host.h | 11 +++
>  include/linux/mm.h   |  1 +
>  mm/gup.c |  4 
>  virt/kvm/async_pf.c  |  4 +---
>  virt/kvm/kvm_main.c  | 49 
> +---
>  5 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3addcbc..4c1991b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
> unsigned long hva,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> +/*
> + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap
> + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL
> + * controls whether we retry the gup one more time to completion in that 
> case.
> + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp
> + * handler.
> + */
> +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
> +  unsigned long addr, bool write_fault,
> +  struct page **pagep);
> +
>  enum {
>   OUTSIDE_GUEST_MODE,
>   IN_GUEST_MODE,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ebc5f90..13e585f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct 
> vm_area_struct *vma,
>  #define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
>  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
> +#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
>  
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>   void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..af7ea3e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
> vm_area_struct *vma,
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>   if (*flags & FOLL_NOWAIT)
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> + if (*flags & FOLL_TRIED) {
> + VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> + fault_flags |= FAULT_FLAG_TRIED;
> + }
>  
>   ret = handle_mm_fault(mm, vma, address, fault_flags);
>   if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..5ff7f7f 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>   might_sleep();
>  
> - down_read(>mmap_sem);
> - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> - up_read(>mmap_sem);
> + kvm_get_user_page_io(NULL, mm, addr, 1, NULL);
>   kvm_async_page_present_sync(vcpu, apf);
>  
>   spin_lock(>async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ef6b48..fa8a565 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct 
> *tsk, struct mm_struct *mm,
>   return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
>  }
>  
> +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
> +

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-18 Thread Gleb Natapov

On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote:
> Hi Andres,
> On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
> [...]
> > static inline int check_user_page_hwpoison(unsigned long addr)
> > {
> > int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
> >@@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
> >*async, bool write_fault,
> > npages = get_user_page_nowait(current, current->mm,
> >   addr, write_fault, page);
> > up_read(>mm->mmap_sem);
> >-} else
> >-npages = get_user_pages_fast(addr, 1, write_fault,
> >- page);
> >+} else {
> >+/*
> >+ * By now we have tried gup_fast, and possibly async_pf, and we
> >+ * are certainly not atomic. Time to retry the gup, allowing
> >+ * mmap semaphore to be relinquished in the case of IO.
> >+ */
> >+npages = kvm_get_user_page_io(current, current->mm, addr,
> >+  write_fault, page);
> >+}
> 
> try_async_pf 
>  gfn_to_pfn_async 
>   __gfn_to_pfnasync = false 
*async = false

>__gfn_to_pfn_memslot
> hva_to_pfn 
>hva_to_pfn_fast 
>hva_to_pfn_slow 
hva_to_pfn_slow checks async not *async.

> kvm_get_user_page_io
> 
> page will always be ready after kvm_get_user_page_io which leads to APF
> don't need to work any more.
> 
> Regards,
> Wanpeng Li
> 
> > if (npages != 1)
> > return npages;
> > 
> >-- 
> >2.1.0.rc2.206.gedb03e5
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majord...@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-18 Thread Gleb Natapov

On Thu, Sep 18, 2014 at 08:29:17AM +0800, Wanpeng Li wrote:
 Hi Andres,
 On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
 [...]
  static inline int check_user_page_hwpoison(unsigned long addr)
  {
  int rc, flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_WRITE;
 @@ -1177,9 +1214,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
 *async, bool write_fault,
  npages = get_user_page_nowait(current, current-mm,
addr, write_fault, page);
  up_read(current-mm-mmap_sem);
 -} else
 -npages = get_user_pages_fast(addr, 1, write_fault,
 - page);
 +} else {
 +/*
 + * By now we have tried gup_fast, and possibly async_pf, and we
 + * are certainly not atomic. Time to retry the gup, allowing
 + * mmap semaphore to be relinquished in the case of IO.
 + */
 +npages = kvm_get_user_page_io(current, current-mm, addr,
 +  write_fault, page);
 +}
 
 try_async_pf 
  gfn_to_pfn_async 
   __gfn_to_pfnasync = false 
*async = false

__gfn_to_pfn_memslot
 hva_to_pfn 
hva_to_pfn_fast 
hva_to_pfn_slow 
hva_to_pfn_slow checks async not *async.

 kvm_get_user_page_io
 
 page will always be ready after kvm_get_user_page_io which leads to APF
 don't need to work any more.
 
 Regards,
 Wanpeng Li
 
  if (npages != 1)
  return npages;
  
 -- 
 2.1.0.rc2.206.gedb03e5
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-18 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:51:48AM -0700, Andres Lagar-Cavilla wrote:
 When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
 has been swapped out or is behind a filemap, this will trigger async
 readahead and return immediately. The rationale is that KVM will kick
 back the guest with an async page fault and allow for some other
 guest process to take over.
 
 If async PFs are enabled the fault is retried asap from an async
 workqueue. If not, it's retried immediately in the same code path. In
 either case the retry will not relinquish the mmap semaphore and will
 block on the IO. This is a bad thing, as other mmap semaphore users
 now stall as a function of swap or filemap latency.
 
 This patch ensures both the regular and async PF path re-enter the
 fault allowing for the mmap semaphore to be relinquished in the case
 of IO wait.
 
Reviewed-by: Gleb Natapov g...@kernel.org

 Reviewed-by: Radim Krčmář rkrc...@redhat.com
 Signed-off-by: Andres Lagar-Cavilla andre...@google.com
 
 ---
 v1 - v2
 
 * WARN_ON_ONCE - VM_WARN_ON_ONCE
 * pagep == NULL skips the final retry
 * kvm_gup_retry - kvm_gup_io
 * Comment updates throughout
 ---
  include/linux/kvm_host.h | 11 +++
  include/linux/mm.h   |  1 +
  mm/gup.c |  4 
  virt/kvm/async_pf.c  |  4 +---
  virt/kvm/kvm_main.c  | 49 
 +---
  5 files changed, 63 insertions(+), 6 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 3addcbc..4c1991b 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -198,6 +198,17 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
 unsigned long hva,
  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
  #endif
  
 +/*
 + * Carry out a gup that requires IO. Allow the mm to relinquish the mmap
 + * semaphore if the filemap/swap has to wait on a page lock. pagep == NULL
 + * controls whether we retry the gup one more time to completion in that 
 case.
 + * Typically this is called after a FAULT_FLAG_RETRY_NOWAIT in the main tdp
 + * handler.
 + */
 +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
 +  unsigned long addr, bool write_fault,
 +  struct page **pagep);
 +
  enum {
   OUTSIDE_GUEST_MODE,
   IN_GUEST_MODE,
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index ebc5f90..13e585f7 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct 
 vm_area_struct *vma,
  #define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
 entry */
 +#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
  
  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
   void *data);
 diff --git a/mm/gup.c b/mm/gup.c
 index 91d044b..af7ea3e 100644
 --- a/mm/gup.c
 +++ b/mm/gup.c
 @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
 vm_area_struct *vma,
   fault_flags |= FAULT_FLAG_ALLOW_RETRY;
   if (*flags  FOLL_NOWAIT)
   fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 + if (*flags  FOLL_TRIED) {
 + VM_WARN_ON_ONCE(fault_flags  FAULT_FLAG_ALLOW_RETRY);
 + fault_flags |= FAULT_FLAG_TRIED;
 + }
  
   ret = handle_mm_fault(mm, vma, address, fault_flags);
   if (ret  VM_FAULT_ERROR) {
 diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
 index d6a3d09..5ff7f7f 100644
 --- a/virt/kvm/async_pf.c
 +++ b/virt/kvm/async_pf.c
 @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
  
   might_sleep();
  
 - down_read(mm-mmap_sem);
 - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
 - up_read(mm-mmap_sem);
 + kvm_get_user_page_io(NULL, mm, addr, 1, NULL);
   kvm_async_page_present_sync(vcpu, apf);
  
   spin_lock(vcpu-async_pf.lock);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 7ef6b48..fa8a565 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1115,6 +1115,43 @@ static int get_user_page_nowait(struct task_struct 
 *tsk, struct mm_struct *mm,
   return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
  }
  
 +int kvm_get_user_page_io(struct task_struct *tsk, struct mm_struct *mm,
 +  unsigned long addr, bool write_fault,
 +  struct page **pagep)
 +{
 + int npages;
 + int locked = 1;
 + int flags = FOLL_TOUCH | FOLL_HWPOISON |
 + (pagep ? FOLL_GET : 0) |
 + (write_fault ? FOLL_WRITE : 0);
 +
 + /*
 +  * If retrying the fault, we get here *not* having allowed the filemap
 +  * to wait on the page lock. We should now allow waiting on the IO

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:13:45AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov  wrote:
> > On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
> >> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov  wrote:
> >> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
> >> >> 2014-09-17 13:26+0300, Gleb Natapov:
> >> >> > For async_pf_execute() you do not need to even retry. Next guest's 
> >> >> > page fault
> >> >> > will retry it for you.
> >> >>
> >> >> Wouldn't that be a waste of vmentries?
> >> > This is how it will work with or without this second gup. Page is not
> >> > mapped into a shadow page table on this path, it happens on a next fault.
> >>
> >> The point is that the gup in the async pf completion from the work
> >> queue will not relinquish the mmap semaphore. And it most definitely
> >> should, given that we are likely looking at swap/filemap.
> >>
> > I get this point and the patch looks good in general, but my point is
> > that when _retry() is called from async_pf_execute() second gup is not
> > needed. In the original code gup is called to do IO and nothing else.
> > In your patch this is accomplished by the first gup already, so you
> > can skip second gup if pagep == nullptr.
> 
> I see. However, if this function were to be used elsewhere in the
> future, then the "if pagep == NULL don't retry" semantics may not
> match the new caller's intention. Would you prefer an explicit flag?
> 
We can add explicit flag whenever such caller will be added, if ever.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
> On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov  wrote:
> > On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
> >> 2014-09-17 13:26+0300, Gleb Natapov:
> >> > For async_pf_execute() you do not need to even retry. Next guest's page 
> >> > fault
> >> > will retry it for you.
> >>
> >> Wouldn't that be a waste of vmentries?
> > This is how it will work with or without this second gup. Page is not
> > mapped into a shadow page table on this path, it happens on a next fault.
> 
> The point is that the gup in the async pf completion from the work
> queue will not relinquish the mmap semaphore. And it most definitely
> should, given that we are likely looking at swap/filemap.
> 
I get this point and the patch looks good in general, but my point is
that when _retry() is called from async_pf_execute() second gup is not
needed. In the original code gup is called to do IO and nothing else.
In your patch this is accomplished by the first gup already, so you
can skip second gup if pagep == nullptr.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
> 2014-09-17 13:26+0300, Gleb Natapov:
> > For async_pf_execute() you do not need to even retry. Next guest's page 
> > fault
> > will retry it for you.
> 
> Wouldn't that be a waste of vmentries?
This is how it will work with or without this second gup. Page is not
mapped into a shadow page table on this path, it happens on a next fault.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Mon, Sep 15, 2014 at 01:11:25PM -0700, Andres Lagar-Cavilla wrote:
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been
> swapped out or is behind a filemap, this will trigger async readahead and
> return immediately. The rationale is that KVM will kick back the guest with an
> "async page fault" and allow for some other guest process to take over.
> 
> If async PFs are enabled the fault is retried asap from a workqueue, or
> immediately if no async PFs. The retry will not relinquish the mmap semaphore
> and will block on the IO. This is a bad thing, as other mmap semaphore users
> now stall. The fault could take a long time, depending on swap or filemap
> latency.
> 
> This patch ensures both the regular and async PF path re-enter the fault
> allowing for the mmap semaphore to be relinquished in the case of IO wait.
> 
> Signed-off-by: Andres Lagar-Cavilla 
> ---
>  include/linux/kvm_host.h |  9 +
>  include/linux/mm.h   |  1 +
>  mm/gup.c |  4 
>  virt/kvm/async_pf.c  |  4 +---
>  virt/kvm/kvm_main.c  | 45 ++---
>  5 files changed, 57 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3addcbc..704908d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -198,6 +198,15 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
> unsigned long hva,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>  
> +/*
> + * Retry a fault after a gup with FOLL_NOWAIT. This properly relinquishes 
> mmap
> + * semaphore if the filemap/swap has to wait on page lock (and retries the 
> gup
> + * to completion after that).
> + */
> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long addr, bool write_fault,
> + struct page **pagep);
> +
>  enum {
>   OUTSIDE_GUEST_MODE,
>   IN_GUEST_MODE,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ebc5f90..13e585f7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct 
> vm_area_struct *vma,
>  #define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
>  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
> entry */
> +#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
>  
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>   void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..332d1c3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
> vm_area_struct *vma,
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>   if (*flags & FOLL_NOWAIT)
>   fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> + if (*flags & FOLL_TRIED) {
> + WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> + fault_flags |= FAULT_FLAG_TRIED;
> + }
>  
>   ret = handle_mm_fault(mm, vma, address, fault_flags);
>   if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..17b78b1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>  
>   might_sleep();
>  
> - down_read(>mmap_sem);
> - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> - up_read(>mmap_sem);
> + kvm_get_user_page_retry(NULL, mm, addr, 1, NULL);
>   kvm_async_page_present_sync(vcpu, apf);
>  
>   spin_lock(>async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7ef6b48..43a9ab9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1115,6 +1115,39 @@ static int get_user_page_nowait(struct task_struct 
> *tsk, struct mm_struct *mm,
>   return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
>  }
>  
> +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long addr, bool write_fault,
> + struct page **pagep)
> +{
> + int npages;
> + int locked = 1;
> + int flags = FOLL_TOUCH | FOLL_HWPOISON |
> + (pagep ? FOLL_GET : 0) |
> + (write_fault ? FOLL_WRITE : 0);
> +
> + /*
> +  * Retrying fault, we get here *not* having allowed the filemap to wait
> +  * on the page lock. We should now allow waiting on the IO with the
> +  * mmap semaphore released.
> +  */
> + down_read(>mmap_sem);
> + npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL,
> +   );
> + if (!locked) {
> + BUG_ON(npages != -EBUSY);
> + /*
> +

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Mon, Sep 15, 2014 at 01:11:25PM -0700, Andres Lagar-Cavilla wrote:
 When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory has been
 swapped out or is behind a filemap, this will trigger async readahead and
 return immediately. The rationale is that KVM will kick back the guest with an
 async page fault and allow for some other guest process to take over.
 
 If async PFs are enabled the fault is retried asap from a workqueue, or
 immediately if no async PFs. The retry will not relinquish the mmap semaphore
 and will block on the IO. This is a bad thing, as other mmap semaphore users
 now stall. The fault could take a long time, depending on swap or filemap
 latency.
 
 This patch ensures both the regular and async PF path re-enter the fault
 allowing for the mmap semaphore to be relinquished in the case of IO wait.
 
 Signed-off-by: Andres Lagar-Cavilla andre...@google.com
 ---
  include/linux/kvm_host.h |  9 +
  include/linux/mm.h   |  1 +
  mm/gup.c |  4 
  virt/kvm/async_pf.c  |  4 +---
  virt/kvm/kvm_main.c  | 45 ++---
  5 files changed, 57 insertions(+), 6 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 3addcbc..704908d 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -198,6 +198,15 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
 unsigned long hva,
  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
  #endif
  
 +/*
 + * Retry a fault after a gup with FOLL_NOWAIT. This properly relinquishes 
 mmap
 + * semaphore if the filemap/swap has to wait on page lock (and retries the 
 gup
 + * to completion after that).
 + */
 +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
 + unsigned long addr, bool write_fault,
 + struct page **pagep);
 +
  enum {
   OUTSIDE_GUEST_MODE,
   IN_GUEST_MODE,
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index ebc5f90..13e585f7 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -2011,6 +2011,7 @@ static inline struct page *follow_page(struct 
 vm_area_struct *vma,
  #define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
 entry */
 +#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
  
  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
   void *data);
 diff --git a/mm/gup.c b/mm/gup.c
 index 91d044b..332d1c3 100644
 --- a/mm/gup.c
 +++ b/mm/gup.c
 @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
 vm_area_struct *vma,
   fault_flags |= FAULT_FLAG_ALLOW_RETRY;
   if (*flags  FOLL_NOWAIT)
   fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 + if (*flags  FOLL_TRIED) {
 + WARN_ON_ONCE(fault_flags  FAULT_FLAG_ALLOW_RETRY);
 + fault_flags |= FAULT_FLAG_TRIED;
 + }
  
   ret = handle_mm_fault(mm, vma, address, fault_flags);
   if (ret  VM_FAULT_ERROR) {
 diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
 index d6a3d09..17b78b1 100644
 --- a/virt/kvm/async_pf.c
 +++ b/virt/kvm/async_pf.c
 @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
  
   might_sleep();
  
 - down_read(mm-mmap_sem);
 - get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
 - up_read(mm-mmap_sem);
 + kvm_get_user_page_retry(NULL, mm, addr, 1, NULL);
   kvm_async_page_present_sync(vcpu, apf);
  
   spin_lock(vcpu-async_pf.lock);
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 7ef6b48..43a9ab9 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1115,6 +1115,39 @@ static int get_user_page_nowait(struct task_struct 
 *tsk, struct mm_struct *mm,
   return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
  }
  
 +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,
 + unsigned long addr, bool write_fault,
 + struct page **pagep)
 +{
 + int npages;
 + int locked = 1;
 + int flags = FOLL_TOUCH | FOLL_HWPOISON |
 + (pagep ? FOLL_GET : 0) |
 + (write_fault ? FOLL_WRITE : 0);
 +
 + /*
 +  * Retrying fault, we get here *not* having allowed the filemap to wait
 +  * on the page lock. We should now allow waiting on the IO with the
 +  * mmap semaphore released.
 +  */
 + down_read(mm-mmap_sem);
 + npages = __get_user_pages(tsk, mm, addr, 1, flags, pagep, NULL,
 +   locked);
 + if (!locked) {
 + BUG_ON(npages != -EBUSY);
 + /*
 +  * The previous call has now waited on the IO. Now we can
 +  * retry

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
 2014-09-17 13:26+0300, Gleb Natapov:
  For async_pf_execute() you do not need to even retry. Next guest's page 
  fault
  will retry it for you.
 
 Wouldn't that be a waste of vmentries?
This is how it will work with or without this second gup. Page is not
mapped into a shadow page table on this path, it happens on a next fault.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
 On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov g...@kernel.org wrote:
  On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
  2014-09-17 13:26+0300, Gleb Natapov:
   For async_pf_execute() you do not need to even retry. Next guest's page 
   fault
   will retry it for you.
 
  Wouldn't that be a waste of vmentries?
  This is how it will work with or without this second gup. Page is not
  mapped into a shadow page table on this path, it happens on a next fault.
 
 The point is that the gup in the async pf completion from the work
 queue will not relinquish the mmap semaphore. And it most definitely
 should, given that we are likely looking at swap/filemap.
 
I get this point and the patch looks good in general, but my point is
that when _retry() is called from async_pf_execute() second gup is not
needed. In the original code gup is called to do IO and nothing else.
In your patch this is accomplished by the first gup already, so you
can skip second gup if pagep == nullptr.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Gleb Natapov

On Wed, Sep 17, 2014 at 10:13:45AM -0700, Andres Lagar-Cavilla wrote:
 On Wed, Sep 17, 2014 at 10:08 AM, Gleb Natapov g...@kernel.org wrote:
  On Wed, Sep 17, 2014 at 10:00:32AM -0700, Andres Lagar-Cavilla wrote:
  On Wed, Sep 17, 2014 at 4:42 AM, Gleb Natapov g...@kernel.org wrote:
   On Wed, Sep 17, 2014 at 01:27:14PM +0200, Radim Krčmář wrote:
   2014-09-17 13:26+0300, Gleb Natapov:
For async_pf_execute() you do not need to even retry. Next guest's 
page fault
will retry it for you.
  
   Wouldn't that be a waste of vmentries?
   This is how it will work with or without this second gup. Page is not
   mapped into a shadow page table on this path, it happens on a next fault.
 
  The point is that the gup in the async pf completion from the work
  queue will not relinquish the mmap semaphore. And it most definitely
  should, given that we are likely looking at swap/filemap.
 
  I get this point and the patch looks good in general, but my point is
  that when _retry() is called from async_pf_execute() second gup is not
  needed. In the original code gup is called to do IO and nothing else.
  In your patch this is accomplished by the first gup already, so you
  can skip second gup if pagep == nullptr.
 
 I see. However, if this function were to be used elsewhere in the
 future, then the if pagep == NULL don't retry semantics may not
 match the new caller's intention. Would you prefer an explicit flag?
 
We can add explicit flag whenever such caller will be added, if ever.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:37:39PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 16:31, Gleb Natapov ha scritto:
> >> > What if the page being swapped out is L1's APIC access page?  We don't
> >> > run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we
> >> > need to "do something".
> > We will do something on L2->L1 exit. We will call 
> > kvm_reload_apic_access_page().
> > That is what patch 5 of this series is doing.
> 
> Sorry, I meant "the APIC access page prepared by L1" for L2's execution.
> 
> You wrote:
> 
> > if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
> > ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
> > write(PIC_ACCESS_ADDR)
> > 
> > In other words if L2 shares L1 apic access page then reload, otherwise do 
> > nothing.
> 
> but in that case you have to redo nested_get_page, so "do nothing"
> doesn't work.
> 
Ah, 7/7 is new in this submission. Before that this page was still
pinned.  Looking at 7/7 now I do not see how it can work since it has no
code for mmu notifier to detect that it deals with such page and call
kvm_reload_apic_access_page().  I said to Tang previously that nested
kvm has a bunch of pinned page that are hard to deal with and suggested
to iron out non nested case first :(

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:24:04PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 16:21, Gleb Natapov ha scritto:
> > As far as I can tell the if that is needed there is:
> > 
> > if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
> > ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
> > write(PIC_ACCESS_ADDR)
> > 
> > In other words if L2 shares L1 apic access page then reload, otherwise do 
> > nothing.
> 
> What if the page being swapped out is L1's APIC access page?  We don't
> run prepare_vmcs12 in that case because it's an L2->L0->L2 entry, so we
> need to "do something".
We will do something on L2->L1 exit. We will call kvm_reload_apic_access_page().
That is what patch 5 of this series is doing.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:06:58PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 15:59, Gleb Natapov ha scritto:
> > 
> > Suppose vmcs01->APIC_ACCESS_ADDR = 0xf000. During L2 entry
> > vmcs02->APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now
> > 0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit,
> > but vcpu is in a guest mode so vmcs02->APIC_ACCESS_ADDR is never updated
> > to 0x8000 because of "if (!is_guest_mode(vcpu))" check. So what am I
> > missing here?
> 
> Right, guest mode isn't left as soon as you execute nested_vmx_vmexit,
> because this isn't an L2->L1 exit.  So we need an "else" for that "if
> (!is_guest_mode(vcpu))", in which case the hpa is ignored and
> vmcs12->apic_access_addr is used instead?
> 
As far as I can tell the if that is needed there is:

if (!is_guest_mode() || !(vmcs12->secondary_vm_exec_control & 
ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
write(PIC_ACCESS_ADDR)

In other words if L2 shares L1 apic access page then reload, otherwise do 
nothing.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 03:05:05PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 13:30, Gleb Natapov ha scritto:
> >> > +vmcs_write64(APIC_ACCESS_ADDR, 
> >> > page_to_phys(page));
> >> > +/*
> >> > + * Do not pin apic access page in memory so 
> >> > that memory
> >> > + * hotplug process is able to migrate it.
> >> > + */
> >> > +put_page(page);
> >> >  }
> > This code is in prepare_vmcs02() and is executed during L1->L2 vmentry. 
> > What happens
> > when apic access page is migrated while L2 is running? It needs to be 
> > update somewhere.
> 
> Before it is migrated, the MMU notifier is called and will force a
> vmexit on all CPUs.  The reload code will call GUP again on the page
> again and swap it in.
> 
This is how it will work without "if (!is_guest_mode(vcpu))". But,
unless I am missing something, with this check it will not work while
vcpu is in L2.

Suppose vmcs01->APIC_ACCESS_ADDR = 0xf000. During L2 entry
vmcs02->APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now
0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit,
but vcpu is in a guest mode so vmcs02->APIC_ACCESS_ADDR is never updated
to 0x8000 because of "if (!is_guest_mode(vcpu))" check. So what am I
missing here?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 12:47:16PM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 12:12, Gleb Natapov ha scritto:
> > On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote:
> >> Il 11/09/2014 07:38, Tang Chen ha scritto:
> >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>> index 63c4c3e..da6d55d 100644
> >>> --- a/arch/x86/kvm/vmx.c
> >>> +++ b/arch/x86/kvm/vmx.c
> >>> @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
> >>> kvm_vcpu *vcpu, bool set)
> >>>   vmx_set_msr_bitmap(vcpu);
> >>>  }
> >>>  
> >>> +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> >>> +{
> >>> + vmcs_write64(APIC_ACCESS_ADDR, hpa);
> >>
> >> This has to be guarded by "if (!is_guest_mode(vcpu))".
> >>
> > We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip
> > it otherwise, no?
> 
> Yes, but this would be handled by patch 6:
> 
>   } else if (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> + struct page *page = gfn_to_page(vmx->vcpu.kvm,
> + APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   exec_control |=
>   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> - vmcs_write64(APIC_ACCESS_ADDR,
> - page_to_phys(vcpu->kvm->arch.apic_access_page));
> + vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> + /*
> +  * Do not pin apic access page in memory so that memory
> +  * hotplug process is able to migrate it.
> +  */
> + put_page(page);
>   }
This code is in prepare_vmcs02() and is executed during L1->L2 vmentry. What 
happens
when apic access page is migrated while L2 is running? It needs to be update 
somewhere.

> 
> However, this is also useless code duplication because the above snippet could
> reuse vcpu_reload_apic_access_page too.
> 
> So I think you cannot do the is_guest_mode check in
> kvm_vcpu_reload_apic_access_page and also not in
> vmx_reload_apic_access_page.  But you could do something like
> 
> kvm_vcpu_reload_apic_access_page(...)
> {
>   ...
>   kvm_x86_ops->reload_apic_access_page(...);
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page);
> 
> /* used in vcpu_enter_guest only */
> vcpu_reload_apic_access_page(...)
> {
>   if (!is_guest_mode(vcpu))
>   kvm_vcpu_reload_apic_access_page(...)
> }
> 
> Paolo

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote:
> Il 11/09/2014 07:38, Tang Chen ha scritto:
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 63c4c3e..da6d55d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
> > kvm_vcpu *vcpu, bool set)
> > vmx_set_msr_bitmap(vcpu);
> >  }
> >  
> > +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> > +{
> > +   vmcs_write64(APIC_ACCESS_ADDR, hpa);
> 
> This has to be guarded by "if (!is_guest_mode(vcpu))".
> 
We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip
it otherwise, no?

> > +}
> > +

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote:
 Il 11/09/2014 07:38, Tang Chen ha scritto:
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 63c4c3e..da6d55d 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
  kvm_vcpu *vcpu, bool set)
  vmx_set_msr_bitmap(vcpu);
   }
   
  +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
  +{
  +   vmcs_write64(APIC_ACCESS_ADDR, hpa);
 
 This has to be guarded by if (!is_guest_mode(vcpu)).
 
We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip
it otherwise, no?

  +}
  +

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 12:47:16PM +0200, Paolo Bonzini wrote:
 Il 11/09/2014 12:12, Gleb Natapov ha scritto:
  On Thu, Sep 11, 2014 at 11:21:49AM +0200, Paolo Bonzini wrote:
  Il 11/09/2014 07:38, Tang Chen ha scritto:
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 63c4c3e..da6d55d 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
  kvm_vcpu *vcpu, bool set)
vmx_set_msr_bitmap(vcpu);
   }
   
  +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
  +{
  + vmcs_write64(APIC_ACCESS_ADDR, hpa);
 
  This has to be guarded by if (!is_guest_mode(vcpu)).
 
  We do need to write it if L1 and L2 share APIC_ACCESS_ADDR and skip
  it otherwise, no?
 
 Yes, but this would be handled by patch 6:
 
   } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 + struct page *page = gfn_to_page(vmx-vcpu.kvm,
 + APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
   exec_control |=
   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 - vmcs_write64(APIC_ACCESS_ADDR,
 - page_to_phys(vcpu-kvm-arch.apic_access_page));
 + vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
 + /*
 +  * Do not pin apic access page in memory so that memory
 +  * hotplug process is able to migrate it.
 +  */
 + put_page(page);
   }
This code is in prepare_vmcs02() and is executed during L1-L2 vmentry. What 
happens
when apic access page is migrated while L2 is running? It needs to be update 
somewhere.

 
 However, this is also useless code duplication because the above snippet could
 reuse vcpu_reload_apic_access_page too.
 
 So I think you cannot do the is_guest_mode check in
 kvm_vcpu_reload_apic_access_page and also not in
 vmx_reload_apic_access_page.  But you could do something like
 
 kvm_vcpu_reload_apic_access_page(...)
 {
   ...
   kvm_x86_ops-reload_apic_access_page(...);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page);
 
 /* used in vcpu_enter_guest only */
 vcpu_reload_apic_access_page(...)
 {
   if (!is_guest_mode(vcpu))
   kvm_vcpu_reload_apic_access_page(...)
 }
 
 Paolo

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 03:05:05PM +0200, Paolo Bonzini wrote:
 Il 11/09/2014 13:30, Gleb Natapov ha scritto:
   +vmcs_write64(APIC_ACCESS_ADDR, 
   page_to_phys(page));
   +/*
   + * Do not pin apic access page in memory so 
   that memory
   + * hotplug process is able to migrate it.
   + */
   +put_page(page);
}
  This code is in prepare_vmcs02() and is executed during L1-L2 vmentry. 
  What happens
  when apic access page is migrated while L2 is running? It needs to be 
  update somewhere.
 
 Before it is migrated, the MMU notifier is called and will force a
 vmexit on all CPUs.  The reload code will call GUP again on the page
 again and swap it in.
 
This is how it will work without if (!is_guest_mode(vcpu)). But,
unless I am missing something, with this check it will not work while
vcpu is in L2.

Suppose vmcs01-APIC_ACCESS_ADDR = 0xf000. During L2 entry
vmcs02-APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now
0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit,
but vcpu is in a guest mode so vmcs02-APIC_ACCESS_ADDR is never updated
to 0x8000 because of if (!is_guest_mode(vcpu)) check. So what am I
missing here?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:06:58PM +0200, Paolo Bonzini wrote:
 Il 11/09/2014 15:59, Gleb Natapov ha scritto:
  
  Suppose vmcs01-APIC_ACCESS_ADDR = 0xf000. During L2 entry
  vmcs02-APIC_ACCESS_ADDR is set to 0xf000 too (by prepare_vmcs02). Now
  0xf000 is migrated to 0x8000, mmu notifier is called, it forces vmexit,
  but vcpu is in a guest mode so vmcs02-APIC_ACCESS_ADDR is never updated
  to 0x8000 because of if (!is_guest_mode(vcpu)) check. So what am I
  missing here?
 
 Right, guest mode isn't left as soon as you execute nested_vmx_vmexit,
 because this isn't an L2-L1 exit.  So we need an else for that if
 (!is_guest_mode(vcpu)), in which case the hpa is ignored and
 vmcs12-apic_access_addr is used instead?
 
As far as I can tell the if that is needed there is:

if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control  
ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
write(PIC_ACCESS_ADDR)

In other words if L2 shares L1 apic access page then reload, otherwise do 
nothing.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:24:04PM +0200, Paolo Bonzini wrote:
 Il 11/09/2014 16:21, Gleb Natapov ha scritto:
  As far as I can tell the if that is needed there is:
  
  if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control  
  ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
  write(PIC_ACCESS_ADDR)
  
  In other words if L2 shares L1 apic access page then reload, otherwise do 
  nothing.
 
 What if the page being swapped out is L1's APIC access page?  We don't
 run prepare_vmcs12 in that case because it's an L2-L0-L2 entry, so we
 need to do something.
We will do something on L2-L1 exit. We will call kvm_reload_apic_access_page().
That is what patch 5 of this series is doing.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 4/7] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-11 Thread Gleb Natapov

On Thu, Sep 11, 2014 at 04:37:39PM +0200, Paolo Bonzini wrote:
 Il 11/09/2014 16:31, Gleb Natapov ha scritto:
   What if the page being swapped out is L1's APIC access page?  We don't
   run prepare_vmcs12 in that case because it's an L2-L0-L2 entry, so we
   need to do something.
  We will do something on L2-L1 exit. We will call 
  kvm_reload_apic_access_page().
  That is what patch 5 of this series is doing.
 
 Sorry, I meant the APIC access page prepared by L1 for L2's execution.
 
 You wrote:
 
  if (!is_guest_mode() || !(vmcs12-secondary_vm_exec_control  
  ECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
  write(PIC_ACCESS_ADDR)
  
  In other words if L2 shares L1 apic access page then reload, otherwise do 
  nothing.
 
 but in that case you have to redo nested_get_page, so do nothing
 doesn't work.
 
Ah, 7/7 is new in this submission. Before that this page was still
pinned.  Looking at 7/7 now I do not see how it can work since it has no
code for mmu notifier to detect that it deals with such page and call
kvm_reload_apic_access_page().  I said to Tang previously that nested
kvm has a bunch of pinned page that are hard to deal with and suggested
to iron out non nested case first :(

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-10 Thread Gleb Natapov

On Tue, Sep 09, 2014 at 03:13:07PM +0800, tangchen wrote:
> Hi Gleb,
> 
> On 09/03/2014 11:04 PM, Gleb Natapov wrote:
> >On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:
> >>Hi Gleb,
> >>
> >>On 09/03/2014 12:00 AM, Gleb Natapov wrote:
> >>>..
> >>>+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >>>+{
> >>>+  /*
> >>>+   * apic access page could be migrated. When the page is being migrated,
> >>>+   * GUP will wait till the migrate entry is replaced with the new pte
> >>>+   * entry pointing to the new page.
> >>>+   */
> >>>+  vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
> >>>+  APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> >>>+  kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
> >>>+  page_to_phys(vcpu->kvm->arch.apic_access_page));
> >>>I am a little bit worried that here all vcpus write to 
> >>>vcpu->kvm->arch.apic_access_page
> >>>without any locking. It is probably benign since pointer write is atomic 
> >>>on x86. Paolo?
> >>>
> >>>Do we even need apic_access_page? Why not call
> >>>  gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
> >>>  put_page()
> >>>on rare occasions we need to know its address?
> >>Isn't it a necessary item defined in hardware spec ?
> >>
> >vcpu->kvm->arch.apic_access_page? No. This is internal kvm data structure.
> >
> >>I didn't read intel spec deeply, but according to the code, the page's
> >>address is
> >>written into vmcs. And it made me think that we cannnot remove it.
> >>
> >We cannot remove writing of apic page address into vmcs, but this is not 
> >done by
> >assigning to vcpu->kvm->arch.apic_access_page, but by vmwrite in 
> >set_apic_access_page_addr().
> 
> OK, I'll try to remove kvm->arch.apic_access_page and send a patch for it
> soon.
> 
> BTW, if you don't have objection to the first two patches, would you please
> help to
> commit them first ?
> 
I acked them and CCed Paolo to this reply. I hope he will look at the series 
too.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/6] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-09-10 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:37PM +0800, Tang Chen wrote:
> kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
> it is never used to refer to the page at all.
> 
> In vcpu initialization, it indicates two things:
> 1. indicates if ept page is allocated
> 2. indicates if a memory slot for identity page is initialized
> 
> Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
> identity pagetable is initialized. So we can remove ept_identity_pagetable.
> 
> NOTE: In the original code, ept identity pagetable page is pinned in memroy.
>   As a result, it cannot be migrated/hot-removed. After this patch, since
>   kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
>   is no longer pinned in memory. And it can be migrated/hot-removed.
Reviewed-by: Gleb Natapov 

> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/vmx.c  | 50 
> -
>  arch/x86/kvm/x86.c  |  2 --
>  3 files changed, 25 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7c492ed..35171c7 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -580,7 +580,6 @@ struct kvm_arch {
>  
>   gpa_t wall_clock;
>  
> - struct page *ept_identity_pagetable;
>   bool ept_identity_pagetable_done;
>   gpa_t ept_identity_map_addr;
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4b80ead..953d529 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -743,6 +743,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment 
> *var);
>  static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu);
>  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
>  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
> +static int alloc_identity_pagetable(struct kvm *kvm);
>  
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -3938,21 +3939,27 @@ out:
>  
>  static int init_rmode_identity_map(struct kvm *kvm)
>  {
> - int i, idx, r, ret;
> + int i, idx, r, ret = 0;
>   pfn_t identity_map_pfn;
>   u32 tmp;
>  
>   if (!enable_ept)
>   return 1;
> - if (unlikely(!kvm->arch.ept_identity_pagetable)) {
> - printk(KERN_ERR "EPT: identity-mapping pagetable "
> - "haven't been allocated!\n");
> - return 0;
> +
> + /* Protect kvm->arch.ept_identity_pagetable_done. */
> + mutex_lock(>slots_lock);
> +
> + if (likely(kvm->arch.ept_identity_pagetable_done)) {
> + ret = 1;
> + goto out2;
>   }
> - if (likely(kvm->arch.ept_identity_pagetable_done))
> - return 1;
> - ret = 0;
> +
>   identity_map_pfn = kvm->arch.ept_identity_map_addr >> PAGE_SHIFT;
> +
> + r = alloc_identity_pagetable(kvm);
> + if (r)
> + goto out2;
> +
>   idx = srcu_read_lock(>srcu);
>   r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
>   if (r < 0)
> @@ -3970,6 +3977,9 @@ static int init_rmode_identity_map(struct kvm *kvm)
>   ret = 1;
>  out:
>   srcu_read_unlock(>srcu, idx);
> +
> +out2:
> + mutex_unlock(>slots_lock);
>   return ret;
>  }
>  
> @@ -4019,31 +4029,23 @@ out:
>  
>  static int alloc_identity_pagetable(struct kvm *kvm)
>  {
> - struct page *page;
> + /*
> +  * In init_rmode_identity_map(), kvm->arch.ept_identity_pagetable_done
> +  * is checked before calling this function and set to true after the
> +  * calling. The access to kvm->arch.ept_identity_pagetable_done should
> +  * be protected by kvm->slots_lock.
> +  */
> +
>   struct kvm_userspace_memory_region kvm_userspace_mem;
>   int r = 0;
>  
> - mutex_lock(>slots_lock);
> - if (kvm->arch.ept_identity_pagetable)
> - goto out;
>   kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
>   kvm_userspace_mem.flags = 0;
>   kvm_userspace_mem.guest_phys_addr =
>   kvm->arch.ept_identity_map_addr;
>   kvm_userspace_mem.memory_size = PAGE_SIZE;
>   r = __kvm_set_memory_region(kvm, _userspace_mem);
> - if (r)
> - goto out;
>  
> - page = gfn_to_page(kvm, kvm->arch.ept_identity_map_addr >> PAGE_SHIFT);
> - if (is_error_page(page)) {
> - r = -EFAULT;
>

Re: [PATCH v4 1/6] kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.

2014-09-10 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:36PM +0800, Tang Chen wrote:
> We have APIC_DEFAULT_PHYS_BASE defined as 0xfee0, which is also the 
> address of
> apic access page. So use this macro.
Reviewed-by: Gleb Natapov 

> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/kvm/svm.c | 3 ++-
>  arch/x86/kvm/vmx.c | 6 +++---
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index ddf7427..1d941ad 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1257,7 +1257,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   svm->asid_generation = 0;
>   init_vmcb(svm);
>  
> - svm->vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
> + svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
> +MSR_IA32_APICBASE_ENABLE;
>   if (kvm_vcpu_is_bsp(>vcpu))
>   svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP;
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index bfe11cf..4b80ead 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3999,13 +3999,13 @@ static int alloc_apic_access_page(struct kvm *kvm)
>   goto out;
>   kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
>   kvm_userspace_mem.flags = 0;
> - kvm_userspace_mem.guest_phys_addr = 0xfee0ULL;
> + kvm_userspace_mem.guest_phys_addr = APIC_DEFAULT_PHYS_BASE;
>   kvm_userspace_mem.memory_size = PAGE_SIZE;
>   r = __kvm_set_memory_region(kvm, _userspace_mem);
>   if (r)
>   goto out;
>  
> - page = gfn_to_page(kvm, 0xfee00);
> + page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   if (is_error_page(page)) {
>   r = -EFAULT;
>   goto out;
> @@ -4477,7 +4477,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>  
>   vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
>   kvm_set_cr8(>vcpu, 0);
> - apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
> + apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
>   if (kvm_vcpu_is_bsp(>vcpu))
>   apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
>   apic_base_msr.host_initiated = true;
> -- 
> 1.8.3.1
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 1/6] kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.

2014-09-10 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:36PM +0800, Tang Chen wrote:
 We have APIC_DEFAULT_PHYS_BASE defined as 0xfee0, which is also the 
 address of
 apic access page. So use this macro.
Reviewed-by: Gleb Natapov g...@kernel.org

 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/kvm/svm.c | 3 ++-
  arch/x86/kvm/vmx.c | 6 +++---
  2 files changed, 5 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index ddf7427..1d941ad 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1257,7 +1257,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm 
 *kvm, unsigned int id)
   svm-asid_generation = 0;
   init_vmcb(svm);
  
 - svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
 + svm-vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
 +MSR_IA32_APICBASE_ENABLE;
   if (kvm_vcpu_is_bsp(svm-vcpu))
   svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP;
  
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index bfe11cf..4b80ead 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3999,13 +3999,13 @@ static int alloc_apic_access_page(struct kvm *kvm)
   goto out;
   kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
   kvm_userspace_mem.flags = 0;
 - kvm_userspace_mem.guest_phys_addr = 0xfee0ULL;
 + kvm_userspace_mem.guest_phys_addr = APIC_DEFAULT_PHYS_BASE;
   kvm_userspace_mem.memory_size = PAGE_SIZE;
   r = __kvm_set_memory_region(kvm, kvm_userspace_mem);
   if (r)
   goto out;
  
 - page = gfn_to_page(kvm, 0xfee00);
 + page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
   if (is_error_page(page)) {
   r = -EFAULT;
   goto out;
 @@ -4477,7 +4477,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
  
   vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
   kvm_set_cr8(vmx-vcpu, 0);
 - apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
 + apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
   if (kvm_vcpu_is_bsp(vmx-vcpu))
   apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
   apic_base_msr.host_initiated = true;
 -- 
 1.8.3.1
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 2/6] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-09-10 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:37PM +0800, Tang Chen wrote:
 kvm_arch-ept_identity_pagetable holds the ept identity pagetable page. But
 it is never used to refer to the page at all.
 
 In vcpu initialization, it indicates two things:
 1. indicates if ept page is allocated
 2. indicates if a memory slot for identity page is initialized
 
 Actually, kvm_arch-ept_identity_pagetable_done is enough to tell if the ept
 identity pagetable is initialized. So we can remove ept_identity_pagetable.
 
 NOTE: In the original code, ept identity pagetable page is pinned in memroy.
   As a result, it cannot be migrated/hot-removed. After this patch, since
   kvm_arch-ept_identity_pagetable is removed, ept identity pagetable page
   is no longer pinned in memory. And it can be migrated/hot-removed.
Reviewed-by: Gleb Natapov g...@kernel.org

 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h |  1 -
  arch/x86/kvm/vmx.c  | 50 
 -
  arch/x86/kvm/x86.c  |  2 --
  3 files changed, 25 insertions(+), 28 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 7c492ed..35171c7 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -580,7 +580,6 @@ struct kvm_arch {
  
   gpa_t wall_clock;
  
 - struct page *ept_identity_pagetable;
   bool ept_identity_pagetable_done;
   gpa_t ept_identity_map_addr;
  
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 4b80ead..953d529 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -743,6 +743,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment 
 *var);
  static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu);
  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
 +static int alloc_identity_pagetable(struct kvm *kvm);
  
  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 @@ -3938,21 +3939,27 @@ out:
  
  static int init_rmode_identity_map(struct kvm *kvm)
  {
 - int i, idx, r, ret;
 + int i, idx, r, ret = 0;
   pfn_t identity_map_pfn;
   u32 tmp;
  
   if (!enable_ept)
   return 1;
 - if (unlikely(!kvm-arch.ept_identity_pagetable)) {
 - printk(KERN_ERR EPT: identity-mapping pagetable 
 - haven't been allocated!\n);
 - return 0;
 +
 + /* Protect kvm-arch.ept_identity_pagetable_done. */
 + mutex_lock(kvm-slots_lock);
 +
 + if (likely(kvm-arch.ept_identity_pagetable_done)) {
 + ret = 1;
 + goto out2;
   }
 - if (likely(kvm-arch.ept_identity_pagetable_done))
 - return 1;
 - ret = 0;
 +
   identity_map_pfn = kvm-arch.ept_identity_map_addr  PAGE_SHIFT;
 +
 + r = alloc_identity_pagetable(kvm);
 + if (r)
 + goto out2;
 +
   idx = srcu_read_lock(kvm-srcu);
   r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE);
   if (r  0)
 @@ -3970,6 +3977,9 @@ static int init_rmode_identity_map(struct kvm *kvm)
   ret = 1;
  out:
   srcu_read_unlock(kvm-srcu, idx);
 +
 +out2:
 + mutex_unlock(kvm-slots_lock);
   return ret;
  }
  
 @@ -4019,31 +4029,23 @@ out:
  
  static int alloc_identity_pagetable(struct kvm *kvm)
  {
 - struct page *page;
 + /*
 +  * In init_rmode_identity_map(), kvm-arch.ept_identity_pagetable_done
 +  * is checked before calling this function and set to true after the
 +  * calling. The access to kvm-arch.ept_identity_pagetable_done should
 +  * be protected by kvm-slots_lock.
 +  */
 +
   struct kvm_userspace_memory_region kvm_userspace_mem;
   int r = 0;
  
 - mutex_lock(kvm-slots_lock);
 - if (kvm-arch.ept_identity_pagetable)
 - goto out;
   kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT;
   kvm_userspace_mem.flags = 0;
   kvm_userspace_mem.guest_phys_addr =
   kvm-arch.ept_identity_map_addr;
   kvm_userspace_mem.memory_size = PAGE_SIZE;
   r = __kvm_set_memory_region(kvm, kvm_userspace_mem);
 - if (r)
 - goto out;
  
 - page = gfn_to_page(kvm, kvm-arch.ept_identity_map_addr  PAGE_SHIFT);
 - if (is_error_page(page)) {
 - r = -EFAULT;
 - goto out;
 - }
 -
 - kvm-arch.ept_identity_pagetable = page;
 -out:
 - mutex_unlock(kvm-slots_lock);
   return r;
  }
  
 @@ -7643,8 +7645,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
 *kvm, unsigned int id)
   kvm-arch.ept_identity_map_addr =
   VMX_EPT_IDENTITY_PAGETABLE_ADDR;
   err = -ENOMEM;
 - if (alloc_identity_pagetable(kvm) != 0)
 - goto free_vmcs;
   if (!init_rmode_identity_map(kvm))
   goto

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-10 Thread Gleb Natapov

On Tue, Sep 09, 2014 at 03:13:07PM +0800, tangchen wrote:
 Hi Gleb,
 
 On 09/03/2014 11:04 PM, Gleb Natapov wrote:
 On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:
 Hi Gleb,
 
 On 09/03/2014 12:00 AM, Gleb Natapov wrote:
 ..
 +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 +{
 +  /*
 +   * apic access page could be migrated. When the page is being migrated,
 +   * GUP will wait till the migrate entry is replaced with the new pte
 +   * entry pointing to the new page.
 +   */
 +  vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm,
 +  APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
 +  kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm,
 +  page_to_phys(vcpu-kvm-arch.apic_access_page));
 I am a little bit worried that here all vcpus write to 
 vcpu-kvm-arch.apic_access_page
 without any locking. It is probably benign since pointer write is atomic 
 on x86. Paolo?
 
 Do we even need apic_access_page? Why not call
   gfn_to_page(APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT)
   put_page()
 on rare occasions we need to know its address?
 Isn't it a necessary item defined in hardware spec ?
 
 vcpu-kvm-arch.apic_access_page? No. This is internal kvm data structure.
 
 I didn't read intel spec deeply, but according to the code, the page's
 address is
 written into vmcs. And it made me think that we cannnot remove it.
 
 We cannot remove writing of apic page address into vmcs, but this is not 
 done by
 assigning to vcpu-kvm-arch.apic_access_page, but by vmwrite in 
 set_apic_access_page_addr().
 
 OK, I'll try to remove kvm-arch.apic_access_page and send a patch for it
 soon.
 
 BTW, if you don't have objection to the first two patches, would you please
 help to
 commit them first ?
 
I acked them and CCed Paolo to this reply. I hope he will look at the series 
too.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-05 Thread Gleb Natapov

On Thu, Sep 04, 2014 at 07:44:51PM +0200, Paolo Bonzini wrote:
> Il 04/09/2014 17:05, Gleb Natapov ha scritto:
> >> > 
> >> > If you do that, KVM gets down to the "if (writeback)" and writes the 
> >> > ctxt->eip from L2 into the L1 EIP.
> > Heh, that's a bummer. We should not write back if an instruction caused a 
> > vmexit.
> > 
> 
> You're right, that works.
Looks good!

Reviewed-by: Gleb Natapov 

> 
> Paolo
> 
> -- 8< -
> Subject: [PATCH] KVM: x86: skip writeback on injection of nested exception
> 
> If a nested page fault happens during emulation, we will inject a vmexit,
> not a page fault.  However because writeback happens after the injection,
> we will write ctxt->eip from L2 into the L1 EIP.  We do not write back
> if an instruction caused an interception vmexit---do the same for page
> faults.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/x86.c  | 15 ++-
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 08cc299..c989651 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -893,7 +893,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct 
> x86_exception *fault);
>  int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   gfn_t gfn, void *data, int offset, int len,
>   u32 access);
> -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
>  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
>  
>  static inline int __kvm_irq_line_state(unsigned long *irq_state,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e4ed85e..3541946 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -408,12 +408,14 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, 
> struct x86_exception *fault)
>  }
>  EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
>  
> -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> +static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
> *fault)
>  {
>   if (mmu_is_nested(vcpu) && !fault->nested_page_fault)
>   vcpu->arch.nested_mmu.inject_page_fault(vcpu, fault);
>   else
>   vcpu->arch.mmu.inject_page_fault(vcpu, fault);
> +
> + return fault->nested_page_fault;
>  }
>  
>  void kvm_inject_nmi(struct kvm_vcpu *vcpu)
> @@ -4929,16 +4931,18 @@ static void toggle_interruptibility(struct kvm_vcpu 
> *vcpu, u32 mask)
>   }
>  }
>  
> -static void inject_emulated_exception(struct kvm_vcpu *vcpu)
> +static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
>  {
>   struct x86_emulate_ctxt *ctxt = >arch.emulate_ctxt;
>   if (ctxt->exception.vector == PF_VECTOR)
> - kvm_propagate_fault(vcpu, >exception);
> - else if (ctxt->exception.error_code_valid)
> + return kvm_propagate_fault(vcpu, >exception);
> +
> + if (ctxt->exception.error_code_valid)
>   kvm_queue_exception_e(vcpu, ctxt->exception.vector,
> ctxt->exception.error_code);
>   else
>   kvm_queue_exception(vcpu, ctxt->exception.vector);
> + return false;
>  }
>  
>  static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
> @@ -5300,8 +5304,9 @@ restart:
>   }
>  
>   if (ctxt->have_exception) {
> - inject_emulated_exception(vcpu);
>   r = EMULATE_DONE;
> + if (inject_emulated_exception(vcpu))
> + return r;
>   } else if (vcpu->arch.pio.count) {
>   if (!vcpu->arch.pio.in) {
>   /* FIXME: return into emulator if single-stepping.  */
> -- 
> 1.9.3
> 
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-05 Thread Gleb Natapov

On Thu, Sep 04, 2014 at 07:44:51PM +0200, Paolo Bonzini wrote:
 Il 04/09/2014 17:05, Gleb Natapov ha scritto:
   
   If you do that, KVM gets down to the if (writeback) and writes the 
   ctxt-eip from L2 into the L1 EIP.
  Heh, that's a bummer. We should not write back if an instruction caused a 
  vmexit.
  
 
 You're right, that works.
Looks good!

Reviewed-by: Gleb Natapov g...@kernel.org

 
 Paolo
 
 -- 8 -
 Subject: [PATCH] KVM: x86: skip writeback on injection of nested exception
 
 If a nested page fault happens during emulation, we will inject a vmexit,
 not a page fault.  However because writeback happens after the injection,
 we will write ctxt-eip from L2 into the L1 EIP.  We do not write back
 if an instruction caused an interception vmexit---do the same for page
 faults.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h |  1 -
  arch/x86/kvm/x86.c  | 15 ++-
  2 files changed, 10 insertions(+), 6 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 08cc299..c989651 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -893,7 +893,6 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct 
 x86_exception *fault);
  int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
   gfn_t gfn, void *data, int offset, int len,
   u32 access);
 -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
  
  static inline int __kvm_irq_line_state(unsigned long *irq_state,
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index e4ed85e..3541946 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -408,12 +408,14 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, 
 struct x86_exception *fault)
  }
  EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
  
 -void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
 +static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
 *fault)
  {
   if (mmu_is_nested(vcpu)  !fault-nested_page_fault)
   vcpu-arch.nested_mmu.inject_page_fault(vcpu, fault);
   else
   vcpu-arch.mmu.inject_page_fault(vcpu, fault);
 +
 + return fault-nested_page_fault;
  }
  
  void kvm_inject_nmi(struct kvm_vcpu *vcpu)
 @@ -4929,16 +4931,18 @@ static void toggle_interruptibility(struct kvm_vcpu 
 *vcpu, u32 mask)
   }
  }
  
 -static void inject_emulated_exception(struct kvm_vcpu *vcpu)
 +static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
  {
   struct x86_emulate_ctxt *ctxt = vcpu-arch.emulate_ctxt;
   if (ctxt-exception.vector == PF_VECTOR)
 - kvm_propagate_fault(vcpu, ctxt-exception);
 - else if (ctxt-exception.error_code_valid)
 + return kvm_propagate_fault(vcpu, ctxt-exception);
 +
 + if (ctxt-exception.error_code_valid)
   kvm_queue_exception_e(vcpu, ctxt-exception.vector,
 ctxt-exception.error_code);
   else
   kvm_queue_exception(vcpu, ctxt-exception.vector);
 + return false;
  }
  
  static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
 @@ -5300,8 +5304,9 @@ restart:
   }
  
   if (ctxt-have_exception) {
 - inject_emulated_exception(vcpu);
   r = EMULATE_DONE;
 + if (inject_emulated_exception(vcpu))
 + return r;
   } else if (vcpu-arch.pio.count) {
   if (!vcpu-arch.pio.in) {
   /* FIXME: return into emulator if single-stepping.  */
 -- 
 1.9.3
 
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-04 Thread Gleb Natapov

On Thu, Sep 04, 2014 at 04:12:19PM +0200, Paolo Bonzini wrote:
> Il 04/09/2014 09:02, Gleb Natapov ha scritto:
> > On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote:
> >> > This is required for the following patch to work correctly.  If a nested 
> >> > page
> >> > fault happens during emulation, we must inject a vmexit, not a page 
> >> > fault.
> >> > Luckily we already have the required machinery: it is enough to return
> >> > X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT.
> >> > 
> > I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes
> > ctxt->have_exception to be set to true in x86_emulate_insn().
> > x86_emulate_instruction() checks ctxt->have_exception and calls
> > inject_emulated_exception() if it is true. inject_emulated_exception()
> > calls kvm_propagate_fault() where we check if the fault was nested and
> > generate vmexit or a page fault accordingly.
> 
> Good question. :)
> 
> If you do that, KVM gets down to the "if (writeback)" and writes the 
> ctxt->eip from L2 into the L1 EIP.
Heh, that's a bummer. We should not write back if an instruction caused a 
vmexit.

> 
> Possibly this patch can be replaced by just this?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 022513b..475e979 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5312,7 +5312,7 @@ restart:
>  
> if (ctxt->have_exception) {
> inject_emulated_exception(vcpu);
> -   r = EMULATE_DONE;
> +   return EMULATE_DONE;
If there was no vmexit we still want to writeback. Perhaps:
writeback = inject_emulated_exception(vcpu);
and return false if there was vmexit due to nested page fault (or any fault,
can't L1 ask for #GP/#UD intercept that need to be handled here too?)

> } else if (vcpu->arch.pio.count) {
> if (!vcpu->arch.pio.in) {
> /* FIXME: return into emulator if single-stepping.  */
> 
> But I'm not sure how to test it, and I like the idea of treating nested page
> faults like other nested vmexits during emulation (which is what this patch
> does).
IMO exits due to instruction intercept and exits due to other interceptable 
events
that may happen during instruction emulation are sufficiently different to be 
handled
slightly different. If my assumption about #GP above are correct with current 
approach it
can be easily handled inside inject_emulated_exception().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-04 Thread Gleb Natapov

On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote:
> This is required for the following patch to work correctly.  If a nested page
> fault happens during emulation, we must inject a vmexit, not a page fault.
> Luckily we already have the required machinery: it is enough to return
> X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT.
> 
I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes
ctxt->have_exception to be set to true in x86_emulate_insn().
x86_emulate_instruction() checks ctxt->have_exception and calls
inject_emulated_exception() if it is true. inject_emulated_exception()
calls kvm_propagate_fault() where we check if the fault was nested and
generate vmexit or a page fault accordingly.

> Reported-by: Valentine Sinitsyn 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/x86.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e4ed85e07a01..9e3b74c044ed 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -416,6 +416,16 @@ void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct 
> x86_exception *fault)
>   vcpu->arch.mmu.inject_page_fault(vcpu, fault);
>  }
>  
> +static inline int kvm_propagate_or_intercept(struct kvm_vcpu *vcpu,
> +  struct x86_exception *exception)
> +{
> + if (likely(!exception->nested_page_fault))
> + return X86EMUL_PROPAGATE_FAULT;
> +
> + vcpu->arch.mmu.inject_page_fault(vcpu, exception);
> + return X86EMUL_INTERCEPTED;
> +}
> +
>  void kvm_inject_nmi(struct kvm_vcpu *vcpu)
>  {
>   atomic_inc(>arch.nmi_queued);
> @@ -4122,7 +4132,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void 
> *val, unsigned int bytes,
>   int ret;
>  
>   if (gpa == UNMAPPED_GVA)
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>   ret = kvm_read_guest_page(vcpu->kvm, gpa >> PAGE_SHIFT, data,
> offset, toread);
>   if (ret < 0) {
> @@ -4152,7 +4162,7 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt 
> *ctxt,
>   gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, 
> access|PFERR_FETCH_MASK,
>   exception);
>   if (unlikely(gpa == UNMAPPED_GVA))
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>  
>   offset = addr & (PAGE_SIZE-1);
>   if (WARN_ON(offset + bytes > PAGE_SIZE))
> @@ -4203,7 +4213,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt 
> *ctxt,
>   int ret;
>  
>   if (gpa == UNMAPPED_GVA)
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>   ret = kvm_write_guest(vcpu->kvm, gpa, data, towrite);
>   if (ret < 0) {
>   r = X86EMUL_IO_NEEDED;
> @@ -4350,7 +4360,7 @@ static int emulator_read_write_onepage(unsigned long 
> addr, void *val,
>   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, , exception, write);
>  
>   if (ret < 0)
> - return X86EMUL_PROPAGATE_FAULT;
> + return kvm_propagate_or_intercept(vcpu, exception);
>  
>   /* For APIC access vmexit */
>   if (ret)
> -- 
> 1.8.3.1
> 
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-04 Thread Gleb Natapov

On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote:
 This is required for the following patch to work correctly.  If a nested page
 fault happens during emulation, we must inject a vmexit, not a page fault.
 Luckily we already have the required machinery: it is enough to return
 X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT.
 
I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes
ctxt-have_exception to be set to true in x86_emulate_insn().
x86_emulate_instruction() checks ctxt-have_exception and calls
inject_emulated_exception() if it is true. inject_emulated_exception()
calls kvm_propagate_fault() where we check if the fault was nested and
generate vmexit or a page fault accordingly.

 Reported-by: Valentine Sinitsyn valentine.sinit...@gmail.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/x86.c | 18 ++
  1 file changed, 14 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index e4ed85e07a01..9e3b74c044ed 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -416,6 +416,16 @@ void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct 
 x86_exception *fault)
   vcpu-arch.mmu.inject_page_fault(vcpu, fault);
  }
  
 +static inline int kvm_propagate_or_intercept(struct kvm_vcpu *vcpu,
 +  struct x86_exception *exception)
 +{
 + if (likely(!exception-nested_page_fault))
 + return X86EMUL_PROPAGATE_FAULT;
 +
 + vcpu-arch.mmu.inject_page_fault(vcpu, exception);
 + return X86EMUL_INTERCEPTED;
 +}
 +
  void kvm_inject_nmi(struct kvm_vcpu *vcpu)
  {
   atomic_inc(vcpu-arch.nmi_queued);
 @@ -4122,7 +4132,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void 
 *val, unsigned int bytes,
   int ret;
  
   if (gpa == UNMAPPED_GVA)
 - return X86EMUL_PROPAGATE_FAULT;
 + return kvm_propagate_or_intercept(vcpu, exception);
   ret = kvm_read_guest_page(vcpu-kvm, gpa  PAGE_SHIFT, data,
 offset, toread);
   if (ret  0) {
 @@ -4152,7 +4162,7 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt 
 *ctxt,
   gpa_t gpa = vcpu-arch.walk_mmu-gva_to_gpa(vcpu, addr, 
 access|PFERR_FETCH_MASK,
   exception);
   if (unlikely(gpa == UNMAPPED_GVA))
 - return X86EMUL_PROPAGATE_FAULT;
 + return kvm_propagate_or_intercept(vcpu, exception);
  
   offset = addr  (PAGE_SIZE-1);
   if (WARN_ON(offset + bytes  PAGE_SIZE))
 @@ -4203,7 +4213,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt 
 *ctxt,
   int ret;
  
   if (gpa == UNMAPPED_GVA)
 - return X86EMUL_PROPAGATE_FAULT;
 + return kvm_propagate_or_intercept(vcpu, exception);
   ret = kvm_write_guest(vcpu-kvm, gpa, data, towrite);
   if (ret  0) {
   r = X86EMUL_IO_NEEDED;
 @@ -4350,7 +4360,7 @@ static int emulator_read_write_onepage(unsigned long 
 addr, void *val,
   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, gpa, exception, write);
  
   if (ret  0)
 - return X86EMUL_PROPAGATE_FAULT;
 + return kvm_propagate_or_intercept(vcpu, exception);
  
   /* For APIC access vmexit */
   if (ret)
 -- 
 1.8.3.1
 
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] KVM: x86: inject nested page faults on emulated instructions

2014-09-04 Thread Gleb Natapov

On Thu, Sep 04, 2014 at 04:12:19PM +0200, Paolo Bonzini wrote:
 Il 04/09/2014 09:02, Gleb Natapov ha scritto:
  On Tue, Sep 02, 2014 at 05:13:49PM +0200, Paolo Bonzini wrote:
   This is required for the following patch to work correctly.  If a nested 
   page
   fault happens during emulation, we must inject a vmexit, not a page 
   fault.
   Luckily we already have the required machinery: it is enough to return
   X86EMUL_INTERCEPTED instead of X86EMUL_PROPAGATE_FAULT.
   
  I wonder why this patch is needed. X86EMUL_PROPAGATE_FAULT causes
  ctxt-have_exception to be set to true in x86_emulate_insn().
  x86_emulate_instruction() checks ctxt-have_exception and calls
  inject_emulated_exception() if it is true. inject_emulated_exception()
  calls kvm_propagate_fault() where we check if the fault was nested and
  generate vmexit or a page fault accordingly.
 
 Good question. :)
 
 If you do that, KVM gets down to the if (writeback) and writes the 
 ctxt-eip from L2 into the L1 EIP.
Heh, that's a bummer. We should not write back if an instruction caused a 
vmexit.

 
 Possibly this patch can be replaced by just this?
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 022513b..475e979 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -5312,7 +5312,7 @@ restart:
  
 if (ctxt-have_exception) {
 inject_emulated_exception(vcpu);
 -   r = EMULATE_DONE;
 +   return EMULATE_DONE;
If there was no vmexit we still want to writeback. Perhaps:
writeback = inject_emulated_exception(vcpu);
and return false if there was vmexit due to nested page fault (or any fault,
can't L1 ask for #GP/#UD intercept that need to be handled here too?)

 } else if (vcpu-arch.pio.count) {
 if (!vcpu-arch.pio.in) {
 /* FIXME: return into emulator if single-stepping.  */
 
 But I'm not sure how to test it, and I like the idea of treating nested page
 faults like other nested vmexits during emulation (which is what this patch
 does).
IMO exits due to instruction intercept and exits due to other interceptable 
events
that may happen during instruction emulation are sufficiently different to be 
handled
slightly different. If my assumption about #GP above are correct with current 
approach it
can be easily handled inside inject_emulated_exception().

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-03 Thread Gleb Natapov

On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:
> Hi Gleb,
> 
> On 09/03/2014 12:00 AM, Gleb Natapov wrote:
> >..
> >+static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >+{
> >+/*
> >+ * apic access page could be migrated. When the page is being migrated,
> >+ * GUP will wait till the migrate entry is replaced with the new pte
> >+ * entry pointing to the new page.
> >+ */
> >+vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
> >+APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> >+kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
> >+page_to_phys(vcpu->kvm->arch.apic_access_page));
> >I am a little bit worried that here all vcpus write to 
> >vcpu->kvm->arch.apic_access_page
> >without any locking. It is probably benign since pointer write is atomic on 
> >x86. Paolo?
> >
> >Do we even need apic_access_page? Why not call
> >  gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
> >  put_page()
> >on rare occasions we need to know its address?
> 
> Isn't it a necessary item defined in hardware spec ?
> 
vcpu->kvm->arch.apic_access_page? No. This is internal kvm data structure.

> I didn't read intel spec deeply, but according to the code, the page's
> address is
> written into vmcs. And it made me think that we cannnot remove it.
> 
We cannot remove writing of apic page address into vmcs, but this is not done by
assigning to vcpu->kvm->arch.apic_access_page, but by vmwrite in 
set_apic_access_page_addr().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 5/6] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.

2014-09-03 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:40PM +0800, Tang Chen wrote:
> This patch only handle "L1 and L2 vm share one apic access page" situation.
> 
> When L1 vm is running, if the shared apic access page is migrated, 
> mmu_notifier will
> request all vcpus to exit to L0, and reload apic access page physical address 
> for
> all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, 
> L2's vmcs
> will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to 
> do
> nothing.
> 
> When L2 vm is running, if the shared apic access page is migrated, 
> mmu_notifier will
> request all vcpus to exit to L0, and reload apic access page physical address 
> for
> all L2 vmcs. And this patch requests apic access page reload in L2->L1 vmexit.
> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  |  6 ++
>  arch/x86/kvm/vmx.c  | 32 
>  arch/x86/kvm/x86.c  |  3 +++
>  virt/kvm/kvm_main.c |  1 +
>  5 files changed, 43 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 514183e..13fbb62 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -740,6 +740,7 @@ struct kvm_x86_ops {
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
>   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
> + void (*set_nested_apic_page_migrated)(struct kvm_vcpu *vcpu, bool set);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f2eacc4..da88646 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3624,6 +3624,11 @@ static void svm_set_apic_access_page_addr(struct kvm 
> *kvm, hpa_t hpa)
>   return;
>  }
>  
> +static void svm_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool 
> set)
> +{
> + return;
> +}
> +
>  static int svm_vm_has_apicv(struct kvm *kvm)
>  {
>   return 0;
> @@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
>   .set_apic_access_page_addr = svm_set_apic_access_page_addr,
> + .set_nested_apic_page_migrated = svm_set_nested_apic_page_migrated,
>   .vm_has_apicv = svm_vm_has_apicv,
>   .load_eoi_exitmap = svm_load_eoi_exitmap,
>   .hwapic_isr_update = svm_hwapic_isr_update,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index da6d55d..9035fd1 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -379,6 +379,16 @@ struct nested_vmx {
>* we must keep them pinned while L2 runs.
>*/
>   struct page *apic_access_page;
> + /*
> +  * L1's apic access page can be migrated. When L1 and L2 are sharing
> +  * the apic access page, after the page is migrated when L2 is running,
> +  * we have to reload it to L1 vmcs before we enter L1.
> +  *
> +  * When the shared apic access page is migrated in L1 mode, we don't
> +  * need to do anything else because we reload apic access page each
> +  * time when entering L2 in prepare_vmcs02().
> +  */
> + bool apic_access_page_migrated;
>   u64 msr_ia32_feature_control;
>  
>   struct hrtimer preemption_timer;
> @@ -7098,6 +7108,12 @@ static void vmx_set_apic_access_page_addr(struct kvm 
> *kvm, hpa_t hpa)
>   vmcs_write64(APIC_ACCESS_ADDR, hpa);
>  }
>  
> +static void vmx_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool 
> set)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + vmx->nested.apic_access_page_migrated = set;
> +}
> +
>  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
>  {
>   u16 status;
> @@ -8796,6 +8812,21 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
> u32 exit_reason,
>   }
>  
>   /*
> +  * When shared (L1 & L2) apic access page is migrated during L2 is
> +  * running, mmu_notifier will force to reload the page's hpa for L2
> +  * vmcs. Need to reload it for L1 before entering L1.
> +  */
> + if (vmx->nested.apic_access_page_migrated) {
> + /*
> +  * Do not call kvm_reload_apic_access_page() because we are now
> +  * in L2. We should not call make_all_cpus_request() to exit to
> +  * L0, otherwise we will reload for L2 vmcs again.
> +  */
> + kvm_reload_apic_access_page(vcpu->kvm);
> + vmx->nested.apic_access_page_migrated = false;
> + }
I would just call kvm_reload_apic_access_page() unconditionally and only if
it will prove to be performance problem would optimize it further.

Re: [PATCH v4 5/6] kvm, mem-hotplug: Reload L1's apic access page on migration when L2 is running.

2014-09-03 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:40PM +0800, Tang Chen wrote:
 This patch only handle L1 and L2 vm share one apic access page situation.
 
 When L1 vm is running, if the shared apic access page is migrated, 
 mmu_notifier will
 request all vcpus to exit to L0, and reload apic access page physical address 
 for
 all the vcpus' vmcs (which is done by patch 5/6). And when it enters L2 vm, 
 L2's vmcs
 will be updated in prepare_vmcs02() called by nested_vm_run(). So we need to 
 do
 nothing.
 
 When L2 vm is running, if the shared apic access page is migrated, 
 mmu_notifier will
 request all vcpus to exit to L0, and reload apic access page physical address 
 for
 all L2 vmcs. And this patch requests apic access page reload in L2-L1 vmexit.
 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/svm.c  |  6 ++
  arch/x86/kvm/vmx.c  | 32 
  arch/x86/kvm/x86.c  |  3 +++
  virt/kvm/kvm_main.c |  1 +
  5 files changed, 43 insertions(+)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 514183e..13fbb62 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -740,6 +740,7 @@ struct kvm_x86_ops {
   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
 + void (*set_nested_apic_page_migrated)(struct kvm_vcpu *vcpu, bool set);
   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index f2eacc4..da88646 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3624,6 +3624,11 @@ static void svm_set_apic_access_page_addr(struct kvm 
 *kvm, hpa_t hpa)
   return;
  }
  
 +static void svm_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool 
 set)
 +{
 + return;
 +}
 +
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
   return 0;
 @@ -4379,6 +4384,7 @@ static struct kvm_x86_ops svm_x86_ops = {
   .update_cr8_intercept = update_cr8_intercept,
   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
   .set_apic_access_page_addr = svm_set_apic_access_page_addr,
 + .set_nested_apic_page_migrated = svm_set_nested_apic_page_migrated,
   .vm_has_apicv = svm_vm_has_apicv,
   .load_eoi_exitmap = svm_load_eoi_exitmap,
   .hwapic_isr_update = svm_hwapic_isr_update,
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index da6d55d..9035fd1 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -379,6 +379,16 @@ struct nested_vmx {
* we must keep them pinned while L2 runs.
*/
   struct page *apic_access_page;
 + /*
 +  * L1's apic access page can be migrated. When L1 and L2 are sharing
 +  * the apic access page, after the page is migrated when L2 is running,
 +  * we have to reload it to L1 vmcs before we enter L1.
 +  *
 +  * When the shared apic access page is migrated in L1 mode, we don't
 +  * need to do anything else because we reload apic access page each
 +  * time when entering L2 in prepare_vmcs02().
 +  */
 + bool apic_access_page_migrated;
   u64 msr_ia32_feature_control;
  
   struct hrtimer preemption_timer;
 @@ -7098,6 +7108,12 @@ static void vmx_set_apic_access_page_addr(struct kvm 
 *kvm, hpa_t hpa)
   vmcs_write64(APIC_ACCESS_ADDR, hpa);
  }
  
 +static void vmx_set_nested_apic_page_migrated(struct kvm_vcpu *vcpu, bool 
 set)
 +{
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + vmx-nested.apic_access_page_migrated = set;
 +}
 +
  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
  {
   u16 status;
 @@ -8796,6 +8812,21 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
 u32 exit_reason,
   }
  
   /*
 +  * When shared (L1  L2) apic access page is migrated during L2 is
 +  * running, mmu_notifier will force to reload the page's hpa for L2
 +  * vmcs. Need to reload it for L1 before entering L1.
 +  */
 + if (vmx-nested.apic_access_page_migrated) {
 + /*
 +  * Do not call kvm_reload_apic_access_page() because we are now
 +  * in L2. We should not call make_all_cpus_request() to exit to
 +  * L0, otherwise we will reload for L2 vmcs again.
 +  */
 + kvm_reload_apic_access_page(vcpu-kvm);
 + vmx-nested.apic_access_page_migrated = false;
 + }
I would just call kvm_reload_apic_access_page() unconditionally and only if
it will prove to be performance problem would optimize it further. Vmexit 
emulation it
pretty heavy, so I doubt one more vmwrite will be noticeable.

--

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-03 Thread Gleb Natapov

On Wed, Sep 03, 2014 at 09:42:30AM +0800, tangchen wrote:
 Hi Gleb,
 
 On 09/03/2014 12:00 AM, Gleb Natapov wrote:
 ..
 +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 +{
 +/*
 + * apic access page could be migrated. When the page is being migrated,
 + * GUP will wait till the migrate entry is replaced with the new pte
 + * entry pointing to the new page.
 + */
 +vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm,
 +APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
 +kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm,
 +page_to_phys(vcpu-kvm-arch.apic_access_page));
 I am a little bit worried that here all vcpus write to 
 vcpu-kvm-arch.apic_access_page
 without any locking. It is probably benign since pointer write is atomic on 
 x86. Paolo?
 
 Do we even need apic_access_page? Why not call
   gfn_to_page(APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT)
   put_page()
 on rare occasions we need to know its address?
 
 Isn't it a necessary item defined in hardware spec ?
 
vcpu-kvm-arch.apic_access_page? No. This is internal kvm data structure.

 I didn't read intel spec deeply, but according to the code, the page's
 address is
 written into vmcs. And it made me think that we cannnot remove it.
 
We cannot remove writing of apic page address into vmcs, but this is not done by
assigning to vcpu-kvm-arch.apic_access_page, but by vmwrite in 
set_apic_access_page_addr().

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-02 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:39PM +0800, Tang Chen wrote:
> apic access page is pinned in memory. As a result, it cannot be 
> migrated/hot-removed.
> Actually, it is not necessary to be pinned.
> 
> The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
> the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
> corresponding ept entry. This patch introduces a new vcpu request named
> KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
> time,
> and force all the vcpus exit guest, and re-enter guest till they updates the 
> VMCS
> APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
> kvm->arch.apic_access_page to the new page.
> 
> Signed-off-by: Tang Chen 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  |  6 ++
>  arch/x86/kvm/vmx.c  |  6 ++
>  arch/x86/kvm/x86.c  | 15 +++
>  include/linux/kvm_host.h|  2 ++
>  virt/kvm/kvm_main.c | 12 
>  6 files changed, 42 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 35171c7..514183e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -739,6 +739,7 @@ struct kvm_x86_ops {
>   void (*hwapic_isr_update)(struct kvm *kvm, int isr);
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
>   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
>   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1d941ad..f2eacc4 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>  }
>  
> +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> +{
> + return;
> +}
> +
>  static int svm_vm_has_apicv(struct kvm *kvm)
>  {
>   return 0;
> @@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .enable_irq_window = enable_irq_window,
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
> + .set_apic_access_page_addr = svm_set_apic_access_page_addr,
>   .vm_has_apicv = svm_vm_has_apicv,
>   .load_eoi_exitmap = svm_load_eoi_exitmap,
>   .hwapic_isr_update = svm_hwapic_isr_update,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 63c4c3e..da6d55d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   vmx_set_msr_bitmap(vcpu);
>  }
>  
> +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
> +{
> + vmcs_write64(APIC_ACCESS_ADDR, hpa);
> +}
> +
>  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
>  {
>   u16 status;
> @@ -8910,6 +8915,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .enable_irq_window = enable_irq_window,
>   .update_cr8_intercept = update_cr8_intercept,
>   .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
> + .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
>   .vm_has_apicv = vmx_vm_has_apicv,
>   .load_eoi_exitmap = vmx_load_eoi_exitmap,
>   .hwapic_irr_update = vmx_hwapic_irr_update,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e05bd58..96f4188 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5989,6 +5989,19 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
>   kvm_apic_update_tmr(vcpu, tmr);
>  }
>  
> +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> +{
> + /*
> +  * apic access page could be migrated. When the page is being migrated,
> +  * GUP will wait till the migrate entry is replaced with the new pte
> +  * entry pointing to the new page.
> +  */
> + vcpu->kvm->arch.apic_access_page = gfn_to_page(vcpu->kvm,
> + APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> + kvm_x86_ops->set_apic_access_page_addr(vcpu->kvm,
> + page_to_phys(vcpu->kvm->arch.apic_access_page));
I am a little bit worried that here all vcpus write to 
vcpu->kvm->arch.apic_access_page
without any locking. It is probably benign since pointer write is atomic on 
x86. Paolo?

Do we even need apic_access_page? Why not call
 gfn_to_page(APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT)
 put_page()
on rare occasions we need to know its address?

> +}
> +
>  /*
>   * Returns 1 to let __vcpu_run() continue the guest execution loop without
>   * exiting to the userspace.  Otherwise, the value will be returned to the
> @@

Re: [PATCH v4 4/6] kvm, mem-hotplug: Reload L1' apic access page on migration in vcpu_enter_guest().

2014-09-02 Thread Gleb Natapov

On Wed, Aug 27, 2014 at 06:17:39PM +0800, Tang Chen wrote:
 apic access page is pinned in memory. As a result, it cannot be 
 migrated/hot-removed.
 Actually, it is not necessary to be pinned.
 
 The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
 the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
 corresponding ept entry. This patch introduces a new vcpu request named
 KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
 time,
 and force all the vcpus exit guest, and re-enter guest till they updates the 
 VMCS
 APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
 kvm-arch.apic_access_page to the new page.
 
 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h |  1 +
  arch/x86/kvm/svm.c  |  6 ++
  arch/x86/kvm/vmx.c  |  6 ++
  arch/x86/kvm/x86.c  | 15 +++
  include/linux/kvm_host.h|  2 ++
  virt/kvm/kvm_main.c | 12 
  6 files changed, 42 insertions(+)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 35171c7..514183e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -739,6 +739,7 @@ struct kvm_x86_ops {
   void (*hwapic_isr_update)(struct kvm *kvm, int isr);
   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
 + void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
   void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 1d941ad..f2eacc4 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3619,6 +3619,11 @@ static void svm_set_virtual_x2apic_mode(struct 
 kvm_vcpu *vcpu, bool set)
   return;
  }
  
 +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
 +{
 + return;
 +}
 +
  static int svm_vm_has_apicv(struct kvm *kvm)
  {
   return 0;
 @@ -4373,6 +4378,7 @@ static struct kvm_x86_ops svm_x86_ops = {
   .enable_irq_window = enable_irq_window,
   .update_cr8_intercept = update_cr8_intercept,
   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
 + .set_apic_access_page_addr = svm_set_apic_access_page_addr,
   .vm_has_apicv = svm_vm_has_apicv,
   .load_eoi_exitmap = svm_load_eoi_exitmap,
   .hwapic_isr_update = svm_hwapic_isr_update,
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 63c4c3e..da6d55d 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7093,6 +7093,11 @@ static void vmx_set_virtual_x2apic_mode(struct 
 kvm_vcpu *vcpu, bool set)
   vmx_set_msr_bitmap(vcpu);
  }
  
 +static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
 +{
 + vmcs_write64(APIC_ACCESS_ADDR, hpa);
 +}
 +
  static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
  {
   u16 status;
 @@ -8910,6 +8915,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
   .enable_irq_window = enable_irq_window,
   .update_cr8_intercept = update_cr8_intercept,
   .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
 + .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
   .vm_has_apicv = vmx_vm_has_apicv,
   .load_eoi_exitmap = vmx_load_eoi_exitmap,
   .hwapic_irr_update = vmx_hwapic_irr_update,
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index e05bd58..96f4188 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -5989,6 +5989,19 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
   kvm_apic_update_tmr(vcpu, tmr);
  }
  
 +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 +{
 + /*
 +  * apic access page could be migrated. When the page is being migrated,
 +  * GUP will wait till the migrate entry is replaced with the new pte
 +  * entry pointing to the new page.
 +  */
 + vcpu-kvm-arch.apic_access_page = gfn_to_page(vcpu-kvm,
 + APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
 + kvm_x86_ops-set_apic_access_page_addr(vcpu-kvm,
 + page_to_phys(vcpu-kvm-arch.apic_access_page));
I am a little bit worried that here all vcpus write to 
vcpu-kvm-arch.apic_access_page
without any locking. It is probably benign since pointer write is atomic on 
x86. Paolo?

Do we even need apic_access_page? Why not call
 gfn_to_page(APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT)
 put_page()
on rare occasions we need to know its address?

 +}
 +
  /*
   * Returns 1 to let __vcpu_run() continue the guest execution loop without
   * exiting to the userspace.  Otherwise, the value will be returned to the
 @@ -6049,6 +6062,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

Re: GET_RNG_SEED hypercall ABI? (Re: [PATCH v5 0/5] random,x86,kvm: Rework arch RNG seeds and get some from kvm)

2014-08-28 Thread Gleb Natapov

On Tue, Aug 26, 2014 at 04:58:34PM -0700, Andy Lutomirski wrote:
> hpa pointed out that the ABI that I chose (an MSR from the KVM range
> and a KVM cpuid bit) is unnecessarily KVM-specific.  It would be nice
> to allocate an MSR that everyone involved can agree on and, rather
> than relying on a cpuid bit, just have the guest probe for the MSR.
> 
CPUID part allows feature to be disabled for machine compatibility purpose
during migration. Of course interface can explicitly state that one successful
use of the MSR does not mean that next use will not result in a #GP, but that
doesn't sound very elegant and is different from any other MSR out there.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GET_RNG_SEED hypercall ABI? (Re: [PATCH v5 0/5] random,x86,kvm: Rework arch RNG seeds and get some from kvm)

2014-08-28 Thread Gleb Natapov

On Tue, Aug 26, 2014 at 04:58:34PM -0700, Andy Lutomirski wrote:
 hpa pointed out that the ABI that I chose (an MSR from the KVM range
 and a KVM cpuid bit) is unnecessarily KVM-specific.  It would be nice
 to allocate an MSR that everyone involved can agree on and, rather
 than relying on a cpuid bit, just have the guest probe for the MSR.
 
CPUID part allows feature to be disabled for machine compatibility purpose
during migration. Of course interface can explicitly state that one successful
use of the MSR does not mean that next use will not result in a #GP, but that
doesn't sound very elegant and is different from any other MSR out there.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode

2014-08-25 Thread Gleb Natapov

On Mon, Aug 25, 2014 at 11:16:34AM +0800, Dennis Chen wrote:
> On Sun, Aug 24, 2014 at 5:38 PM, Gleb Natapov  wrote:
> > On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote:
> >> This patch is used to construct the eptp in vmx mode with values
> >> readed from MSR according to the intel x86 software developer's
> >> manual.
> >>
> >>  static u64 construct_eptp(unsigned long root_hpa)
> >>  {
> >> -u64 eptp;
> >> +u64 eptp, pwl;
> >> +
> >> +if (cpu_has_vmx_ept_4levels())
> >> +pwl = VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
> >> +else {
> >> +WARN(1, "Unsupported page-walk length of 4.\n");
> > Page-walk length of 4 is the only one that is supported.
> >
> Since there is a bit 6 in IA32_VMX_EPT_VPID_CAP MSR indicating the
> support for the page-walk length, I think sanity check is necessary.
> But I just checked the code, it's already done in the hardware_setup()
> function which will disable ept feature if the page-wake length is not
> 4. Gleb, any comments for the memory type check part?
Looks fine, but are there CPUs out there that do not support WB for eptp? Since
there was no bug reports about it I assume no.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode

2014-08-25 Thread Gleb Natapov

On Mon, Aug 25, 2014 at 11:16:34AM +0800, Dennis Chen wrote:
 On Sun, Aug 24, 2014 at 5:38 PM, Gleb Natapov g...@kernel.org wrote:
  On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote:
  This patch is used to construct the eptp in vmx mode with values
  readed from MSR according to the intel x86 software developer's
  manual.
 
   static u64 construct_eptp(unsigned long root_hpa)
   {
  -u64 eptp;
  +u64 eptp, pwl;
  +
  +if (cpu_has_vmx_ept_4levels())
  +pwl = VMX_EPT_DEFAULT_GAW  VMX_EPT_GAW_EPTP_SHIFT;
  +else {
  +WARN(1, Unsupported page-walk length of 4.\n);
  Page-walk length of 4 is the only one that is supported.
 
 Since there is a bit 6 in IA32_VMX_EPT_VPID_CAP MSR indicating the
 support for the page-walk length, I think sanity check is necessary.
 But I just checked the code, it's already done in the hardware_setup()
 function which will disable ept feature if the page-wake length is not
 4. Gleb, any comments for the memory type check part?
Looks fine, but are there CPUs out there that do not support WB for eptp? Since
there was no bug reports about it I assume no.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode

2014-08-24 Thread Gleb Natapov

On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote:
> This patch is used to construct the eptp in vmx mode with values
> readed from MSR according to the intel x86 software developer's
> manual.
> 
> Signed-off-by: Dennis Chen 
> ---
>  arch/x86/include/asm/vmx.h |1 +
>  arch/x86/kvm/vmx.c |   21 +
>  2 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index bcbfade..bf82a77 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -417,6 +417,7 @@ enum vmcs_field {
>  #define VMX_EPT_GAW_EPTP_SHIFT3
>  #define VMX_EPT_AD_ENABLE_BIT(1ull << 6)
>  #define VMX_EPT_DEFAULT_MT0x6ull
> +#define VMX_EPT_UC_MT0x0ull
>  #define VMX_EPT_READABLE_MASK0x1ull
>  #define VMX_EPT_WRITABLE_MASK0x2ull
>  #define VMX_EPT_EXECUTABLE_MASK0x4ull
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index bfe11cf..7add5ce 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3477,11 +3477,24 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu,
> unsigned long cr0)
> 
>  static u64 construct_eptp(unsigned long root_hpa)
>  {
> -u64 eptp;
> +u64 eptp, pwl;
> +
> +if (cpu_has_vmx_ept_4levels())
> +pwl = VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
> +else {
> +WARN(1, "Unsupported page-walk length of 4.\n");
Page-walk length of 4 is the only one that is supported.

> +BUG();
> +}
> +
> +if (cpu_has_vmx_eptp_writeback())
> +eptp = VMX_EPT_DEFAULT_MT | pwl;
> +else if (cpu_has_vmx_eptp_uncacheable())
> +eptp = VMX_EPT_UC_MT | pwl;
> +else {
> +WARN(1, "Unsupported memory type config in vmx eptp.\n");
> +BUG();
> +}
> 
> -/* TODO write the value reading from MSR */
> -eptp = VMX_EPT_DEFAULT_MT |
> -VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
>  if (enable_ept_ad_bits)
>  eptp |= VMX_EPT_AD_ENABLE_BIT;
>  eptp |= (root_hpa & PAGE_MASK);
> -- 
> 1.7.9.5

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM-Use value reading from MSR when construct the eptp in VMX mode

2014-08-24 Thread Gleb Natapov

On Sun, Aug 24, 2014 at 11:54:32AM +0800, Dennis Chen wrote:
 This patch is used to construct the eptp in vmx mode with values
 readed from MSR according to the intel x86 software developer's
 manual.
 
 Signed-off-by: Dennis Chen kernel.org@gmail.com
 ---
  arch/x86/include/asm/vmx.h |1 +
  arch/x86/kvm/vmx.c |   21 +
  2 files changed, 18 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
 index bcbfade..bf82a77 100644
 --- a/arch/x86/include/asm/vmx.h
 +++ b/arch/x86/include/asm/vmx.h
 @@ -417,6 +417,7 @@ enum vmcs_field {
  #define VMX_EPT_GAW_EPTP_SHIFT3
  #define VMX_EPT_AD_ENABLE_BIT(1ull  6)
  #define VMX_EPT_DEFAULT_MT0x6ull
 +#define VMX_EPT_UC_MT0x0ull
  #define VMX_EPT_READABLE_MASK0x1ull
  #define VMX_EPT_WRITABLE_MASK0x2ull
  #define VMX_EPT_EXECUTABLE_MASK0x4ull
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index bfe11cf..7add5ce 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3477,11 +3477,24 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu,
 unsigned long cr0)
 
  static u64 construct_eptp(unsigned long root_hpa)
  {
 -u64 eptp;
 +u64 eptp, pwl;
 +
 +if (cpu_has_vmx_ept_4levels())
 +pwl = VMX_EPT_DEFAULT_GAW  VMX_EPT_GAW_EPTP_SHIFT;
 +else {
 +WARN(1, Unsupported page-walk length of 4.\n);
Page-walk length of 4 is the only one that is supported.

 +BUG();
 +}
 +
 +if (cpu_has_vmx_eptp_writeback())
 +eptp = VMX_EPT_DEFAULT_MT | pwl;
 +else if (cpu_has_vmx_eptp_uncacheable())
 +eptp = VMX_EPT_UC_MT | pwl;
 +else {
 +WARN(1, Unsupported memory type config in vmx eptp.\n);
 +BUG();
 +}
 
 -/* TODO write the value reading from MSR */
 -eptp = VMX_EPT_DEFAULT_MT |
 -VMX_EPT_DEFAULT_GAW  VMX_EPT_GAW_EPTP_SHIFT;
  if (enable_ept_ad_bits)
  eptp |= VMX_EPT_AD_ENABLE_BIT;
  eptp |= (root_hpa  PAGE_MASK);
 -- 
 1.7.9.5

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-18 Thread Gleb Natapov

On Fri, Jul 18, 2014 at 05:05:20PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> On 07/17/2014 09:57 PM, Gleb Natapov wrote:
> >On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote:
> >>Hi Gleb,
> >>
> >>On 07/15/2014 08:40 PM, Gleb Natapov wrote:
> >>..
> >>>>
> >>>>And yes, we have the problem you said here. We can migrate the page while 
> >>>>L2
> >>>>vm is running.
> >>>>So I think we should enforce L2 vm to exit to L1. Right ?
> >>>>
> >>>We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> >>>if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> >>>too.
> >>>
> >>
> >>Sorry, I think I don't quite understand the procedure you are talking about
> >>here.
> >>
> >>Referring to the code, I think we have three machines: L0(host), L1 and L2.
> >>And we have two types of vmexit: L2->L1 and L2->L0.  Right ?
> >>
> >>We are now talking about this case: L2 and L1 shares the apic page.
> >>
> >>Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
> >>L1,
> >>and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we
> >Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
> >L1 or L2 VMCS depending on which one happens to be running right now.
> >If it is L1 then L2's VMCS will be updated during vmentry emulation,
> 
> OK, this is easy to understand.
> 
> >if it is
> >L2 we need to request reload during vmexit emulation to make sure L1's VMCS 
> >is
> >updated.
> >
> 
> I'm a little confused here. In patch 5/5, I called make_all_cpus_request()
> to
> force all vcpus exit to host. If we are in L2, where will the vcpus exit to
> ?
> L1 or L0 ?
> 
L0. CPU always exits to L0. L1->L2 exit is emulated by nested_vmx_vmexit().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-18 Thread Gleb Natapov

On Fri, Jul 18, 2014 at 05:05:20PM +0800, Tang Chen wrote:
 Hi Gleb,
 
 On 07/17/2014 09:57 PM, Gleb Natapov wrote:
 On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote:
 Hi Gleb,
 
 On 07/15/2014 08:40 PM, Gleb Natapov wrote:
 ..
 
 And yes, we have the problem you said here. We can migrate the page while 
 L2
 vm is running.
 So I think we should enforce L2 vm to exit to L1. Right ?
 
 We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so
 if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
 too.
 
 
 Sorry, I think I don't quite understand the procedure you are talking about
 here.
 
 Referring to the code, I think we have three machines: L0(host), L1 and L2.
 And we have two types of vmexit: L2-L1 and L2-L0.  Right ?
 
 We are now talking about this case: L2 and L1 shares the apic page.
 
 Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
 L1,
 and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we
 Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
 L1 or L2 VMCS depending on which one happens to be running right now.
 If it is L1 then L2's VMCS will be updated during vmentry emulation,
 
 OK, this is easy to understand.
 
 if it is
 L2 we need to request reload during vmexit emulation to make sure L1's VMCS 
 is
 updated.
 
 
 I'm a little confused here. In patch 5/5, I called make_all_cpus_request()
 to
 force all vcpus exit to host. If we are in L2, where will the vcpus exit to
 ?
 L1 or L0 ?
 
L0. CPU always exits to L0. L1-L2 exit is emulated by nested_vmx_vmexit().

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-17 Thread Gleb Natapov

On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> On 07/15/2014 08:40 PM, Gleb Natapov wrote:
> ..
> >>
> >>And yes, we have the problem you said here. We can migrate the page while L2
> >>vm is running.
> >>So I think we should enforce L2 vm to exit to L1. Right ?
> >>
> >We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> >if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> >too.
> >
> 
> Sorry, I think I don't quite understand the procedure you are talking about
> here.
> 
> Referring to the code, I think we have three machines: L0(host), L1 and L2.
> And we have two types of vmexit: L2->L1 and L2->L0.  Right ?
> 
> We are now talking about this case: L2 and L1 shares the apic page.
> 
> Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
> L1,
> and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we
Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
L1 or L2 VMCS depending on which one happens to be running right now.
If it is L1 then L2's VMCS will be updated during vmentry emulation, if it is
L2 we need to request reload during vmexit emulation to make sure L1's VMCS is
updated.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-17 Thread Gleb Natapov

On Thu, Jul 17, 2014 at 09:34:20PM +0800, Tang Chen wrote:
 Hi Gleb,
 
 On 07/15/2014 08:40 PM, Gleb Natapov wrote:
 ..
 
 And yes, we have the problem you said here. We can migrate the page while L2
 vm is running.
 So I think we should enforce L2 vm to exit to L1. Right ?
 
 We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so
 if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
 too.
 
 
 Sorry, I think I don't quite understand the procedure you are talking about
 here.
 
 Referring to the code, I think we have three machines: L0(host), L1 and L2.
 And we have two types of vmexit: L2-L1 and L2-L0.  Right ?
 
 We are now talking about this case: L2 and L1 shares the apic page.
 
 Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
 L1,
 and update L1's VMCS. At this time, we are in L0, not L2. Why cannot we
Using patch 5/5, when apic page is migrated on L0, mmu_notifier will notify
L1 or L2 VMCS depending on which one happens to be running right now.
If it is L1 then L2's VMCS will be updated during vmentry emulation, if it is
L2 we need to request reload during vmexit emulation to make sure L1's VMCS is
updated.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 09:13:23AM -0700, H. Peter Anvin wrote:
> On 07/16/2014 09:08 AM, Paolo Bonzini wrote:
> > Il 16/07/2014 18:03, H. Peter Anvin ha scritto:
> >> I suggested emulating RDRAND *but not set the CPUID bit*.  We already
> >> developed a protocol in KVM/Qemu to enumerate emulated features (created
> >> for MOVBE as I recall), specifically to service the semantic "feature X
> >> will work but will be substantially slower than normal."
> > 
> > But those will set the CPUID bit.  There is currently no way for KVM
> > guests to know if a CPUID bit is real or emulated.
> > 
> 
> OK, so there wasn't any protocol implemented in the end.  I sit corrected.
> 
That protocol that was implemented is between qemu and kvm, not kvm and a guest.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 04:32:19PM +0200, Paolo Bonzini wrote:
> Il 16/07/2014 16:07, Andy Lutomirski ha scritto:
> >This patch has nothing whatsoever to do with how much I trust the CPU
> >vs the hypervisor.  It's for the enormous installed base of machines
> >without RDRAND.
> 
> Ok.  I think an MSR is fine, though I don't think it's useful for the guest
> to use it if it already has RDRAND and/or RDSEED.
> 
Agree. It is unfortunate that we add PV interfaces for a HW that will be extinct
in a couple of years though :(

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 09:10:27AM +0200, Daniel Borkmann wrote:
> On 07/16/2014 08:41 AM, Gleb Natapov wrote:
> >On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote:
> >>virtio-rng is both too complicated and insufficient for initial rng
> >>seeding.  It's far too complicated to use for KASLR or any other
> >>early boot random number needs.  It also provides /dev/random-style
> >>bits, which means that making guest boot wait for virtio-rng is
> >>unacceptably slow, and doing it asynchronously means that
> >>/dev/urandom might be predictable when userspace starts.
> >>
> >>This introduces a very simple synchronous mechanism to get
> >>/dev/urandom-style bits.
> >
> >Why can't you use RDRAND instruction for that?
> 
> You mean using it directly? I think simply for the very same reasons
> as in c2557a303a ...
So you trust your hypervisor vendor more than you trust your CPU vendor? :)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote:
> virtio-rng is both too complicated and insufficient for initial rng
> seeding.  It's far too complicated to use for KASLR or any other
> early boot random number needs.  It also provides /dev/random-style
> bits, which means that making guest boot wait for virtio-rng is
> unacceptably slow, and doing it asynchronously means that
> /dev/urandom might be predictable when userspace starts.
> 
> This introduces a very simple synchronous mechanism to get
> /dev/urandom-style bits.
Why can't you use RDRAND instruction for that?

> 
> This is a KVM change: am I supposed to write a unit test somewhere?
> 
> Andy Lutomirski (4):
>   x86,kvm: Add MSR_KVM_GET_RNG_SEED and a matching feature bit
>   random,x86: Add arch_get_slow_rng_u64
>   random: Seed pools from arch_get_slow_rng_u64 at startup
>   x86,kaslr: Use MSR_KVM_GET_RNG_SEED for KASLR if available
> 
>  Documentation/virtual/kvm/cpuid.txt  |  3 +++
>  arch/x86/Kconfig |  4 
>  arch/x86/boot/compressed/aslr.c  | 27 +++
>  arch/x86/include/asm/archslowrng.h   | 30 ++
>  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
>  arch/x86/kernel/kvm.c| 22 ++
>  arch/x86/kvm/cpuid.c |  3 ++-
>  arch/x86/kvm/x86.c   |  4 
>  drivers/char/random.c| 14 +-
>  include/linux/random.h   |  9 +
>  10 files changed, 116 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/include/asm/archslowrng.h
> 
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote:
 virtio-rng is both too complicated and insufficient for initial rng
 seeding.  It's far too complicated to use for KASLR or any other
 early boot random number needs.  It also provides /dev/random-style
 bits, which means that making guest boot wait for virtio-rng is
 unacceptably slow, and doing it asynchronously means that
 /dev/urandom might be predictable when userspace starts.
 
 This introduces a very simple synchronous mechanism to get
 /dev/urandom-style bits.
Why can't you use RDRAND instruction for that?

 
 This is a KVM change: am I supposed to write a unit test somewhere?
 
 Andy Lutomirski (4):
   x86,kvm: Add MSR_KVM_GET_RNG_SEED and a matching feature bit
   random,x86: Add arch_get_slow_rng_u64
   random: Seed pools from arch_get_slow_rng_u64 at startup
   x86,kaslr: Use MSR_KVM_GET_RNG_SEED for KASLR if available
 
  Documentation/virtual/kvm/cpuid.txt  |  3 +++
  arch/x86/Kconfig |  4 
  arch/x86/boot/compressed/aslr.c  | 27 +++
  arch/x86/include/asm/archslowrng.h   | 30 ++
  arch/x86/include/uapi/asm/kvm_para.h |  2 ++
  arch/x86/kernel/kvm.c| 22 ++
  arch/x86/kvm/cpuid.c |  3 ++-
  arch/x86/kvm/x86.c   |  4 
  drivers/char/random.c| 14 +-
  include/linux/random.h   |  9 +
  10 files changed, 116 insertions(+), 2 deletions(-)
  create mode 100644 arch/x86/include/asm/archslowrng.h
 
 -- 
 1.9.3
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 09:10:27AM +0200, Daniel Borkmann wrote:
 On 07/16/2014 08:41 AM, Gleb Natapov wrote:
 On Tue, Jul 15, 2014 at 07:48:06PM -0700, Andy Lutomirski wrote:
 virtio-rng is both too complicated and insufficient for initial rng
 seeding.  It's far too complicated to use for KASLR or any other
 early boot random number needs.  It also provides /dev/random-style
 bits, which means that making guest boot wait for virtio-rng is
 unacceptably slow, and doing it asynchronously means that
 /dev/urandom might be predictable when userspace starts.
 
 This introduces a very simple synchronous mechanism to get
 /dev/urandom-style bits.
 
 Why can't you use RDRAND instruction for that?
 
 You mean using it directly? I think simply for the very same reasons
 as in c2557a303a ...
So you trust your hypervisor vendor more than you trust your CPU vendor? :)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 04:32:19PM +0200, Paolo Bonzini wrote:
 Il 16/07/2014 16:07, Andy Lutomirski ha scritto:
 This patch has nothing whatsoever to do with how much I trust the CPU
 vs the hypervisor.  It's for the enormous installed base of machines
 without RDRAND.
 
 Ok.  I think an MSR is fine, though I don't think it's useful for the guest
 to use it if it already has RDRAND and/or RDSEED.
 
Agree. It is unfortunate that we add PV interfaces for a HW that will be extinct
in a couple of years though :(

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] random,x86,kvm: Add and use MSR_KVM_GET_RNG_SEED

2014-07-16 Thread Gleb Natapov

On Wed, Jul 16, 2014 at 09:13:23AM -0700, H. Peter Anvin wrote:
 On 07/16/2014 09:08 AM, Paolo Bonzini wrote:
  Il 16/07/2014 18:03, H. Peter Anvin ha scritto:
  I suggested emulating RDRAND *but not set the CPUID bit*.  We already
  developed a protocol in KVM/Qemu to enumerate emulated features (created
  for MOVBE as I recall), specifically to service the semantic feature X
  will work but will be substantially slower than normal.
  
  But those will set the CPUID bit.  There is currently no way for KVM
  guests to know if a CPUID bit is real or emulated.
  
 
 OK, so there wasn't any protocol implemented in the end.  I sit corrected.
 
That protocol that was implemented is between qemu and kvm, not kvm and a guest.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 08:54:01PM +0800, Tang Chen wrote:
> On 07/15/2014 08:40 PM, Gleb Natapov wrote:
> >On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote:
> >>On 07/15/2014 08:09 PM, Gleb Natapov wrote:
> >>>On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
> >>..
> >>>>
> >>>>I cannot follow your concerns yet. Specifically, how should
> >>>>APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
> >>>>currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean?
> >>>>
> >>>I am talking about this case:
> >>>  if (cpu_has_secondary_exec_ctrls()) {a
> >>>  } else {
> >>>  exec_control |=
> >>> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>> vmcs_write64(APIC_ACCESS_ADDR,
> >>> page_to_phys(vcpu->kvm->arch.apic_access_page));
> >>>  }
> >>>We do not pin here.
> >>>
> >>
> >>Hi Gleb,
> >>
> >>
> >>7905 if (exec_control&
> >>SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) {
> >>..
> >>7912 if (vmx->nested.apic_access_page) /* shouldn't
> >>happen */
> >>7913 nested_release_page(vmx->nested.apic_access_page);
> >>7914 vmx->nested.apic_access_page =
> >>7915 nested_get_page(vcpu,
> >>vmcs12->apic_access_addr);
> >>
> >>I thought you were talking about the problem here. We pin
> >>vmcs12->apic_access_addr
> >>in memory. And I think we should do the same thing to this page as to L1 vm.
> >>Right ?
> >Nested kvm pins a lot of pages, it will probably be not easy to handle all 
> >of them,
> >so for now I am concerned with non nested case only (but nested should 
> >continue to
> >work obviously, just pin pages like it does now).
> 
> True. I will work on it.
> 
> And also, when using PCI passthrough, kvm_pin_pages() also pins some pages.
> This is
> also in my todo list.
Those pages are (almost) directly accessible by assigned PCI devices,
I am not sure this is even doable.

> 
> But sorry, a little strange. I didn't find where vmcs12->apic_access_addr is
> allocated
> or initialized... Would you please tell me ?
handle_vmwrite() writes it when guest is executing vmwrite(APIC_ACCESS_ADDR);

> 
> >
> >>
> >>..
> >>7922 if (!vmx->nested.apic_access_page)
> >>7923 exec_control&=
> >>7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>7925 else
> >>7926 vmcs_write64(APIC_ACCESS_ADDR,
> >>7927 page_to_phys(vmx->nested.apic_access_page));
> >>7928 } else if
> >>(vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> >>7929 exec_control |=
> >>7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>7931 vmcs_write64(APIC_ACCESS_ADDR,
> >>7932 page_to_phys(vcpu->kvm->arch.apic_access_page));
> >>7933 }
> >>
> >>And yes, we have the problem you said here. We can migrate the page while L2
> >>vm is running.
> >>So I think we should enforce L2 vm to exit to L1. Right ?
> >>
> >We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> >if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> >too.
> >
> 
> apic pages for L2 and L1 are not the same page, right ?
> 
If L2 guest enable apic access page then they are different, otherwise
they are the same.

> I think, just like we are doing in patch 5/5, we cannot wait for the next
> L2->L1 vmexit.
> We should enforce a L2->L1 vmexit in mmu_notifier, just like
> make_all_cpus_request() does.
> 
> Am I right ?
> 
I do not see why forcing APIC_ACCESS_ADDR reload during L2->L1 exit is not 
enough.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 03:10:15PM +0200, Jan Kiszka wrote:
> On 2014-07-15 14:40, Gleb Natapov wrote:
> >>
> >> ..
> >> 7922 if (!vmx->nested.apic_access_page)
> >> 7923 exec_control &=
> >> 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >> 7925 else
> >> 7926 vmcs_write64(APIC_ACCESS_ADDR,
> >> 7927 page_to_phys(vmx->nested.apic_access_page));
> >> 7928 } else if
> >> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> >> 7929 exec_control |=
> >> 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >> 7931 vmcs_write64(APIC_ACCESS_ADDR,
> >> 7932 page_to_phys(vcpu->kvm->arch.apic_access_page));
> >> 7933 }
> >>
> >> And yes, we have the problem you said here. We can migrate the page while 
> >> L2
> >> vm is running.
> >> So I think we should enforce L2 vm to exit to L1. Right ?
> >>
> > We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
> > if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
> > too.
> 
> How should this host-managed VMCS field possibly change while L2 is running?
> 
That what "Do not pin apic access page in memory" patch is doing. It changes 
APIC_ACCESS_ADDR
of a current vmcs. It kicks vcpu out of a guest mode of course, but vcpu may 
still be in 
L2 at this point, so only L2 vmcs will get new APIC_ACCESS_ADDR pointer, L1 
vmcs will still have
an old one.
   
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote:
> On 07/15/2014 08:09 PM, Gleb Natapov wrote:
> >On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
> ..
> >>
> >>I cannot follow your concerns yet. Specifically, how should
> >>APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
> >>currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean?
> >>
> >I am talking about this case:
> >  if (cpu_has_secondary_exec_ctrls()) {a
> >  } else {
> >  exec_control |=
> > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> > vmcs_write64(APIC_ACCESS_ADDR,
> > page_to_phys(vcpu->kvm->arch.apic_access_page));
> >  }
> >We do not pin here.
> >
> 
> Hi Gleb,
> 
> 
> 7905 if (exec_control &
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) {
> ..
> 7912 if (vmx->nested.apic_access_page) /* shouldn't
> happen */
> 7913 nested_release_page(vmx->nested.apic_access_page);
> 7914 vmx->nested.apic_access_page =
> 7915 nested_get_page(vcpu,
> vmcs12->apic_access_addr);
> 
> I thought you were talking about the problem here. We pin
> vmcs12->apic_access_addr
> in memory. And I think we should do the same thing to this page as to L1 vm.
> Right ?
Nested kvm pins a lot of pages, it will probably be not easy to handle all of 
them,
so for now I am concerned with non nested case only (but nested should continue 
to
work obviously, just pin pages like it does now).

> 
> ..
> 7922 if (!vmx->nested.apic_access_page)
> 7923 exec_control &=
> 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> 7925 else
> 7926 vmcs_write64(APIC_ACCESS_ADDR,
> 7927 page_to_phys(vmx->nested.apic_access_page));
> 7928 } else if
> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> 7929 exec_control |=
> 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> 7931 vmcs_write64(APIC_ACCESS_ADDR,
> 7932 page_to_phys(vcpu->kvm->arch.apic_access_page));
> 7933 }
> 
> And yes, we have the problem you said here. We can migrate the page while L2
> vm is running.
> So I think we should enforce L2 vm to exit to L1. Right ?
> 
We can request APIC_ACCESS_ADDR reload during L2->L1 vmexit emulation, so
if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
> On 2014-07-14 16:58, Gleb Natapov wrote:
> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>> index ffbe557..7080eda 100644
> >>>> --- a/arch/x86/kvm/x86.c
> >>>> +++ b/arch/x86/kvm/x86.c
> >>>> @@ -5929,6 +5929,18 @@ static void vcpu_scan_ioapic(struct kvm_vcpu 
> >>>> *vcpu)
> >>>>  kvm_apic_update_tmr(vcpu, tmr);
> >>>>  }
> >>>>
> >>>> +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +/*
> >>>> + * When the page is being migrated, GUP will wait till the 
> >>>> migrate
> >>>> + * entry is replaced with the new pte entry pointing to the new 
> >>>> page.
> >>>> + */
> >>>> +struct page *page = gfn_to_page_no_pin(vcpu->kvm,
> >>>> +APIC_DEFAULT_PHYS_BASE>>  
> >>>> PAGE_SHIFT);
> >>> If you do not use kvm->arch.apic_access_page to get current address why 
> >>> not drop it entirely?
> >>>
> >>
> >> I should also update kvm->arch.apic_access_page here. It is used in other
> >> places
> >> in kvm, so I don't think we should drop it. Will update the patch.
> > What other places? The only other place I see is in nested kvm code and you 
> > can call
> > gfn_to_page_no_pin() there instead of using kvm->arch.apic_access_page 
> > directly. But
> > as far as I see nested kvm code cannot handle change of APIC_ACCESS_ADDR 
> > phys address.
> > If APIC_ACCESS_ADDR changes during nested guest run, non nested vmcs will 
> > still have old
> > physical address. One way to fix that is to set KVM_REQ_APIC_PAGE_RELOAD 
> > during nested exit.
> 
> I cannot follow your concerns yet. Specifically, how should
> APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
> currently pin/unpin on L1->L2/L2->L1, respectively. Or what do you mean?
> 
I am talking about this case:
 if (cpu_has_secondary_exec_ctrls()) {a
 } else {
 exec_control |=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
vmcs_write64(APIC_ACCESS_ADDR,
page_to_phys(vcpu->kvm->arch.apic_access_page));
 }
We do not pin here.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
 On 2014-07-14 16:58, Gleb Natapov wrote:
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index ffbe557..7080eda 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -5929,6 +5929,18 @@ static void vcpu_scan_ioapic(struct kvm_vcpu 
  *vcpu)
   kvm_apic_update_tmr(vcpu, tmr);
   }
 
  +static void vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
  +{
  +/*
  + * When the page is being migrated, GUP will wait till the 
  migrate
  + * entry is replaced with the new pte entry pointing to the new 
  page.
  + */
  +struct page *page = gfn_to_page_no_pin(vcpu-kvm,
  +APIC_DEFAULT_PHYS_BASE  
  PAGE_SHIFT);
  If you do not use kvm-arch.apic_access_page to get current address why 
  not drop it entirely?
 
 
  I should also update kvm-arch.apic_access_page here. It is used in other
  places
  in kvm, so I don't think we should drop it. Will update the patch.
  What other places? The only other place I see is in nested kvm code and you 
  can call
  gfn_to_page_no_pin() there instead of using kvm-arch.apic_access_page 
  directly. But
  as far as I see nested kvm code cannot handle change of APIC_ACCESS_ADDR 
  phys address.
  If APIC_ACCESS_ADDR changes during nested guest run, non nested vmcs will 
  still have old
  physical address. One way to fix that is to set KVM_REQ_APIC_PAGE_RELOAD 
  during nested exit.
 
 I cannot follow your concerns yet. Specifically, how should
 APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
 currently pin/unpin on L1-L2/L2-L1, respectively. Or what do you mean?
 
I am talking about this case:
 if (cpu_has_secondary_exec_ctrls()) {a
 } else {
 exec_control |=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
vmcs_write64(APIC_ACCESS_ADDR,
page_to_phys(vcpu-kvm-arch.apic_access_page));
 }
We do not pin here.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote:
 On 07/15/2014 08:09 PM, Gleb Natapov wrote:
 On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
 ..
 
 I cannot follow your concerns yet. Specifically, how should
 APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
 currently pin/unpin on L1-L2/L2-L1, respectively. Or what do you mean?
 
 I am talking about this case:
   if (cpu_has_secondary_exec_ctrls()) {a
   } else {
   exec_control |=
  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
  vmcs_write64(APIC_ACCESS_ADDR,
  page_to_phys(vcpu-kvm-arch.apic_access_page));
   }
 We do not pin here.
 
 
 Hi Gleb,
 
 
 7905 if (exec_control 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) {
 ..
 7912 if (vmx-nested.apic_access_page) /* shouldn't
 happen */
 7913 nested_release_page(vmx-nested.apic_access_page);
 7914 vmx-nested.apic_access_page =
 7915 nested_get_page(vcpu,
 vmcs12-apic_access_addr);
 
 I thought you were talking about the problem here. We pin
 vmcs12-apic_access_addr
 in memory. And I think we should do the same thing to this page as to L1 vm.
 Right ?
Nested kvm pins a lot of pages, it will probably be not easy to handle all of 
them,
so for now I am concerned with non nested case only (but nested should continue 
to
work obviously, just pin pages like it does now).

 
 ..
 7922 if (!vmx-nested.apic_access_page)
 7923 exec_control =
 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 7925 else
 7926 vmcs_write64(APIC_ACCESS_ADDR,
 7927 page_to_phys(vmx-nested.apic_access_page));
 7928 } else if
 (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 7929 exec_control |=
 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 7931 vmcs_write64(APIC_ACCESS_ADDR,
 7932 page_to_phys(vcpu-kvm-arch.apic_access_page));
 7933 }
 
 And yes, we have the problem you said here. We can migrate the page while L2
 vm is running.
 So I think we should enforce L2 vm to exit to L1. Right ?
 
We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so
if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 too.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 03:10:15PM +0200, Jan Kiszka wrote:
 On 2014-07-15 14:40, Gleb Natapov wrote:
 
  ..
  7922 if (!vmx-nested.apic_access_page)
  7923 exec_control =
  7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
  7925 else
  7926 vmcs_write64(APIC_ACCESS_ADDR,
  7927 page_to_phys(vmx-nested.apic_access_page));
  7928 } else if
  (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
  7929 exec_control |=
  7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
  7931 vmcs_write64(APIC_ACCESS_ADDR,
  7932 page_to_phys(vcpu-kvm-arch.apic_access_page));
  7933 }
 
  And yes, we have the problem you said here. We can migrate the page while 
  L2
  vm is running.
  So I think we should enforce L2 vm to exit to L1. Right ?
 
  We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so
  if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
  too.
 
 How should this host-managed VMCS field possibly change while L2 is running?
 
That what Do not pin apic access page in memory patch is doing. It changes 
APIC_ACCESS_ADDR
of a current vmcs. It kicks vcpu out of a guest mode of course, but vcpu may 
still be in 
L2 at this point, so only L2 vmcs will get new APIC_ACCESS_ADDR pointer, L1 
vmcs will still have
an old one.
   
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-15 Thread Gleb Natapov

On Tue, Jul 15, 2014 at 08:54:01PM +0800, Tang Chen wrote:
 On 07/15/2014 08:40 PM, Gleb Natapov wrote:
 On Tue, Jul 15, 2014 at 08:28:22PM +0800, Tang Chen wrote:
 On 07/15/2014 08:09 PM, Gleb Natapov wrote:
 On Tue, Jul 15, 2014 at 01:52:40PM +0200, Jan Kiszka wrote:
 ..
 
 I cannot follow your concerns yet. Specifically, how should
 APIC_ACCESS_ADDR (the VMCS field, right?) change while L2 is running? We
 currently pin/unpin on L1-L2/L2-L1, respectively. Or what do you mean?
 
 I am talking about this case:
   if (cpu_has_secondary_exec_ctrls()) {a
   } else {
   exec_control |=
  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
  vmcs_write64(APIC_ACCESS_ADDR,
  page_to_phys(vcpu-kvm-arch.apic_access_page));
   }
 We do not pin here.
 
 
 Hi Gleb,
 
 
 7905 if (exec_control
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) {
 ..
 7912 if (vmx-nested.apic_access_page) /* shouldn't
 happen */
 7913 nested_release_page(vmx-nested.apic_access_page);
 7914 vmx-nested.apic_access_page =
 7915 nested_get_page(vcpu,
 vmcs12-apic_access_addr);
 
 I thought you were talking about the problem here. We pin
 vmcs12-apic_access_addr
 in memory. And I think we should do the same thing to this page as to L1 vm.
 Right ?
 Nested kvm pins a lot of pages, it will probably be not easy to handle all 
 of them,
 so for now I am concerned with non nested case only (but nested should 
 continue to
 work obviously, just pin pages like it does now).
 
 True. I will work on it.
 
 And also, when using PCI passthrough, kvm_pin_pages() also pins some pages.
 This is
 also in my todo list.
Those pages are (almost) directly accessible by assigned PCI devices,
I am not sure this is even doable.

 
 But sorry, a little strange. I didn't find where vmcs12-apic_access_addr is
 allocated
 or initialized... Would you please tell me ?
handle_vmwrite() writes it when guest is executing vmwrite(APIC_ACCESS_ADDR);

 
 
 
 ..
 7922 if (!vmx-nested.apic_access_page)
 7923 exec_control=
 7924 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 7925 else
 7926 vmcs_write64(APIC_ACCESS_ADDR,
 7927 page_to_phys(vmx-nested.apic_access_page));
 7928 } else if
 (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 7929 exec_control |=
 7930 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 7931 vmcs_write64(APIC_ACCESS_ADDR,
 7932 page_to_phys(vcpu-kvm-arch.apic_access_page));
 7933 }
 
 And yes, we have the problem you said here. We can migrate the page while L2
 vm is running.
 So I think we should enforce L2 vm to exit to L1. Right ?
 
 We can request APIC_ACCESS_ADDR reload during L2-L1 vmexit emulation, so
 if APIC_ACCESS_ADDR changes while L2 is running it will be reloaded for L1 
 too.
 
 
 apic pages for L2 and L1 are not the same page, right ?
 
If L2 guest enable apic access page then they are different, otherwise
they are the same.

 I think, just like we are doing in patch 5/5, we cannot wait for the next
 L2-L1 vmexit.
 We should enforce a L2-L1 vmexit in mmu_notifier, just like
 make_all_cpus_request() does.
 
 Am I right ?
 
I do not see why forcing APIC_ACCESS_ADDR reload during L2-L1 exit is not 
enough.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-14 Thread Gleb Natapov

CCing Jan to check my nested kvm findings below.

On Mon, Jul 14, 2014 at 03:57:09PM +0800, Tang Chen wrote:
> Hi Gleb,
> 
> Thanks for the reply. Please see below.
> 
> On 07/12/2014 04:04 PM, Gleb Natapov wrote:
> >On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote:
> >>apic access page is pinned in memory. As a result, it cannot be 
> >>migrated/hot-removed.
> >>Actually, it is not necessary to be pinned.
> >>
> >>The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
> >>the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
> >>corresponding ept entry. This patch introduces a new vcpu request named
> >>KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
> >>time,
> >>and force all the vcpus exit guest, and re-enter guest till they updates 
> >>the VMCS
> >>APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
> >>kvm->arch.apic_access_page to the new page.
> >>
> >By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism
> >is not used since no MMIO access to APIC is ever done. Have you tested
> >this with "-cpu modelname,-x2apic" qemu flag?
> 
> I used the following commandline to test the patches.
> 
> # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp
> 2
> 
That most likely uses x2apic.

> And I think the guest used APIC_ACCESS_ADDR mechanism because the previous
> patch-set has some problem which will happen when the apic page is accessed.
> And it did happen.
> 
> I'll test this patch-set with "-cpu modelname,-x2apic" flag.
> 
Replace "modelname" with one of supported model names such as nehalem of course 
:)

> >
> >>Signed-off-by: Tang Chen
> >>---
> >>  arch/x86/include/asm/kvm_host.h |  1 +
> >>  arch/x86/kvm/mmu.c  | 11 +++
> >>  arch/x86/kvm/svm.c  |  6 ++
> >>  arch/x86/kvm/vmx.c  |  8 +++-
> >>  arch/x86/kvm/x86.c  | 14 ++
> >>  include/linux/kvm_host.h|  2 ++
> >>  virt/kvm/kvm_main.c | 12 
> >>  7 files changed, 53 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/arch/x86/include/asm/kvm_host.h 
> >>b/arch/x86/include/asm/kvm_host.h
> >>index 62f973e..9ce6bfd 100644
> >>--- a/arch/x86/include/asm/kvm_host.h
> >>+++ b/arch/x86/include/asm/kvm_host.h
> >>@@ -737,6 +737,7 @@ struct kvm_x86_ops {
> >>void (*hwapic_isr_update)(struct kvm *kvm, int isr);
> >>void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> >>void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> >>+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
> >>void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
> >>void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
> >>int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>index 9314678..551693d 100644
> >>--- a/arch/x86/kvm/mmu.c
> >>+++ b/arch/x86/kvm/mmu.c
> >>@@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, 
> >>gva_t gpa, u32 error_code,
> >> level, gfn, pfn, prefault);
> >>spin_unlock(>kvm->mmu_lock);
> >>
> >>+   /*
> >>+* apic access page could be migrated. When the guest tries to access
> >>+* the apic access page, ept violation will occur, and we can use GUP
> >>+* to find the new page.
> >>+*
> >>+* GUP will wait till the migrate entry be replaced with the new page.
> >>+*/
> >>+   if (gpa == APIC_DEFAULT_PHYS_BASE)
> >>+   vcpu->kvm->arch.apic_access_page = gfn_to_page_no_pin(vcpu->kvm,
> >>+   APIC_DEFAULT_PHYS_BASE>>  PAGE_SHIFT);
> >Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here?
> 
> I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here.
> 
> In kvm_mmu_notifier_invalidate_page() I made the request. And the handler
> called
> gfn_to_page_no_pin() to get the new page, which will wait till the migration
> finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the
> vcpus
> were forced to exit the guest mode, they would wait till the VMCS
> APIC_ACCESS_ADDR
> pointer was updated.
> 
> As a result, we don't need to make the request here.
OK, I do

Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-07-14 Thread Gleb Natapov

On Mon, Jul 14, 2014 at 05:17:04PM +0800, Tang Chen wrote:
> On 07/12/2014 03:44 PM, Gleb Natapov wrote:
> >On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote:
> >>kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
> >>it is never used to refer to the page at all.
> >>
> >>In vcpu initialization, it indicates two things:
> >>1. indicates if ept page is allocated
> >>2. indicates if a memory slot for identity page is initialized
> >>
> >>Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
> >>identity pagetable is initialized. So we can remove ept_identity_pagetable.
> >>
> >>Signed-off-by: Tang Chen
> >>---
> >>  arch/x86/include/asm/kvm_host.h |  1 -
> >>  arch/x86/kvm/vmx.c  | 25 +++--
> >>  2 files changed, 11 insertions(+), 15 deletions(-)
> >>
> >>diff --git a/arch/x86/include/asm/kvm_host.h 
> >>b/arch/x86/include/asm/kvm_host.h
> >>index 4931415..62f973e 100644
> >>--- a/arch/x86/include/asm/kvm_host.h
> >>+++ b/arch/x86/include/asm/kvm_host.h
> >>@@ -578,7 +578,6 @@ struct kvm_arch {
> >>
> >>gpa_t wall_clock;
> >>
> >>-   struct page *ept_identity_pagetable;
> >>bool ept_identity_pagetable_done;
> >>gpa_t ept_identity_map_addr;
> >>
> >>diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>index 0918635e..fe2e5f4 100644
> >>--- a/arch/x86/kvm/vmx.c
> >>+++ b/arch/x86/kvm/vmx.c
> >>@@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu 
> >>*vcpu);
> >>  static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
> >>  static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
> >>  static bool vmx_mpx_supported(void);
> >>+static int alloc_identity_pagetable(struct kvm *kvm);
> >>
> >>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
> >>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> >>@@ -3921,21 +3922,21 @@ out:
> >>
> >>  static int init_rmode_identity_map(struct kvm *kvm)
> >>  {
> >>-   int i, idx, r, ret;
> >>+   int i, idx, r, ret = 0;
> >>pfn_t identity_map_pfn;
> >>u32 tmp;
> >>
> >>if (!enable_ept)
> >>return 1;
> >>-   if (unlikely(!kvm->arch.ept_identity_pagetable)) {
> >>-   printk(KERN_ERR "EPT: identity-mapping pagetable "
> >>-   "haven't been allocated!\n");
> >>-   return 0;
> >>-   }
> >>if (likely(kvm->arch.ept_identity_pagetable_done))
> >>return 1;
> >>-   ret = 0;
> >>identity_map_pfn = kvm->arch.ept_identity_map_addr>>  PAGE_SHIFT;
> >>+
> >>+   mutex_lock(>slots_lock);
> >Why move this out of alloc_identity_pagetable()?
> >
> 
> Referring to the original code, I think mutex_lock(>slots_lock) is used
> to protect kvm->arch.ept_identity_pagetable. If two or more threads try to
> modify it at the same time, the mutex ensures that the identity table is
> only
> allocated once.
> 
> Now, we dropped kvm->arch.ept_identity_pagetable. And use
> kvm->arch.ept_identity_pagetable_done
> to check if the identity table is allocated and initialized. So we should
> protect
> memory slot operation in alloc_identity_pagetable() and
> kvm->arch.ept_identity_pagetable_done
> with this mutex.
> 
> Of course, I can see that the name "slots_lock" indicates that it may be
> used
> to protect the memory slot operation only. Maybe move it out here is not
> suitable.
> 
> If I'm wrong, please tell me.
> 
No, you are right that besides memory slot creation slots_lock protects checking
of ept_identity_pagetable here, but after you patch ept_identity_pagetable_done 
is
tested outside of slots_lock so the allocation can happen twice, no?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH v2 4/5] kvm: Remove ept_identity_pagetable from struct kvm_arch.

2014-07-14 Thread Gleb Natapov

On Mon, Jul 14, 2014 at 05:17:04PM +0800, Tang Chen wrote:
 On 07/12/2014 03:44 PM, Gleb Natapov wrote:
 On Wed, Jul 09, 2014 at 10:08:03AM +0800, Tang Chen wrote:
 kvm_arch-ept_identity_pagetable holds the ept identity pagetable page. But
 it is never used to refer to the page at all.
 
 In vcpu initialization, it indicates two things:
 1. indicates if ept page is allocated
 2. indicates if a memory slot for identity page is initialized
 
 Actually, kvm_arch-ept_identity_pagetable_done is enough to tell if the ept
 identity pagetable is initialized. So we can remove ept_identity_pagetable.
 
 Signed-off-by: Tang Chentangc...@cn.fujitsu.com
 ---
   arch/x86/include/asm/kvm_host.h |  1 -
   arch/x86/kvm/vmx.c  | 25 +++--
   2 files changed, 11 insertions(+), 15 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index 4931415..62f973e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -578,7 +578,6 @@ struct kvm_arch {
 
 gpa_t wall_clock;
 
 -   struct page *ept_identity_pagetable;
 bool ept_identity_pagetable_done;
 gpa_t ept_identity_map_addr;
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 0918635e..fe2e5f4 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -741,6 +741,7 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu 
 *vcpu);
   static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
   static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
   static bool vmx_mpx_supported(void);
 +static int alloc_identity_pagetable(struct kvm *kvm);
 
   static DEFINE_PER_CPU(struct vmcs *, vmxarea);
   static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 @@ -3921,21 +3922,21 @@ out:
 
   static int init_rmode_identity_map(struct kvm *kvm)
   {
 -   int i, idx, r, ret;
 +   int i, idx, r, ret = 0;
 pfn_t identity_map_pfn;
 u32 tmp;
 
 if (!enable_ept)
 return 1;
 -   if (unlikely(!kvm-arch.ept_identity_pagetable)) {
 -   printk(KERN_ERR EPT: identity-mapping pagetable 
 -   haven't been allocated!\n);
 -   return 0;
 -   }
 if (likely(kvm-arch.ept_identity_pagetable_done))
 return 1;
 -   ret = 0;
 identity_map_pfn = kvm-arch.ept_identity_map_addr  PAGE_SHIFT;
 +
 +   mutex_lock(kvm-slots_lock);
 Why move this out of alloc_identity_pagetable()?
 
 
 Referring to the original code, I think mutex_lock(kvm-slots_lock) is used
 to protect kvm-arch.ept_identity_pagetable. If two or more threads try to
 modify it at the same time, the mutex ensures that the identity table is
 only
 allocated once.
 
 Now, we dropped kvm-arch.ept_identity_pagetable. And use
 kvm-arch.ept_identity_pagetable_done
 to check if the identity table is allocated and initialized. So we should
 protect
 memory slot operation in alloc_identity_pagetable() and
 kvm-arch.ept_identity_pagetable_done
 with this mutex.
 
 Of course, I can see that the name slots_lock indicates that it may be
 used
 to protect the memory slot operation only. Maybe move it out here is not
 suitable.
 
 If I'm wrong, please tell me.
 
No, you are right that besides memory slot creation slots_lock protects checking
of ept_identity_pagetable here, but after you patch ept_identity_pagetable_done 
is
tested outside of slots_lock so the allocation can happen twice, no?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] kvm, mem-hotplug: Do not pin apic access page in memory.

2014-07-14 Thread Gleb Natapov

CCing Jan to check my nested kvm findings below.

On Mon, Jul 14, 2014 at 03:57:09PM +0800, Tang Chen wrote:
 Hi Gleb,
 
 Thanks for the reply. Please see below.
 
 On 07/12/2014 04:04 PM, Gleb Natapov wrote:
 On Tue, Jul 08, 2014 at 09:01:32PM +0800, Tang Chen wrote:
 apic access page is pinned in memory. As a result, it cannot be 
 migrated/hot-removed.
 Actually, it is not necessary to be pinned.
 
 The hpa of apic access page is stored in VMCS APIC_ACCESS_ADDR pointer. When
 the page is migrated, kvm_mmu_notifier_invalidate_page() will invalidate the
 corresponding ept entry. This patch introduces a new vcpu request named
 KVM_REQ_APIC_PAGE_RELOAD, and makes this request to all the vcpus at this 
 time,
 and force all the vcpus exit guest, and re-enter guest till they updates 
 the VMCS
 APIC_ACCESS_ADDR pointer to the new apic access page address, and updates
 kvm-arch.apic_access_page to the new page.
 
 By default kvm Linux guest uses x2apic, so APIC_ACCESS_ADDR mechanism
 is not used since no MMIO access to APIC is ever done. Have you tested
 this with -cpu modelname,-x2apic qemu flag?
 
 I used the following commandline to test the patches.
 
 # /usr/libexec/qemu-kvm -m 512M -hda /home/tangchen/xxx.img -enable-kvm -smp
 2
 
That most likely uses x2apic.

 And I think the guest used APIC_ACCESS_ADDR mechanism because the previous
 patch-set has some problem which will happen when the apic page is accessed.
 And it did happen.
 
 I'll test this patch-set with -cpu modelname,-x2apic flag.
 
Replace modelname with one of supported model names such as nehalem of course 
:)

 
 Signed-off-by: Tang Chentangc...@cn.fujitsu.com
 ---
   arch/x86/include/asm/kvm_host.h |  1 +
   arch/x86/kvm/mmu.c  | 11 +++
   arch/x86/kvm/svm.c  |  6 ++
   arch/x86/kvm/vmx.c  |  8 +++-
   arch/x86/kvm/x86.c  | 14 ++
   include/linux/kvm_host.h|  2 ++
   virt/kvm/kvm_main.c | 12 
   7 files changed, 53 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index 62f973e..9ce6bfd 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -737,6 +737,7 @@ struct kvm_x86_ops {
 void (*hwapic_isr_update)(struct kvm *kvm, int isr);
 void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
 void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
 +   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
 void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
 void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
 int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9314678..551693d 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -3427,6 +3427,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, 
 gva_t gpa, u32 error_code,
  level, gfn, pfn, prefault);
 spin_unlock(vcpu-kvm-mmu_lock);
 
 +   /*
 +* apic access page could be migrated. When the guest tries to access
 +* the apic access page, ept violation will occur, and we can use GUP
 +* to find the new page.
 +*
 +* GUP will wait till the migrate entry be replaced with the new page.
 +*/
 +   if (gpa == APIC_DEFAULT_PHYS_BASE)
 +   vcpu-kvm-arch.apic_access_page = gfn_to_page_no_pin(vcpu-kvm,
 +   APIC_DEFAULT_PHYS_BASE  PAGE_SHIFT);
 Shouldn't you make KVM_REQ_APIC_PAGE_RELOAD request here?
 
 I don't think we need to make KVM_REQ_APIC_PAGE_RELOAD request here.
 
 In kvm_mmu_notifier_invalidate_page() I made the request. And the handler
 called
 gfn_to_page_no_pin() to get the new page, which will wait till the migration
 finished. And then updated the VMCS APIC_ACCESS_ADDR pointer. So, when the
 vcpus
 were forced to exit the guest mode, they would wait till the VMCS
 APIC_ACCESS_ADDR
 pointer was updated.
 
 As a result, we don't need to make the request here.
OK, I do not see what's the purpose of the code here then.

 
 
 
 +
 return r;
 
   out_unlock:
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 576b525..dc76f29 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3612,6 +3612,11 @@ static void svm_set_virtual_x2apic_mode(struct 
 kvm_vcpu *vcpu, bool set)
 return;
   }
 
 +static void svm_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
 +{
 +   return;
 +}
 +
   static int svm_vm_has_apicv(struct kvm *kvm)
   {
 return 0;
 @@ -4365,6 +4370,7 @@ static struct kvm_x86_ops svm_x86_ops = {
 .enable_irq_window = enable_irq_window,
 .update_cr8_intercept = update_cr8_intercept,
 .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
 +   .set_apic_access_page_addr = svm_set_apic_access_page_addr,
 .vm_has_apicv = svm_vm_has_apicv,
 .load_eoi_exitmap = svm_load_eoi_exitmap

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1410 matches

Mail list logo