Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-11-09 Thread Jason Wang


On 11/10/2015 04:19 AM, Michael S. Tsirkin wrote:
> On Mon, Nov 09, 2015 at 12:35:45PM +0800, Jason Wang wrote:
>> > 
>> > 
>> > On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote:
>>> > > On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote:
 > >> Hi:
 > >>
 > >> This series fixes two issues of fast mmio eventfd:
 > >>
 > >> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
 > >>and KVM_FAST_MMIO_BUS. This will cause double in
 > >>ioeventfd_destructor()
 > >> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
 > >>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
 > >>qemu instead of host.
 > >>
 > >> 1 is fixed by allocating two instances of iodev and introduce a new
 > >> capability for userspace. 2 is fixed by ignore the actual length if
 > >> the length of iodev is zero in kvm_io_bus_cmp().
 > >>
 > >> Please review.
 > >> Changes from V5:
 > >> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
 > >>   remove the unnecessary checks
 > >> - even more grammar and typo fixes
 > >> - rabase to kvm.git
 > >> - document KVM_CAP_FAST_MMIO
>>> > > What's up with userspace using this capability?
>> > 
>> > It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH.
>> > 
>>> > > Did patches ever get posted?
>> > 
>> > See https://lkml.org/lkml/2015/9/28/208
> Talking about userspace here.
> QEMU freeze is approaching, it really should
> use this to avoid regressions.
>

The patches were posted at
http://lists.gnu.org/archive/html/qemu-devel/2015-11/msg01276.html

(you were in cc list)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
> 
> Which leaves the special case of Xen, where even preexisting devices
> don't bypass the IOMMU.  Can we keep this specific to powerpc and
> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
> is properly self-describing.
> 
> IOW, I think that on x86 we should assume that all virtio devices
> honor the IOMMU.

You don't like performances ? :-)

Cheers,
Ben.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 9:28 PM, Benjamin Herrenschmidt
 wrote:
> On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
>>
>> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF.
>> */
>> static const struct pci_device_id virtio_pci_id_table[] = {
>> { PCI_DEVICE(0x1af4, PCI_ANY_ID) },
>> { 0 }
>> };
>>
>> Can we match on that range?
>
> We can, but the problem remains, how do we differenciate an existing
> device that does bypass vs. a newer one that needs the IOMMU and thus
> doesn't have the new "bypass" property in the device-tree.
>

We could do it the other way around: on powerpc, if a PCI device is in
that range and doesn't have the "bypass" property at all, then it's
assumed to bypass the IOMMU.  This means that everything that
currently works continues working.  If someone builds a physical
virtio device or uses another system in PCIe target mode speaking
virtio, then it won't work until they upgrade their firmware to set
bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
bypass=0 and no ambiguity.

vfio will presumably notice the bypass and correctly refuse to map any
current virtio devices.

Would that work?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 9:26 PM, Benjamin Herrenschmidt
 wrote:
> On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
>>
>> Which leaves the special case of Xen, where even preexisting devices
>> don't bypass the IOMMU.  Can we keep this specific to powerpc and
>> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
>> is properly self-describing.
>>
>> IOW, I think that on x86 we should assume that all virtio devices
>> honor the IOMMU.
>
> You don't like performances ? :-)

This should have basically no effect.  Every non-experimental x86
virtio setup in existence either doesn't work at all (Xen) or has DMA
ops that are no-ops.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
> 
> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF.
> */
> static const struct pci_device_id virtio_pci_id_table[] = {
>     { PCI_DEVICE(0x1af4, PCI_ANY_ID) },
>     { 0 }
> };
> 
> Can we match on that range?

We can, but the problem remains, how do we differenciate an existing
device that does bypass vs. a newer one that needs the IOMMU and thus
doesn't have the new "bypass" property in the device-tree.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote:
> The problem here is that in some of the problematic cases the virtio
> driver may not even be loaded.  If someone runs an L1 guest with an
> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
> *boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)
> 
> >
> > The only way out of this while keeping the "platform" stuff would be to
> > also bump some kind of version in the virtio config (or PCI header). I
> > have no other way to differenciate between "this is an old qemu that
> > doesn't do the 'bypass property' yet" from "this is a virtio device
> > that doesn't bypass".
> >
> > Any better idea ?
> 
> I'd suggest that, in the absence of the new DT binding, we assume that
> any PCI device with the virtio vendor ID is passthrough on powerpc.  I
> can do this in the virtio driver, but if it's in the platform code
> then vfio gets it right too (i.e. fails to load).

The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor
ID which will be used by more than just virtio, so we need to
specifically list the devices.

Additionally, that still means that once we have a virtio device that
actually uses the iommu, powerpc will not work since the "workaround"
above will kick in.

The "in absence of the new DT binding" doesn't make that much sense.

Those platforms use device-trees defined since the dawn of ages by
actual open firmware implementations, they either have no iommu
representation in there (Macs, the platform code hooks it all up) or
have various properties related to the iommu but no concept of "bypass"
in there.

We can *add* a new property under some circumstances that indicates a
bypass on a per-device basis, however that doesn't completely solve it:

  - As I said above, what does the absence of that property mean ? An
old qemu that does bypass on all virtio or a new qemu trying to tell
you that the virtio device actually does use the iommu (or some other
environment that isn't qemu) ?

  - On things like macs, the device-tree is generated by openbios, it
would have to have some added logic to try to figure that out, which
means it needs to know *via different means* that some or all virtio
devices bypass the iommu.

I thus go back to my original statement, it's a LOT easier to handle if
the device itself is self describing, indicating whether it is set to
bypass a host iommu or not. For L1->L2, well, that wouldn't be the
first time qemu/VFIO plays tricks with the passed through device
configuration space...

Note that the above can be solved via some kind of compromise: The
device self describes the ability to honor the iommu, along with the
property (or ACPI table entry) that indicates whether or not it does.

IE. We could use the revision or ProgIf field of the config space for
example. Or something in virtio config. If it's an "old" device, we
know it always bypass. If it's a new device, we know it only bypasses
if the corresponding property is in. I still would have to sort out the
openbios case for mac among others but it's at least a workable
direction.

BTW. Don't you have a similar problem on x86 that today qemu claims
that everything honors the iommu in ACPI ?

Unless somebody can come up with a better idea...

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 6:04 PM, Benjamin Herrenschmidt
 wrote:
> On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote:
>> The problem here is that in some of the problematic cases the virtio
>> driver may not even be loaded.  If someone runs an L1 guest with an
>> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
>> *boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)
>>
>> >
>> > The only way out of this while keeping the "platform" stuff would be to
>> > also bump some kind of version in the virtio config (or PCI header). I
>> > have no other way to differenciate between "this is an old qemu that
>> > doesn't do the 'bypass property' yet" from "this is a virtio device
>> > that doesn't bypass".
>> >
>> > Any better idea ?
>>
>> I'd suggest that, in the absence of the new DT binding, we assume that
>> any PCI device with the virtio vendor ID is passthrough on powerpc.  I
>> can do this in the virtio driver, but if it's in the platform code
>> then vfio gets it right too (i.e. fails to load).
>
> The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor
> ID which will be used by more than just virtio, so we need to
> specifically list the devices.

Really?

/* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
static const struct pci_device_id virtio_pci_id_table[] = {
{ PCI_DEVICE(0x1af4, PCI_ANY_ID) },
{ 0 }
};

Can we match on that range?

>
> Additionally, that still means that once we have a virtio device that
> actually uses the iommu, powerpc will not work since the "workaround"
> above will kick in.

I don't know how to solve that problem, though, especially since the
vendor of such a device (especially if it's real hardware) might not
set any new bit.

>
> The "in absence of the new DT binding" doesn't make that much sense.
>
> Those platforms use device-trees defined since the dawn of ages by
> actual open firmware implementations, they either have no iommu
> representation in there (Macs, the platform code hooks it all up) or
> have various properties related to the iommu but no concept of "bypass"
> in there.
>
> We can *add* a new property under some circumstances that indicates a
> bypass on a per-device basis, however that doesn't completely solve it:
>
>   - As I said above, what does the absence of that property mean ? An
> old qemu that does bypass on all virtio or a new qemu trying to tell
> you that the virtio device actually does use the iommu (or some other
> environment that isn't qemu) ?
>
>   - On things like macs, the device-tree is generated by openbios, it
> would have to have some added logic to try to figure that out, which
> means it needs to know *via different means* that some or all virtio
> devices bypass the iommu.
>
> I thus go back to my original statement, it's a LOT easier to handle if
> the device itself is self describing, indicating whether it is set to
> bypass a host iommu or not. For L1->L2, well, that wouldn't be the
> first time qemu/VFIO plays tricks with the passed through device
> configuration space...

Which leaves the special case of Xen, where even preexisting devices
don't bypass the IOMMU.  Can we keep this specific to powerpc and
sparc?  On x86, this problem is basically nonexistent, since the IOMMU
is properly self-describing.

IOW, I think that on x86 we should assume that all virtio devices
honor the IOMMU.

>
> Note that the above can be solved via some kind of compromise: The
> device self describes the ability to honor the iommu, along with the
> property (or ACPI table entry) that indicates whether or not it does.
>
> IE. We could use the revision or ProgIf field of the config space for
> example. Or something in virtio config. If it's an "old" device, we
> know it always bypass. If it's a new device, we know it only bypasses
> if the corresponding property is in. I still would have to sort out the
> openbios case for mac among others but it's at least a workable
> direction.
>
> BTW. Don't you have a similar problem on x86 that today qemu claims
> that everything honors the iommu in ACPI ?

Only on a single experimental configuration, and that can apparently
just be fixed going forward without any real problems being caused.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated

2015-11-09 Thread Haozhong Zhang
On 11/09/15 14:01, Eduardo Habkost wrote:
> On Mon, Nov 09, 2015 at 08:33:55AM +0800, haozhong.zh...@intel.com wrote:
> > On 11/06/15 13:12, Eduardo Habkost wrote:
> > > On Fri, Nov 06, 2015 at 10:32:24AM +0800, haozhong.zh...@intel.com wrote:
> > > > On 11/05/15 14:05, Eduardo Habkost wrote:
> > > > > On Thu, Nov 05, 2015 at 09:30:51AM +0800, Haozhong Zhang wrote:
> > > > > > On 11/04/15 19:42, Eduardo Habkost wrote:
> > > [...]
> > > > > > > > +env->tsc_khz_saved = r;
> > > > > > > > +}
> > > > > > > 
> > > > > > > Why do you need a separate tsc_khz_saved field, and don't simply 
> > > > > > > use
> > > > > > > tsc_khz? It would have the additional feature of letting QMP 
> > > > > > > clients
> > > > > > > query the current TSC rate by asking for the tsc-freq property on 
> > > > > > > CPU
> > > > > > > objects.
> > > > > > >
> > > > > > 
> > > > > > It's to avoid overriding env->tsc_khz on the destination in the
> > > > > > migration. I can change this line to
> > > > > >  env->tsc_khz = env->tsc_khz_saved = r;
> > > > > 
> > > > > You are already avoiding overriding env->tsc_khz, because you use
> > > > > KVM_GET_TSC_KHZ only if tsc_khz is not set yet. I still don't see why
> > > > > you need a tsc_khz_saved field that requires duplicating the 
> > > > > SET_TSC_KHZ
> > > > > code, if you could just do this:
> > > > > 
> > > > > if (!env->tsc_khz) {
> > > > > env->tsc_khz = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ);
> > > > > }
> > > > >
> > > > 
> > > > Consider an example that we migrate a VM from machine A to machine B
> > > > and then to machine C, and QEMU on machine B is launched with the cpu
> > > > option 'tsc-freq' (i.e. env->tsc_khz on B is non-zero at the
> > > > beginning):
> > > >  1) In the migration from B to C, the user-specified TSC frequency by
> > > > 'tsc-freq' on B is expected to be migrated to C. That is, the
> > > > value of env->tsc_khz on B is migrated.
> > > >  2) If TSC frequency is migrated through env->tsc_khz, then
> > > > env->tsc_khz on B will be overrode in the migration from A to B
> > > > before kvm_arch_setup_tsc_khz(). If the guest TSC frequency is
> > > > different than the user-specified TSC frequency on B, the
> > > > expectation in 1) will not be satisfied anymore.
> > > 
> > > Setting tsc-freq on B when tsc-freq was not used on A is invalid usage.
> > > This is not different from changing the CPU model and adding or removing
> > > CPU flags when migrating, which is also incorrect. The command-line
> > > parameters defining the VM must be the same when you migrate.
> > >
> > 
> > Good to know it's an invalid usage. Then the question is what QEMU is
> > expected to do for this invalid usage?
> > 
> >  1) Abort the migration? But I find that the current QEMU does not
> > abort the migration between different CPU models (e.g. Nehalem and
> > Haswell).
> > 
> >  2) Or do not abort the migration and ignore tsc-freq option? If so,
> > tsc_khz_saved will be not needed.
> 
> My first choice is to abort migration. If we decide to abort today and
> find it to cause problems, we can easily fix it. If we decide to
> continue without aborting, it is difficult to change that behavior
> without breaking existing setups.
> 
> -- 
> Eduardo

Agree, but I would like to relax the abort condition to "abort the
migration only if QEMU on the destination uses a different TSC
frequency than the migrated one" so that the following usages would be
still valid:
 1) Only QEMU on the destination has 'tsc-freq' option, but it' set to
the same value of the migrated one.
 2) Only QEMU on the source has 'tsc-freq' option.
 3) QEMU on both sides have 'tsc-freq' option, but they are set to the
same value.
In all above usages, TSC frequency on the destination is the same as
both the value on the source and the value explicitly expected by
users on the destination (by 'tsc-freq' on the destination).

And I still need tsc_khz_saved to tell on the destination whether
 1) both tsc-freq option and migrated TSC frequency are present, and
 2) above two values are the same.
Even though we restrictively requires QEMU on both sides use the same
CPU options, tsc_khz_saved is still needed because of 1).

Haozhong
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Andy Lutomirski
On Mon, Nov 9, 2015 at 2:58 PM, Benjamin Herrenschmidt
 wrote:
> So ...
>
> I've finally tried to sort that out for powerpc and I can't find a way
> to make that work that isn't a complete pile of stinking shit.
>
> I'm very tempted to go back to my original idea: virtio itself should
> indicate it's "bypassing ability" via the virtio config space or some
> other bit (like the ProgIf of the PCI header etc...).
>
> I don't see how I can make it work otherwise.
>
> The problem with the statement "it's a platform matter" is that:
>
>   - It's not entirely correct. It's actually a feature of a specific
> virtio implementation (qemu's) that it bypasses all the platform IOMMU
> facilities.
>
>   - The platforms (on powerpc there's at least 3 or 4 that have qemu
> emulation and support some form of PCI) don't have an existing way to
> convey the information that a device bypasses the IOMMU (if any).
>
>   - Even if we add such a mechanism (new property in the device-tree),
> we end up with a big turd: Because we need to be compatible with older
> qemus, we essentially need a quirk that will make all virtio devices
> assume that property is present. That will of course break whenever we
> try to use another implementation of virtio on powerpc which doesn't
> bypass the iommu.
>
> We don't have a way to differenciate a virtio device that does the
> bypass from one that doesn't and the backward compatibility requirement
> forces us to treat all existing virtio devices as doing bypass.

The problem here is that in some of the problematic cases the virtio
driver may not even be loaded.  If someone runs an L1 guest with an
IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
*boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)

>
> The only way out of this while keeping the "platform" stuff would be to
> also bump some kind of version in the virtio config (or PCI header). I
> have no other way to differenciate between "this is an old qemu that
> doesn't do the 'bypass property' yet" from "this is a virtio device
> that doesn't bypass".
>
> Any better idea ?

I'd suggest that, in the absence of the new DT binding, we assume that
any PCI device with the virtio vendor ID is passthrough on powerpc.  I
can do this in the virtio driver, but if it's in the platform code
then vfio gets it right too (i.e. fails to load).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-all: PAGE_SIZE should be real host page size

2015-11-09 Thread Andrew Jones
Just noticed this while grepping TARGET_PAGE_SIZE for an unrelated
reason. I didn't use qemu_real_host_page_size as kvm_set_phys_mem()
does, because we'd need to make sure page_size_init() has run first.

Signed-off-by: Andrew Jones 
---
 kvm-all.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 1bc12737723c3..de9ff5971fb3b 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -45,8 +45,10 @@
 #include 
 #endif
 
-/* KVM uses PAGE_SIZE in its definition of COALESCED_MMIO_MAX */
-#define PAGE_SIZE TARGET_PAGE_SIZE
+/* KVM uses PAGE_SIZE in its definition of KVM_COALESCED_MMIO_MAX. We
+ * need to use the real host PAGE_SIZE, as that's what KVM will use.
+ */
+#define PAGE_SIZE getpagesize()
 
 //#define DEBUG_KVM
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
So ...

I've finally tried to sort that out for powerpc and I can't find a way
to make that work that isn't a complete pile of stinking shit.

I'm very tempted to go back to my original idea: virtio itself should
indicate it's "bypassing ability" via the virtio config space or some
other bit (like the ProgIf of the PCI header etc...).

I don't see how I can make it work otherwise.

The problem with the statement "it's a platform matter" is that:

  - It's not entirely correct. It's actually a feature of a specific
virtio implementation (qemu's) that it bypasses all the platform IOMMU
facilities.

  - The platforms (on powerpc there's at least 3 or 4 that have qemu
emulation and support some form of PCI) don't have an existing way to
convey the information that a device bypasses the IOMMU (if any).

  - Even if we add such a mechanism (new property in the device-tree),
we end up with a big turd: Because we need to be compatible with older
qemus, we essentially need a quirk that will make all virtio devices
assume that property is present. That will of course break whenever we
try to use another implementation of virtio on powerpc which doesn't
bypass the iommu.

We don't have a way to differenciate a virtio device that does the
bypass from one that doesn't and the backward compatibility requirement
forces us to treat all existing virtio devices as doing bypass.

The only way out of this while keeping the "platform" stuff would be to
also bump some kind of version in the virtio config (or PCI header). I
have no other way to differenciate between "this is an old qemu that
doesn't do the 'bypass property' yet" from "this is a virtio device
that doesn't bypass".

Any better idea ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM/arm64: enable enhanced armv8 fp/simd lazy switch

2015-11-09 Thread Mario Smarduch


On 11/5/2015 7:02 AM, Christoffer Dall wrote:
> On Fri, Oct 30, 2015 at 02:56:33PM -0700, Mario Smarduch wrote:
>> This patch enables arm64 lazy fp/simd switch, similar to arm described in
>> second patch. Change from previous version - restore function is moved to
>> host. 
>>
>> Signed-off-by: Mario Smarduch 
>> ---
>>  arch/arm64/include/asm/kvm_host.h |  2 +-
>>  arch/arm64/kernel/asm-offsets.c   |  1 +
>>  arch/arm64/kvm/hyp.S  | 37 +++--
>>  3 files changed, 33 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>> b/arch/arm64/include/asm/kvm_host.h
>> index 26a2347..dcecf92 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -251,11 +251,11 @@ static inline void kvm_arch_hardware_unsetup(void) {}
>>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
>>  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>> -static inline void kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu) {}
>>  
>>  void kvm_arm_init_debug(void);
>>  void kvm_arm_setup_debug(struct kvm_vcpu *vcpu);
>>  void kvm_arm_clear_debug(struct kvm_vcpu *vcpu);
>>  void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu);
>> +void kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu);
>>  
>>  #endif /* __ARM64_KVM_HOST_H__ */
>> diff --git a/arch/arm64/kernel/asm-offsets.c 
>> b/arch/arm64/kernel/asm-offsets.c
>> index 8d89cf8..c9c5242 100644
>> --- a/arch/arm64/kernel/asm-offsets.c
>> +++ b/arch/arm64/kernel/asm-offsets.c
>> @@ -124,6 +124,7 @@ int main(void)
>>DEFINE(VCPU_HCR_EL2,  offsetof(struct kvm_vcpu, 
>> arch.hcr_el2));
>>DEFINE(VCPU_MDCR_EL2, offsetof(struct kvm_vcpu, arch.mdcr_el2));
>>DEFINE(VCPU_IRQ_LINES,offsetof(struct kvm_vcpu, arch.irq_lines));
>> +  DEFINE(VCPU_VFP_DIRTY,offsetof(struct kvm_vcpu, arch.vfp_dirty));
>>DEFINE(VCPU_HOST_CONTEXT, offsetof(struct kvm_vcpu, 
>> arch.host_cpu_context));
>>DEFINE(VCPU_HOST_DEBUG_STATE, offsetof(struct kvm_vcpu, 
>> arch.host_debug_state));
>>DEFINE(VCPU_TIMER_CNTV_CTL,   offsetof(struct kvm_vcpu, 
>> arch.timer_cpu.cntv_ctl));
>> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>> index e583613..ed2c4cf 100644
>> --- a/arch/arm64/kvm/hyp.S
>> +++ b/arch/arm64/kvm/hyp.S
>> @@ -36,6 +36,28 @@
>>  #define CPU_SYSREG_OFFSET(x)(CPU_SYSREGS + 8*x)
>>  
>>  .text
>> +
>> +/**
>> + * void kvm_restore_host_vfp_state(struct vcpu *vcpu) - Executes lazy
>> + *  fp/simd switch, saves the guest, restores host. Called from host
>> + *  mode, placed outside of hyp section.
> 
> same comments on style as previous patch
> 
>> + */
>> +ENTRY(kvm_restore_host_vfp_state)
>> +pushxzr, lr
>> +
>> +add x2, x0, #VCPU_CONTEXT
>> +mov w3, #0
>> +strbw3, [x0, #VCPU_VFP_DIRTY]
> 
> I've been discussing with myself if it would make more sense to clear
> the dirty flag in the C-code...
> 
>> +
>> +bl __save_fpsimd
>> +
>> +ldr x2, [x0, #VCPU_HOST_CONTEXT]
>> +bl __restore_fpsimd
>> +
>> +pop xzr, lr
>> +ret
>> +ENDPROC(kvm_restore_host_vfp_state)
>> +
>>  .pushsection.hyp.text, "ax"
>>  .align  PAGE_SHIFT
>>  
>> @@ -482,7 +504,11 @@
>>  99:
>>  msr hcr_el2, x2
>>  mov x2, #CPTR_EL2_TTA
>> +
>> +ldrbw3, [x0, #VCPU_VFP_DIRTY]
>> +tbnzw3, #0, 98f
>>  orr x2, x2, #CPTR_EL2_TFP
>> +98:
> 
> mmm, don't you need to only set the fpexc32 when you're actually going
> to trap the guest accesses?
> 
> also, you can consider only setting this in vcpu_load (jumping quickly
> to EL2 to do so) if we're running a 32-bit guest.  Probably worth
> measuring the difference between the extra EL2 jump on vcpu_load
> compared to hitting this register on every entry/exit.
> 
> Code-wise, it will be nicer to do it on vcpu_load.
Hi Christoffer, Marc -
  just want to run this by you, I ran a test with typical number of
fp threads and couple lmbench benchmarks  the stride and bandwidth ones. The
ratio of exits to vcpu puts is high 50:1 or so. But of course that's subject
to the loads you run.

I substituted:
tbnz x2, #HCR_RW_SHIFT, 99f
mov x3, #(1 << 30)
msr fpexc32_el2, x3
isb

with vcpu_load hyp call and check for 32 bit guest in C
mov x1, #(1 << 30)
msr fpexc32_el2, x3
ret

And then
skip_fpsimd_state x8, 2f
mrs x6, fpexec_el2
str x6, [x3, #16]

with vcpu_put hyp call with check for 32 bit guest in C this is called
substantially less often then vcpu_load since fp/simd registers are not
always dirty on vcpu_put

kern_hyp_va x0
add x2, x0, #VCPU_CONTEXT
mrs x1, fpexec32_el2
str x1, [x2, #CPU_SYSREG_OFFSET(FPEXC32_EL2)]
ret

Of course each hyp call has additional overhead, at a high exit to
vcpu_put ratio hyp call appears better. But all this is very
highly dependent on exit rate and fp/simd usage. IMO hyp call
works better under extrem

[kvm-unit-tests PATCH 19/18] don't embed code inside asserts

2015-11-09 Thread Andrew Jones
assert() is classically a macro which could also be disabled, so if
somebody introduces a switch to "#define assert(...) /*nothing*/" in
the future, we'd lose code.

Suggested-by: Thomas Huth 
Signed-off-by: Andrew Jones 
---
 lib/arm/setup.c   | 19 ++-
 lib/arm/smp.c |  4 +++-
 lib/virtio-mmio.c |  4 +++-
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/lib/arm/setup.c b/lib/arm/setup.c
index 02e81a689a8a6..da6edc1f9d8ff 100644
--- a/lib/arm/setup.c
+++ b/lib/arm/setup.c
@@ -39,8 +39,11 @@ static void cpu_set(int fdtnode __unused, u32 regval, void 
*info __unused)
 
 static void cpu_init(void)
 {
+   int ret;
+
nr_cpus = 0;
-   assert(dt_for_each_cpu_node(cpu_set, NULL) == 0);
+   ret = dt_for_each_cpu_node(cpu_set, NULL);
+   assert(ret == 0);
set_cpu_online(0, true);
 }
 
@@ -49,8 +52,10 @@ static void mem_init(phys_addr_t freemem_start)
/* we only expect one membank to be defined in the DT */
struct dt_pbus_reg regs[1];
phys_addr_t mem_start, mem_end;
+   int ret;
 
-   assert(dt_get_memory_params(regs, 1));
+   ret = dt_get_memory_params(regs, 1);
+   assert(ret != 0);
 
mem_start = regs[0].addr;
mem_end = mem_start + regs[0].size;
@@ -71,14 +76,17 @@ void setup(const void *fdt)
 {
const char *bootargs;
u32 fdt_size;
+   int ret;
 
/*
 * Move the fdt to just above the stack. The free memory
 * then starts just after the fdt.
 */
fdt_size = fdt_totalsize(fdt);
-   assert(fdt_move(fdt, &stacktop, fdt_size) == 0);
-   assert(dt_init(&stacktop) == 0);
+   ret = fdt_move(fdt, &stacktop, fdt_size);
+   assert(ret == 0);
+   ret = dt_init(&stacktop);
+   assert(ret == 0);
 
mem_init(PAGE_ALIGN((unsigned long)&stacktop + fdt_size));
io_init();
@@ -86,6 +94,7 @@ void setup(const void *fdt)
 
thread_info_init(current_thread_info(), 0);
 
-   assert(dt_get_bootargs(&bootargs) == 0);
+   ret = dt_get_bootargs(&bootargs);
+   assert(ret == 0);
setup_args(bootargs);
 }
diff --git a/lib/arm/smp.c b/lib/arm/smp.c
index 3cfc6d5ddedd0..390c53b5d84c3 100644
--- a/lib/arm/smp.c
+++ b/lib/arm/smp.c
@@ -44,11 +44,13 @@ secondary_entry_fn secondary_cinit(void)
 void smp_boot_secondary(int cpu, secondary_entry_fn entry)
 {
void *stack_base = memalign(THREAD_SIZE, THREAD_SIZE);
+   int ret;
 
secondary_data.stack = stack_base + THREAD_START_SP;
secondary_data.entry = entry;
mmu_mark_disabled(cpu);
-   assert(cpu_psci_cpu_boot(cpu) == 0);
+   ret = cpu_psci_cpu_boot(cpu);
+   assert(ret == 0);
 
while (!cpu_online(cpu))
wfe();
diff --git a/lib/virtio-mmio.c b/lib/virtio-mmio.c
index 5ccbd193a264a..fa8dd5b8d484d 100644
--- a/lib/virtio-mmio.c
+++ b/lib/virtio-mmio.c
@@ -123,10 +123,12 @@ static int vm_dt_match(const struct dt_device *dev, int 
fdtnode)
struct vm_dt_info *info = (struct vm_dt_info *)dev->info;
struct dt_pbus_reg base;
u32 magic;
+   int ret;
 
dt_device_bind_node((struct dt_device *)dev, fdtnode);
 
-   assert(dt_pbus_get_base(dev, &base) == 0);
+   ret = dt_pbus_get_base(dev, &base);
+   assert(ret == 0);
info->base = ioremap(base.addr, base.size);
 
magic = readl(info->base + VIRTIO_MMIO_MAGIC_VALUE);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-unit-tests PATCH v2 02/19] trivial: lib: fail hard on failed mallocs

2015-11-09 Thread Thomas Huth
On 09/11/15 21:53, Andrew Jones wrote:
> It's pretty safe to not even bother checking for NULL when
> using malloc and friends, but if we do check, then fail
> hard.
> 
> Signed-off-by: Andrew Jones 
> ---
> v2: no code in asserts [Thomas Huth]
> 
>  lib/virtio-mmio.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/virtio-mmio.c b/lib/virtio-mmio.c
> index 043832299174e..5ccbd193a264a 100644
> --- a/lib/virtio-mmio.c
> +++ b/lib/virtio-mmio.c
> @@ -54,8 +54,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device 
> *vdev,
>  
>   vq = calloc(1, sizeof(*vq));
>   queue = memalign(PAGE_SIZE, VIRTIO_MMIO_QUEUE_SIZE_MIN);
> - if (!vq || !queue)
> - return NULL;
> + assert(vq && queue);
>  
>   writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
>  
> @@ -162,8 +161,7 @@ static struct virtio_device *virtio_mmio_dt_bind(u32 
> devid)
>   return NULL;
>  
>   vm_dev = calloc(1, sizeof(*vm_dev));
> - if (!vm_dev)
> - return NULL;
> + assert(vm_dev != NULL);
>  
>   vm_dev->base = info.base;
>   vm_device_init(vm_dev);

Reviewed-by: Thomas Huth 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[kvm-unit-tests PATCH v2 02/19] trivial: lib: fail hard on failed mallocs

2015-11-09 Thread Andrew Jones
It's pretty safe to not even bother checking for NULL when
using malloc and friends, but if we do check, then fail
hard.

Signed-off-by: Andrew Jones 
---
v2: no code in asserts [Thomas Huth]

 lib/virtio-mmio.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/lib/virtio-mmio.c b/lib/virtio-mmio.c
index 043832299174e..5ccbd193a264a 100644
--- a/lib/virtio-mmio.c
+++ b/lib/virtio-mmio.c
@@ -54,8 +54,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device 
*vdev,
 
vq = calloc(1, sizeof(*vq));
queue = memalign(PAGE_SIZE, VIRTIO_MMIO_QUEUE_SIZE_MIN);
-   if (!vq || !queue)
-   return NULL;
+   assert(vq && queue);
 
writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
 
@@ -162,8 +161,7 @@ static struct virtio_device *virtio_mmio_dt_bind(u32 devid)
return NULL;
 
vm_dev = calloc(1, sizeof(*vm_dev));
-   if (!vm_dev)
-   return NULL;
+   assert(vm_dev != NULL);
 
vm_dev->base = info.base;
vm_device_init(vm_dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-11-09 Thread Michael S. Tsirkin
On Mon, Nov 09, 2015 at 12:35:45PM +0800, Jason Wang wrote:
> 
> 
> On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote:
> > On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote:
> >> Hi:
> >>
> >> This series fixes two issues of fast mmio eventfd:
> >>
> >> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS
> >>and KVM_FAST_MMIO_BUS. This will cause double in
> >>ioeventfd_destructor()
> >> 2) A zero length iodev on KVM_MMIO_BUS will never be found but
> >>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by
> >>qemu instead of host.
> >>
> >> 1 is fixed by allocating two instances of iodev and introduce a new
> >> capability for userspace. 2 is fixed by ignore the actual length if
> >> the length of iodev is zero in kvm_io_bus_cmp().
> >>
> >> Please review.
> >> Changes from V5:
> >> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and
> >>   remove the unnecessary checks
> >> - even more grammar and typo fixes
> >> - rabase to kvm.git
> >> - document KVM_CAP_FAST_MMIO
> > What's up with userspace using this capability?
> 
> It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH.
> 
> > Did patches ever get posted?
> 
> See https://lkml.org/lkml/2015/9/28/208

Talking about userspace here.
QEMU freeze is approaching, it really should
use this to avoid regressions.


> >
> >> Changes from V4:
> >> - move the location of kvm_assign_ioeventfd() in patch 1 which reduce
> >>   the change set.
> >> - commit log typo fixes
> >> - switch to use kvm_deassign_ioeventfd_id) when fail to register to
> >>   fast mmio bus
> >> - change kvm_io_bus_cmp() as Paolo's suggestions
> >> - introduce a new capability to avoid new userspace crash old kernel
> >> - add a new patch that only try to register mmio eventfd on fast mmio
> >>   bus
> >>
> >> Changes from V3:
> >>
> >> - Don't do search on two buses when trying to do write on
> >>   KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat.
> >> - Since we don't do search on two buses, change kvm_io_bus_cmp() to
> >>   let it can find zero length iodevs.
> >> - Fix the unnecessary lines in tracepoint patch.
> >>
> >> Changes from V2:
> >> - Tweak styles and comment suggested by Cornelia.
> >>
> >> Changes from v1:
> >> - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when
> >>   needed to save lots of unnecessary changes.
> >>
> >> Jason Wang (6):
> >>   kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd
> >>   kvm: factor out core eventfd assign/deassign logic
> >>   kvm: fix double free for fast mmio eventfd
> >>   kvm: fix zero length mmio searching
> >>   kvm: add tracepoint for fast mmio
> >>   kvm: add fast mmio capabilitiy
> >>
> >>  Documentation/virtual/kvm/api.txt |   7 ++-
> >>  arch/x86/kvm/trace.h  |  18 ++
> >>  arch/x86/kvm/vmx.c|   1 +
> >>  arch/x86/kvm/x86.c|   1 +
> >>  include/uapi/linux/kvm.h  |   1 +
> >>  virt/kvm/eventfd.c| 124 
> >> ++
> >>  virt/kvm/kvm_main.c   |  20 +-
> >>  7 files changed, 118 insertions(+), 54 deletions(-)
> >>
> >> -- 
> >> 2.1.4
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 14/33] pc-dimm: drop the prefix of pc-dimm

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:08PM +0800, Xiao Guangrong wrote:
> This patch is generated by this script:
> 
> find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
> | xargs sed -i "s/PC_DIMM/DIMM/g"
> 
> find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
> | xargs sed -i "s/PCDIMM/DIMM/g"
> 
> find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
> | xargs sed -i "s/pc_dimm/dimm/g"
> 
> find ./ -name "trace-events" -type f | xargs sed -i "s/pc-dimm/dimm/g"
> 
> It prepares the work which abstracts dimm device type for both pc-dimm and
> nvdimm
> 
> Signed-off-by: Xiao Guangrong 

I can see two ways this patchset can get merged
- merge refactorings first, nvdimm support on top
- merge nvdimm support first, refactor code on top

The way you put it in the middle of the series
allows neither.  And I definitely favor option 2:
it's easier to reason about the best way to refactor code
when you have multiple users before you.


> ---
>  hmp.c   |   2 +-
>  hw/acpi/ich9.c  |   6 +-
>  hw/acpi/memory_hotplug.c|  16 ++---
>  hw/acpi/piix4.c |   6 +-
>  hw/i386/pc.c|  32 -
>  hw/mem/pc-dimm.c| 148 
> 
>  hw/ppc/spapr.c  |  18 ++---
>  include/hw/mem/pc-dimm.h|  62 -
>  numa.c  |   2 +-
>  qapi-schema.json|   8 +--
>  qmp.c   |   2 +-
>  stubs/qmp_pc_dimm_device_list.c |   2 +-
>  trace-events|   8 +--
>  13 files changed, 156 insertions(+), 156 deletions(-)
> 
> diff --git a/hmp.c b/hmp.c
> index 5048eee..5c617d2 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -1952,7 +1952,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
> *qdict)
>  MemoryDeviceInfoList *info_list = qmp_query_memory_devices(&err);
>  MemoryDeviceInfoList *info;
>  MemoryDeviceInfo *value;
> -PCDIMMDeviceInfo *di;
> +DIMMDeviceInfo *di;
>  
>  for (info = info_list; info; info = info->next) {
>  value = info->value;
> diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
> index 1c7fcfa..b0d6a67 100644
> --- a/hw/acpi/ich9.c
> +++ b/hw/acpi/ich9.c
> @@ -440,7 +440,7 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs 
> *pm, Error **errp)
>  void ich9_pm_device_plug_cb(ICH9LPCPMRegs *pm, DeviceState *dev, Error 
> **errp)
>  {
>  if (pm->acpi_memory_hotplug.is_enabled &&
> -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
>  acpi_memory_plug_cb(&pm->acpi_regs, pm->irq, 
> &pm->acpi_memory_hotplug,
>  dev, errp);
>  } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
> @@ -455,7 +455,7 @@ void ich9_pm_device_unplug_request_cb(ICH9LPCPMRegs *pm, 
> DeviceState *dev,
>Error **errp)
>  {
>  if (pm->acpi_memory_hotplug.is_enabled &&
> -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
>  acpi_memory_unplug_request_cb(&pm->acpi_regs, pm->irq,
>&pm->acpi_memory_hotplug, dev, errp);
>  } else {
> @@ -468,7 +468,7 @@ void ich9_pm_device_unplug_cb(ICH9LPCPMRegs *pm, 
> DeviceState *dev,
>Error **errp)
>  {
>  if (pm->acpi_memory_hotplug.is_enabled &&
> -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
>  acpi_memory_unplug_cb(&pm->acpi_memory_hotplug, dev, errp);
>  } else {
>  error_setg(errp, "acpi: device unplug for not supported device"
> diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
> index ce428df..e687852 100644
> --- a/hw/acpi/memory_hotplug.c
> +++ b/hw/acpi/memory_hotplug.c
> @@ -54,23 +54,23 @@ static uint64_t acpi_memory_hotplug_read(void *opaque, 
> hwaddr addr,
>  o = OBJECT(mdev->dimm);
>  switch (addr) {
>  case 0x0: /* Lo part of phys address where DIMM is mapped */
> -val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) : 0;
> +val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) : 0;
>  trace_mhp_acpi_read_addr_lo(mem_st->selector, val);
>  break;
>  case 0x4: /* Hi part of phys address where DIMM is mapped */
> -val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) >> 32 
> : 0;
> +val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) >> 32 : 0;
>  trace_mhp_acpi_read_addr_hi(mem_st->selector, val);
>  break;
>  case 0x8: /* Lo part of DIMM size */
> -val = o ? object_property_get_int(o, PC_DIMM_SIZE_PROP, NULL) : 0;
> +val = o ? object_property_get_int(o, DIMM_SIZE_PROP, NULL) : 0;
>  trace_mhp_acpi_read_size_lo(mem_st->selecto

Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()

2015-11-09 Thread Michael S. Tsirkin
On Sat, Oct 31, 2015 at 04:09:56PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/30/2015 11:54 PM, Eduardo Habkost wrote:
> >On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote:
> >>There are three places use the some logic to get the page size on
> >>the file path or file fd
> >>
> >>This patch introduces qemu_file_get_page_size() to unify the code
> >>
> >>Signed-off-by: Xiao Guangrong 
> >[...]
> >>diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>index 914cef5..ad94c5a 100644
> >>--- a/util/oslib-posix.c
> >>+++ b/util/oslib-posix.c
> >>@@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd)
> >>  return getpagesize();
> >>  }
> >>
> >>+size_t qemu_file_get_page_size(const char *path)
> >>+{
> >>+size_t size = 0;
> >>+int fd = qemu_open(path, O_RDONLY);
> >>+
> >>+if (fd < 0) {
> >>+fprintf(stderr, "Could not open %s.\n", path);
> >>+goto exit;
> >
> >Have you considered using a Error** argument here?
> >
> >>+}
> >>+
> >>+size = fd_getpagesize(fd);
> >>+qemu_close(fd);
> >>+exit:
> >>+return size;
> >>+}
> >>+
> >>diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> >>index ac70f08..c661f1c 100644
> >>--- a/target-ppc/kvm.c
> >>+++ b/target-ppc/kvm.c
> >>@@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
> >>kvm_ppc_smmu_info *info)
> >>
> >>  static long gethugepagesize(const char *mem_path)
> >>  {
> >>-struct statfs fs;
> >>-int ret;
> >>-
> >>-do {
> >>-ret = statfs(mem_path, &fs);
> >>-} while (ret != 0 && errno == EINTR);
> >>+long size = qemu_file_get_page_size(mem_path);
> >>
> >>-if (ret != 0) {
> >>-fprintf(stderr, "Couldn't statfs() memory path: %s\n",
> >>-strerror(errno));
> >>+if (!size) {
> >>  exit(1);
> >>  }
> >>
> >>-#define HUGETLBFS_MAGIC   0x958458f6
> >>-
> >>-if (fs.f_type != HUGETLBFS_MAGIC) {
> >>-/* Explicit mempath, but it's ordinary pages */
> >>-return getpagesize();
> >>-}
> >>-
> >>-/* It's hugepage, return the huge page size */
> >>-return fs.f_bsize;
> >>+return size;
> >>  }
> >
> >Why are you changing target-ppc/kvm.c:gethugepagesize() to use the new
> >funtion, but not the copy at exec.c? To make it simpler, we could
> >eliminate both gethugepagesize() functions completely and replace them
> >with qemu_file_get_page_size() calls (maybe as part of this patch, maybe
> >in a separate patch, I'm not sure).
> >
> 
> The gethugepagesize() in exec.c will be eliminated in later patch :).

That's why it's not a good idea to split patchset like this,
where patch 1 adds a new function, patch 2 uses it.
It's better if user is in the same patchset.
An exception if when a completely separate group of people
should review the function and the usage,
e.g. some logic in memory core versus caller in acpi.


> And the gethugepagesize() in ppc platform has error handling logic
> and has multiple caller. It's not so bad to keep it.
> 

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 11/35] util: introduce qemu_file_getlength()

2015-11-09 Thread Eduardo Habkost
On Mon, Nov 09, 2015 at 12:44:55PM +0800, Xiao Guangrong wrote:
> On 11/06/2015 11:50 PM, Eduardo Habkost wrote:
> >As this patch affects raw_getlength(), CCing the raw block driver
> >maintainer and the qemu-block mailing list.
> 
> Eduardo, thanks for your reminder. I will keep CCing Kevin and qemu-block mail
> list for future version.
> 
> >
> >On Mon, Nov 02, 2015 at 05:13:13PM +0800, Xiao Guangrong wrote:
> >>It is used to get the size of the specified file, also qemu_fd_getlength()
> >>is introduced to unify the code with raw_getlength() in block/raw-posix.c
> >>
> >>Signed-off-by: Xiao Guangrong 
> >>---
> >>  block/raw-posix.c|  7 +--
> >>  include/qemu/osdep.h |  2 ++
> >>  util/osdep.c | 31 +++
> >
> >I know I was the one who suggested osdep.c, but maybe oslib-posix.c is a
> >more appropriate place for the new function?
> >
> 
> Since the function we introduced here can work on both windows and posix, so
> i thing osdep.c is the right place. Otherwise we should implement it for 
> multiple
> platforms.

I didn't notice it was going to be used by a platform-independent
qemu_file_getlength() function in addition to the posix-specific
raw_getlength(). Have you tested it on Windows, though?

If you didn't test it on Windows, what about keeping
qemu_file_getlength() available only on posix, by now? The only
users are raw-posix.c and hostmem-file.c, currently. If in the
future somebody need it on Windows, they can decide between
moving the SEEK_END code to osdep.c (after testing it), or moving
the existing raw-win32.c:raw_getlength() code to a
oslib-win32.c:qemu_file_getlength() function.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] qemu, pkeys: add pkeys support for qemu xsave state handling

2015-11-09 Thread Eduardo Habkost
On Mon, Nov 09, 2015 at 07:55:33PM +0800, Huaitong Han wrote:
> This patch adds pkeys support for qemu xsave state handling.
> 
> Signed-off-by: Huaitong Han 
[...]
> @@ -1145,6 +1146,7 @@ static int kvm_put_xsave(X86CPU *cpu)
>  #ifdef TARGET_X86_64
>  memcpy(&xsave->region[XSAVE_Hi16_ZMM], &env->xmm_regs[16],
>  16 * sizeof env->xmm_regs[16]);
> +memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru);
>  #endif
>  r = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_XSAVE, xsave);
>  return r;
> @@ -1516,6 +1518,7 @@ static int kvm_get_xsave(X86CPU *cpu)
>  #ifdef TARGET_X86_64
>  memcpy(&env->xmm_regs[16], &xsave->region[XSAVE_Hi16_ZMM],
> 16 * sizeof env->xmm_regs[16]);
> +memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru);

Did you mean:
  memcpy(&env->pkru, &xsave->region[XSAVE_PKRU], sizeof env->pkru)

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 12/35] util: let qemu_fd_getlength support block device

2015-11-09 Thread Eduardo Habkost
On Mon, Nov 09, 2015 at 01:58:27PM +0800, Xiao Guangrong wrote:
> 
> 
> On 11/06/2015 11:54 PM, Eduardo Habkost wrote:
> >On Mon, Nov 02, 2015 at 05:13:14PM +0800, Xiao Guangrong wrote:
> >>lseek can not work for all block devices as the man page says:
> >>| Some devices are incapable of seeking and POSIX does not specify
> >>| which devices must support lseek().
> >>
> >>This patch tries to add the support on Linux by using BLKGETSIZE64
> >>ioctl
> >>
> >>Signed-off-by: Xiao Guangrong 
> >
> >On which cases is this patch necessary? Do you know any examples of
> >Linux block devices that won't work with lseek(SEEK_END)?
> 
> To be honest, i have not checked all block device, this patch was made
> based on the man page. However, i do not mind drop this patch (and maybe
> other patches) to make this pachset smaller. BLKGETSIZE64 can be added
> in the future if we meet such device.

By looking at the Linux source code implementing BLKGETSIZE64, it looks
like it should work for all block devices where SEEK_END works:

* BLKGETSIZE64 returns i_size_read(bdev->bd_inode)
  (block/ioctl.c:blkdev_ioctl())
* llseek(SEEK_END) uses i_size_read(bd_inode) as the offset
  (fs/block_dev.c:block_llseek())

That's probably why raw_getlength() never needed a Linux-specific
BLKGETSIZE call.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 07/35] util: introduce qemu_file_get_page_size()

2015-11-09 Thread Eduardo Habkost
On Mon, Nov 09, 2015 at 12:36:36PM +0800, Xiao Guangrong wrote:
> On 11/06/2015 11:36 PM, Eduardo Habkost wrote:
> >On Mon, Nov 02, 2015 at 05:13:09PM +0800, Xiao Guangrong wrote:
> >>There are three places use the some logic to get the page size on
> >>the file path or file fd
> >>
> >>Windows did not support file hugepage, so it will return normal page
> >>for this case. And this interface has not been used on windows so far
> >>
> >>This patch introduces qemu_file_get_page_size() to unify the code
> >>
> >>Signed-off-by: Xiao Guangrong 
> >[...]
> >>diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>index 914cef5..51437ff 100644
> >>--- a/util/oslib-posix.c
> >>+++ b/util/oslib-posix.c
> >>@@ -340,7 +340,7 @@ static void sigbus_handler(int signal)
> >>  siglongjmp(sigjump, 1);
> >>  }
> >>
> >>-static size_t fd_getpagesize(int fd)
> >>+static size_t fd_getpagesize(int fd, Error **errp)
> >>  {
> >>  #ifdef CONFIG_LINUX
> >>  struct statfs fs;
> >>@@ -351,7 +351,12 @@ static size_t fd_getpagesize(int fd)
> >>  ret = fstatfs(fd, &fs);
> >>  } while (ret != 0 && errno == EINTR);
> >>
> >>-if (ret == 0 && fs.f_type == HUGETLBFS_MAGIC) {
> >>+if (ret) {
> >>+error_setg_errno(errp, errno, "fstatfs is failed");
> >>+return 0;
> >>+}
> >>+
> >>+if (fs.f_type == HUGETLBFS_MAGIC) {
> >>  return fs.f_bsize;
> >>  }
> >
> >You are changing os_mem_prealloc() behavior when fstatfs() fails, here.
> >Have you ensured there are no cases where fstatfs() fails but this code
> >is still expected to work?
> 
> stat() is supported for all kinds of files, so failed stat() is caused by
> file is not exist or kernel internal error (e,g memory is not enough) or
> security check is not passed. Whichever we should not do any operation on
> the file if stat() failed. The origin code did not check it but it is worth
> being fixed i think.

Note that this is fstatfs(), not stat(). It's possible go get
ENOSYS as error from statfs() if it is not implemented by the
filesystem, I just don't know if this really can happen in
practice.

(But the answer won't matter, as we already aborted on statfs()
errors on all codepaths that call fd_getpagesize(). See below.)

> 
> >
> >The change looks safe: gethugepagesize() seems to be always called in
> >the path that would make fd_getpagesize() be called from
> >os_mem_prealloc(), so allocation would abort much earlier if statfs()
> >failed. But I haven't confirmed that yet, and I wanted to be sure.
> >
> 
> Yes, I am entirely agree with you. :)
> 

I have just confirmed that this is the case, as:

* fd_getpagesize() is only called from os_mem_prealloc(),
* os_mem_prealloc() is called from:
  * host_memory_backend_set_prealloc()
* Using memory_region_get_fd() as the fd argument
  * host_memory_backend_memory_complete()
* Using memory_region_get_fd() as the fd argument
  * file_ram_alloc()
* After qemu_file_get_page_size()/gethugepagesize() was already called
  in the same fd (with errors checked)
* fd_getpagesize() checks for fd == -1
* The only code that sets the fd for a RAMBlock is file_ram_alloc(),
  which checks for qemu_file_get_page_size()/gethugepagesize()
  errors

So, it was already impossible to get os_mem_prealloc() called
with fd != -1 without having gethugepagesize() called first (and
gethugepagesize() already checked for statfs() errors).

Reviewed-by: Eduardo Habkost 

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated

2015-11-09 Thread Dr. David Alan Gilbert
* Eduardo Habkost (ehabk...@redhat.com) wrote:
> On Mon, Nov 09, 2015 at 08:33:55AM +0800, haozhong.zh...@intel.com wrote:
> > On 11/06/15 13:12, Eduardo Habkost wrote:
> > > On Fri, Nov 06, 2015 at 10:32:24AM +0800, haozhong.zh...@intel.com wrote:
> > > > On 11/05/15 14:05, Eduardo Habkost wrote:
> > > > > On Thu, Nov 05, 2015 at 09:30:51AM +0800, Haozhong Zhang wrote:
> > > > > > On 11/04/15 19:42, Eduardo Habkost wrote:
> > > [...]
> > > > > > > > +env->tsc_khz_saved = r;
> > > > > > > > +}
> > > > > > > 
> > > > > > > Why do you need a separate tsc_khz_saved field, and don't simply 
> > > > > > > use
> > > > > > > tsc_khz? It would have the additional feature of letting QMP 
> > > > > > > clients
> > > > > > > query the current TSC rate by asking for the tsc-freq property on 
> > > > > > > CPU
> > > > > > > objects.
> > > > > > >
> > > > > > 
> > > > > > It's to avoid overriding env->tsc_khz on the destination in the
> > > > > > migration. I can change this line to
> > > > > >  env->tsc_khz = env->tsc_khz_saved = r;
> > > > > 
> > > > > You are already avoiding overriding env->tsc_khz, because you use
> > > > > KVM_GET_TSC_KHZ only if tsc_khz is not set yet. I still don't see why
> > > > > you need a tsc_khz_saved field that requires duplicating the 
> > > > > SET_TSC_KHZ
> > > > > code, if you could just do this:
> > > > > 
> > > > > if (!env->tsc_khz) {
> > > > > env->tsc_khz = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ);
> > > > > }
> > > > >
> > > > 
> > > > Consider an example that we migrate a VM from machine A to machine B
> > > > and then to machine C, and QEMU on machine B is launched with the cpu
> > > > option 'tsc-freq' (i.e. env->tsc_khz on B is non-zero at the
> > > > beginning):
> > > >  1) In the migration from B to C, the user-specified TSC frequency by
> > > > 'tsc-freq' on B is expected to be migrated to C. That is, the
> > > > value of env->tsc_khz on B is migrated.
> > > >  2) If TSC frequency is migrated through env->tsc_khz, then
> > > > env->tsc_khz on B will be overrode in the migration from A to B
> > > > before kvm_arch_setup_tsc_khz(). If the guest TSC frequency is
> > > > different than the user-specified TSC frequency on B, the
> > > > expectation in 1) will not be satisfied anymore.
> > > 
> > > Setting tsc-freq on B when tsc-freq was not used on A is invalid usage.
> > > This is not different from changing the CPU model and adding or removing
> > > CPU flags when migrating, which is also incorrect. The command-line
> > > parameters defining the VM must be the same when you migrate.
> > >
> > 
> > Good to know it's an invalid usage. Then the question is what QEMU is
> > expected to do for this invalid usage?
> > 
> >  1) Abort the migration? But I find that the current QEMU does not
> > abort the migration between different CPU models (e.g. Nehalem and
> > Haswell).
> > 
> >  2) Or do not abort the migration and ignore tsc-freq option? If so,
> > tsc_khz_saved will be not needed.
> 
> My first choice is to abort migration. If we decide to abort today and
> find it to cause problems, we can easily fix it. If we decide to
> continue without aborting, it is difficult to change that behavior
> without breaking existing setups.

Yes, if it's a bad config please abort the migration and put a clear
message in the log so we can tell easily.

Dave

> 
> -- 
> Eduardo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated

2015-11-09 Thread Eduardo Habkost
On Mon, Nov 09, 2015 at 08:33:55AM +0800, haozhong.zh...@intel.com wrote:
> On 11/06/15 13:12, Eduardo Habkost wrote:
> > On Fri, Nov 06, 2015 at 10:32:24AM +0800, haozhong.zh...@intel.com wrote:
> > > On 11/05/15 14:05, Eduardo Habkost wrote:
> > > > On Thu, Nov 05, 2015 at 09:30:51AM +0800, Haozhong Zhang wrote:
> > > > > On 11/04/15 19:42, Eduardo Habkost wrote:
> > [...]
> > > > > > > +env->tsc_khz_saved = r;
> > > > > > > +}
> > > > > > 
> > > > > > Why do you need a separate tsc_khz_saved field, and don't simply use
> > > > > > tsc_khz? It would have the additional feature of letting QMP clients
> > > > > > query the current TSC rate by asking for the tsc-freq property on 
> > > > > > CPU
> > > > > > objects.
> > > > > >
> > > > > 
> > > > > It's to avoid overriding env->tsc_khz on the destination in the
> > > > > migration. I can change this line to
> > > > >  env->tsc_khz = env->tsc_khz_saved = r;
> > > > 
> > > > You are already avoiding overriding env->tsc_khz, because you use
> > > > KVM_GET_TSC_KHZ only if tsc_khz is not set yet. I still don't see why
> > > > you need a tsc_khz_saved field that requires duplicating the SET_TSC_KHZ
> > > > code, if you could just do this:
> > > > 
> > > > if (!env->tsc_khz) {
> > > > env->tsc_khz = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ);
> > > > }
> > > >
> > > 
> > > Consider an example that we migrate a VM from machine A to machine B
> > > and then to machine C, and QEMU on machine B is launched with the cpu
> > > option 'tsc-freq' (i.e. env->tsc_khz on B is non-zero at the
> > > beginning):
> > >  1) In the migration from B to C, the user-specified TSC frequency by
> > > 'tsc-freq' on B is expected to be migrated to C. That is, the
> > > value of env->tsc_khz on B is migrated.
> > >  2) If TSC frequency is migrated through env->tsc_khz, then
> > > env->tsc_khz on B will be overrode in the migration from A to B
> > > before kvm_arch_setup_tsc_khz(). If the guest TSC frequency is
> > > different than the user-specified TSC frequency on B, the
> > > expectation in 1) will not be satisfied anymore.
> > 
> > Setting tsc-freq on B when tsc-freq was not used on A is invalid usage.
> > This is not different from changing the CPU model and adding or removing
> > CPU flags when migrating, which is also incorrect. The command-line
> > parameters defining the VM must be the same when you migrate.
> >
> 
> Good to know it's an invalid usage. Then the question is what QEMU is
> expected to do for this invalid usage?
> 
>  1) Abort the migration? But I find that the current QEMU does not
> abort the migration between different CPU models (e.g. Nehalem and
> Haswell).
> 
>  2) Or do not abort the migration and ignore tsc-freq option? If so,
> tsc_khz_saved will be not needed.

My first choice is to abort migration. If we decide to abort today and
find it to cause problems, we can easily fix it. If we decide to
continue without aborting, it is difficult to change that behavior
without breaking existing setups.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] qemu, pkeys: add pkeys support for qemu cpuid handling

2015-11-09 Thread Andreas Färber
Am 09.11.2015 um 13:24 schrieb Paolo Bonzini:
> On 09/11/2015 12:55, Huaitong Han wrote:
>> @@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = {
>>CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
>>CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
>>CPUID_7_0_EBX_RDSEED */
>> +#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE)
> 
> This should be zero.  Apart from this detail, the QEMU parts look good.

...except for the subjects, which should be "target-i386: add pkeys
support for cpuid handling" etc. - no need to put qemu into a QEMU
commit subject, especially not twice.

Regards,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton; HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] context_tracking: remove duplicate enabled check

2015-11-09 Thread Rik van Riel
On 10/27/2015 09:39 PM, Paolo Bonzini wrote:
> All calls to context_tracking_enter and context_tracking_exit
> are already checking context_tracking_is_enabled, except the
> context_tracking_user_enter and context_tracking_user_exit
> functions left in for the benefit of assembly calls.
> 
> Pull the check up to those functions, by making them simple
> wrappers around the user_enter and user_exit inline functions.
> 
> Cc: Andy Lutomirski 
> Cc: Frederic Weisbecker 
> Cc: Rik van Riel 
> Cc: Paul McKenney 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Rik van Riel 
Tested-by: Rik van Riel 

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit

2015-11-09 Thread Rik van Riel
On 10/27/2015 09:39 PM, Paolo Bonzini wrote:
> guest_enter and guest_exit must be called with interrupts disabled,
> since they take the vtime_seqlock with write_seq{lock,unlock}.
> Therefore, it is not necessary to check for exceptions, nor to
> save/restore the IRQ state, when context tracking functions are
> called by guest_enter and guest_exit.
> 
> Split the body of context_tracking_entry and context_tracking_exit
> out to __-prefixed functions, and use them from KVM.
> 
> Rik van Riel has measured this to speed up a tight vmentry/vmexit
> loop by about 2%.
> 
> Cc: Andy Lutomirski 
> Cc: Frederic Weisbecker 
> Cc: Rik van Riel 
> Cc: Paul McKenney 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Rik van Riel 
Tested-by: Rik van Riel 

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] KVM, pkeys: add memory protection-key support

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 12:54, Huaitong Han wrote:
> The protection-key feature provides an additional mechanism by which IA-32e
> paging controls access to usermode addresses.
> 
> Hardware support for protection keys for user pages is enumerated with CPUID
> feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE
> with the setting of CR4.PKE(bit 22).
> 
> When CR4.PKE = 1, every linear address is associated with the 4-bit protection
> key located in bits 62:59 of the paging-structure entry that mapped the page
> containing the linear address. The PKRU register determines, for each
> protection key, whether user-mode addresses with that protection key may be
> read or written.
> 
> The PKRU register (protection key rights for user pages) is a 32-bit register
> with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the
> access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable
> bit for protection key i (WDi).
> 
> Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and
> write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus
> be read and written by instructions in the XSAVE feature set.

Hi, this looks more or less okay.  I made a few comments on the
individual patches.

Please add a test for PKRU to kvm-unit-tests' access.c.  I will _not_
merge this feature without unit tests.  I have merged nested VPID
without, and it was a mistake because they were never submitted and
probably never will.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] KVM, pkeys: update memeory permission bitmask for pkeys

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 12:54, Huaitong Han wrote:
>* Byte index: page fault error code [4:1]
>* Bit index: pte permissions in ACC_* format
> +  *
> +  * Add PFEC.PK (bit 5) for protection-key violations

Instead, change "[4:1]" to "[5:1]" in the "Byte index" line.

Paolo

>*/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/9] KVM, pkeys: Add pkeys support for gva_to_gpa funcions

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 12:54, Huaitong Han wrote:
> index 7a84b83..6e9156d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3960,6 +3960,8 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, 
> gva_t gva,
> struct x86_exception *exception)
>  {
>   u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
> + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
> + ? PFERR_PK_MASK : 0;
>   return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);

I think checking is_long_mode is not necessary here, since is_long_mode
is not checked in update_permission_bitmask but (dynamically) in
permission_fault.
> 
> + gpa_t gpa;
> +
> + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
> + ? PFERR_PK_MASK : 0;

Fetches never have PFERR_PK_MASK set.

Thanks,

Paolo

>   /* Inline kvm_read_guest_virt_helper for speed.  */
> - gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, 
> access|PFERR_FETCH_MASK,
> - exception);
> + gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
> + access | PFERR_FETCH_MASK, exception);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] KVM, pkeys: add pkeys support for permission_fault logic

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 13:43, Paolo Bonzini wrote:
> 
> 
> On 09/11/2015 12:54, Huaitong Han wrote:
>> Protection keys define a new 4-bit protection key field (PKEY) in bits
>> 62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
>> register(16 domains), every domain has 2 bits(write disable bit, access
>> disable bit).
>>
>> Static logic has been produced in update_permission_bitmask, dynamic logic
>> need read pkey from page table entries, get pkru value, and deduce the
>> correct result.
>>
>> Signed-off-by: Huaitong Han 
>>
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index e4202e4..bbb 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -3,6 +3,7 @@
>>  
>>  #include 
>>  #include "kvm_cache_regs.h"
>> +#include "x86.h"
>>  
>>  #define PT64_PT_BITS 9
>>  #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
>> @@ -24,6 +25,11 @@
>>  #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
>>  #define PT_PAT_MASK (1ULL << 7)
>>  #define PT_GLOBAL_MASK (1ULL << 8)
>> +
>> +#define PT64_PKEY_BIT0 (1ULL << _PAGE_BIT_PKEY_BIT0)
>> +#define PT64_PKEY_BIT1 (1ULL << _PAGE_BIT_PKEY_BIT1)
>> +#define PT64_PKEY_BIT2 (1ULL << _PAGE_BIT_PKEY_BIT2)
>> +#define PT64_PKEY_BIT3 (1ULL << _PAGE_BIT_PKEY_BIT3)
>>  #define PT64_NX_SHIFT 63
>>  #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
>>  
>> @@ -45,6 +51,15 @@
>>  #define PT_PAGE_TABLE_LEVEL 1
>>  #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1)
>>  
>> +#define PKEYS_BIT0_VALUE (1ULL << 0)
>> +#define PKEYS_BIT1_VALUE (1ULL << 1)
>> +#define PKEYS_BIT2_VALUE (1ULL << 2)
>> +#define PKEYS_BIT3_VALUE (1ULL << 3)
>> +
>> +#define PKRU_READ   0
>> +#define PKRU_WRITE  1
>> +#define PKRU_ATTRS  2
>> +
>>  static inline u64 rsvd_bits(int s, int e)
>>  {
>>  return ((1ULL << (e - s + 1)) - 1) << s;
>> @@ -145,10 +160,44 @@ static inline bool is_write_protection(struct kvm_vcpu 
>> *vcpu)
>>   * fault with the given access (in ACC_* format)?
>>   */
>>  static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu 
>> *mmu,
>> -unsigned pte_access, unsigned pfec)
>> +unsigned pte_access, unsigned pte_pkeys, unsigned pfec)
>>  {
>> -int cpl = kvm_x86_ops->get_cpl(vcpu);
>> -unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
>> +unsigned long smap, rflags;
>> +u32 pkru;
>> +int cpl, index;
>> +bool wf, uf, pk, pkru_ad, pkru_wd;
>> +
>> +cpl = kvm_x86_ops->get_cpl(vcpu);
>> +rflags = kvm_x86_ops->get_rflags(vcpu);
>> +
>> +pkru = read_pkru();
>> +
>> +/*
>> +* PKRU defines 32 bits, there are 16 domains and 2 attribute bits per
>> +* domain in pkru, pkey is index to a defined domain, so the value of
>> +* pkey * PKRU_ATTRS + R/W is offset of a defined domain attribute.
>> +*/
>> +pkru_ad = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_READ)) & 1;
>> +pkru_wd = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_WRITE)) & 1;
>> +
>> +wf = pfec & PFERR_WRITE_MASK;
>> +uf = pfec & PFERR_USER_MASK;
>> +
>> +/*
>> +* PKeys 2nd and 6th conditions:
>> +* 2.EFER_LMA=1
>> +* 6.PKRU.AD=1
>> +*   or The access is a data write and PKRU.WD=1 and
>> +*   either CR0.WP=1 or it is a user mode access
>> +*/
>> +pk = is_long_mode(vcpu) && (pkru_ad ||
>> +(pkru_wd && wf && (is_write_protection(vcpu) || uf)));

A little more optimized:

pkru_bits = (pkru >> (pte_pkeys * PKRU_ATTRS)) & 3;

/*
 * Ignore PKRU.WD if not relevant to this access (a read,
 * or a supervisor mode access if CR0.WP=0).
 */
if (!wf || (!uf && !is_write_protection(vcpu)))
pkru_bits &= ~(1 << PKRU_WRITE);

... and then just check pkru_bits != 0.

>> +/*
>> +* PK bit right value in pfec equal to
>> +* PK bit current value in pfec and pk value.
>> +*/
>> +pfec &= (pk << PFERR_PK_BIT) + ~PFERR_PK_MASK;
> 
> PK is only applicable to guest page tables, but if you do not support
> PKRU without EPT (patch 9), none of this is necessary, is it?

Doh. :(  Sorry, this is of course needed for the emulation case.

I think you should optimize this for the common case where pkru is zero,
hence pk will always be zero.  So something like

pkru = is_long_mode(vcpu) ? read_pkru() : 0;
if (unlikely(pkru) && (pfec & PFERR_PK_MASK)) {
... from above ... */

/* Flip PFERR_PK_MASK if pkru_bits is non-zero */
pfec ^= -pkru_bits & PFERR_PK_MASK;
}

Paolo

>>  /*
>>   * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
>> @@ -163,8 +212,8 @@ static inline bool permission_fault(struct kvm_vcpu 
>> *vcpu, struct kvm_mmu *mmu,
>>   * but it will be one in index if SMAP checks are being overridden.
>>   * It is important to keep this branchless.
>>   */
>> -unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
>> -int i

Re: [PATCH 6/9] KVM, pkeys: add pkeys support for permission_fault logic

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 12:54, Huaitong Han wrote:
> Protection keys define a new 4-bit protection key field (PKEY) in bits
> 62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
> register(16 domains), every domain has 2 bits(write disable bit, access
> disable bit).
> 
> Static logic has been produced in update_permission_bitmask, dynamic logic
> need read pkey from page table entries, get pkru value, and deduce the
> correct result.
> 
> Signed-off-by: Huaitong Han 
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index e4202e4..bbb 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -3,6 +3,7 @@
>  
>  #include 
>  #include "kvm_cache_regs.h"
> +#include "x86.h"
>  
>  #define PT64_PT_BITS 9
>  #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
> @@ -24,6 +25,11 @@
>  #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
>  #define PT_PAT_MASK (1ULL << 7)
>  #define PT_GLOBAL_MASK (1ULL << 8)
> +
> +#define PT64_PKEY_BIT0 (1ULL << _PAGE_BIT_PKEY_BIT0)
> +#define PT64_PKEY_BIT1 (1ULL << _PAGE_BIT_PKEY_BIT1)
> +#define PT64_PKEY_BIT2 (1ULL << _PAGE_BIT_PKEY_BIT2)
> +#define PT64_PKEY_BIT3 (1ULL << _PAGE_BIT_PKEY_BIT3)
>  #define PT64_NX_SHIFT 63
>  #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
>  
> @@ -45,6 +51,15 @@
>  #define PT_PAGE_TABLE_LEVEL 1
>  #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1)
>  
> +#define PKEYS_BIT0_VALUE (1ULL << 0)
> +#define PKEYS_BIT1_VALUE (1ULL << 1)
> +#define PKEYS_BIT2_VALUE (1ULL << 2)
> +#define PKEYS_BIT3_VALUE (1ULL << 3)
> +
> +#define PKRU_READ   0
> +#define PKRU_WRITE  1
> +#define PKRU_ATTRS  2
> +
>  static inline u64 rsvd_bits(int s, int e)
>  {
>   return ((1ULL << (e - s + 1)) - 1) << s;
> @@ -145,10 +160,44 @@ static inline bool is_write_protection(struct kvm_vcpu 
> *vcpu)
>   * fault with the given access (in ACC_* format)?
>   */
>  static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu 
> *mmu,
> - unsigned pte_access, unsigned pfec)
> + unsigned pte_access, unsigned pte_pkeys, unsigned pfec)
>  {
> - int cpl = kvm_x86_ops->get_cpl(vcpu);
> - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
> + unsigned long smap, rflags;
> + u32 pkru;
> + int cpl, index;
> + bool wf, uf, pk, pkru_ad, pkru_wd;
> +
> + cpl = kvm_x86_ops->get_cpl(vcpu);
> + rflags = kvm_x86_ops->get_rflags(vcpu);
> +
> + pkru = read_pkru();
> +
> + /*
> + * PKRU defines 32 bits, there are 16 domains and 2 attribute bits per
> + * domain in pkru, pkey is index to a defined domain, so the value of
> + * pkey * PKRU_ATTRS + R/W is offset of a defined domain attribute.
> + */
> + pkru_ad = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_READ)) & 1;
> + pkru_wd = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_WRITE)) & 1;
> +
> + wf = pfec & PFERR_WRITE_MASK;
> + uf = pfec & PFERR_USER_MASK;
> +
> + /*
> + * PKeys 2nd and 6th conditions:
> + * 2.EFER_LMA=1
> + * 6.PKRU.AD=1
> + *   or The access is a data write and PKRU.WD=1 and
> + *   either CR0.WP=1 or it is a user mode access
> + */
> + pk = is_long_mode(vcpu) && (pkru_ad ||
> + (pkru_wd && wf && (is_write_protection(vcpu) || uf)));
> +
> + /*
> + * PK bit right value in pfec equal to
> + * PK bit current value in pfec and pk value.
> + */
> + pfec &= (pk << PFERR_PK_BIT) + ~PFERR_PK_MASK;

PK is only applicable to guest page tables, but if you do not support
PKRU without EPT (patch 9), none of this is necessary, is it?

>   /*
>* If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
> @@ -163,8 +212,8 @@ static inline bool permission_fault(struct kvm_vcpu 
> *vcpu, struct kvm_mmu *mmu,
>* but it will be one in index if SMAP checks are being overridden.
>* It is important to keep this branchless.
>*/
> - unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
> - int index = (pfec >> 1) +
> + smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
> + index = (pfec >> 1) +
>   (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));
>  
>   WARN_ON(pfec & PFERR_RSVD_MASK);
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 736e6ab..99563bc 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -253,6 +253,17 @@ static int FNAME(update_accessed_dirty_bits)(struct 
> kvm_vcpu *vcpu,
>   }
>   return 0;
>  }
> +static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
> +{
> + unsigned pkeys = 0;
> +#if PTTYPE == 64
> + pkeys = ((gpte & PT64_PKEY_BIT0) ? PKEYS_BIT0_VALUE : 0) |
> + ((gpte & PT64_PKEY_BIT1) ? PKEYS_BIT1_VALUE : 0) |
> + ((gpte & PT64_PKEY_BIT2) ? PKEYS_BIT2_VALUE : 0) |
> + ((gpte & PT64_PKEY_BIT3) ? PKEYS_BIT3_VALUE : 0);

This is just pkeys = (gpte >> _PAGE_BIT_PKEY

Re: [PATCH v6 05/33] acpi: add aml_object_type

2015-11-09 Thread Igor Mammedov
On Mon, 9 Nov 2015 13:35:51 +0200
"Michael S. Tsirkin"  wrote:

> On Fri, Oct 30, 2015 at 01:55:59PM +0800, Xiao Guangrong wrote:
> > Implement ObjectType which is used by NVDIMM _DSM method in
> > later patch
> > 
> > Signed-off-by: Xiao Guangrong 
> 
> I had to go dig in the _DSM patch to see how it's used.
> And sure enough, callers have to know AML to make
> sense of it. So pls don't split out tiny patches like this.
> include the callee with the caller.
I'd prefer AML API patches as separate ones, as it makes
easier to review vs ACPI spec and also easier to reuse
in another series.
And once they are ok/reviewed we can merge them ahead of
users, so next series respins won't have to post the same
patches over and over and we won't have to review them
again every respin and others could use already merged API
instead of duplicating work that's been already done.

> 
> > ---
> >  hw/acpi/aml-build.c | 8 
> >  include/hw/acpi/aml-build.h | 1 +
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > index efc06ab..9f792ab 100644
> > --- a/hw/acpi/aml-build.c
> > +++ b/hw/acpi/aml-build.c
> > @@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml 
> > *target)
> >  return var;
> >  }
> >  
> > +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */
> > +Aml *aml_object_type(Aml *object)
> > +{
> > +Aml *var = aml_opcode(0x8E /* ObjectTypeOp */);
> > +aml_append(var, object);
> > +return var;
> > +}
> > +
> 
> It would be better to have a higher level API
> that can be used without knowning AML.
> For example:
> 
>   aml_object_type_is_package()
Higher level API is fine but it's better be done
on top of AML one, i.e. add it in addition to
 aml_object_type()


> 
> 
> 
> >  void
> >  build_header(GArray *linker, GArray *table_data,
> >   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > index 325782d..5b8a118 100644
> > --- a/include/hw/acpi/aml-build.h
> > +++ b/include/hw/acpi/aml-build.h
> > @@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg);
> >  Aml *aml_sizeof(Aml *arg);
> >  Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
> >  Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
> > +Aml *aml_object_type(Aml *object);
> >  
> >  void
> >  build_header(GArray *linker, GArray *table_data,
> > -- 
> > 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM, pkeys: expose CPUID:OSPKE to guest

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 12:54, Huaitong Han wrote:
> This patch exposes X86_FEATURE_OSPKE to guest, X86_FEATURE_OSPKE is
> software support for pkeys, enumerated with CPUID.7.0.ECX[4]:OSPKE,
> and it reflects the setting of CR4.PKE.
> 
> Signed-off-by: Huaitong Han 
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 29e6502..ece687b 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -81,6 +81,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
>   apic->lapic_timer.timer_mode_mask = 1 << 17;
>   }
>  
> + best = kvm_find_cpuid_entry(vcpu, 7, 0);
> + if (!best)
> + return 0;
> +
> + /*Update OSPKE bit */
> + if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) {
> + best->ecx &= ~F(OSPKE);
> + if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE))
> + best->ecx |= F(OSPKE);
> + }
> +
>   best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
>   if (!best) {
>   vcpu->arch.guest_supported_xcr0 = 0;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d268da0..5181834 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -754,7 +754,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>   (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
>   kvm_mmu_reset_context(vcpu);
>  
> - if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE)
> + if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
>   kvm_update_cpuid(vcpu);
>  
>   return 0;
> @@ -6841,7 +6841,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>  
>   mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
>   kvm_x86_ops->set_cr4(vcpu, sregs->cr4);
> - if (sregs->cr4 & X86_CR4_OSXSAVE)
> + if (sregs->cr4 & (X86_CR4_OSXSAVE | X86_CR4_PKE))
>   kvm_update_cpuid(vcpu);
>  
>   idx = srcu_read_lock(&vcpu->kvm->srcu);
> 

Hi, please squash these first three patches and move them last.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 13:15, Michael S. Tsirkin wrote:
> Well that's not exactly true. I think we would like to make
> it possible to put virtio devices behind an IOMMU on x86,
> but if this means existing guests break, then many people won't be able
> to use this option: having to find out which kernel version your guest
> is running is a significant burden.
> 
> So on the host side, we need to detect guests that
> don't program the IOMMU and make QEMU ignore it.
> I think we need to figure out a way to do this
> before we commit to the guest change.

What is the usecase for putting virtio devices behind an IOMMU, apart from:

1) "because you can"

2) using VFIO within the guest

?  Case 1 can be ignored, and in case 2 the guest will do the right thing.

> Additionally, IOMMU overhead is very high when running within the VM.
> So for uses such as VFIO, we'd like a way to make something like
> iommu-pt the default.

That's not something that the kernel cares about.  It's just a
configuration issue.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch v2] vfio/pci: make an array larger

2015-11-09 Thread Dan Carpenter
Smatch complains about a possible out of bounds error:

drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
error: buffer overflow 'pci_cap_length' 20 <= 20

The problem is that pci_cap_length[] was defined as large enough to
hold "PCI_CAP_ID_AF + 1" elements.  The code in vfio_cap_init() assumes
it has PCI_CAP_ID_MAX + 1 elements.  Originally, PCI_CAP_ID_AF and
PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
f80b0ba95964 ('PCI: Add Enhanced Allocation register entries') so now
the array is too small.

Let's fix this by making the array size PCI_CAP_ID_MAX + 1.  And let's
make a similar change to pci_ext_cap_length[] for consistency.  Also
both these arrays can be made const.

Signed-off-by: Dan Carpenter 
---
v2: more cleanups

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index a8657ef..fe2b470 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -46,7 +46,7 @@
  *   0: Removed from the user visible capability list
  *   FF: Variable length
  */
-static u8 pci_cap_length[] = {
+static const u8 pci_cap_length[PCI_CAP_ID_MAX + 1] = {
[PCI_CAP_ID_BASIC]  = PCI_STD_HEADER_SIZEOF, /* pci config header */
[PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
[PCI_CAP_ID_AGP]= PCI_AGP_SIZEOF,
@@ -74,7 +74,7 @@ static u8 pci_cap_length[] = {
  *   0: Removed or masked from the user visible capabilty list
  *   FF: Variable length
  */
-static u16 pci_ext_cap_length[] = {
+static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
[PCI_EXT_CAP_ID_ERR]=   PCI_ERR_ROOT_COMMAND,
[PCI_EXT_CAP_ID_VC] =   0xFF,
[PCI_EXT_CAP_ID_DSN]=   PCI_EXT_CAP_DSN_SIZEOF,
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] qemu, pkeys: add pkeys support for qemu cpuid handling

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 12:55, Huaitong Han wrote:
> @@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = {
>CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
>CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
>CPUID_7_0_EBX_RDSEED */
> +#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE)

This should be zero.  Apart from this detail, the QEMU parts look good.

Paolo

>  #define TCG_APM_FEATURES 0
>  #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
>  
> @@ -408,6 +420,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] 
> = {
>  .cpuid_reg = R_EBX,
>  .tcg_features = TCG_7_0_EBX_FEATURES,
>  },
> +[FEAT_7_0_ECX] = {
> +.feat_names = cpuid_7_0_ecx_feature_name,
> +.cpuid_eax = 7,
> +.cpuid_needs_ecx = true, .cpuid_ecx = 0,
> +.cpuid_reg = R_ECX,
> +.tcg_features = TCG_7_0_ECX_FEATURES,
> +},
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 107561] New: 4.2 breaks PCI passthrough in QEMU/KVM

2015-11-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=107561

Bug ID: 107561
   Summary: 4.2 breaks PCI passthrough in QEMU/KVM
   Product: Virtualization
   Version: unspecified
Kernel Version: 4.2
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: fhor...@gmail.com
Regression: No

QEMU/KVM VM's using PCI pass-through are taking a long time to boot, or never
booting. The issue is only present when the VM has more then ~2.5G of ram. I
was able to get a VM with 6G to boot, but I had to let it run overnight.

More info can be found in this thread:
https://bbs.archlinux.org/viewtopic.php?id=203240

Someone in the forum seems to have narrowed it down to a a set of commits
regarding kvm and mtrr in 4.1-rc2 that are in the mainline kernel as of 4.2.

Specifically commit fa61213746a706dd975661151c35795ca4dd82c2 KVM: MTRR:
simplify kvm_mtrr_get_guest_memory_type.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Michael S. Tsirkin
On Thu, Oct 29, 2015 at 06:09:45PM -0700, Andy Lutomirski wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.
> 
> This appears to work on native and Xen x86_64 using both modern and
> legacy virtio-pci.  It also appears to work on arm and arm64.
> 
> It definitely won't work as-is on s390x, and I haven't been able to
> test Christian's patches because I can't get virtio-ccw to work in
> QEMU at all.  I don't know what I'm doing wrong.
> 
> It doesn't work on ppc64.  Ben, consider yourself pinged to send me
> a patch :)
> 
> It doesn't work on sparc64.  I didn't realize at Kernel Summit that
> sparc64 has the same problem as ppc64.
> 
> DaveM, for background, we're trying to fix virtio to use the DMA
> API.  That will require that every platform that uses virtio
> supplies valid DMA operations on devices that use virtio_ring.
> Unfortunately, QEMU historically ignores the IOMMU on virtio
> devices.
> 
> On x86, this isn't really a problem.  x86 has a nice way for the
> platform to describe which devices are behind an IOMMU, and QEMU
> will be adjusted accordingly.  The only thing that will break is a
> recently-added experimental mode.

Well that's not exactly true. I think we would like to make
it possible to put virtio devices behind an IOMMU on x86,
but if this means existing guests break, then many people won't be able
to use this option: having to find out which kernel version your guest
is running is a significant burden.


So on the host side, we need to detect guests that
don't program the IOMMU and make QEMU ignore it.
I think we need to figure out a way to do this
before we commit to the guest change.

Additionally, IOMMU overhead is very high when running within the VM.
So for uses such as VFIO, we'd like a way to make something like
iommu-pt the default.



> Ben's plan for powerpc is to add a quirk for existing virtio-pci
> devices and to eventually update the devicetree stuff to allow QEMU
> to tell the guest which devices use the IOMMU.
> 
> AFAICT sparc has a similar problem to powerpc.  DaveM, can you come
> up with a straightforward way to get sparc's DMA API to work
> correctly for virtio-pci devices?
> 
> NB: Sadly, the platforms I've successfully tested on don't include any
> big-endian platforms, so there could still be lurking endian problems.
> 
> Changes from v3:
>  - More big-endian fixes.
>  - Added better virtio-ring APIs that handle allocation and use them in
>virtio-mmio and virtio-pci.
>  - Switch to Michael's virtio-net patch.
> 
> Changes from v2:
>  - Fix vring_mapping_error incorrect argument
> 
> Changes from v1:
>  - Fix an endian conversion error causing a BUG to hit.
>  - Fix a DMA ordering issue (swiotlb=force works now).
>  - Minor cleanups.
> 
> Andy Lutomirski (5):
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
>   virtio: Add improved queue allocation API
>   virtio_mmio: Use the DMA API
>   virtio_pci: Use the DMA API
> 
> Michael S. Tsirkin (1):
>   virtio-net: Stop doing DMA from the stack
> 
>  drivers/net/virtio_net.c   |  34 ++--
>  drivers/virtio/Kconfig |   2 +-
>  drivers/virtio/virtio_mmio.c   |  67 ++-
>  drivers/virtio/virtio_pci_common.h |   6 -
>  drivers/virtio/virtio_pci_legacy.c |  42 ++---
>  drivers/virtio/virtio_pci_modern.c |  61 ++-
>  drivers/virtio/virtio_ring.c   | 348 
> ++---
>  include/linux/virtio.h |  23 ++-
>  include/linux/virtio_ring.h|  35 
>  tools/virtio/linux/dma-mapping.h   |  17 ++
>  10 files changed, 426 insertions(+), 209 deletions(-)
>  create mode 100644 tools/virtio/linux/dma-mapping.h
> 
> -- 
> 2.4.3
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] qemu, pkeys: add memory protection-key support

2015-11-09 Thread Huaitong Han
The protection-key feature provides an additional mechanism by which IA-32e
paging controls access to usermode addresses.

Hardware support for protection keys for user pages is enumerated with CPUID
feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE
with the setting of CR4.PKE(bit 22).

The PKRU register is XSAVE-managed state CPUID.D.0.EAX[9], the size of XSAVE
state component for PKRU is 8 bytes, the offset is 0xa80.

The specification of Protection Keys can be found at SDM (4.6.2, volume 3)
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.

Huaitong Han (3):
  qemu, pkeys: add pkeys support for qemu cpuid handling
  qemu, pkeys: add pkeys support for qemu xsave state handling
  qemu, pkeys: add pkeys support for qemu migration

 target-i386/cpu.c | 23 ++-
 target-i386/cpu.h |  7 +++
 target-i386/kvm.c |  3 +++
 target-i386/machine.c | 23 +++
 4 files changed, 55 insertions(+), 1 deletion(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] KVM, pkeys: Add pkeys support for gva_to_gpa funcions

2015-11-09 Thread Huaitong Han
This patch adds pkeys support for gva_to_gpa funcions.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7a84b83..6e9156d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3960,6 +3960,8 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, 
gva_t gva,
  struct x86_exception *exception)
 {
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+   access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
+   ? PFERR_PK_MASK : 0;
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
@@ -3968,6 +3970,8 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, 
gva_t gva,
 {
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
access |= PFERR_FETCH_MASK;
+   access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
+   ? PFERR_PK_MASK : 0;
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
@@ -3976,6 +3980,8 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, 
gva_t gva,
 {
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
access |= PFERR_WRITE_MASK;
+   access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
+   ? PFERR_PK_MASK : 0;
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
@@ -4026,10 +4032,14 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt 
*ctxt,
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
unsigned offset;
int ret;
+   gpa_t gpa;
+
+   access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
+   ? PFERR_PK_MASK : 0;
 
/* Inline kvm_read_guest_virt_helper for speed.  */
-   gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, 
access|PFERR_FETCH_MASK,
-   exception);
+   gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
+   access | PFERR_FETCH_MASK, exception);
if (unlikely(gpa == UNMAPPED_GVA))
return X86EMUL_PROPAGATE_FAULT;
 
@@ -4050,6 +4060,8 @@ int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
 {
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+   access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
+   ? PFERR_PK_MASK : 0;
 
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
  exception);
@@ -4073,9 +4085,14 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt 
*ctxt,
void *data = val;
int r = X86EMUL_CONTINUE;
 
+   u32 access = PFERR_WRITE_MASK;
+
+   access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE)
+   ? PFERR_PK_MASK : 0;
+
while (bytes) {
gpa_t gpa =  vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
-PFERR_WRITE_MASK,
+access,
 exception);
unsigned offset = addr & (PAGE_SIZE-1);
unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] qemu, pkeys: add pkeys support for qemu xsave state handling

2015-11-09 Thread Huaitong Han
This patch adds pkeys support for qemu xsave state handling.

Signed-off-by: Huaitong Han 
---
 target-i386/cpu.c | 2 ++
 target-i386/cpu.h | 3 +++
 target-i386/kvm.c | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 575ad8d..7a6a3f8 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -487,6 +487,8 @@ static const ExtSaveArea ext_save_areas[] = {
 .offset = 0x480, .size = 0x200 },
 [7] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F,
 .offset = 0x680, .size = 0x400 },
+[9] = { .feature = FEAT_7_0_ECX, .bits = CPUID_7_0_ECX_PKU,
+.offset = 0xA80, .size = 0x8 },
 };
 
 const char *get_register_name_32(unsigned int reg)
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index c2e7501..2230b3e 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -401,6 +401,7 @@
 #define XSTATE_OPMASK   (1ULL << 5)
 #define XSTATE_ZMM_Hi256(1ULL << 6)
 #define XSTATE_Hi16_ZMM (1ULL << 7)
+#define XSTATE_PKRU (1ULL << 9)
 
 
 /* CPUID feature words */
@@ -984,6 +985,8 @@ typedef struct CPUX86State {
 uint64_t xcr0;
 uint64_t xss;
 
+uint32_t pkru;
+
 TPRAccess tpr_access_type;
 } CPUX86State;
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 066d03d..12164a6 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1092,6 +1092,7 @@ static int kvm_put_fpu(X86CPU *cpu)
 #define XSAVE_OPMASK  272
 #define XSAVE_ZMM_Hi256   288
 #define XSAVE_Hi16_ZMM416
+#define XSAVE_PKRU672
 
 static int kvm_put_xsave(X86CPU *cpu)
 {
@@ -1145,6 +1146,7 @@ static int kvm_put_xsave(X86CPU *cpu)
 #ifdef TARGET_X86_64
 memcpy(&xsave->region[XSAVE_Hi16_ZMM], &env->xmm_regs[16],
 16 * sizeof env->xmm_regs[16]);
+memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru);
 #endif
 r = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_XSAVE, xsave);
 return r;
@@ -1516,6 +1518,7 @@ static int kvm_get_xsave(X86CPU *cpu)
 #ifdef TARGET_X86_64
 memcpy(&env->xmm_regs[16], &xsave->region[XSAVE_Hi16_ZMM],
16 * sizeof env->xmm_regs[16]);
+memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru);
 #endif
 return 0;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] KVM, pkeys: update memeory permission bitmask for pkeys

2015-11-09 Thread Huaitong Han
Pkeys define a new status bit in the PFEC. PFEC.PK (bit 5), if some
conditions is true, the fault is considered as a PKU violation.

This patch updates memeory permission bitmask for pkeys.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3bbc1cb..3fc6ada 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -159,12 +159,14 @@ enum {
 #define PFERR_USER_BIT 2
 #define PFERR_RSVD_BIT 3
 #define PFERR_FETCH_BIT 4
+#define PFERR_PK_BIT 5
 
 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
 #define PFERR_USER_MASK (1U << PFERR_USER_BIT)
 #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
 #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
+#define PFERR_PK_MASK (1U << PFERR_PK_BIT)
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC   0
@@ -290,8 +292,10 @@ struct kvm_mmu {
 * Bitmap; bit set = permission fault
 * Byte index: page fault error code [4:1]
 * Bit index: pte permissions in ACC_* format
+*
+* Add PFEC.PK (bit 5) for protection-key violations
 */
-   u8 permissions[16];
+   u8 permissions[32];
 
u64 *pae_root;
u64 *lm_root;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 69088a1..0568635 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3793,16 +3793,22 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
+   bool fault, x, w, u, smap = 0, pku = 0;
+   bool wf, uf, ff, smapf, rsvdf, pkuf;
+   bool cr4_smap, cr4_smep, cr4_pku;
 
cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
+   cr4_pku = kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
+
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
pfec = byte << 1;
map = 0;
wf = pfec & PFERR_WRITE_MASK;
uf = pfec & PFERR_USER_MASK;
ff = pfec & PFERR_FETCH_MASK;
+   rsvdf = pfec & PFERR_RSVD_MASK;
+   pkuf = pfec & PFERR_PK_MASK;
/*
 * PFERR_RSVD_MASK bit is set in PFEC if the access is not
 * subject to SMAP restrictions, and cleared otherwise. The
@@ -3841,12 +3847,34 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
 *   clearer.
 */
smap = cr4_smap && u && !uf && !ff;
+
+   /*
+   * PKU:additional mechanism by which the paging
+   * controls access to user-mode addresses based
+   * on the value in the PKRU register. A fault is
+   * considered as a PKU violation if all of the
+   * following conditions are true:
+   * 1.CR4_PKE=1.
+   * 2.EFER_LMA=1.
+   * 3.page is present with no reserved bit
+   *   violations.
+   * 4.the access is not an instruction fetch.
+   * 5.the access is to a user page.
+   * 6.PKRU.AD=1
+   *   or The access is a data write and
+   *  PKRU.WD=1 and either CR0.WP=1
+   *  or it is a user access.
+   *
+   * The 2nd and 6th conditions are computed
+   * dynamically in permission_fault.
+   */
+   pku = cr4_pku && !rsvdf && !ff && u;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
fault = (ff && !x) || (uf && !u) || (wf && !w) ||
-   (smapf && smap);
+   (smapf && smap) || (pkuf && pku);
map |= fault << bit;
}
mmu->permissions[byte] = map;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] KVM, pkeys: add pkeys support for permission_fault logic

2015-11-09 Thread Huaitong Han
Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).

Static logic has been produced in update_permission_bitmask, dynamic logic
need read pkey from page table entries, get pkru value, and deduce the
correct result.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e4202e4..bbb 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -3,6 +3,7 @@
 
 #include 
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #define PT64_PT_BITS 9
 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
@@ -24,6 +25,11 @@
 #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
 #define PT_PAT_MASK (1ULL << 7)
 #define PT_GLOBAL_MASK (1ULL << 8)
+
+#define PT64_PKEY_BIT0 (1ULL << _PAGE_BIT_PKEY_BIT0)
+#define PT64_PKEY_BIT1 (1ULL << _PAGE_BIT_PKEY_BIT1)
+#define PT64_PKEY_BIT2 (1ULL << _PAGE_BIT_PKEY_BIT2)
+#define PT64_PKEY_BIT3 (1ULL << _PAGE_BIT_PKEY_BIT3)
 #define PT64_NX_SHIFT 63
 #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
 
@@ -45,6 +51,15 @@
 #define PT_PAGE_TABLE_LEVEL 1
 #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1)
 
+#define PKEYS_BIT0_VALUE (1ULL << 0)
+#define PKEYS_BIT1_VALUE (1ULL << 1)
+#define PKEYS_BIT2_VALUE (1ULL << 2)
+#define PKEYS_BIT3_VALUE (1ULL << 3)
+
+#define PKRU_READ   0
+#define PKRU_WRITE  1
+#define PKRU_ATTRS  2
+
 static inline u64 rsvd_bits(int s, int e)
 {
return ((1ULL << (e - s + 1)) - 1) << s;
@@ -145,10 +160,44 @@ static inline bool is_write_protection(struct kvm_vcpu 
*vcpu)
  * fault with the given access (in ACC_* format)?
  */
 static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-   unsigned pte_access, unsigned pfec)
+   unsigned pte_access, unsigned pte_pkeys, unsigned pfec)
 {
-   int cpl = kvm_x86_ops->get_cpl(vcpu);
-   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
+   unsigned long smap, rflags;
+   u32 pkru;
+   int cpl, index;
+   bool wf, uf, pk, pkru_ad, pkru_wd;
+
+   cpl = kvm_x86_ops->get_cpl(vcpu);
+   rflags = kvm_x86_ops->get_rflags(vcpu);
+
+   pkru = read_pkru();
+
+   /*
+   * PKRU defines 32 bits, there are 16 domains and 2 attribute bits per
+   * domain in pkru, pkey is index to a defined domain, so the value of
+   * pkey * PKRU_ATTRS + R/W is offset of a defined domain attribute.
+   */
+   pkru_ad = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_READ)) & 1;
+   pkru_wd = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_WRITE)) & 1;
+
+   wf = pfec & PFERR_WRITE_MASK;
+   uf = pfec & PFERR_USER_MASK;
+
+   /*
+   * PKeys 2nd and 6th conditions:
+   * 2.EFER_LMA=1
+   * 6.PKRU.AD=1
+   *   or The access is a data write and PKRU.WD=1 and
+   *   either CR0.WP=1 or it is a user mode access
+   */
+   pk = is_long_mode(vcpu) && (pkru_ad ||
+   (pkru_wd && wf && (is_write_protection(vcpu) || uf)));
+
+   /*
+   * PK bit right value in pfec equal to
+   * PK bit current value in pfec and pk value.
+   */
+   pfec &= (pk << PFERR_PK_BIT) + ~PFERR_PK_MASK;
 
/*
 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
@@ -163,8 +212,8 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
 * but it will be one in index if SMAP checks are being overridden.
 * It is important to keep this branchless.
 */
-   unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
-   int index = (pfec >> 1) +
+   smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
+   index = (pfec >> 1) +
(smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));
 
WARN_ON(pfec & PFERR_RSVD_MASK);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 736e6ab..99563bc 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -253,6 +253,17 @@ static int FNAME(update_accessed_dirty_bits)(struct 
kvm_vcpu *vcpu,
}
return 0;
 }
+static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+   unsigned pkeys = 0;
+#if PTTYPE == 64
+   pkeys = ((gpte & PT64_PKEY_BIT0) ? PKEYS_BIT0_VALUE : 0) |
+   ((gpte & PT64_PKEY_BIT1) ? PKEYS_BIT1_VALUE : 0) |
+   ((gpte & PT64_PKEY_BIT2) ? PKEYS_BIT2_VALUE : 0) |
+   ((gpte & PT64_PKEY_BIT3) ? PKEYS_BIT3_VALUE : 0);
+#endif
+   return pkeys;
+}
 
 /*
  * Fetch a guest pte for a guest virtual address
@@ -265,12 +276,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
pt_element_t pte;
pt_element_t __user *uninitialized_var(ptep_user);
gfn_t table_gfn;
-   unsigned index, pt_access, pte_access, accessed_dirty;
+   unsigned index,

[PATCH 3/3] qemu, pkeys: add pkeys support for qemu migration

2015-11-09 Thread Huaitong Han
This patch adds pkeys support for qemu migration.

Signed-off-by: Huaitong Han 
---
 target-i386/machine.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/target-i386/machine.c b/target-i386/machine.c
index a0df64b..1b190c7 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -725,6 +725,26 @@ static const VMStateDescription vmstate_xss = {
 VMSTATE_END_OF_LIST()
 }
 };
+#ifdef TARGET_X86_64
+static bool pkru_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = &cpu->env;
+
+return env->pkru != 0;
+}
+
+static const VMStateDescription vmstate_pkru = {
+.name = "cpu/pkru",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = pkru_needed,
+.fields = (VMStateField[]){
+VMSTATE_UINT32(env.pkru, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+#endif
 
 VMStateDescription vmstate_x86_cpu = {
 .name = "cpu",
@@ -844,6 +864,9 @@ VMStateDescription vmstate_x86_cpu = {
 &vmstate_msr_hyperv_time,
 &vmstate_avx512,
 &vmstate_xss,
+#ifdef TARGET_X86_64
+&vmstate_pkru,
+#endif
 NULL
 }
 };
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] KVM, pkeys: add pkeys support for xsave state

2015-11-09 Thread Huaitong Han
This patch adds pkeys support for xsave state.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f2afa5f..0f71d5d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -182,7 +182,8 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
| XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
-   | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512)
+   | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
+   | XFEATURE_MASK_PKRU)
 extern u64 host_xcr0;
 
 extern u64 kvm_supported_xcr0(void);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/9] KVM, pkeys: disable PKU feature without ept

2015-11-09 Thread Huaitong Han
This patch disables CPUID:PKU without ept.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ece687b..e1113ae 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -447,6 +447,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
entry->ebx |= F(TSC_ADJUST);
entry->ecx &= kvm_supported_word11_x86_features;
cpuid_mask(&entry->ecx, 13);
+   if (!tdp_enabled)
+   entry->ecx &= ~F(PKU);
} else {
entry->ebx = 0;
entry->ecx = 0;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] KVM, pkeys: disable pkeys for guests in non-paging mode

2015-11-09 Thread Huaitong Han
Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
always uses paging mode to emulate guest non-paging, mode with TDP. To
emulate this behavior, pkeys needs to be manually disabled when guest
switches to non-paging mode.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d019868..9b12c80 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3645,14 +3645,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 &= ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP/SMAP is disabled if CPU is in non-paging mode
-* in hardware. However KVM always uses paging mode to
-* emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP/SMAP needs to be
+* SMEP/SMAP/PKU is disabled if CPU is in non-paging
+* mode in hardware. However KVM always uses paging
+* mode to emulate guest non-paging mode with TDP.
+* To emulate this behavior, SMEP/SMAP/PKU needs to be
 * manually disabled when guest switches to non-paging
 * mode.
 */
-   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
+   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
} else if (!(cr4 & X86_CR4_PAE)) {
hw_cr4 &= ~X86_CR4_PAE;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] qemu, pkeys: add pkeys support for qemu cpuid handling

2015-11-09 Thread Huaitong Han
This patch adds pkeys support for qemu cpuid handling.

Signed-off-by: Huaitong Han 
---
 target-i386/cpu.c | 21 -
 target-i386/cpu.h |  4 
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 4d1b085..575ad8d 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -264,6 +264,17 @@ static const char *cpuid_7_0_ebx_feature_name[] = {
 NULL, NULL, "avx512pf", "avx512er", "avx512cd", NULL, NULL, NULL,
 };
 
+static const char *cpuid_7_0_ecx_feature_name[] = {
+NULL, NULL, "pku", "ospke",
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 static const char *cpuid_apm_edx_feature_name[] = {
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
@@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = {
   CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
   CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
+#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE)
 #define TCG_APM_FEATURES 0
 #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 
@@ -408,6 +420,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .cpuid_reg = R_EBX,
 .tcg_features = TCG_7_0_EBX_FEATURES,
 },
+[FEAT_7_0_ECX] = {
+.feat_names = cpuid_7_0_ecx_feature_name,
+.cpuid_eax = 7,
+.cpuid_needs_ecx = true, .cpuid_ecx = 0,
+.cpuid_reg = R_ECX,
+.tcg_features = TCG_7_0_ECX_FEATURES,
+},
 [FEAT_8000_0007_EDX] = {
 .feat_names = cpuid_apm_edx_feature_name,
 .cpuid_eax = 0x8007,
@@ -2401,7 +2420,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 if (count == 0) {
 *eax = 0; /* Maximum ECX value for sub-leaves */
 *ebx = env->features[FEAT_7_0_EBX]; /* Feature flags */
-*ecx = 0; /* Reserved */
+*ecx = env->features[FEAT_7_0_ECX]; /* Feature flags */
 *edx = 0; /* Reserved */
 } else {
 *eax = 0;
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index ead2832..c2e7501 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -408,6 +408,7 @@ typedef enum FeatureWord {
 FEAT_1_EDX, /* CPUID[1].EDX */
 FEAT_1_ECX, /* CPUID[1].ECX */
 FEAT_7_0_EBX,   /* CPUID[EAX=7,ECX=0].EBX */
+FEAT_7_0_ECX,   /* CPUID[EAX=7,ECX=0].ECX */
 FEAT_8000_0001_EDX, /* CPUID[8000_0001].EDX */
 FEAT_8000_0001_ECX, /* CPUID[8000_0001].ECX */
 FEAT_8000_0007_EDX, /* CPUID[8000_0007].EDX */
@@ -576,6 +577,9 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
 #define CPUID_7_0_EBX_AVX512ER (1U << 27) /* AVX-512 Exponential and 
Reciprocal */
 #define CPUID_7_0_EBX_AVX512CD (1U << 28) /* AVX-512 Conflict Detection */
 
+#define CPUID_7_0_ECX_PKU  (1U << 3)
+#define CPUID_7_0_ECX_OSPKE(1U << 4)
+
 #define CPUID_XSAVE_XSAVEOPT   (1U << 0)
 #define CPUID_XSAVE_XSAVEC (1U << 1)
 #define CPUID_XSAVE_XGETBV1(1U << 2)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/9] KVM, pkeys: add memory protection-key support

2015-11-09 Thread Huaitong Han
The protection-key feature provides an additional mechanism by which IA-32e
paging controls access to usermode addresses.

Hardware support for protection keys for user pages is enumerated with CPUID
feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE
with the setting of CR4.PKE(bit 22).

When CR4.PKE = 1, every linear address is associated with the 4-bit protection
key located in bits 62:59 of the paging-structure entry that mapped the page
containing the linear address. The PKRU register determines, for each
protection key, whether user-mode addresses with that protection key may be
read or written.

The PKRU register (protection key rights for user pages) is a 32-bit register
with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the
access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable
bit for protection key i (WDi).

Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and
write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus
be read and written by instructions in the XSAVE feature set.

PFEC.PK (bit 5) is defined as protection key violations.

The specification of Protection Keys can be found at SDM (4.6.2, volume 3)
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.

The kernel native patchset have not yet been merged to upstream, you can found
at git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v007.

Huaitong Han (9):
  KVM, pkeys: expose CPUID:PKU to guest
  KVM, pkeys: add pkeys support when setting CR4
  KVM, pkeys: expose CPUID:OSPKE to guest
  KVM, pkeys: disable pkeys for guests in non-paging mode
  KVM, pkeys: update memeory permission bitmask for pkeys
  KVM, pkeys: add pkeys support for permission_fault logic
  KVM, pkeys: Add pkeys support for gva_to_gpa funcions
  KVM, pkeys: add pkeys support for xsave state
  KVM, pkeys: disable PKU feature without ept

 arch/x86/include/asm/kvm_host.h |  9 +--
 arch/x86/kvm/cpuid.c| 23 ++--
 arch/x86/kvm/cpuid.h|  8 ++
 arch/x86/kvm/mmu.c  | 32 --
 arch/x86/kvm/mmu.h  | 59 +
 arch/x86/kvm/paging_tmpl.h  | 20 +++---
 arch/x86/kvm/vmx.c  | 10 +++
 arch/x86/kvm/x86.c  | 34 +++-
 arch/x86/kvm/x86.h  |  3 ++-
 9 files changed, 171 insertions(+), 27 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] KVM, pkeys: expose CPUID:OSPKE to guest

2015-11-09 Thread Huaitong Han
This patch exposes X86_FEATURE_OSPKE to guest, X86_FEATURE_OSPKE is
software support for pkeys, enumerated with CPUID.7.0.ECX[4]:OSPKE,
and it reflects the setting of CR4.PKE.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 29e6502..ece687b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -81,6 +81,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
apic->lapic_timer.timer_mode_mask = 1 << 17;
}
 
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   if (!best)
+   return 0;
+
+   /*Update OSPKE bit */
+   if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) {
+   best->ecx &= ~F(OSPKE);
+   if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE))
+   best->ecx |= F(OSPKE);
+   }
+
best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
if (!best) {
vcpu->arch.guest_supported_xcr0 = 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d268da0..5181834 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -754,7 +754,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
(!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
kvm_mmu_reset_context(vcpu);
 
-   if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE)
+   if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
kvm_update_cpuid(vcpu);
 
return 0;
@@ -6841,7 +6841,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 
mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
kvm_x86_ops->set_cr4(vcpu, sregs->cr4);
-   if (sregs->cr4 & X86_CR4_OSXSAVE)
+   if (sregs->cr4 & (X86_CR4_OSXSAVE | X86_CR4_PKE))
kvm_update_cpuid(vcpu);
 
idx = srcu_read_lock(&vcpu->kvm->srcu);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] KVM, pkeys: add pkeys support when setting CR4

2015-11-09 Thread Huaitong Han
This patch adds pkeys support when setting CR4.PKE (bit 22).

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c12e845..3bbc1cb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -55,7 +55,8 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
+ | X86_CR4_PKE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index dd05b9c..7775158 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -70,6 +70,14 @@ static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu 
*vcpu)
return best && (best->ebx & bit(X86_FEATURE_FSGSBASE));
 }
 
+static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best && (best->ecx & bit(X86_FEATURE_PKU));
+}
+
 static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d4e54d..d268da0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -709,7 +709,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
unsigned long old_cr4 = kvm_read_cr4(vcpu);
unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
-  X86_CR4_SMEP | X86_CR4_SMAP;
+  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
 
if (cr4 & CR4_RESERVED_BITS)
return 1;
@@ -726,6 +726,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
if (!guest_cpuid_has_fsgsbase(vcpu) && (cr4 & X86_CR4_FSGSBASE))
return 1;
 
+   if (!guest_cpuid_has_pku(vcpu) && (cr4 & X86_CR4_PKE))
+   return 1;
+
if (is_long_mode(vcpu)) {
if (!(cr4 & X86_CR4_PAE))
return 1;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] KVM, pkeys: expose CPUID:PKU to guest

2015-11-09 Thread Huaitong Han
This patch expose X86_FEATURE_PKU to guest, X86_FEATURE_PKU is referred to
as "PKU" in the hardware documentation: CPUID.7.0.ECX[3]:PKU.

Signed-off-by: Huaitong Han 

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 156441b..29e6502 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -354,6 +354,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
const u32 kvm_supported_word10_x86_features =
F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | f_xsaves;
 
+   /* cpuid 7.0.ecx*/
+   const u32 kvm_supported_word11_x86_features = F(PKU) | 0 /*OSPKE*/;
+
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
 
@@ -431,10 +434,13 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
cpuid_mask(&entry->ebx, 9);
// TSC_ADJUST is emulated
entry->ebx |= F(TSC_ADJUST);
-   } else
+   entry->ecx &= kvm_supported_word11_x86_features;
+   cpuid_mask(&entry->ecx, 13);
+   } else {
entry->ebx = 0;
+   entry->ecx = 0;
+   }
entry->eax = 0;
-   entry->ecx = 0;
entry->edx = 0;
break;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 05/33] acpi: add aml_object_type

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:55:59PM +0800, Xiao Guangrong wrote:
> Implement ObjectType which is used by NVDIMM _DSM method in
> later patch
> 
> Signed-off-by: Xiao Guangrong 

I had to go dig in the _DSM patch to see how it's used.
And sure enough, callers have to know AML to make
sense of it. So pls don't split out tiny patches like this.
include the callee with the caller.

> ---
>  hw/acpi/aml-build.c | 8 
>  include/hw/acpi/aml-build.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index efc06ab..9f792ab 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml 
> *target)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */
> +Aml *aml_object_type(Aml *object)
> +{
> +Aml *var = aml_opcode(0x8E /* ObjectTypeOp */);
> +aml_append(var, object);
> +return var;
> +}
> +

It would be better to have a higher level API
that can be used without knowning AML.
For example:

aml_object_type_is_package()



>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 325782d..5b8a118 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg);
>  Aml *aml_sizeof(Aml *arg);
>  Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
>  Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
> +Aml *aml_object_type(Aml *object);
>  
>  void
>  build_header(GArray *linker, GArray *table_data,
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 06/33] acpi: add aml_method_serialized

2015-11-09 Thread Xiao Guangrong



On 11/09/2015 07:14 PM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:00PM +0800, Xiao Guangrong wrote:

It avoid explicit Mutex and will be used by NVDIMM ACPI

Signed-off-by: Xiao Guangrong 


I'd rather you squashed these utility patches in with where
the code is used. This is just making it harder to review
as I have to jump back and forth.


Okay to me.




---
  hw/acpi/aml-build.c | 26 --
  include/hw/acpi/aml-build.h |  1 +
  2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 9f792ab..8bee8b2 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate)
  }

  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */
-Aml *aml_method(const char *name, int arg_count)
+static Aml *__aml_method(const char *name, int arg_count, bool serialized)


Please don't prefix names with __.
what should you call this?
For example, you can call it aml_method_serialized.


Igor disliked that aml_method_serialized() is spepated from aml_method(), so i
will unify them as a single function instead. :)




  {
  Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE);
+int methodflags;
+
+/*
+ * MethodFlags:
+ *   bit 0-2: ArgCount (0-7)
+ *   bit 3: SerializeFlag
+ * 0: NotSerialized
+ * 1: Serialized
+ *   bit 4-7: reserved (must be 0)
+ */
+assert(!(arg_count & ~7));


Or shorter assert(arg_count < 8);


Good to me.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/33] pc-dimm: remove DEFAULT_PC_DIMMSIZE

2015-11-09 Thread Xiao Guangrong



On 11/09/2015 06:40 PM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:06PM +0800, Xiao Guangrong wrote:

It's not used any more

Signed-off-by: Xiao Guangrong 


You should leave the renames and cleanups off for later.
This patchset is large enough as it is.


Okay.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 19/33] dimm: keep the state of the whole backend memory

2015-11-09 Thread Xiao Guangrong



On 11/09/2015 07:04 PM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:13PM +0800, Xiao Guangrong wrote:

QEMU keeps the state of memory of dimm device during live migration,
however, it is not enough for nvdimm device as its memory does not
contain its label data, so that we should protect the whole backend
memory instead

Signed-off-by: Xiao Guangrong 


It looks like there's now a difference between
host_memory_backend_get_memory and get_memory_region,
whereas previously they were exactly interchangeable.


Yes, it is. host_memory_backend_get_memory() is used to get the whole
backend memory region. get_memory_region() is used to get the memory region
which will be mapped to guest's address space. They are on the same page
for pc-dimm, but it's different for nvdimm since part of backend memory used
as label data which is not directly mapped to guest.



This needs some thought, in particular the theoretically
generic dimm.c has to do tricks to accomodate nvdimm.

The missing piece for NVDIMM is the 128k label space at the end,
right?  Can't nvdimm specific code just register that as a
separate RAM chunk in order to migrate it?


Yes, it is.

I thought this before, then we need to have addition callbackes:
- plug(), which is called when the dimm device is plugged, then nvdimm
  has the chance to do it own migration things.

- unplug(), let nvdimm undo the thing of migration.

It needs more codes but the logic seems more clear. If you like this way,
i will do.




---
  hw/mem/dimm.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 498d380..7d1 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -134,9 +134,16 @@ void dimm_memory_plug(DeviceState *dev, MemoryHotplugState 
*hpms,
  }

  memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
-vmstate_register_ram(mr, dev);
  numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node);

+/*
+ * save the state only for @mr is not enough as it does not contain
+ * the label data of NVDIMM device, so that we keep the state of
+ * whole hostmem instead.
+ */
+vmstate_register_ram(host_memory_backend_get_memory(dimm->hostmem, errp),
+ dev);
+
  out:
  error_propagate(errp, local_err);
  }
@@ -145,10 +152,13 @@ void dimm_memory_unplug(DeviceState *dev, 
MemoryHotplugState *hpms,
 MemoryRegion *mr)
  {
  DIMMDevice *dimm = DIMM(dev);
+MemoryRegion *backend_mr;
+
+backend_mr = host_memory_backend_get_memory(dimm->hostmem, &error_abort);

  numa_unset_mem_node_id(dimm->addr, memory_region_size(mr), dimm->node);
  memory_region_del_subregion(&hpms->mr, mr);
-vmstate_unregister_ram(mr, dev);
+vmstate_unregister_ram(backend_mr, dev);
  }

  int qmp_dimm_device_list(Object *obj, void *opaque)
--
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 06/33] acpi: add aml_method_serialized

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:00PM +0800, Xiao Guangrong wrote:
> It avoid explicit Mutex and will be used by NVDIMM ACPI
> 
> Signed-off-by: Xiao Guangrong 

I'd rather you squashed these utility patches in with where
the code is used. This is just making it harder to review
as I have to jump back and forth.

> ---
>  hw/acpi/aml-build.c | 26 --
>  include/hw/acpi/aml-build.h |  1 +
>  2 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 9f792ab..8bee8b2 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate)
>  }
>  
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */
> -Aml *aml_method(const char *name, int arg_count)
> +static Aml *__aml_method(const char *name, int arg_count, bool serialized)

Please don't prefix names with __.
what should you call this?
For example, you can call it aml_method_serialized.

>  {
>  Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE);
> +int methodflags;
> +
> +/*
> + * MethodFlags:
> + *   bit 0-2: ArgCount (0-7)
> + *   bit 3: SerializeFlag
> + * 0: NotSerialized
> + * 1: Serialized
> + *   bit 4-7: reserved (must be 0)
> + */
> +assert(!(arg_count & ~7));

Or shorter assert(arg_count < 8);

> +methodflags = arg_count | (serialized << 3);
>  build_append_namestring(var->buf, "%s", name);
> -build_append_byte(var->buf, arg_count); /* MethodFlags: ArgCount */
> +build_append_byte(var->buf, methodflags);
>  return var;
>  }
>  
> +Aml *aml_method(const char *name, int arg_count)
> +{
> +return __aml_method(name, arg_count, false);
> +}
> +
> +Aml *aml_method_serialized(const char *name, int arg_count)
> +{
> +return __aml_method(name, arg_count, true);
> +}
> +
>  /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefDevice */
>  Aml *aml_device(const char *name_format, ...)
>  {
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 5b8a118..00cf40e 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -263,6 +263,7 @@ Aml *aml_qword_memory(AmlDecode dec, AmlMinFixed 
> min_fixed,
>  Aml *aml_scope(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
>  Aml *aml_device(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
>  Aml *aml_method(const char *name, int arg_count);
> +Aml *aml_method_serialized(const char *name, int arg_count);
>  Aml *aml_if(Aml *predicate);
>  Aml *aml_else(void);
>  Aml *aml_while(Aml *predicate);
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v7 25/35] nvdimm acpi: init the resource used by NVDIMM ACPI

2015-11-09 Thread Igor Mammedov
On Fri, 6 Nov 2015 16:31:43 +0800
Xiao Guangrong  wrote:

> 
> 
> On 11/05/2015 10:49 PM, Igor Mammedov wrote:
> > On Thu, 5 Nov 2015 21:33:39 +0800
> > Xiao Guangrong  wrote:
> >
> >>
> >>
> >> On 11/05/2015 09:03 PM, Igor Mammedov wrote:
> >>> On Thu, 5 Nov 2015 18:15:31 +0800
> >>> Xiao Guangrong  wrote:
> >>>
> 
> 
>  On 11/05/2015 05:58 PM, Igor Mammedov wrote:
> > On Mon,  2 Nov 2015 17:13:27 +0800
> > Xiao Guangrong  wrote:
> >
> >> A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
> >   ^^ missing one 0???
> >
> >> reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
> >> for detailed design
> >>
> >> A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
> >> that controls if nvdimm support is enabled, it is true on default and
> >> it is false on 2.4 and its earlier version to keep compatibility
> >>
> >> Signed-off-by: Xiao Guangrong 
> > [...]
> >
> >> @@ -33,6 +33,15 @@
> >>  */
> >> #define MIN_NAMESPACE_LABEL_SIZE  (128UL << 10)
> >>
> >> +/*
> >> + * A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest 
> >> are
> > ^^^ missing 0 or value in define 
> > below has an extra 0
> >
> >> + * reserved for NVDIMM ACPI emulation, refer to 
> >> docs/specs/acpi_nvdimm.txt
> >> + * for detailed design.
> >> + */
> >> +#define NVDIMM_ACPI_MEM_BASE  0xFF00ULL
> > it still maps RAM at arbitrary place,
> > that's the reason why VMGenID patches hasn't been merged for
> > more than several months.
> > I'm not against of using (more exactly I'm for it) direct mapping
> > but we should reach consensus when and how to use it first.
> >
> > I'd wouldn't use addresses below 4G as it may be used firmware or
> > legacy hardware and I won't bet that 0xFF00ULL won't conflict
> > with anything.
> > An alternative place to allocate reserve from could be high memory.
> > For pc we have "reserved-memory-end" which currently makes sure
> > that hotpluggable memory range isn't used by firmware.
> >
> > How about making API that allows to map additional memory
> > ranges before reserved-memory-end and pushes it up as mappings are
> > added.
[...]

> 
> Really a good study case to me, i tried your patch and moved the 64 bit
> staffs to the private method, it worked. :)
> 
> Igor, is this the API you want?

Lets get ack from Michael on the idea of RAM mapping before
"reserved-memory-end" first.
If he rejects it then there isn't any other way except of switching
to MMIO instead.

> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 6bf569a..aba29df 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1291,6 +1291,38 @@ FWCfgState *xen_load_linux(PCMachineState *pcms,
>   return fw_cfg;
>   }
> 
> +static void pc_reserve_high_memory_init(PCMachineState *pcms,
> +uint64_t base, uint64_t align)
> +{
> +pcms->reserve_high_memory.current_addr = ROUND_UP(base, align);
> +}
> +
> +static uint64_t
> +pc_reserve_high_memory_end(PCMachineState *pcms, int64_t align)
> +{
> +return ROUND_UP(pcms->reserve_high_memory.current_addr, align);
> +}
> +
> +uint64_t pc_reserve_high_memory(PCMachineState *pcms, uint64_t size,
> +int64_t align, Error **errp)
> +{
> +uint64_t end_addr, current_addr = pcms->reserve_high_memory.current_addr;
> +
> +if (!current_addr) {
> +error_setg(errp, "reserved high memory is not initialized.");
> +return 0;
> +}
> +
> +end_addr = pc_reserve_high_memory_end(pcms, align) + size;
> +if (current_addr > end_addr) {
> +error_setg(errp, "reserved high memory is not enough.");
> +return 0;
> +}
> +
> +pcms->reserve_high_memory.current_addr = end_addr;
> +return end_addr;
> +}
> +
>   FWCfgState *pc_memory_init(PCMachineState *pcms,
>  MemoryRegion *system_memory,
>  MemoryRegion *rom_memory,
> @@ -1379,6 +1411,8 @@ FWCfgState *pc_memory_init(PCMachineState *pcms,
>  "hotplug-memory", hotplug_mem_size);
>   memory_region_add_subregion(system_memory, 
> pcms->hotplug_memory.base,
>   &pcms->hotplug_memory.mr);
> +pc_reserve_high_memory_init(pcms, pcms->hotplug_memory.base +
> +hotplug_mem_size, 1ULL << 30);
>   }
> 
>   /* Initialize PC system firmware */
> @@ -1403,7 +1437,7 @@ FWCfgState *pc_memory_init(PCMachineState *pcms,
>   uint64_t res_mem_end = pcms->hotplug_memory.base;
> 
>   if (!pcmc->broken_reserved_end) {
> -res_mem_end += memory_region_size(&pcms->hotplug_memory.mr);
> +res_mem_end = 

Re: [PATCH v6 08/33] exec: allow memory to be allocated from any kind of path

2015-11-09 Thread Xiao Guangrong



On 11/09/2015 06:39 PM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:02PM +0800, Xiao Guangrong wrote:

Currently file_ram_alloc() is designed for hugetlbfs, however, the memory
of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file
locates at DAX enabled filesystem

So this patch let it work on any kind of path

Signed-off-by: Xiao Guangrong 


So this allows regular memory to be specified directly.
This needs to be split out and merged separately
from acpi/nvdimm bits.


Yup, that is in my plan.



Alternatively, if it's possible to use nvdimm with DAX fs
(similar to hugetlbfs), leave these patches off for now.



DAX is a filesystem mount options, it is not easy to get this
info from a file.




---
  exec.c | 56 +---
  1 file changed, 17 insertions(+), 39 deletions(-)

diff --git a/exec.c b/exec.c
index 8af2570..3ca7e50 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,32 +1174,6 @@ void qemu_mutex_unlock_ramlist(void)
  }

  #ifdef __linux__
-
-#include 
-
-#define HUGETLBFS_MAGIC   0x958458f6
-
-static long gethugepagesize(const char *path, Error **errp)
-{
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(path, &fs);
-} while (ret != 0 && errno == EINTR);
-
-if (ret != 0) {
-error_setg_errno(errp, errno, "failed to get page size of file %s",
- path);
-return 0;
-}
-
-if (fs.f_type != HUGETLBFS_MAGIC)
-fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
-
-return fs.f_bsize;
-}
-
  static void *file_ram_alloc(RAMBlock *block,
  ram_addr_t memory,
  const char *path,
@@ -1210,20 +1184,24 @@ static void *file_ram_alloc(RAMBlock *block,
  char *c;
  void *area;
  int fd;
-uint64_t hpagesize;
-Error *local_err = NULL;
+uint64_t pagesize;

-hpagesize = gethugepagesize(path, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
+pagesize = qemu_file_get_page_size(path);
+if (!pagesize) {
+error_setg(errp, "can't get page size for %s", path);
  goto error;
  }
-block->mr->align = hpagesize;

-if (memory < hpagesize) {
+if (pagesize == getpagesize()) {
+fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");
+}
+
+block->mr->align = pagesize;
+
+if (memory < pagesize) {
  error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
-   "or larger than huge page size 0x%" PRIx64,
-   memory, hpagesize);
+   "or larger than page size 0x%" PRIx64,
+   memory, pagesize);
  goto error;
  }

@@ -1247,14 +1225,14 @@ static void *file_ram_alloc(RAMBlock *block,
  fd = mkstemp(filename);
  if (fd < 0) {
  error_setg_errno(errp, errno,
- "unable to create backing store for hugepages");
+ "unable to create backing store for path %s", path);
  g_free(filename);
  goto error;
  }
  unlink(filename);
  g_free(filename);


Looks like we are still calling mkstemp/unlink here.
How does this work?


Hmm? We have got the fd so the file can be safely unlinked (kernel does not 
actually unlink
the file since mkstemp() holds a refcount.).

And this patch just renames the variables, no logic changed. Will drop it from 
the serials
for now on.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 19/33] dimm: keep the state of the whole backend memory

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:13PM +0800, Xiao Guangrong wrote:
> QEMU keeps the state of memory of dimm device during live migration,
> however, it is not enough for nvdimm device as its memory does not
> contain its label data, so that we should protect the whole backend
> memory instead
> 
> Signed-off-by: Xiao Guangrong 

It looks like there's now a difference between
host_memory_backend_get_memory and get_memory_region,
whereas previously they were exactly interchangeable.

This needs some thought, in particular the theoretically
generic dimm.c has to do tricks to accomodate nvdimm.

The missing piece for NVDIMM is the 128k label space at the end,
right?  Can't nvdimm specific code just register that as a
separate RAM chunk in order to migrate it?

> ---
>  hw/mem/dimm.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
> index 498d380..7d1 100644
> --- a/hw/mem/dimm.c
> +++ b/hw/mem/dimm.c
> @@ -134,9 +134,16 @@ void dimm_memory_plug(DeviceState *dev, 
> MemoryHotplugState *hpms,
>  }
>  
>  memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
> -vmstate_register_ram(mr, dev);
>  numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node);
>  
> +/*
> + * save the state only for @mr is not enough as it does not contain
> + * the label data of NVDIMM device, so that we keep the state of
> + * whole hostmem instead.
> + */
> +vmstate_register_ram(host_memory_backend_get_memory(dimm->hostmem, errp),
> + dev);
> +
>  out:
>  error_propagate(errp, local_err);
>  }
> @@ -145,10 +152,13 @@ void dimm_memory_unplug(DeviceState *dev, 
> MemoryHotplugState *hpms,
> MemoryRegion *mr)
>  {
>  DIMMDevice *dimm = DIMM(dev);
> +MemoryRegion *backend_mr;
> +
> +backend_mr = host_memory_backend_get_memory(dimm->hostmem, &error_abort);
>  
>  numa_unset_mem_node_id(dimm->addr, memory_region_size(mr), dimm->node);
>  memory_region_del_subregion(&hpms->mr, mr);
> -vmstate_unregister_ram(mr, dev);
> +vmstate_unregister_ram(backend_mr, dev);
>  }
>  
>  int qmp_dimm_device_list(Object *obj, void *opaque)
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()

2015-11-09 Thread Xiao Guangrong



On 11/09/2015 06:33 PM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote:

There are three places use the some logic to get the page size on
the file path or file fd

This patch introduces qemu_file_get_page_size() to unify the code

Signed-off-by: Xiao Guangrong 
---
  include/qemu/osdep.h |  1 +
  target-ppc/kvm.c | 21 +++--
  util/oslib-posix.c   | 16 
  util/oslib-win32.c   |  5 +
  4 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b568424..d4dde02 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -302,4 +302,5 @@ int qemu_read_password(char *buf, int buf_size);
   */
  pid_t qemu_fork(Error **errp);

+size_t qemu_file_get_page_size(const char *mem_path);
  #endif
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index ac70f08..c661f1c 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
kvm_ppc_smmu_info *info)

  static long gethugepagesize(const char *mem_path)
  {
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(mem_path, &fs);
-} while (ret != 0 && errno == EINTR);
+long size = qemu_file_get_page_size(mem_path);

-if (ret != 0) {
-fprintf(stderr, "Couldn't statfs() memory path: %s\n",
-strerror(errno));
+if (!size) {
  exit(1);
  }

-#define HUGETLBFS_MAGIC   0x958458f6
-
-if (fs.f_type != HUGETLBFS_MAGIC) {
-/* Explicit mempath, but it's ordinary pages */
-return getpagesize();
-}
-
-/* It's hugepage, return the huge page size */
-return fs.f_bsize;
+return size;
  }

  static int find_max_supported_pagesize(Object *obj, void *opaque)
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 914cef5..ad94c5a 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd)
  return getpagesize();
  }

+size_t qemu_file_get_page_size(const char *path)
+{
+size_t size = 0;
+int fd = qemu_open(path, O_RDONLY);
+
+if (fd < 0) {
+fprintf(stderr, "Could not open %s.\n", path);
+goto exit;
+}
+
+size = fd_getpagesize(fd);
+qemu_close(fd);
+exit:
+return size;
+}
+
  void os_mem_prealloc(int fd, char *area, size_t memory)
  {
  int ret;


So this is opening the file for the sole purpose of
doing the fstatfs on it. Seems strange, just do statfs instead.
In fact, maybe we want statfs_getpagesize.


It is just to reuse the code of fd_getpagesize() which already has
the logic to check pagesize.




diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 09f9e98..a18aa87 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -462,6 +462,11 @@ size_t getpagesize(void)
  return system_info.dwPageSize;
  }

+size_t qemu_file_get_page_size(const char *path)
+{
+return getpagesize();
+}
+
  void os_mem_prealloc(int fd, char *area, size_t memory)
  {
  int i;


And why is this needed on win32?


It is not actually used, just make osdep.h happy which has
qemu_file_get_page_size() declare.

BTW, i will drop this patch for now on.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 11/33] hostmem-file: use whole file size if possible

2015-11-09 Thread Xiao Guangrong



On 11/09/2015 06:17 PM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:05PM +0800, Xiao Guangrong wrote:

Use the whole file size if @size is not specified which is useful
if we want to directly pass a file to guest

Signed-off-by: Xiao Guangrong 


Better split these simplifications off from the series.


Yep, i am also little tired to respin this long series, will drop this
one.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 09/33] exec: allow file_ram_alloc to work on file

2015-11-09 Thread Xiao Guangrong



On 11/09/2015 06:13 PM, Michael S. Tsirkin wrote:

On Fri, Oct 30, 2015 at 01:56:03PM +0800, Xiao Guangrong wrote:

Currently, file_ram_alloc() only works on directory - it creates a file
under @path and do mmap on it

This patch tries to allow it to work on file directly, if @path is a
directory it works as before, otherwise it treats @path as the target
file then directly allocate memory from it

Signed-off-by: Xiao Guangrong 
---
  exec.c | 80 ++
  1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/exec.c b/exec.c
index 3ca7e50..f219010 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
  }

  #ifdef __linux__
+static bool path_is_dir(const char *path)
+{
+struct stat fs;
+
+return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);


This means file doesn't exist is treated as a file.
Can't figure out if that's intentional, should
be documented in any case.


The path is dir only if it passes S_ISDIR test. Otherwise, it is treated
as regular file. Any error on the file will be figured out by open() and
mmap() later.




+}
+
+static int open_file_path(RAMBlock *block, const char *path, size_t size)
+{
+char *filename;
+char *sanitized_name;
+char *c;
+int fd;
+
+if (!path_is_dir(path)) {
+int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;


Why does this make sense?


RAM_SHARED means its write is invisible to others also it does not actual
change the content of the file. Readonly permission is enough for this case.
It is also help us to pass a readonly file to the guess but discard its change
after guest shutdown.




+
+flags |= O_EXCL;


And why does this makes sense?



It' used if we pass a block device as a NVDIMM backend memory:
  O_EXCL can be used without O_CREAT if pathname refers to a block
device.  If the block device is in use by the system (e.g., mounted),
open() fails with the error EBUSY

BTW, the similar logic has already been in upsteam QEMU which is
introduced by commit 8d31d6b6 (backends/hostmem-file: Allow to specify
full pathname for backing file). I will drop these patches:
  util: introduce qemu_file_get_page_size()
  exec: allow memory to be allocated from any kind of path
  exec: allow file_ram_alloc to work on file
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/33] pc-dimm: remove DEFAULT_PC_DIMMSIZE

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:06PM +0800, Xiao Guangrong wrote:
> It's not used any more
> 
> Signed-off-by: Xiao Guangrong 

You should leave the renames and cleanups off for later.
This patchset is large enough as it is.

> ---
>  include/hw/mem/pc-dimm.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
> index d83bf30..11a8937 100644
> --- a/include/hw/mem/pc-dimm.h
> +++ b/include/hw/mem/pc-dimm.h
> @@ -20,8 +20,6 @@
>  #include "sysemu/hostmem.h"
>  #include "hw/qdev.h"
>  
> -#define DEFAULT_PC_DIMMSIZE (1024*1024*1024)
> -
>  #define TYPE_PC_DIMM "pc-dimm"
>  #define PC_DIMM(obj) \
>  OBJECT_CHECK(PCDIMMDevice, (obj), TYPE_PC_DIMM)
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 08/33] exec: allow memory to be allocated from any kind of path

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:02PM +0800, Xiao Guangrong wrote:
> Currently file_ram_alloc() is designed for hugetlbfs, however, the memory
> of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file
> locates at DAX enabled filesystem
> 
> So this patch let it work on any kind of path
> 
> Signed-off-by: Xiao Guangrong 

So this allows regular memory to be specified directly.
This needs to be split out and merged separately
from acpi/nvdimm bits.

Alternatively, if it's possible to use nvdimm with DAX fs
(similar to hugetlbfs), leave these patches off for now.


> ---
>  exec.c | 56 +---
>  1 file changed, 17 insertions(+), 39 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 8af2570..3ca7e50 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1174,32 +1174,6 @@ void qemu_mutex_unlock_ramlist(void)
>  }
>  
>  #ifdef __linux__
> -
> -#include 
> -
> -#define HUGETLBFS_MAGIC   0x958458f6
> -
> -static long gethugepagesize(const char *path, Error **errp)
> -{
> -struct statfs fs;
> -int ret;
> -
> -do {
> -ret = statfs(path, &fs);
> -} while (ret != 0 && errno == EINTR);
> -
> -if (ret != 0) {
> -error_setg_errno(errp, errno, "failed to get page size of file %s",
> - path);
> -return 0;
> -}
> -
> -if (fs.f_type != HUGETLBFS_MAGIC)
> -fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
> -
> -return fs.f_bsize;
> -}
> -
>  static void *file_ram_alloc(RAMBlock *block,
>  ram_addr_t memory,
>  const char *path,
> @@ -1210,20 +1184,24 @@ static void *file_ram_alloc(RAMBlock *block,
>  char *c;
>  void *area;
>  int fd;
> -uint64_t hpagesize;
> -Error *local_err = NULL;
> +uint64_t pagesize;
>  
> -hpagesize = gethugepagesize(path, &local_err);
> -if (local_err) {
> -error_propagate(errp, local_err);
> +pagesize = qemu_file_get_page_size(path);
> +if (!pagesize) {
> +error_setg(errp, "can't get page size for %s", path);
>  goto error;
>  }
> -block->mr->align = hpagesize;
>  
> -if (memory < hpagesize) {
> +if (pagesize == getpagesize()) {
> +fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");
> +}
> +
> +block->mr->align = pagesize;
> +
> +if (memory < pagesize) {
>  error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
> -   "or larger than huge page size 0x%" PRIx64,
> -   memory, hpagesize);
> +   "or larger than page size 0x%" PRIx64,
> +   memory, pagesize);
>  goto error;
>  }
>  
> @@ -1247,14 +1225,14 @@ static void *file_ram_alloc(RAMBlock *block,
>  fd = mkstemp(filename);
>  if (fd < 0) {
>  error_setg_errno(errp, errno,
> - "unable to create backing store for hugepages");
> + "unable to create backing store for path %s", path);
>  g_free(filename);
>  goto error;
>  }
>  unlink(filename);
>  g_free(filename);

Looks like we are still calling mkstemp/unlink here.
How does this work?

>  
> -memory = ROUND_UP(memory, hpagesize);
> +memory = ROUND_UP(memory, pagesize);
>  
>  /*
>   * ftruncate is not supported by hugetlbfs in older
> @@ -1266,10 +1244,10 @@ static void *file_ram_alloc(RAMBlock *block,
>  perror("ftruncate");
>  }
>  
> -area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & RAM_SHARED);
> +area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
>  if (area == MAP_FAILED) {
>  error_setg_errno(errp, errno,
> - "unable to map backing store for hugepages");
> + "unable to map backing store for path %s", path);
>  close(fd);
>  goto error;
>  }
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote:
> There are three places use the some logic to get the page size on
> the file path or file fd
> 
> This patch introduces qemu_file_get_page_size() to unify the code
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  include/qemu/osdep.h |  1 +
>  target-ppc/kvm.c | 21 +++--
>  util/oslib-posix.c   | 16 
>  util/oslib-win32.c   |  5 +
>  4 files changed, 25 insertions(+), 18 deletions(-)
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index b568424..d4dde02 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -302,4 +302,5 @@ int qemu_read_password(char *buf, int buf_size);
>   */
>  pid_t qemu_fork(Error **errp);
>  
> +size_t qemu_file_get_page_size(const char *mem_path);
>  #endif
> diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> index ac70f08..c661f1c 100644
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
> kvm_ppc_smmu_info *info)
>  
>  static long gethugepagesize(const char *mem_path)
>  {
> -struct statfs fs;
> -int ret;
> -
> -do {
> -ret = statfs(mem_path, &fs);
> -} while (ret != 0 && errno == EINTR);
> +long size = qemu_file_get_page_size(mem_path);
>  
> -if (ret != 0) {
> -fprintf(stderr, "Couldn't statfs() memory path: %s\n",
> -strerror(errno));
> +if (!size) {
>  exit(1);
>  }
>  
> -#define HUGETLBFS_MAGIC   0x958458f6
> -
> -if (fs.f_type != HUGETLBFS_MAGIC) {
> -/* Explicit mempath, but it's ordinary pages */
> -return getpagesize();
> -}
> -
> -/* It's hugepage, return the huge page size */
> -return fs.f_bsize;
> +return size;
>  }
>  
>  static int find_max_supported_pagesize(Object *obj, void *opaque)
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 914cef5..ad94c5a 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd)
>  return getpagesize();
>  }
>  
> +size_t qemu_file_get_page_size(const char *path)
> +{
> +size_t size = 0;
> +int fd = qemu_open(path, O_RDONLY);
> +
> +if (fd < 0) {
> +fprintf(stderr, "Could not open %s.\n", path);
> +goto exit;
> +}
> +
> +size = fd_getpagesize(fd);
> +qemu_close(fd);
> +exit:
> +return size;
> +}
> +
>  void os_mem_prealloc(int fd, char *area, size_t memory)
>  {
>  int ret;

So this is opening the file for the sole purpose of
doing the fstatfs on it. Seems strange, just do statfs instead.
In fact, maybe we want statfs_getpagesize.

> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index 09f9e98..a18aa87 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -462,6 +462,11 @@ size_t getpagesize(void)
>  return system_info.dwPageSize;
>  }
>  
> +size_t qemu_file_get_page_size(const char *path)
> +{
> +return getpagesize();
> +}
> +
>  void os_mem_prealloc(int fd, char *area, size_t memory)
>  {
>  int i;

And why is this needed on win32?

> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 11/33] hostmem-file: use whole file size if possible

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:05PM +0800, Xiao Guangrong wrote:
> Use the whole file size if @size is not specified which is useful
> if we want to directly pass a file to guest
> 
> Signed-off-by: Xiao Guangrong 

Better split these simplifications off from the series.

> ---
>  backends/hostmem-file.c | 48 
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
> index 9097a57..e1bc9ff 100644
> --- a/backends/hostmem-file.c
> +++ b/backends/hostmem-file.c
> @@ -9,6 +9,9 @@
>   * This work is licensed under the terms of the GNU GPL, version 2 or later.
>   * See the COPYING file in the top-level directory.
>   */
> +#include 
> +#include 
> +
>  #include "qemu-common.h"
>  #include "sysemu/hostmem.h"
>  #include "sysemu/sysemu.h"
> @@ -33,20 +36,57 @@ struct HostMemoryBackendFile {
>  char *mem_path;
>  };
>  
> +static uint64_t get_file_size(const char *file)
> +{
> +struct stat stat_buf;
> +uint64_t size = 0;
> +int fd;
> +
> +fd = open(file, O_RDONLY);
> +if (fd < 0) {
> +return 0;
> +}
> +
> +if (stat(file, &stat_buf) < 0) {
> +goto exit;
> +}
> +
> +if ((S_ISBLK(stat_buf.st_mode)) && !ioctl(fd, BLKGETSIZE64, &size)) {

You must test S_ISDIR too.

> +goto exit;
> +}
> +
> +size = lseek(fd, 0, SEEK_END);
> +if (size == -1) {
> +size = 0;
> +}
> +exit:
> +close(fd);
> +return size;
> +}
> +
>  static void
>  file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>  {
>  HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
>  
> -if (!backend->size) {
> -error_setg(errp, "can't create backend with size 0");
> -return;
> -}
>  if (!fb->mem_path) {
>  error_setg(errp, "mem-path property not set");
>  return;
>  }
>  
> +if (!backend->size) {
> +/*
> + * use the whole file size if @size is not specified.
> + */
> +backend->size = get_file_size(fb->mem_path);
> +}
> +
> +if (!backend->size) {
> +error_setg(errp, "failed to get file size for %s, can't create "
> + "backend on it", mem_path);
> +return;
> +}
> +
>  backend->force_prealloc = mem_prealloc;
>  memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
>   object_get_canonical_path(OBJECT(backend)),
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] KVM: x86: MMU: Clean up x86's mmu code for future work

2015-11-09 Thread Paolo Bonzini


On 06/11/2015 08:20, Takuya Yoshikawa wrote:
> Patch 1/2/3 are easy ones.
> 
> Following two, patch 4/5, may not be ideal solutions, but at least
> explain, or try to explain, the problems.

They are okay!  I replied to patch 5 with a suggestion for further
cleanup.  I'll apply them for 4.5.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes

2015-11-09 Thread Paolo Bonzini
On 06/11/2015 08:25, Takuya Yoshikawa wrote:
> At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is
> placed right after the call to detect unrelated sptes which should not
> be found in the reverse-mapping list.
> 
> Move this check in rmap_get_first/next() so that all call sites, not
> just the users of the for_each_rmap_spte() macro, will be checked the
> same way.  In addition, change the BUG_ON to WARN_ON since killing the
> whole host is the last thing that KVM should try.
> 
> One thing to keep in mind is that kvm_mmu_unlink_parents() also uses
> rmap_get_first() to handle parent sptes.  The change will not break it
> because parent sptes are present, at least until drop_parent_pte()
> actually unlinks them, and not mmio-sptes.

Can you also change kvm_mmu_mark_parents_unsync to use
for_each_rmap_spte instead of pte_list_walk?  It is the last use of
pte_list_walk, and it's nice if we have two uses of for_each_rmap_spte
with parent_ptes as the argument.

BTW, on my todo list is to change the rmap items to a struct (with a
single u64 inside) for type safety.  Since you are touching this code,
perhaps you can give it a shot?

Paolo

> Signed-off-by: Takuya Yoshikawa 
> ---
>  Documentation/virtual/kvm/mmu.txt |  4 ++--
>  arch/x86/kvm/mmu.c| 31 ++-
>  2 files changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/mmu.txt 
> b/Documentation/virtual/kvm/mmu.txt
> index 3a4d681..daf9c0f 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -203,10 +203,10 @@ Shadow pages contain the following information:
>  page cannot be destroyed.  See role.invalid.
>parent_ptes:
>  The reverse mapping for the pte/ptes pointing at this page's spt. If
> -parent_ptes bit 0 is zero, only one spte points at this pages and
> +parent_ptes bit 0 is zero, only one spte points at this page and
>  parent_ptes points at this single spte, otherwise, there exists multiple
>  sptes pointing at this page and (parent_ptes & ~0x1) points at a data
> -structure with a list of parent_ptes.
> +structure with a list of parent sptes.
>unsync:
>  If true, then the translations in this page may not match the guest's
>  translation.  This is equivalent to the state of the tlb when a pte is
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index c5e2363..353d752 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1099,17 +1099,28 @@ struct rmap_iterator {
>   */
>  static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
>  {
> + u64 *sptep;
> +
>   if (!rmap)
>   return NULL;
>  
>   if (!(rmap & 1)) {
>   iter->desc = NULL;
> - return (u64 *)rmap;
> + sptep = (u64 *)rmap;
> + goto out;
>   }
>  
>   iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
>   iter->pos = 0;
> - return iter->desc->sptes[iter->pos];
> + sptep = iter->desc->sptes[iter->pos];
> +out:
> + /*
> +  * Parent sptes found in sp->parent_ptes lists are also checked here
> +  * since kvm_mmu_unlink_parents() uses this function.  If the condition
> +  * needs to be changed for them, make another wrapper function.
> +  */
> + WARN_ON(!is_shadow_present_pte(*sptep));
> + return sptep;
>  }
>  
>  /*
> @@ -1119,14 +1130,14 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
> rmap_iterator *iter)
>   */
>  static u64 *rmap_get_next(struct rmap_iterator *iter)
>  {
> + u64 *sptep;
> +
>   if (iter->desc) {
>   if (iter->pos < PTE_LIST_EXT - 1) {
> - u64 *sptep;
> -
>   ++iter->pos;
>   sptep = iter->desc->sptes[iter->pos];
>   if (sptep)
> - return sptep;
> + goto out;
>   }
>  
>   iter->desc = iter->desc->more;
> @@ -1134,17 +1145,20 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>   if (iter->desc) {
>   iter->pos = 0;
>   /* desc->sptes[0] cannot be NULL */
> - return iter->desc->sptes[iter->pos];
> + sptep = iter->desc->sptes[iter->pos];
> + goto out;
>   }
>   }
>  
>   return NULL;
> +out:
> + WARN_ON(!is_shadow_present_pte(*sptep));
> + return sptep;
>  }
>  
>  #define for_each_rmap_spte(_rmap_, _iter_, _spte_)   \
>  for (_spte_ = rmap_get_first(*_rmap_, _iter_);   \
> - _spte_ && ({BUG_ON(!is_shadow_present_pte(*_spte_)); 1;});  \
> - _spte_ = rmap_get_next(_iter_))
> + _spte_; _spte_ = rmap_get_next(_iter_))
>  
>  static void drop_spte(struct kvm *kvm, u64 *sptep)
>  {
> @@ -1358,7 +1372,6 @@ static bool kvm_zap_rmap

Re: [PATCH v6 09/33] exec: allow file_ram_alloc to work on file

2015-11-09 Thread Michael S. Tsirkin
On Fri, Oct 30, 2015 at 01:56:03PM +0800, Xiao Guangrong wrote:
> Currently, file_ram_alloc() only works on directory - it creates a file
> under @path and do mmap on it
> 
> This patch tries to allow it to work on file directly, if @path is a
> directory it works as before, otherwise it treats @path as the target
> file then directly allocate memory from it
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  exec.c | 80 
> ++
>  1 file changed, 51 insertions(+), 29 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 3ca7e50..f219010 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
>  }
>  
>  #ifdef __linux__
> +static bool path_is_dir(const char *path)
> +{
> +struct stat fs;
> +
> +return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);

This means file doesn't exist is treated as a file.
Can't figure out if that's intentional, should
be documented in any case.

> +}
> +
> +static int open_file_path(RAMBlock *block, const char *path, size_t size)
> +{
> +char *filename;
> +char *sanitized_name;
> +char *c;
> +int fd;
> +
> +if (!path_is_dir(path)) {
> +int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;

Why does this make sense?

> +
> +flags |= O_EXCL;

And why does this makes sense?

> +return open(path, flags);
> +}
> +
> +/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> +sanitized_name = g_strdup(memory_region_name(block->mr));
> +for (c = sanitized_name; *c != '\0'; c++) {
> +if (*c == '/') {
> +*c = '_';
> +}
> +}
> +filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> +   sanitized_name);
> +g_free(sanitized_name);
> +fd = mkstemp(filename);
> +if (fd >= 0) {
> +unlink(filename);
> +/*
> + * ftruncate is not supported by hugetlbfs in older
> + * hosts, so don't bother bailing out on errors.
> + * If anything goes wrong with it under other filesystems,
> + * mmap will fail.
> + */
> +if (ftruncate(fd, size)) {
> +perror("ftruncate");
> +}
> +}
> +g_free(filename);
> +
> +return fd;
> +}
> +
>  static void *file_ram_alloc(RAMBlock *block,
>  ram_addr_t memory,
>  const char *path,
>  Error **errp)
>  {
> -char *filename;
> -char *sanitized_name;
> -char *c;
>  void *area;
>  int fd;
>  uint64_t pagesize;
> @@ -1211,38 +1257,14 @@ static void *file_ram_alloc(RAMBlock *block,
>  goto error;
>  }
>  
> -/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> -sanitized_name = g_strdup(memory_region_name(block->mr));
> -for (c = sanitized_name; *c != '\0'; c++) {
> -if (*c == '/')
> -*c = '_';
> -}
> -
> -filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> -   sanitized_name);
> -g_free(sanitized_name);
> +memory = ROUND_UP(memory, pagesize);
>  
> -fd = mkstemp(filename);
> +fd = open_file_path(block, path, memory);
>  if (fd < 0) {
>  error_setg_errno(errp, errno,
>   "unable to create backing store for path %s", path);
> -g_free(filename);
>  goto error;
>  }
> -unlink(filename);
> -g_free(filename);
> -
> -memory = ROUND_UP(memory, pagesize);
> -
> -/*
> - * ftruncate is not supported by hugetlbfs in older
> - * hosts, so don't bother bailing out on errors.
> - * If anything goes wrong with it under other filesystems,
> - * mmap will fail.
> - */
> -if (ftruncate(fd, memory)) {
> -perror("ftruncate");
> -}
>  
>  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
>  if (area == MAP_FAILED) {
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/4] KVM: VMX: enable LBR virtualization

2015-11-09 Thread Jian Zhou

On 2015/11/9 17:06, Paolo Bonzini wrote:



On 09/11/2015 02:33, Jian Zhou wrote:

Hi Paolo,

May I ask that any suggestion about the version 2 of VMX LBRV?
This version is updated following your advices in version 1.
BTW the kvm-unit-test for this feature has sent too, and I
have tested the CPUs emulated by QEMU.


Hi,

since these patches will not be part of Linux 4.4, I will review them
after the end of the merge window (or at least, after I've finished
sending material for 4.4).


  Thanks for your consideration.

  Jian



Paolo

.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/4] KVM: VMX: enable LBR virtualization

2015-11-09 Thread Paolo Bonzini


On 09/11/2015 02:33, Jian Zhou wrote:
> Hi Paolo,
> 
> May I ask that any suggestion about the version 2 of VMX LBRV?
> This version is updated following your advices in version 1.
> BTW the kvm-unit-test for this feature has sent too, and I
> have tested the CPUs emulated by QEMU.

Hi,

since these patches will not be part of Linux 4.4, I will review them
after the end of the merge window (or at least, after I've finished
sending material for 4.4).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html