Re: [PATCH V6 0/6] Fast mmio eventfd fixes
On 11/10/2015 04:19 AM, Michael S. Tsirkin wrote: > On Mon, Nov 09, 2015 at 12:35:45PM +0800, Jason Wang wrote: >> > >> > >> > On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote: >>> > > On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote: > >> Hi: > >> > >> This series fixes two issues of fast mmio eventfd: > >> > >> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS > >>and KVM_FAST_MMIO_BUS. This will cause double in > >>ioeventfd_destructor() > >> 2) A zero length iodev on KVM_MMIO_BUS will never be found but > >>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by > >>qemu instead of host. > >> > >> 1 is fixed by allocating two instances of iodev and introduce a new > >> capability for userspace. 2 is fixed by ignore the actual length if > >> the length of iodev is zero in kvm_io_bus_cmp(). > >> > >> Please review. > >> Changes from V5: > >> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and > >> remove the unnecessary checks > >> - even more grammar and typo fixes > >> - rabase to kvm.git > >> - document KVM_CAP_FAST_MMIO >>> > > What's up with userspace using this capability? >> > >> > It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH. >> > >>> > > Did patches ever get posted? >> > >> > See https://lkml.org/lkml/2015/9/28/208 > Talking about userspace here. > QEMU freeze is approaching, it really should > use this to avoid regressions. > The patches were posted at http://lists.gnu.org/archive/html/qemu-devel/2015-11/msg01276.html (you were in cc list) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: > > Which leaves the special case of Xen, where even preexisting devices > don't bypass the IOMMU. Can we keep this specific to powerpc and > sparc? On x86, this problem is basically nonexistent, since the IOMMU > is properly self-describing. > > IOW, I think that on x86 we should assume that all virtio devices > honor the IOMMU. You don't like performances ? :-) Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 9:28 PM, Benjamin Herrenschmidt wrote: > On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: >> >> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. >> */ >> static const struct pci_device_id virtio_pci_id_table[] = { >> { PCI_DEVICE(0x1af4, PCI_ANY_ID) }, >> { 0 } >> }; >> >> Can we match on that range? > > We can, but the problem remains, how do we differenciate an existing > device that does bypass vs. a newer one that needs the IOMMU and thus > doesn't have the new "bypass" property in the device-tree. > We could do it the other way around: on powerpc, if a PCI device is in that range and doesn't have the "bypass" property at all, then it's assumed to bypass the IOMMU. This means that everything that currently works continues working. If someone builds a physical virtio device or uses another system in PCIe target mode speaking virtio, then it won't work until they upgrade their firmware to set bypass=0. Meanwhile everyone using hypothetical new QEMU also gets bypass=0 and no ambiguity. vfio will presumably notice the bypass and correctly refuse to map any current virtio devices. Would that work? --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 9:26 PM, Benjamin Herrenschmidt wrote: > On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: >> >> Which leaves the special case of Xen, where even preexisting devices >> don't bypass the IOMMU. Can we keep this specific to powerpc and >> sparc? On x86, this problem is basically nonexistent, since the IOMMU >> is properly self-describing. >> >> IOW, I think that on x86 we should assume that all virtio devices >> honor the IOMMU. > > You don't like performances ? :-) This should have basically no effect. Every non-experimental x86 virtio setup in existence either doesn't work at all (Xen) or has DMA ops that are no-ops. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: > > /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. > */ > static const struct pci_device_id virtio_pci_id_table[] = { > { PCI_DEVICE(0x1af4, PCI_ANY_ID) }, > { 0 } > }; > > Can we match on that range? We can, but the problem remains, how do we differenciate an existing device that does bypass vs. a newer one that needs the IOMMU and thus doesn't have the new "bypass" property in the device-tree. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote: > The problem here is that in some of the problematic cases the virtio > driver may not even be loaded. If someone runs an L1 guest with an > IOMMU-bypassing virtio device and assigns it to L2 using vfio, then > *boom* L1 crashes. (Same if, say, DPDK gets used, I think.) > > > > > The only way out of this while keeping the "platform" stuff would be to > > also bump some kind of version in the virtio config (or PCI header). I > > have no other way to differenciate between "this is an old qemu that > > doesn't do the 'bypass property' yet" from "this is a virtio device > > that doesn't bypass". > > > > Any better idea ? > > I'd suggest that, in the absence of the new DT binding, we assume that > any PCI device with the virtio vendor ID is passthrough on powerpc. I > can do this in the virtio driver, but if it's in the platform code > then vfio gets it right too (i.e. fails to load). The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor ID which will be used by more than just virtio, so we need to specifically list the devices. Additionally, that still means that once we have a virtio device that actually uses the iommu, powerpc will not work since the "workaround" above will kick in. The "in absence of the new DT binding" doesn't make that much sense. Those platforms use device-trees defined since the dawn of ages by actual open firmware implementations, they either have no iommu representation in there (Macs, the platform code hooks it all up) or have various properties related to the iommu but no concept of "bypass" in there. We can *add* a new property under some circumstances that indicates a bypass on a per-device basis, however that doesn't completely solve it: - As I said above, what does the absence of that property mean ? An old qemu that does bypass on all virtio or a new qemu trying to tell you that the virtio device actually does use the iommu (or some other environment that isn't qemu) ? - On things like macs, the device-tree is generated by openbios, it would have to have some added logic to try to figure that out, which means it needs to know *via different means* that some or all virtio devices bypass the iommu. I thus go back to my original statement, it's a LOT easier to handle if the device itself is self describing, indicating whether it is set to bypass a host iommu or not. For L1->L2, well, that wouldn't be the first time qemu/VFIO plays tricks with the passed through device configuration space... Note that the above can be solved via some kind of compromise: The device self describes the ability to honor the iommu, along with the property (or ACPI table entry) that indicates whether or not it does. IE. We could use the revision or ProgIf field of the config space for example. Or something in virtio config. If it's an "old" device, we know it always bypass. If it's a new device, we know it only bypasses if the corresponding property is in. I still would have to sort out the openbios case for mac among others but it's at least a workable direction. BTW. Don't you have a similar problem on x86 that today qemu claims that everything honors the iommu in ACPI ? Unless somebody can come up with a better idea... Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 6:04 PM, Benjamin Herrenschmidt wrote: > On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote: >> The problem here is that in some of the problematic cases the virtio >> driver may not even be loaded. If someone runs an L1 guest with an >> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then >> *boom* L1 crashes. (Same if, say, DPDK gets used, I think.) >> >> > >> > The only way out of this while keeping the "platform" stuff would be to >> > also bump some kind of version in the virtio config (or PCI header). I >> > have no other way to differenciate between "this is an old qemu that >> > doesn't do the 'bypass property' yet" from "this is a virtio device >> > that doesn't bypass". >> > >> > Any better idea ? >> >> I'd suggest that, in the absence of the new DT binding, we assume that >> any PCI device with the virtio vendor ID is passthrough on powerpc. I >> can do this in the virtio driver, but if it's in the platform code >> then vfio gets it right too (i.e. fails to load). > > The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor > ID which will be used by more than just virtio, so we need to > specifically list the devices. Really? /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */ static const struct pci_device_id virtio_pci_id_table[] = { { PCI_DEVICE(0x1af4, PCI_ANY_ID) }, { 0 } }; Can we match on that range? > > Additionally, that still means that once we have a virtio device that > actually uses the iommu, powerpc will not work since the "workaround" > above will kick in. I don't know how to solve that problem, though, especially since the vendor of such a device (especially if it's real hardware) might not set any new bit. > > The "in absence of the new DT binding" doesn't make that much sense. > > Those platforms use device-trees defined since the dawn of ages by > actual open firmware implementations, they either have no iommu > representation in there (Macs, the platform code hooks it all up) or > have various properties related to the iommu but no concept of "bypass" > in there. > > We can *add* a new property under some circumstances that indicates a > bypass on a per-device basis, however that doesn't completely solve it: > > - As I said above, what does the absence of that property mean ? An > old qemu that does bypass on all virtio or a new qemu trying to tell > you that the virtio device actually does use the iommu (or some other > environment that isn't qemu) ? > > - On things like macs, the device-tree is generated by openbios, it > would have to have some added logic to try to figure that out, which > means it needs to know *via different means* that some or all virtio > devices bypass the iommu. > > I thus go back to my original statement, it's a LOT easier to handle if > the device itself is self describing, indicating whether it is set to > bypass a host iommu or not. For L1->L2, well, that wouldn't be the > first time qemu/VFIO plays tricks with the passed through device > configuration space... Which leaves the special case of Xen, where even preexisting devices don't bypass the IOMMU. Can we keep this specific to powerpc and sparc? On x86, this problem is basically nonexistent, since the IOMMU is properly self-describing. IOW, I think that on x86 we should assume that all virtio devices honor the IOMMU. > > Note that the above can be solved via some kind of compromise: The > device self describes the ability to honor the iommu, along with the > property (or ACPI table entry) that indicates whether or not it does. > > IE. We could use the revision or ProgIf field of the config space for > example. Or something in virtio config. If it's an "old" device, we > know it always bypass. If it's a new device, we know it only bypasses > if the corresponding property is in. I still would have to sort out the > openbios case for mac among others but it's at least a workable > direction. > > BTW. Don't you have a similar problem on x86 that today qemu claims > that everything honors the iommu in ACPI ? Only on a single experimental configuration, and that can apparently just be fixed going forward without any real problems being caused. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated
On 11/09/15 14:01, Eduardo Habkost wrote: > On Mon, Nov 09, 2015 at 08:33:55AM +0800, haozhong.zh...@intel.com wrote: > > On 11/06/15 13:12, Eduardo Habkost wrote: > > > On Fri, Nov 06, 2015 at 10:32:24AM +0800, haozhong.zh...@intel.com wrote: > > > > On 11/05/15 14:05, Eduardo Habkost wrote: > > > > > On Thu, Nov 05, 2015 at 09:30:51AM +0800, Haozhong Zhang wrote: > > > > > > On 11/04/15 19:42, Eduardo Habkost wrote: > > > [...] > > > > > > > > +env->tsc_khz_saved = r; > > > > > > > > +} > > > > > > > > > > > > > > Why do you need a separate tsc_khz_saved field, and don't simply > > > > > > > use > > > > > > > tsc_khz? It would have the additional feature of letting QMP > > > > > > > clients > > > > > > > query the current TSC rate by asking for the tsc-freq property on > > > > > > > CPU > > > > > > > objects. > > > > > > > > > > > > > > > > > > > It's to avoid overriding env->tsc_khz on the destination in the > > > > > > migration. I can change this line to > > > > > > env->tsc_khz = env->tsc_khz_saved = r; > > > > > > > > > > You are already avoiding overriding env->tsc_khz, because you use > > > > > KVM_GET_TSC_KHZ only if tsc_khz is not set yet. I still don't see why > > > > > you need a tsc_khz_saved field that requires duplicating the > > > > > SET_TSC_KHZ > > > > > code, if you could just do this: > > > > > > > > > > if (!env->tsc_khz) { > > > > > env->tsc_khz = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ); > > > > > } > > > > > > > > > > > > > Consider an example that we migrate a VM from machine A to machine B > > > > and then to machine C, and QEMU on machine B is launched with the cpu > > > > option 'tsc-freq' (i.e. env->tsc_khz on B is non-zero at the > > > > beginning): > > > > 1) In the migration from B to C, the user-specified TSC frequency by > > > > 'tsc-freq' on B is expected to be migrated to C. That is, the > > > > value of env->tsc_khz on B is migrated. > > > > 2) If TSC frequency is migrated through env->tsc_khz, then > > > > env->tsc_khz on B will be overrode in the migration from A to B > > > > before kvm_arch_setup_tsc_khz(). If the guest TSC frequency is > > > > different than the user-specified TSC frequency on B, the > > > > expectation in 1) will not be satisfied anymore. > > > > > > Setting tsc-freq on B when tsc-freq was not used on A is invalid usage. > > > This is not different from changing the CPU model and adding or removing > > > CPU flags when migrating, which is also incorrect. The command-line > > > parameters defining the VM must be the same when you migrate. > > > > > > > Good to know it's an invalid usage. Then the question is what QEMU is > > expected to do for this invalid usage? > > > > 1) Abort the migration? But I find that the current QEMU does not > > abort the migration between different CPU models (e.g. Nehalem and > > Haswell). > > > > 2) Or do not abort the migration and ignore tsc-freq option? If so, > > tsc_khz_saved will be not needed. > > My first choice is to abort migration. If we decide to abort today and > find it to cause problems, we can easily fix it. If we decide to > continue without aborting, it is difficult to change that behavior > without breaking existing setups. > > -- > Eduardo Agree, but I would like to relax the abort condition to "abort the migration only if QEMU on the destination uses a different TSC frequency than the migrated one" so that the following usages would be still valid: 1) Only QEMU on the destination has 'tsc-freq' option, but it' set to the same value of the migrated one. 2) Only QEMU on the source has 'tsc-freq' option. 3) QEMU on both sides have 'tsc-freq' option, but they are set to the same value. In all above usages, TSC frequency on the destination is the same as both the value on the source and the value explicitly expected by users on the destination (by 'tsc-freq' on the destination). And I still need tsc_khz_saved to tell on the destination whether 1) both tsc-freq option and migrated TSC frequency are present, and 2) above two values are the same. Even though we restrictively requires QEMU on both sides use the same CPU options, tsc_khz_saved is still needed because of 1). Haozhong -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 2:58 PM, Benjamin Herrenschmidt wrote: > So ... > > I've finally tried to sort that out for powerpc and I can't find a way > to make that work that isn't a complete pile of stinking shit. > > I'm very tempted to go back to my original idea: virtio itself should > indicate it's "bypassing ability" via the virtio config space or some > other bit (like the ProgIf of the PCI header etc...). > > I don't see how I can make it work otherwise. > > The problem with the statement "it's a platform matter" is that: > > - It's not entirely correct. It's actually a feature of a specific > virtio implementation (qemu's) that it bypasses all the platform IOMMU > facilities. > > - The platforms (on powerpc there's at least 3 or 4 that have qemu > emulation and support some form of PCI) don't have an existing way to > convey the information that a device bypasses the IOMMU (if any). > > - Even if we add such a mechanism (new property in the device-tree), > we end up with a big turd: Because we need to be compatible with older > qemus, we essentially need a quirk that will make all virtio devices > assume that property is present. That will of course break whenever we > try to use another implementation of virtio on powerpc which doesn't > bypass the iommu. > > We don't have a way to differenciate a virtio device that does the > bypass from one that doesn't and the backward compatibility requirement > forces us to treat all existing virtio devices as doing bypass. The problem here is that in some of the problematic cases the virtio driver may not even be loaded. If someone runs an L1 guest with an IOMMU-bypassing virtio device and assigns it to L2 using vfio, then *boom* L1 crashes. (Same if, say, DPDK gets used, I think.) > > The only way out of this while keeping the "platform" stuff would be to > also bump some kind of version in the virtio config (or PCI header). I > have no other way to differenciate between "this is an old qemu that > doesn't do the 'bypass property' yet" from "this is a virtio device > that doesn't bypass". > > Any better idea ? I'd suggest that, in the absence of the new DT binding, we assume that any PCI device with the virtio vendor ID is passthrough on powerpc. I can do this in the virtio driver, but if it's in the platform code then vfio gets it right too (i.e. fails to load). --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm-all: PAGE_SIZE should be real host page size
Just noticed this while grepping TARGET_PAGE_SIZE for an unrelated reason. I didn't use qemu_real_host_page_size as kvm_set_phys_mem() does, because we'd need to make sure page_size_init() has run first. Signed-off-by: Andrew Jones --- kvm-all.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 1bc12737723c3..de9ff5971fb3b 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -45,8 +45,10 @@ #include #endif -/* KVM uses PAGE_SIZE in its definition of COALESCED_MMIO_MAX */ -#define PAGE_SIZE TARGET_PAGE_SIZE +/* KVM uses PAGE_SIZE in its definition of KVM_COALESCED_MMIO_MAX. We + * need to use the real host PAGE_SIZE, as that's what KVM will use. + */ +#define PAGE_SIZE getpagesize() //#define DEBUG_KVM -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
So ... I've finally tried to sort that out for powerpc and I can't find a way to make that work that isn't a complete pile of stinking shit. I'm very tempted to go back to my original idea: virtio itself should indicate it's "bypassing ability" via the virtio config space or some other bit (like the ProgIf of the PCI header etc...). I don't see how I can make it work otherwise. The problem with the statement "it's a platform matter" is that: - It's not entirely correct. It's actually a feature of a specific virtio implementation (qemu's) that it bypasses all the platform IOMMU facilities. - The platforms (on powerpc there's at least 3 or 4 that have qemu emulation and support some form of PCI) don't have an existing way to convey the information that a device bypasses the IOMMU (if any). - Even if we add such a mechanism (new property in the device-tree), we end up with a big turd: Because we need to be compatible with older qemus, we essentially need a quirk that will make all virtio devices assume that property is present. That will of course break whenever we try to use another implementation of virtio on powerpc which doesn't bypass the iommu. We don't have a way to differenciate a virtio device that does the bypass from one that doesn't and the backward compatibility requirement forces us to treat all existing virtio devices as doing bypass. The only way out of this while keeping the "platform" stuff would be to also bump some kind of version in the virtio config (or PCI header). I have no other way to differenciate between "this is an old qemu that doesn't do the 'bypass property' yet" from "this is a virtio device that doesn't bypass". Any better idea ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM/arm64: enable enhanced armv8 fp/simd lazy switch
On 11/5/2015 7:02 AM, Christoffer Dall wrote: > On Fri, Oct 30, 2015 at 02:56:33PM -0700, Mario Smarduch wrote: >> This patch enables arm64 lazy fp/simd switch, similar to arm described in >> second patch. Change from previous version - restore function is moved to >> host. >> >> Signed-off-by: Mario Smarduch >> --- >> arch/arm64/include/asm/kvm_host.h | 2 +- >> arch/arm64/kernel/asm-offsets.c | 1 + >> arch/arm64/kvm/hyp.S | 37 +++-- >> 3 files changed, 33 insertions(+), 7 deletions(-) >> >> diff --git a/arch/arm64/include/asm/kvm_host.h >> b/arch/arm64/include/asm/kvm_host.h >> index 26a2347..dcecf92 100644 >> --- a/arch/arm64/include/asm/kvm_host.h >> +++ b/arch/arm64/include/asm/kvm_host.h >> @@ -251,11 +251,11 @@ static inline void kvm_arch_hardware_unsetup(void) {} >> static inline void kvm_arch_sync_events(struct kvm *kvm) {} >> static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} >> static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} >> -static inline void kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu) {} >> >> void kvm_arm_init_debug(void); >> void kvm_arm_setup_debug(struct kvm_vcpu *vcpu); >> void kvm_arm_clear_debug(struct kvm_vcpu *vcpu); >> void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu); >> +void kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu); >> >> #endif /* __ARM64_KVM_HOST_H__ */ >> diff --git a/arch/arm64/kernel/asm-offsets.c >> b/arch/arm64/kernel/asm-offsets.c >> index 8d89cf8..c9c5242 100644 >> --- a/arch/arm64/kernel/asm-offsets.c >> +++ b/arch/arm64/kernel/asm-offsets.c >> @@ -124,6 +124,7 @@ int main(void) >>DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, >> arch.hcr_el2)); >>DEFINE(VCPU_MDCR_EL2, offsetof(struct kvm_vcpu, arch.mdcr_el2)); >>DEFINE(VCPU_IRQ_LINES,offsetof(struct kvm_vcpu, arch.irq_lines)); >> + DEFINE(VCPU_VFP_DIRTY,offsetof(struct kvm_vcpu, arch.vfp_dirty)); >>DEFINE(VCPU_HOST_CONTEXT, offsetof(struct kvm_vcpu, >> arch.host_cpu_context)); >>DEFINE(VCPU_HOST_DEBUG_STATE, offsetof(struct kvm_vcpu, >> arch.host_debug_state)); >>DEFINE(VCPU_TIMER_CNTV_CTL, offsetof(struct kvm_vcpu, >> arch.timer_cpu.cntv_ctl)); >> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S >> index e583613..ed2c4cf 100644 >> --- a/arch/arm64/kvm/hyp.S >> +++ b/arch/arm64/kvm/hyp.S >> @@ -36,6 +36,28 @@ >> #define CPU_SYSREG_OFFSET(x)(CPU_SYSREGS + 8*x) >> >> .text >> + >> +/** >> + * void kvm_restore_host_vfp_state(struct vcpu *vcpu) - Executes lazy >> + * fp/simd switch, saves the guest, restores host. Called from host >> + * mode, placed outside of hyp section. > > same comments on style as previous patch > >> + */ >> +ENTRY(kvm_restore_host_vfp_state) >> +pushxzr, lr >> + >> +add x2, x0, #VCPU_CONTEXT >> +mov w3, #0 >> +strbw3, [x0, #VCPU_VFP_DIRTY] > > I've been discussing with myself if it would make more sense to clear > the dirty flag in the C-code... > >> + >> +bl __save_fpsimd >> + >> +ldr x2, [x0, #VCPU_HOST_CONTEXT] >> +bl __restore_fpsimd >> + >> +pop xzr, lr >> +ret >> +ENDPROC(kvm_restore_host_vfp_state) >> + >> .pushsection.hyp.text, "ax" >> .align PAGE_SHIFT >> >> @@ -482,7 +504,11 @@ >> 99: >> msr hcr_el2, x2 >> mov x2, #CPTR_EL2_TTA >> + >> +ldrbw3, [x0, #VCPU_VFP_DIRTY] >> +tbnzw3, #0, 98f >> orr x2, x2, #CPTR_EL2_TFP >> +98: > > mmm, don't you need to only set the fpexc32 when you're actually going > to trap the guest accesses? > > also, you can consider only setting this in vcpu_load (jumping quickly > to EL2 to do so) if we're running a 32-bit guest. Probably worth > measuring the difference between the extra EL2 jump on vcpu_load > compared to hitting this register on every entry/exit. > > Code-wise, it will be nicer to do it on vcpu_load. Hi Christoffer, Marc - just want to run this by you, I ran a test with typical number of fp threads and couple lmbench benchmarks the stride and bandwidth ones. The ratio of exits to vcpu puts is high 50:1 or so. But of course that's subject to the loads you run. I substituted: tbnz x2, #HCR_RW_SHIFT, 99f mov x3, #(1 << 30) msr fpexc32_el2, x3 isb with vcpu_load hyp call and check for 32 bit guest in C mov x1, #(1 << 30) msr fpexc32_el2, x3 ret And then skip_fpsimd_state x8, 2f mrs x6, fpexec_el2 str x6, [x3, #16] with vcpu_put hyp call with check for 32 bit guest in C this is called substantially less often then vcpu_load since fp/simd registers are not always dirty on vcpu_put kern_hyp_va x0 add x2, x0, #VCPU_CONTEXT mrs x1, fpexec32_el2 str x1, [x2, #CPU_SYSREG_OFFSET(FPEXC32_EL2)] ret Of course each hyp call has additional overhead, at a high exit to vcpu_put ratio hyp call appears better. But all this is very highly dependent on exit rate and fp/simd usage. IMO hyp call works better under extrem
[kvm-unit-tests PATCH 19/18] don't embed code inside asserts
assert() is classically a macro which could also be disabled, so if somebody introduces a switch to "#define assert(...) /*nothing*/" in the future, we'd lose code. Suggested-by: Thomas Huth Signed-off-by: Andrew Jones --- lib/arm/setup.c | 19 ++- lib/arm/smp.c | 4 +++- lib/virtio-mmio.c | 4 +++- 3 files changed, 20 insertions(+), 7 deletions(-) diff --git a/lib/arm/setup.c b/lib/arm/setup.c index 02e81a689a8a6..da6edc1f9d8ff 100644 --- a/lib/arm/setup.c +++ b/lib/arm/setup.c @@ -39,8 +39,11 @@ static void cpu_set(int fdtnode __unused, u32 regval, void *info __unused) static void cpu_init(void) { + int ret; + nr_cpus = 0; - assert(dt_for_each_cpu_node(cpu_set, NULL) == 0); + ret = dt_for_each_cpu_node(cpu_set, NULL); + assert(ret == 0); set_cpu_online(0, true); } @@ -49,8 +52,10 @@ static void mem_init(phys_addr_t freemem_start) /* we only expect one membank to be defined in the DT */ struct dt_pbus_reg regs[1]; phys_addr_t mem_start, mem_end; + int ret; - assert(dt_get_memory_params(regs, 1)); + ret = dt_get_memory_params(regs, 1); + assert(ret != 0); mem_start = regs[0].addr; mem_end = mem_start + regs[0].size; @@ -71,14 +76,17 @@ void setup(const void *fdt) { const char *bootargs; u32 fdt_size; + int ret; /* * Move the fdt to just above the stack. The free memory * then starts just after the fdt. */ fdt_size = fdt_totalsize(fdt); - assert(fdt_move(fdt, &stacktop, fdt_size) == 0); - assert(dt_init(&stacktop) == 0); + ret = fdt_move(fdt, &stacktop, fdt_size); + assert(ret == 0); + ret = dt_init(&stacktop); + assert(ret == 0); mem_init(PAGE_ALIGN((unsigned long)&stacktop + fdt_size)); io_init(); @@ -86,6 +94,7 @@ void setup(const void *fdt) thread_info_init(current_thread_info(), 0); - assert(dt_get_bootargs(&bootargs) == 0); + ret = dt_get_bootargs(&bootargs); + assert(ret == 0); setup_args(bootargs); } diff --git a/lib/arm/smp.c b/lib/arm/smp.c index 3cfc6d5ddedd0..390c53b5d84c3 100644 --- a/lib/arm/smp.c +++ b/lib/arm/smp.c @@ -44,11 +44,13 @@ secondary_entry_fn secondary_cinit(void) void smp_boot_secondary(int cpu, secondary_entry_fn entry) { void *stack_base = memalign(THREAD_SIZE, THREAD_SIZE); + int ret; secondary_data.stack = stack_base + THREAD_START_SP; secondary_data.entry = entry; mmu_mark_disabled(cpu); - assert(cpu_psci_cpu_boot(cpu) == 0); + ret = cpu_psci_cpu_boot(cpu); + assert(ret == 0); while (!cpu_online(cpu)) wfe(); diff --git a/lib/virtio-mmio.c b/lib/virtio-mmio.c index 5ccbd193a264a..fa8dd5b8d484d 100644 --- a/lib/virtio-mmio.c +++ b/lib/virtio-mmio.c @@ -123,10 +123,12 @@ static int vm_dt_match(const struct dt_device *dev, int fdtnode) struct vm_dt_info *info = (struct vm_dt_info *)dev->info; struct dt_pbus_reg base; u32 magic; + int ret; dt_device_bind_node((struct dt_device *)dev, fdtnode); - assert(dt_pbus_get_base(dev, &base) == 0); + ret = dt_pbus_get_base(dev, &base); + assert(ret == 0); info->base = ioremap(base.addr, base.size); magic = readl(info->base + VIRTIO_MMIO_MAGIC_VALUE); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [kvm-unit-tests PATCH v2 02/19] trivial: lib: fail hard on failed mallocs
On 09/11/15 21:53, Andrew Jones wrote: > It's pretty safe to not even bother checking for NULL when > using malloc and friends, but if we do check, then fail > hard. > > Signed-off-by: Andrew Jones > --- > v2: no code in asserts [Thomas Huth] > > lib/virtio-mmio.c | 6 ++ > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/lib/virtio-mmio.c b/lib/virtio-mmio.c > index 043832299174e..5ccbd193a264a 100644 > --- a/lib/virtio-mmio.c > +++ b/lib/virtio-mmio.c > @@ -54,8 +54,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device > *vdev, > > vq = calloc(1, sizeof(*vq)); > queue = memalign(PAGE_SIZE, VIRTIO_MMIO_QUEUE_SIZE_MIN); > - if (!vq || !queue) > - return NULL; > + assert(vq && queue); > > writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL); > > @@ -162,8 +161,7 @@ static struct virtio_device *virtio_mmio_dt_bind(u32 > devid) > return NULL; > > vm_dev = calloc(1, sizeof(*vm_dev)); > - if (!vm_dev) > - return NULL; > + assert(vm_dev != NULL); > > vm_dev->base = info.base; > vm_device_init(vm_dev); Reviewed-by: Thomas Huth -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[kvm-unit-tests PATCH v2 02/19] trivial: lib: fail hard on failed mallocs
It's pretty safe to not even bother checking for NULL when using malloc and friends, but if we do check, then fail hard. Signed-off-by: Andrew Jones --- v2: no code in asserts [Thomas Huth] lib/virtio-mmio.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/lib/virtio-mmio.c b/lib/virtio-mmio.c index 043832299174e..5ccbd193a264a 100644 --- a/lib/virtio-mmio.c +++ b/lib/virtio-mmio.c @@ -54,8 +54,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, vq = calloc(1, sizeof(*vq)); queue = memalign(PAGE_SIZE, VIRTIO_MMIO_QUEUE_SIZE_MIN); - if (!vq || !queue) - return NULL; + assert(vq && queue); writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL); @@ -162,8 +161,7 @@ static struct virtio_device *virtio_mmio_dt_bind(u32 devid) return NULL; vm_dev = calloc(1, sizeof(*vm_dev)); - if (!vm_dev) - return NULL; + assert(vm_dev != NULL); vm_dev->base = info.base; vm_device_init(vm_dev); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 0/6] Fast mmio eventfd fixes
On Mon, Nov 09, 2015 at 12:35:45PM +0800, Jason Wang wrote: > > > On 11/09/2015 01:11 AM, Michael S. Tsirkin wrote: > > On Tue, Sep 15, 2015 at 02:41:53PM +0800, Jason Wang wrote: > >> Hi: > >> > >> This series fixes two issues of fast mmio eventfd: > >> > >> 1) A single iodev instance were registerd on two buses: KVM_MMIO_BUS > >>and KVM_FAST_MMIO_BUS. This will cause double in > >>ioeventfd_destructor() > >> 2) A zero length iodev on KVM_MMIO_BUS will never be found but > >>kvm_io_bus_cmp(). This will lead e.g the eventfd will be trapped by > >>qemu instead of host. > >> > >> 1 is fixed by allocating two instances of iodev and introduce a new > >> capability for userspace. 2 is fixed by ignore the actual length if > >> the length of iodev is zero in kvm_io_bus_cmp(). > >> > >> Please review. > >> Changes from V5: > >> - move patch of explicitly checking for KVM_MMIO_BUS to patch 1 and > >> remove the unnecessary checks > >> - even more grammar and typo fixes > >> - rabase to kvm.git > >> - document KVM_CAP_FAST_MMIO > > What's up with userspace using this capability? > > It was renamed to KVM_CAP_IOEVENTFD_ANY_LENGTH. > > > Did patches ever get posted? > > See https://lkml.org/lkml/2015/9/28/208 Talking about userspace here. QEMU freeze is approaching, it really should use this to avoid regressions. > > > >> Changes from V4: > >> - move the location of kvm_assign_ioeventfd() in patch 1 which reduce > >> the change set. > >> - commit log typo fixes > >> - switch to use kvm_deassign_ioeventfd_id) when fail to register to > >> fast mmio bus > >> - change kvm_io_bus_cmp() as Paolo's suggestions > >> - introduce a new capability to avoid new userspace crash old kernel > >> - add a new patch that only try to register mmio eventfd on fast mmio > >> bus > >> > >> Changes from V3: > >> > >> - Don't do search on two buses when trying to do write on > >> KVM_MMIO_BUS. This fixes a small regression found by vmexit.flat. > >> - Since we don't do search on two buses, change kvm_io_bus_cmp() to > >> let it can find zero length iodevs. > >> - Fix the unnecessary lines in tracepoint patch. > >> > >> Changes from V2: > >> - Tweak styles and comment suggested by Cornelia. > >> > >> Changes from v1: > >> - change ioeventfd_bus_from_flags() to return KVM_FAST_MMIO_BUS when > >> needed to save lots of unnecessary changes. > >> > >> Jason Wang (6): > >> kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd > >> kvm: factor out core eventfd assign/deassign logic > >> kvm: fix double free for fast mmio eventfd > >> kvm: fix zero length mmio searching > >> kvm: add tracepoint for fast mmio > >> kvm: add fast mmio capabilitiy > >> > >> Documentation/virtual/kvm/api.txt | 7 ++- > >> arch/x86/kvm/trace.h | 18 ++ > >> arch/x86/kvm/vmx.c| 1 + > >> arch/x86/kvm/x86.c| 1 + > >> include/uapi/linux/kvm.h | 1 + > >> virt/kvm/eventfd.c| 124 > >> ++ > >> virt/kvm/kvm_main.c | 20 +- > >> 7 files changed, 118 insertions(+), 54 deletions(-) > >> > >> -- > >> 2.1.4 > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 14/33] pc-dimm: drop the prefix of pc-dimm
On Fri, Oct 30, 2015 at 01:56:08PM +0800, Xiao Guangrong wrote: > This patch is generated by this script: > > find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \ > | xargs sed -i "s/PC_DIMM/DIMM/g" > > find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \ > | xargs sed -i "s/PCDIMM/DIMM/g" > > find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \ > | xargs sed -i "s/pc_dimm/dimm/g" > > find ./ -name "trace-events" -type f | xargs sed -i "s/pc-dimm/dimm/g" > > It prepares the work which abstracts dimm device type for both pc-dimm and > nvdimm > > Signed-off-by: Xiao Guangrong I can see two ways this patchset can get merged - merge refactorings first, nvdimm support on top - merge nvdimm support first, refactor code on top The way you put it in the middle of the series allows neither. And I definitely favor option 2: it's easier to reason about the best way to refactor code when you have multiple users before you. > --- > hmp.c | 2 +- > hw/acpi/ich9.c | 6 +- > hw/acpi/memory_hotplug.c| 16 ++--- > hw/acpi/piix4.c | 6 +- > hw/i386/pc.c| 32 - > hw/mem/pc-dimm.c| 148 > > hw/ppc/spapr.c | 18 ++--- > include/hw/mem/pc-dimm.h| 62 - > numa.c | 2 +- > qapi-schema.json| 8 +-- > qmp.c | 2 +- > stubs/qmp_pc_dimm_device_list.c | 2 +- > trace-events| 8 +-- > 13 files changed, 156 insertions(+), 156 deletions(-) > > diff --git a/hmp.c b/hmp.c > index 5048eee..5c617d2 100644 > --- a/hmp.c > +++ b/hmp.c > @@ -1952,7 +1952,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict > *qdict) > MemoryDeviceInfoList *info_list = qmp_query_memory_devices(&err); > MemoryDeviceInfoList *info; > MemoryDeviceInfo *value; > -PCDIMMDeviceInfo *di; > +DIMMDeviceInfo *di; > > for (info = info_list; info; info = info->next) { > value = info->value; > diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c > index 1c7fcfa..b0d6a67 100644 > --- a/hw/acpi/ich9.c > +++ b/hw/acpi/ich9.c > @@ -440,7 +440,7 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs > *pm, Error **errp) > void ich9_pm_device_plug_cb(ICH9LPCPMRegs *pm, DeviceState *dev, Error > **errp) > { > if (pm->acpi_memory_hotplug.is_enabled && > -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { > +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) { > acpi_memory_plug_cb(&pm->acpi_regs, pm->irq, > &pm->acpi_memory_hotplug, > dev, errp); > } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) { > @@ -455,7 +455,7 @@ void ich9_pm_device_unplug_request_cb(ICH9LPCPMRegs *pm, > DeviceState *dev, >Error **errp) > { > if (pm->acpi_memory_hotplug.is_enabled && > -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { > +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) { > acpi_memory_unplug_request_cb(&pm->acpi_regs, pm->irq, >&pm->acpi_memory_hotplug, dev, errp); > } else { > @@ -468,7 +468,7 @@ void ich9_pm_device_unplug_cb(ICH9LPCPMRegs *pm, > DeviceState *dev, >Error **errp) > { > if (pm->acpi_memory_hotplug.is_enabled && > -object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { > +object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) { > acpi_memory_unplug_cb(&pm->acpi_memory_hotplug, dev, errp); > } else { > error_setg(errp, "acpi: device unplug for not supported device" > diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c > index ce428df..e687852 100644 > --- a/hw/acpi/memory_hotplug.c > +++ b/hw/acpi/memory_hotplug.c > @@ -54,23 +54,23 @@ static uint64_t acpi_memory_hotplug_read(void *opaque, > hwaddr addr, > o = OBJECT(mdev->dimm); > switch (addr) { > case 0x0: /* Lo part of phys address where DIMM is mapped */ > -val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) : 0; > +val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) : 0; > trace_mhp_acpi_read_addr_lo(mem_st->selector, val); > break; > case 0x4: /* Hi part of phys address where DIMM is mapped */ > -val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) >> 32 > : 0; > +val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) >> 32 : 0; > trace_mhp_acpi_read_addr_hi(mem_st->selector, val); > break; > case 0x8: /* Lo part of DIMM size */ > -val = o ? object_property_get_int(o, PC_DIMM_SIZE_PROP, NULL) : 0; > +val = o ? object_property_get_int(o, DIMM_SIZE_PROP, NULL) : 0; > trace_mhp_acpi_read_size_lo(mem_st->selecto
Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()
On Sat, Oct 31, 2015 at 04:09:56PM +0800, Xiao Guangrong wrote: > > > On 10/30/2015 11:54 PM, Eduardo Habkost wrote: > >On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote: > >>There are three places use the some logic to get the page size on > >>the file path or file fd > >> > >>This patch introduces qemu_file_get_page_size() to unify the code > >> > >>Signed-off-by: Xiao Guangrong > >[...] > >>diff --git a/util/oslib-posix.c b/util/oslib-posix.c > >>index 914cef5..ad94c5a 100644 > >>--- a/util/oslib-posix.c > >>+++ b/util/oslib-posix.c > >>@@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd) > >> return getpagesize(); > >> } > >> > >>+size_t qemu_file_get_page_size(const char *path) > >>+{ > >>+size_t size = 0; > >>+int fd = qemu_open(path, O_RDONLY); > >>+ > >>+if (fd < 0) { > >>+fprintf(stderr, "Could not open %s.\n", path); > >>+goto exit; > > > >Have you considered using a Error** argument here? > > > >>+} > >>+ > >>+size = fd_getpagesize(fd); > >>+qemu_close(fd); > >>+exit: > >>+return size; > >>+} > >>+ > >>diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c > >>index ac70f08..c661f1c 100644 > >>--- a/target-ppc/kvm.c > >>+++ b/target-ppc/kvm.c > >>@@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct > >>kvm_ppc_smmu_info *info) > >> > >> static long gethugepagesize(const char *mem_path) > >> { > >>-struct statfs fs; > >>-int ret; > >>- > >>-do { > >>-ret = statfs(mem_path, &fs); > >>-} while (ret != 0 && errno == EINTR); > >>+long size = qemu_file_get_page_size(mem_path); > >> > >>-if (ret != 0) { > >>-fprintf(stderr, "Couldn't statfs() memory path: %s\n", > >>-strerror(errno)); > >>+if (!size) { > >> exit(1); > >> } > >> > >>-#define HUGETLBFS_MAGIC 0x958458f6 > >>- > >>-if (fs.f_type != HUGETLBFS_MAGIC) { > >>-/* Explicit mempath, but it's ordinary pages */ > >>-return getpagesize(); > >>-} > >>- > >>-/* It's hugepage, return the huge page size */ > >>-return fs.f_bsize; > >>+return size; > >> } > > > >Why are you changing target-ppc/kvm.c:gethugepagesize() to use the new > >funtion, but not the copy at exec.c? To make it simpler, we could > >eliminate both gethugepagesize() functions completely and replace them > >with qemu_file_get_page_size() calls (maybe as part of this patch, maybe > >in a separate patch, I'm not sure). > > > > The gethugepagesize() in exec.c will be eliminated in later patch :). That's why it's not a good idea to split patchset like this, where patch 1 adds a new function, patch 2 uses it. It's better if user is in the same patchset. An exception if when a completely separate group of people should review the function and the usage, e.g. some logic in memory core versus caller in acpi. > And the gethugepagesize() in ppc platform has error handling logic > and has multiple caller. It's not so bad to keep it. > -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 11/35] util: introduce qemu_file_getlength()
On Mon, Nov 09, 2015 at 12:44:55PM +0800, Xiao Guangrong wrote: > On 11/06/2015 11:50 PM, Eduardo Habkost wrote: > >As this patch affects raw_getlength(), CCing the raw block driver > >maintainer and the qemu-block mailing list. > > Eduardo, thanks for your reminder. I will keep CCing Kevin and qemu-block mail > list for future version. > > > > >On Mon, Nov 02, 2015 at 05:13:13PM +0800, Xiao Guangrong wrote: > >>It is used to get the size of the specified file, also qemu_fd_getlength() > >>is introduced to unify the code with raw_getlength() in block/raw-posix.c > >> > >>Signed-off-by: Xiao Guangrong > >>--- > >> block/raw-posix.c| 7 +-- > >> include/qemu/osdep.h | 2 ++ > >> util/osdep.c | 31 +++ > > > >I know I was the one who suggested osdep.c, but maybe oslib-posix.c is a > >more appropriate place for the new function? > > > > Since the function we introduced here can work on both windows and posix, so > i thing osdep.c is the right place. Otherwise we should implement it for > multiple > platforms. I didn't notice it was going to be used by a platform-independent qemu_file_getlength() function in addition to the posix-specific raw_getlength(). Have you tested it on Windows, though? If you didn't test it on Windows, what about keeping qemu_file_getlength() available only on posix, by now? The only users are raw-posix.c and hostmem-file.c, currently. If in the future somebody need it on Windows, they can decide between moving the SEEK_END code to osdep.c (after testing it), or moving the existing raw-win32.c:raw_getlength() code to a oslib-win32.c:qemu_file_getlength() function. -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] qemu, pkeys: add pkeys support for qemu xsave state handling
On Mon, Nov 09, 2015 at 07:55:33PM +0800, Huaitong Han wrote: > This patch adds pkeys support for qemu xsave state handling. > > Signed-off-by: Huaitong Han [...] > @@ -1145,6 +1146,7 @@ static int kvm_put_xsave(X86CPU *cpu) > #ifdef TARGET_X86_64 > memcpy(&xsave->region[XSAVE_Hi16_ZMM], &env->xmm_regs[16], > 16 * sizeof env->xmm_regs[16]); > +memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru); > #endif > r = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_XSAVE, xsave); > return r; > @@ -1516,6 +1518,7 @@ static int kvm_get_xsave(X86CPU *cpu) > #ifdef TARGET_X86_64 > memcpy(&env->xmm_regs[16], &xsave->region[XSAVE_Hi16_ZMM], > 16 * sizeof env->xmm_regs[16]); > +memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru); Did you mean: memcpy(&env->pkru, &xsave->region[XSAVE_PKRU], sizeof env->pkru) -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 12/35] util: let qemu_fd_getlength support block device
On Mon, Nov 09, 2015 at 01:58:27PM +0800, Xiao Guangrong wrote: > > > On 11/06/2015 11:54 PM, Eduardo Habkost wrote: > >On Mon, Nov 02, 2015 at 05:13:14PM +0800, Xiao Guangrong wrote: > >>lseek can not work for all block devices as the man page says: > >>| Some devices are incapable of seeking and POSIX does not specify > >>| which devices must support lseek(). > >> > >>This patch tries to add the support on Linux by using BLKGETSIZE64 > >>ioctl > >> > >>Signed-off-by: Xiao Guangrong > > > >On which cases is this patch necessary? Do you know any examples of > >Linux block devices that won't work with lseek(SEEK_END)? > > To be honest, i have not checked all block device, this patch was made > based on the man page. However, i do not mind drop this patch (and maybe > other patches) to make this pachset smaller. BLKGETSIZE64 can be added > in the future if we meet such device. By looking at the Linux source code implementing BLKGETSIZE64, it looks like it should work for all block devices where SEEK_END works: * BLKGETSIZE64 returns i_size_read(bdev->bd_inode) (block/ioctl.c:blkdev_ioctl()) * llseek(SEEK_END) uses i_size_read(bd_inode) as the offset (fs/block_dev.c:block_llseek()) That's probably why raw_getlength() never needed a Linux-specific BLKGETSIZE call. -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 07/35] util: introduce qemu_file_get_page_size()
On Mon, Nov 09, 2015 at 12:36:36PM +0800, Xiao Guangrong wrote: > On 11/06/2015 11:36 PM, Eduardo Habkost wrote: > >On Mon, Nov 02, 2015 at 05:13:09PM +0800, Xiao Guangrong wrote: > >>There are three places use the some logic to get the page size on > >>the file path or file fd > >> > >>Windows did not support file hugepage, so it will return normal page > >>for this case. And this interface has not been used on windows so far > >> > >>This patch introduces qemu_file_get_page_size() to unify the code > >> > >>Signed-off-by: Xiao Guangrong > >[...] > >>diff --git a/util/oslib-posix.c b/util/oslib-posix.c > >>index 914cef5..51437ff 100644 > >>--- a/util/oslib-posix.c > >>+++ b/util/oslib-posix.c > >>@@ -340,7 +340,7 @@ static void sigbus_handler(int signal) > >> siglongjmp(sigjump, 1); > >> } > >> > >>-static size_t fd_getpagesize(int fd) > >>+static size_t fd_getpagesize(int fd, Error **errp) > >> { > >> #ifdef CONFIG_LINUX > >> struct statfs fs; > >>@@ -351,7 +351,12 @@ static size_t fd_getpagesize(int fd) > >> ret = fstatfs(fd, &fs); > >> } while (ret != 0 && errno == EINTR); > >> > >>-if (ret == 0 && fs.f_type == HUGETLBFS_MAGIC) { > >>+if (ret) { > >>+error_setg_errno(errp, errno, "fstatfs is failed"); > >>+return 0; > >>+} > >>+ > >>+if (fs.f_type == HUGETLBFS_MAGIC) { > >> return fs.f_bsize; > >> } > > > >You are changing os_mem_prealloc() behavior when fstatfs() fails, here. > >Have you ensured there are no cases where fstatfs() fails but this code > >is still expected to work? > > stat() is supported for all kinds of files, so failed stat() is caused by > file is not exist or kernel internal error (e,g memory is not enough) or > security check is not passed. Whichever we should not do any operation on > the file if stat() failed. The origin code did not check it but it is worth > being fixed i think. Note that this is fstatfs(), not stat(). It's possible go get ENOSYS as error from statfs() if it is not implemented by the filesystem, I just don't know if this really can happen in practice. (But the answer won't matter, as we already aborted on statfs() errors on all codepaths that call fd_getpagesize(). See below.) > > > > >The change looks safe: gethugepagesize() seems to be always called in > >the path that would make fd_getpagesize() be called from > >os_mem_prealloc(), so allocation would abort much earlier if statfs() > >failed. But I haven't confirmed that yet, and I wanted to be sure. > > > > Yes, I am entirely agree with you. :) > I have just confirmed that this is the case, as: * fd_getpagesize() is only called from os_mem_prealloc(), * os_mem_prealloc() is called from: * host_memory_backend_set_prealloc() * Using memory_region_get_fd() as the fd argument * host_memory_backend_memory_complete() * Using memory_region_get_fd() as the fd argument * file_ram_alloc() * After qemu_file_get_page_size()/gethugepagesize() was already called in the same fd (with errors checked) * fd_getpagesize() checks for fd == -1 * The only code that sets the fd for a RAMBlock is file_ram_alloc(), which checks for qemu_file_get_page_size()/gethugepagesize() errors So, it was already impossible to get os_mem_prealloc() called with fd != -1 without having gethugepagesize() called first (and gethugepagesize() already checked for statfs() errors). Reviewed-by: Eduardo Habkost -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated
* Eduardo Habkost (ehabk...@redhat.com) wrote: > On Mon, Nov 09, 2015 at 08:33:55AM +0800, haozhong.zh...@intel.com wrote: > > On 11/06/15 13:12, Eduardo Habkost wrote: > > > On Fri, Nov 06, 2015 at 10:32:24AM +0800, haozhong.zh...@intel.com wrote: > > > > On 11/05/15 14:05, Eduardo Habkost wrote: > > > > > On Thu, Nov 05, 2015 at 09:30:51AM +0800, Haozhong Zhang wrote: > > > > > > On 11/04/15 19:42, Eduardo Habkost wrote: > > > [...] > > > > > > > > +env->tsc_khz_saved = r; > > > > > > > > +} > > > > > > > > > > > > > > Why do you need a separate tsc_khz_saved field, and don't simply > > > > > > > use > > > > > > > tsc_khz? It would have the additional feature of letting QMP > > > > > > > clients > > > > > > > query the current TSC rate by asking for the tsc-freq property on > > > > > > > CPU > > > > > > > objects. > > > > > > > > > > > > > > > > > > > It's to avoid overriding env->tsc_khz on the destination in the > > > > > > migration. I can change this line to > > > > > > env->tsc_khz = env->tsc_khz_saved = r; > > > > > > > > > > You are already avoiding overriding env->tsc_khz, because you use > > > > > KVM_GET_TSC_KHZ only if tsc_khz is not set yet. I still don't see why > > > > > you need a tsc_khz_saved field that requires duplicating the > > > > > SET_TSC_KHZ > > > > > code, if you could just do this: > > > > > > > > > > if (!env->tsc_khz) { > > > > > env->tsc_khz = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ); > > > > > } > > > > > > > > > > > > > Consider an example that we migrate a VM from machine A to machine B > > > > and then to machine C, and QEMU on machine B is launched with the cpu > > > > option 'tsc-freq' (i.e. env->tsc_khz on B is non-zero at the > > > > beginning): > > > > 1) In the migration from B to C, the user-specified TSC frequency by > > > > 'tsc-freq' on B is expected to be migrated to C. That is, the > > > > value of env->tsc_khz on B is migrated. > > > > 2) If TSC frequency is migrated through env->tsc_khz, then > > > > env->tsc_khz on B will be overrode in the migration from A to B > > > > before kvm_arch_setup_tsc_khz(). If the guest TSC frequency is > > > > different than the user-specified TSC frequency on B, the > > > > expectation in 1) will not be satisfied anymore. > > > > > > Setting tsc-freq on B when tsc-freq was not used on A is invalid usage. > > > This is not different from changing the CPU model and adding or removing > > > CPU flags when migrating, which is also incorrect. The command-line > > > parameters defining the VM must be the same when you migrate. > > > > > > > Good to know it's an invalid usage. Then the question is what QEMU is > > expected to do for this invalid usage? > > > > 1) Abort the migration? But I find that the current QEMU does not > > abort the migration between different CPU models (e.g. Nehalem and > > Haswell). > > > > 2) Or do not abort the migration and ignore tsc-freq option? If so, > > tsc_khz_saved will be not needed. > > My first choice is to abort migration. If we decide to abort today and > find it to cause problems, we can easily fix it. If we decide to > continue without aborting, it is difficult to change that behavior > without breaking existing setups. Yes, if it's a bad config please abort the migration and put a clear message in the log so we can tell easily. Dave > > -- > Eduardo -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated
On Mon, Nov 09, 2015 at 08:33:55AM +0800, haozhong.zh...@intel.com wrote: > On 11/06/15 13:12, Eduardo Habkost wrote: > > On Fri, Nov 06, 2015 at 10:32:24AM +0800, haozhong.zh...@intel.com wrote: > > > On 11/05/15 14:05, Eduardo Habkost wrote: > > > > On Thu, Nov 05, 2015 at 09:30:51AM +0800, Haozhong Zhang wrote: > > > > > On 11/04/15 19:42, Eduardo Habkost wrote: > > [...] > > > > > > > +env->tsc_khz_saved = r; > > > > > > > +} > > > > > > > > > > > > Why do you need a separate tsc_khz_saved field, and don't simply use > > > > > > tsc_khz? It would have the additional feature of letting QMP clients > > > > > > query the current TSC rate by asking for the tsc-freq property on > > > > > > CPU > > > > > > objects. > > > > > > > > > > > > > > > > It's to avoid overriding env->tsc_khz on the destination in the > > > > > migration. I can change this line to > > > > > env->tsc_khz = env->tsc_khz_saved = r; > > > > > > > > You are already avoiding overriding env->tsc_khz, because you use > > > > KVM_GET_TSC_KHZ only if tsc_khz is not set yet. I still don't see why > > > > you need a tsc_khz_saved field that requires duplicating the SET_TSC_KHZ > > > > code, if you could just do this: > > > > > > > > if (!env->tsc_khz) { > > > > env->tsc_khz = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ); > > > > } > > > > > > > > > > Consider an example that we migrate a VM from machine A to machine B > > > and then to machine C, and QEMU on machine B is launched with the cpu > > > option 'tsc-freq' (i.e. env->tsc_khz on B is non-zero at the > > > beginning): > > > 1) In the migration from B to C, the user-specified TSC frequency by > > > 'tsc-freq' on B is expected to be migrated to C. That is, the > > > value of env->tsc_khz on B is migrated. > > > 2) If TSC frequency is migrated through env->tsc_khz, then > > > env->tsc_khz on B will be overrode in the migration from A to B > > > before kvm_arch_setup_tsc_khz(). If the guest TSC frequency is > > > different than the user-specified TSC frequency on B, the > > > expectation in 1) will not be satisfied anymore. > > > > Setting tsc-freq on B when tsc-freq was not used on A is invalid usage. > > This is not different from changing the CPU model and adding or removing > > CPU flags when migrating, which is also incorrect. The command-line > > parameters defining the VM must be the same when you migrate. > > > > Good to know it's an invalid usage. Then the question is what QEMU is > expected to do for this invalid usage? > > 1) Abort the migration? But I find that the current QEMU does not > abort the migration between different CPU models (e.g. Nehalem and > Haswell). > > 2) Or do not abort the migration and ignore tsc-freq option? If so, > tsc_khz_saved will be not needed. My first choice is to abort migration. If we decide to abort today and find it to cause problems, we can easily fix it. If we decide to continue without aborting, it is difficult to change that behavior without breaking existing setups. -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] qemu, pkeys: add pkeys support for qemu cpuid handling
Am 09.11.2015 um 13:24 schrieb Paolo Bonzini: > On 09/11/2015 12:55, Huaitong Han wrote: >> @@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = { >>CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2, >>CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM, >>CPUID_7_0_EBX_RDSEED */ >> +#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE) > > This should be zero. Apart from this detail, the QEMU parts look good. ...except for the subjects, which should be "target-i386: add pkeys support for cpuid handling" etc. - no need to put qemu into a QEMU commit subject, especially not twice. Regards, Andreas -- SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Felix Imendörffer, Jane Smithard, Graham Norton; HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] context_tracking: remove duplicate enabled check
On 10/27/2015 09:39 PM, Paolo Bonzini wrote: > All calls to context_tracking_enter and context_tracking_exit > are already checking context_tracking_is_enabled, except the > context_tracking_user_enter and context_tracking_user_exit > functions left in for the benefit of assembly calls. > > Pull the check up to those functions, by making them simple > wrappers around the user_enter and user_exit inline functions. > > Cc: Andy Lutomirski > Cc: Frederic Weisbecker > Cc: Rik van Riel > Cc: Paul McKenney > Signed-off-by: Paolo Bonzini Reviewed-by: Rik van Riel Tested-by: Rik van Riel -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit
On 10/27/2015 09:39 PM, Paolo Bonzini wrote: > guest_enter and guest_exit must be called with interrupts disabled, > since they take the vtime_seqlock with write_seq{lock,unlock}. > Therefore, it is not necessary to check for exceptions, nor to > save/restore the IRQ state, when context tracking functions are > called by guest_enter and guest_exit. > > Split the body of context_tracking_entry and context_tracking_exit > out to __-prefixed functions, and use them from KVM. > > Rik van Riel has measured this to speed up a tight vmentry/vmexit > loop by about 2%. > > Cc: Andy Lutomirski > Cc: Frederic Weisbecker > Cc: Rik van Riel > Cc: Paul McKenney > Signed-off-by: Paolo Bonzini Reviewed-by: Rik van Riel Tested-by: Rik van Riel -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] KVM, pkeys: add memory protection-key support
On 09/11/2015 12:54, Huaitong Han wrote: > The protection-key feature provides an additional mechanism by which IA-32e > paging controls access to usermode addresses. > > Hardware support for protection keys for user pages is enumerated with CPUID > feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE > with the setting of CR4.PKE(bit 22). > > When CR4.PKE = 1, every linear address is associated with the 4-bit protection > key located in bits 62:59 of the paging-structure entry that mapped the page > containing the linear address. The PKRU register determines, for each > protection key, whether user-mode addresses with that protection key may be > read or written. > > The PKRU register (protection key rights for user pages) is a 32-bit register > with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the > access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable > bit for protection key i (WDi). > > Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and > write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus > be read and written by instructions in the XSAVE feature set. Hi, this looks more or less okay. I made a few comments on the individual patches. Please add a test for PKRU to kvm-unit-tests' access.c. I will _not_ merge this feature without unit tests. I have merged nested VPID without, and it was a mistake because they were never submitted and probably never will. Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] KVM, pkeys: update memeory permission bitmask for pkeys
On 09/11/2015 12:54, Huaitong Han wrote: >* Byte index: page fault error code [4:1] >* Bit index: pte permissions in ACC_* format > + * > + * Add PFEC.PK (bit 5) for protection-key violations Instead, change "[4:1]" to "[5:1]" in the "Byte index" line. Paolo >*/ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/9] KVM, pkeys: Add pkeys support for gva_to_gpa funcions
On 09/11/2015 12:54, Huaitong Han wrote: > index 7a84b83..6e9156d 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3960,6 +3960,8 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, > gva_t gva, > struct x86_exception *exception) > { > u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; > + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) > + ? PFERR_PK_MASK : 0; > return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception); I think checking is_long_mode is not necessary here, since is_long_mode is not checked in update_permission_bitmask but (dynamically) in permission_fault. > > + gpa_t gpa; > + > + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) > + ? PFERR_PK_MASK : 0; Fetches never have PFERR_PK_MASK set. Thanks, Paolo > /* Inline kvm_read_guest_virt_helper for speed. */ > - gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, > access|PFERR_FETCH_MASK, > - exception); > + gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, > + access | PFERR_FETCH_MASK, exception); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/9] KVM, pkeys: add pkeys support for permission_fault logic
On 09/11/2015 13:43, Paolo Bonzini wrote: > > > On 09/11/2015 12:54, Huaitong Han wrote: >> Protection keys define a new 4-bit protection key field (PKEY) in bits >> 62:59 of leaf entries of the page tables, the PKEY is an index to PKRU >> register(16 domains), every domain has 2 bits(write disable bit, access >> disable bit). >> >> Static logic has been produced in update_permission_bitmask, dynamic logic >> need read pkey from page table entries, get pkru value, and deduce the >> correct result. >> >> Signed-off-by: Huaitong Han >> >> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h >> index e4202e4..bbb 100644 >> --- a/arch/x86/kvm/mmu.h >> +++ b/arch/x86/kvm/mmu.h >> @@ -3,6 +3,7 @@ >> >> #include >> #include "kvm_cache_regs.h" >> +#include "x86.h" >> >> #define PT64_PT_BITS 9 >> #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS) >> @@ -24,6 +25,11 @@ >> #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT) >> #define PT_PAT_MASK (1ULL << 7) >> #define PT_GLOBAL_MASK (1ULL << 8) >> + >> +#define PT64_PKEY_BIT0 (1ULL << _PAGE_BIT_PKEY_BIT0) >> +#define PT64_PKEY_BIT1 (1ULL << _PAGE_BIT_PKEY_BIT1) >> +#define PT64_PKEY_BIT2 (1ULL << _PAGE_BIT_PKEY_BIT2) >> +#define PT64_PKEY_BIT3 (1ULL << _PAGE_BIT_PKEY_BIT3) >> #define PT64_NX_SHIFT 63 >> #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT) >> >> @@ -45,6 +51,15 @@ >> #define PT_PAGE_TABLE_LEVEL 1 >> #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1) >> >> +#define PKEYS_BIT0_VALUE (1ULL << 0) >> +#define PKEYS_BIT1_VALUE (1ULL << 1) >> +#define PKEYS_BIT2_VALUE (1ULL << 2) >> +#define PKEYS_BIT3_VALUE (1ULL << 3) >> + >> +#define PKRU_READ 0 >> +#define PKRU_WRITE 1 >> +#define PKRU_ATTRS 2 >> + >> static inline u64 rsvd_bits(int s, int e) >> { >> return ((1ULL << (e - s + 1)) - 1) << s; >> @@ -145,10 +160,44 @@ static inline bool is_write_protection(struct kvm_vcpu >> *vcpu) >> * fault with the given access (in ACC_* format)? >> */ >> static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu >> *mmu, >> -unsigned pte_access, unsigned pfec) >> +unsigned pte_access, unsigned pte_pkeys, unsigned pfec) >> { >> -int cpl = kvm_x86_ops->get_cpl(vcpu); >> -unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); >> +unsigned long smap, rflags; >> +u32 pkru; >> +int cpl, index; >> +bool wf, uf, pk, pkru_ad, pkru_wd; >> + >> +cpl = kvm_x86_ops->get_cpl(vcpu); >> +rflags = kvm_x86_ops->get_rflags(vcpu); >> + >> +pkru = read_pkru(); >> + >> +/* >> +* PKRU defines 32 bits, there are 16 domains and 2 attribute bits per >> +* domain in pkru, pkey is index to a defined domain, so the value of >> +* pkey * PKRU_ATTRS + R/W is offset of a defined domain attribute. >> +*/ >> +pkru_ad = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_READ)) & 1; >> +pkru_wd = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_WRITE)) & 1; >> + >> +wf = pfec & PFERR_WRITE_MASK; >> +uf = pfec & PFERR_USER_MASK; >> + >> +/* >> +* PKeys 2nd and 6th conditions: >> +* 2.EFER_LMA=1 >> +* 6.PKRU.AD=1 >> +* or The access is a data write and PKRU.WD=1 and >> +* either CR0.WP=1 or it is a user mode access >> +*/ >> +pk = is_long_mode(vcpu) && (pkru_ad || >> +(pkru_wd && wf && (is_write_protection(vcpu) || uf))); A little more optimized: pkru_bits = (pkru >> (pte_pkeys * PKRU_ATTRS)) & 3; /* * Ignore PKRU.WD if not relevant to this access (a read, * or a supervisor mode access if CR0.WP=0). */ if (!wf || (!uf && !is_write_protection(vcpu))) pkru_bits &= ~(1 << PKRU_WRITE); ... and then just check pkru_bits != 0. >> +/* >> +* PK bit right value in pfec equal to >> +* PK bit current value in pfec and pk value. >> +*/ >> +pfec &= (pk << PFERR_PK_BIT) + ~PFERR_PK_MASK; > > PK is only applicable to guest page tables, but if you do not support > PKRU without EPT (patch 9), none of this is necessary, is it? Doh. :( Sorry, this is of course needed for the emulation case. I think you should optimize this for the common case where pkru is zero, hence pk will always be zero. So something like pkru = is_long_mode(vcpu) ? read_pkru() : 0; if (unlikely(pkru) && (pfec & PFERR_PK_MASK)) { ... from above ... */ /* Flip PFERR_PK_MASK if pkru_bits is non-zero */ pfec ^= -pkru_bits & PFERR_PK_MASK; } Paolo >> /* >> * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. >> @@ -163,8 +212,8 @@ static inline bool permission_fault(struct kvm_vcpu >> *vcpu, struct kvm_mmu *mmu, >> * but it will be one in index if SMAP checks are being overridden. >> * It is important to keep this branchless. >> */ >> -unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); >> -int i
Re: [PATCH 6/9] KVM, pkeys: add pkeys support for permission_fault logic
On 09/11/2015 12:54, Huaitong Han wrote: > Protection keys define a new 4-bit protection key field (PKEY) in bits > 62:59 of leaf entries of the page tables, the PKEY is an index to PKRU > register(16 domains), every domain has 2 bits(write disable bit, access > disable bit). > > Static logic has been produced in update_permission_bitmask, dynamic logic > need read pkey from page table entries, get pkru value, and deduce the > correct result. > > Signed-off-by: Huaitong Han > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index e4202e4..bbb 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -3,6 +3,7 @@ > > #include > #include "kvm_cache_regs.h" > +#include "x86.h" > > #define PT64_PT_BITS 9 > #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS) > @@ -24,6 +25,11 @@ > #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT) > #define PT_PAT_MASK (1ULL << 7) > #define PT_GLOBAL_MASK (1ULL << 8) > + > +#define PT64_PKEY_BIT0 (1ULL << _PAGE_BIT_PKEY_BIT0) > +#define PT64_PKEY_BIT1 (1ULL << _PAGE_BIT_PKEY_BIT1) > +#define PT64_PKEY_BIT2 (1ULL << _PAGE_BIT_PKEY_BIT2) > +#define PT64_PKEY_BIT3 (1ULL << _PAGE_BIT_PKEY_BIT3) > #define PT64_NX_SHIFT 63 > #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT) > > @@ -45,6 +51,15 @@ > #define PT_PAGE_TABLE_LEVEL 1 > #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1) > > +#define PKEYS_BIT0_VALUE (1ULL << 0) > +#define PKEYS_BIT1_VALUE (1ULL << 1) > +#define PKEYS_BIT2_VALUE (1ULL << 2) > +#define PKEYS_BIT3_VALUE (1ULL << 3) > + > +#define PKRU_READ 0 > +#define PKRU_WRITE 1 > +#define PKRU_ATTRS 2 > + > static inline u64 rsvd_bits(int s, int e) > { > return ((1ULL << (e - s + 1)) - 1) << s; > @@ -145,10 +160,44 @@ static inline bool is_write_protection(struct kvm_vcpu > *vcpu) > * fault with the given access (in ACC_* format)? > */ > static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu > *mmu, > - unsigned pte_access, unsigned pfec) > + unsigned pte_access, unsigned pte_pkeys, unsigned pfec) > { > - int cpl = kvm_x86_ops->get_cpl(vcpu); > - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); > + unsigned long smap, rflags; > + u32 pkru; > + int cpl, index; > + bool wf, uf, pk, pkru_ad, pkru_wd; > + > + cpl = kvm_x86_ops->get_cpl(vcpu); > + rflags = kvm_x86_ops->get_rflags(vcpu); > + > + pkru = read_pkru(); > + > + /* > + * PKRU defines 32 bits, there are 16 domains and 2 attribute bits per > + * domain in pkru, pkey is index to a defined domain, so the value of > + * pkey * PKRU_ATTRS + R/W is offset of a defined domain attribute. > + */ > + pkru_ad = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_READ)) & 1; > + pkru_wd = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_WRITE)) & 1; > + > + wf = pfec & PFERR_WRITE_MASK; > + uf = pfec & PFERR_USER_MASK; > + > + /* > + * PKeys 2nd and 6th conditions: > + * 2.EFER_LMA=1 > + * 6.PKRU.AD=1 > + * or The access is a data write and PKRU.WD=1 and > + * either CR0.WP=1 or it is a user mode access > + */ > + pk = is_long_mode(vcpu) && (pkru_ad || > + (pkru_wd && wf && (is_write_protection(vcpu) || uf))); > + > + /* > + * PK bit right value in pfec equal to > + * PK bit current value in pfec and pk value. > + */ > + pfec &= (pk << PFERR_PK_BIT) + ~PFERR_PK_MASK; PK is only applicable to guest page tables, but if you do not support PKRU without EPT (patch 9), none of this is necessary, is it? > /* >* If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. > @@ -163,8 +212,8 @@ static inline bool permission_fault(struct kvm_vcpu > *vcpu, struct kvm_mmu *mmu, >* but it will be one in index if SMAP checks are being overridden. >* It is important to keep this branchless. >*/ > - unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); > - int index = (pfec >> 1) + > + smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); > + index = (pfec >> 1) + > (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)); > > WARN_ON(pfec & PFERR_RSVD_MASK); > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > index 736e6ab..99563bc 100644 > --- a/arch/x86/kvm/paging_tmpl.h > +++ b/arch/x86/kvm/paging_tmpl.h > @@ -253,6 +253,17 @@ static int FNAME(update_accessed_dirty_bits)(struct > kvm_vcpu *vcpu, > } > return 0; > } > +static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte) > +{ > + unsigned pkeys = 0; > +#if PTTYPE == 64 > + pkeys = ((gpte & PT64_PKEY_BIT0) ? PKEYS_BIT0_VALUE : 0) | > + ((gpte & PT64_PKEY_BIT1) ? PKEYS_BIT1_VALUE : 0) | > + ((gpte & PT64_PKEY_BIT2) ? PKEYS_BIT2_VALUE : 0) | > + ((gpte & PT64_PKEY_BIT3) ? PKEYS_BIT3_VALUE : 0); This is just pkeys = (gpte >> _PAGE_BIT_PKEY
Re: [PATCH v6 05/33] acpi: add aml_object_type
On Mon, 9 Nov 2015 13:35:51 +0200 "Michael S. Tsirkin" wrote: > On Fri, Oct 30, 2015 at 01:55:59PM +0800, Xiao Guangrong wrote: > > Implement ObjectType which is used by NVDIMM _DSM method in > > later patch > > > > Signed-off-by: Xiao Guangrong > > I had to go dig in the _DSM patch to see how it's used. > And sure enough, callers have to know AML to make > sense of it. So pls don't split out tiny patches like this. > include the callee with the caller. I'd prefer AML API patches as separate ones, as it makes easier to review vs ACPI spec and also easier to reuse in another series. And once they are ok/reviewed we can merge them ahead of users, so next series respins won't have to post the same patches over and over and we won't have to review them again every respin and others could use already merged API instead of duplicating work that's been already done. > > > --- > > hw/acpi/aml-build.c | 8 > > include/hw/acpi/aml-build.h | 1 + > > 2 files changed, 9 insertions(+) > > > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c > > index efc06ab..9f792ab 100644 > > --- a/hw/acpi/aml-build.c > > +++ b/hw/acpi/aml-build.c > > @@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml > > *target) > > return var; > > } > > > > +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */ > > +Aml *aml_object_type(Aml *object) > > +{ > > +Aml *var = aml_opcode(0x8E /* ObjectTypeOp */); > > +aml_append(var, object); > > +return var; > > +} > > + > > It would be better to have a higher level API > that can be used without knowning AML. > For example: > > aml_object_type_is_package() Higher level API is fine but it's better be done on top of AML one, i.e. add it in addition to aml_object_type() > > > > > void > > build_header(GArray *linker, GArray *table_data, > > AcpiTableHeader *h, const char *sig, int len, uint8_t rev) > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h > > index 325782d..5b8a118 100644 > > --- a/include/hw/acpi/aml-build.h > > +++ b/include/hw/acpi/aml-build.h > > @@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg); > > Aml *aml_sizeof(Aml *arg); > > Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name); > > Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target); > > +Aml *aml_object_type(Aml *object); > > > > void > > build_header(GArray *linker, GArray *table_data, > > -- > > 1.8.3.1 > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM, pkeys: expose CPUID:OSPKE to guest
On 09/11/2015 12:54, Huaitong Han wrote: > This patch exposes X86_FEATURE_OSPKE to guest, X86_FEATURE_OSPKE is > software support for pkeys, enumerated with CPUID.7.0.ECX[4]:OSPKE, > and it reflects the setting of CR4.PKE. > > Signed-off-by: Huaitong Han > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 29e6502..ece687b 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -81,6 +81,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) > apic->lapic_timer.timer_mode_mask = 1 << 17; > } > > + best = kvm_find_cpuid_entry(vcpu, 7, 0); > + if (!best) > + return 0; > + > + /*Update OSPKE bit */ > + if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) { > + best->ecx &= ~F(OSPKE); > + if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE)) > + best->ecx |= F(OSPKE); > + } > + > best = kvm_find_cpuid_entry(vcpu, 0xD, 0); > if (!best) { > vcpu->arch.guest_supported_xcr0 = 0; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index d268da0..5181834 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -754,7 +754,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) > (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE))) > kvm_mmu_reset_context(vcpu); > > - if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE) > + if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE)) > kvm_update_cpuid(vcpu); > > return 0; > @@ -6841,7 +6841,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, > > mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4; > kvm_x86_ops->set_cr4(vcpu, sregs->cr4); > - if (sregs->cr4 & X86_CR4_OSXSAVE) > + if (sregs->cr4 & (X86_CR4_OSXSAVE | X86_CR4_PKE)) > kvm_update_cpuid(vcpu); > > idx = srcu_read_lock(&vcpu->kvm->srcu); > Hi, please squash these first three patches and move them last. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On 09/11/2015 13:15, Michael S. Tsirkin wrote: > Well that's not exactly true. I think we would like to make > it possible to put virtio devices behind an IOMMU on x86, > but if this means existing guests break, then many people won't be able > to use this option: having to find out which kernel version your guest > is running is a significant burden. > > So on the host side, we need to detect guests that > don't program the IOMMU and make QEMU ignore it. > I think we need to figure out a way to do this > before we commit to the guest change. What is the usecase for putting virtio devices behind an IOMMU, apart from: 1) "because you can" 2) using VFIO within the guest ? Case 1 can be ignored, and in case 2 the guest will do the right thing. > Additionally, IOMMU overhead is very high when running within the VM. > So for uses such as VFIO, we'd like a way to make something like > iommu-pt the default. That's not something that the kernel cares about. It's just a configuration issue. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch v2] vfio/pci: make an array larger
Smatch complains about a possible out of bounds error: drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init() error: buffer overflow 'pci_cap_length' 20 <= 20 The problem is that pci_cap_length[] was defined as large enough to hold "PCI_CAP_ID_AF + 1" elements. The code in vfio_cap_init() assumes it has PCI_CAP_ID_MAX + 1 elements. Originally, PCI_CAP_ID_AF and PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in f80b0ba95964 ('PCI: Add Enhanced Allocation register entries') so now the array is too small. Let's fix this by making the array size PCI_CAP_ID_MAX + 1. And let's make a similar change to pci_ext_cap_length[] for consistency. Also both these arrays can be made const. Signed-off-by: Dan Carpenter --- v2: more cleanups diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index a8657ef..fe2b470 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -46,7 +46,7 @@ * 0: Removed from the user visible capability list * FF: Variable length */ -static u8 pci_cap_length[] = { +static const u8 pci_cap_length[PCI_CAP_ID_MAX + 1] = { [PCI_CAP_ID_BASIC] = PCI_STD_HEADER_SIZEOF, /* pci config header */ [PCI_CAP_ID_PM] = PCI_PM_SIZEOF, [PCI_CAP_ID_AGP]= PCI_AGP_SIZEOF, @@ -74,7 +74,7 @@ static u8 pci_cap_length[] = { * 0: Removed or masked from the user visible capabilty list * FF: Variable length */ -static u16 pci_ext_cap_length[] = { +static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = { [PCI_EXT_CAP_ID_ERR]= PCI_ERR_ROOT_COMMAND, [PCI_EXT_CAP_ID_VC] = 0xFF, [PCI_EXT_CAP_ID_DSN]= PCI_EXT_CAP_DSN_SIZEOF, -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] qemu, pkeys: add pkeys support for qemu cpuid handling
On 09/11/2015 12:55, Huaitong Han wrote: > @@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = { >CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2, >CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM, >CPUID_7_0_EBX_RDSEED */ > +#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE) This should be zero. Apart from this detail, the QEMU parts look good. Paolo > #define TCG_APM_FEATURES 0 > #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT > > @@ -408,6 +420,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] > = { > .cpuid_reg = R_EBX, > .tcg_features = TCG_7_0_EBX_FEATURES, > }, > +[FEAT_7_0_ECX] = { > +.feat_names = cpuid_7_0_ecx_feature_name, > +.cpuid_eax = 7, > +.cpuid_needs_ecx = true, .cpuid_ecx = 0, > +.cpuid_reg = R_ECX, > +.tcg_features = TCG_7_0_ECX_FEATURES, > +}, -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 107561] New: 4.2 breaks PCI passthrough in QEMU/KVM
https://bugzilla.kernel.org/show_bug.cgi?id=107561 Bug ID: 107561 Summary: 4.2 breaks PCI passthrough in QEMU/KVM Product: Virtualization Version: unspecified Kernel Version: 4.2 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: fhor...@gmail.com Regression: No QEMU/KVM VM's using PCI pass-through are taking a long time to boot, or never booting. The issue is only present when the VM has more then ~2.5G of ram. I was able to get a VM with 6G to boot, but I had to let it run overnight. More info can be found in this thread: https://bbs.archlinux.org/viewtopic.php?id=203240 Someone in the forum seems to have narrowed it down to a a set of commits regarding kvm and mtrr in 4.1-rc2 that are in the mainline kernel as of 4.2. Specifically commit fa61213746a706dd975661151c35795ca4dd82c2 KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Thu, Oct 29, 2015 at 06:09:45PM -0700, Andy Lutomirski wrote: > This switches virtio to use the DMA API unconditionally. I'm sure > it breaks things, but it seems to work on x86 using virtio-pci, with > and without Xen, and using both the modern 1.0 variant and the > legacy variant. > > This appears to work on native and Xen x86_64 using both modern and > legacy virtio-pci. It also appears to work on arm and arm64. > > It definitely won't work as-is on s390x, and I haven't been able to > test Christian's patches because I can't get virtio-ccw to work in > QEMU at all. I don't know what I'm doing wrong. > > It doesn't work on ppc64. Ben, consider yourself pinged to send me > a patch :) > > It doesn't work on sparc64. I didn't realize at Kernel Summit that > sparc64 has the same problem as ppc64. > > DaveM, for background, we're trying to fix virtio to use the DMA > API. That will require that every platform that uses virtio > supplies valid DMA operations on devices that use virtio_ring. > Unfortunately, QEMU historically ignores the IOMMU on virtio > devices. > > On x86, this isn't really a problem. x86 has a nice way for the > platform to describe which devices are behind an IOMMU, and QEMU > will be adjusted accordingly. The only thing that will break is a > recently-added experimental mode. Well that's not exactly true. I think we would like to make it possible to put virtio devices behind an IOMMU on x86, but if this means existing guests break, then many people won't be able to use this option: having to find out which kernel version your guest is running is a significant burden. So on the host side, we need to detect guests that don't program the IOMMU and make QEMU ignore it. I think we need to figure out a way to do this before we commit to the guest change. Additionally, IOMMU overhead is very high when running within the VM. So for uses such as VFIO, we'd like a way to make something like iommu-pt the default. > Ben's plan for powerpc is to add a quirk for existing virtio-pci > devices and to eventually update the devicetree stuff to allow QEMU > to tell the guest which devices use the IOMMU. > > AFAICT sparc has a similar problem to powerpc. DaveM, can you come > up with a straightforward way to get sparc's DMA API to work > correctly for virtio-pci devices? > > NB: Sadly, the platforms I've successfully tested on don't include any > big-endian platforms, so there could still be lurking endian problems. > > Changes from v3: > - More big-endian fixes. > - Added better virtio-ring APIs that handle allocation and use them in >virtio-mmio and virtio-pci. > - Switch to Michael's virtio-net patch. > > Changes from v2: > - Fix vring_mapping_error incorrect argument > > Changes from v1: > - Fix an endian conversion error causing a BUG to hit. > - Fix a DMA ordering issue (swiotlb=force works now). > - Minor cleanups. > > Andy Lutomirski (5): > virtio_ring: Support DMA APIs > virtio_pci: Use the DMA API > virtio: Add improved queue allocation API > virtio_mmio: Use the DMA API > virtio_pci: Use the DMA API > > Michael S. Tsirkin (1): > virtio-net: Stop doing DMA from the stack > > drivers/net/virtio_net.c | 34 ++-- > drivers/virtio/Kconfig | 2 +- > drivers/virtio/virtio_mmio.c | 67 ++- > drivers/virtio/virtio_pci_common.h | 6 - > drivers/virtio/virtio_pci_legacy.c | 42 ++--- > drivers/virtio/virtio_pci_modern.c | 61 ++- > drivers/virtio/virtio_ring.c | 348 > ++--- > include/linux/virtio.h | 23 ++- > include/linux/virtio_ring.h| 35 > tools/virtio/linux/dma-mapping.h | 17 ++ > 10 files changed, 426 insertions(+), 209 deletions(-) > create mode 100644 tools/virtio/linux/dma-mapping.h > > -- > 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] qemu, pkeys: add memory protection-key support
The protection-key feature provides an additional mechanism by which IA-32e paging controls access to usermode addresses. Hardware support for protection keys for user pages is enumerated with CPUID feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE with the setting of CR4.PKE(bit 22). The PKRU register is XSAVE-managed state CPUID.D.0.EAX[9], the size of XSAVE state component for PKRU is 8 bytes, the offset is 0xa80. The specification of Protection Keys can be found at SDM (4.6.2, volume 3) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf. Huaitong Han (3): qemu, pkeys: add pkeys support for qemu cpuid handling qemu, pkeys: add pkeys support for qemu xsave state handling qemu, pkeys: add pkeys support for qemu migration target-i386/cpu.c | 23 ++- target-i386/cpu.h | 7 +++ target-i386/kvm.c | 3 +++ target-i386/machine.c | 23 +++ 4 files changed, 55 insertions(+), 1 deletion(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] KVM, pkeys: Add pkeys support for gva_to_gpa funcions
This patch adds pkeys support for gva_to_gpa funcions. Signed-off-by: Huaitong Han diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7a84b83..6e9156d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3960,6 +3960,8 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, struct x86_exception *exception) { u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) + ? PFERR_PK_MASK : 0; return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception); } @@ -3968,6 +3970,8 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, { u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; access |= PFERR_FETCH_MASK; + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) + ? PFERR_PK_MASK : 0; return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception); } @@ -3976,6 +3980,8 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva, { u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; access |= PFERR_WRITE_MASK; + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) + ? PFERR_PK_MASK : 0; return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception); } @@ -4026,10 +4032,14 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; unsigned offset; int ret; + gpa_t gpa; + + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) + ? PFERR_PK_MASK : 0; /* Inline kvm_read_guest_virt_helper for speed. */ - gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK, - exception); + gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, + access | PFERR_FETCH_MASK, exception); if (unlikely(gpa == UNMAPPED_GVA)) return X86EMUL_PROPAGATE_FAULT; @@ -4050,6 +4060,8 @@ int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt, { struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) + ? PFERR_PK_MASK : 0; return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access, exception); @@ -4073,9 +4085,14 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt, void *data = val; int r = X86EMUL_CONTINUE; + u32 access = PFERR_WRITE_MASK; + + access |= is_long_mode(vcpu) && kvm_read_cr4_bits(vcpu, X86_CR4_PKE) + ? PFERR_PK_MASK : 0; + while (bytes) { gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, -PFERR_WRITE_MASK, +access, exception); unsigned offset = addr & (PAGE_SIZE-1); unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] qemu, pkeys: add pkeys support for qemu xsave state handling
This patch adds pkeys support for qemu xsave state handling. Signed-off-by: Huaitong Han --- target-i386/cpu.c | 2 ++ target-i386/cpu.h | 3 +++ target-i386/kvm.c | 3 +++ 3 files changed, 8 insertions(+) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 575ad8d..7a6a3f8 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -487,6 +487,8 @@ static const ExtSaveArea ext_save_areas[] = { .offset = 0x480, .size = 0x200 }, [7] = { .feature = FEAT_7_0_EBX, .bits = CPUID_7_0_EBX_AVX512F, .offset = 0x680, .size = 0x400 }, +[9] = { .feature = FEAT_7_0_ECX, .bits = CPUID_7_0_ECX_PKU, +.offset = 0xA80, .size = 0x8 }, }; const char *get_register_name_32(unsigned int reg) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index c2e7501..2230b3e 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -401,6 +401,7 @@ #define XSTATE_OPMASK (1ULL << 5) #define XSTATE_ZMM_Hi256(1ULL << 6) #define XSTATE_Hi16_ZMM (1ULL << 7) +#define XSTATE_PKRU (1ULL << 9) /* CPUID feature words */ @@ -984,6 +985,8 @@ typedef struct CPUX86State { uint64_t xcr0; uint64_t xss; +uint32_t pkru; + TPRAccess tpr_access_type; } CPUX86State; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 066d03d..12164a6 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1092,6 +1092,7 @@ static int kvm_put_fpu(X86CPU *cpu) #define XSAVE_OPMASK 272 #define XSAVE_ZMM_Hi256 288 #define XSAVE_Hi16_ZMM416 +#define XSAVE_PKRU672 static int kvm_put_xsave(X86CPU *cpu) { @@ -1145,6 +1146,7 @@ static int kvm_put_xsave(X86CPU *cpu) #ifdef TARGET_X86_64 memcpy(&xsave->region[XSAVE_Hi16_ZMM], &env->xmm_regs[16], 16 * sizeof env->xmm_regs[16]); +memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru); #endif r = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_XSAVE, xsave); return r; @@ -1516,6 +1518,7 @@ static int kvm_get_xsave(X86CPU *cpu) #ifdef TARGET_X86_64 memcpy(&env->xmm_regs[16], &xsave->region[XSAVE_Hi16_ZMM], 16 * sizeof env->xmm_regs[16]); +memcpy(&xsave->region[XSAVE_PKRU], &env->pkru, sizeof env->pkru); #endif return 0; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] KVM, pkeys: update memeory permission bitmask for pkeys
Pkeys define a new status bit in the PFEC. PFEC.PK (bit 5), if some conditions is true, the fault is considered as a PKU violation. This patch updates memeory permission bitmask for pkeys. Signed-off-by: Huaitong Han diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3bbc1cb..3fc6ada 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -159,12 +159,14 @@ enum { #define PFERR_USER_BIT 2 #define PFERR_RSVD_BIT 3 #define PFERR_FETCH_BIT 4 +#define PFERR_PK_BIT 5 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT) #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT) #define PFERR_USER_MASK (1U << PFERR_USER_BIT) #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT) #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT) +#define PFERR_PK_MASK (1U << PFERR_PK_BIT) /* apic attention bits */ #define KVM_APIC_CHECK_VAPIC 0 @@ -290,8 +292,10 @@ struct kvm_mmu { * Bitmap; bit set = permission fault * Byte index: page fault error code [4:1] * Bit index: pte permissions in ACC_* format +* +* Add PFEC.PK (bit 5) for protection-key violations */ - u8 permissions[16]; + u8 permissions[32]; u64 *pae_root; u64 *lm_root; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 69088a1..0568635 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3793,16 +3793,22 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, { unsigned bit, byte, pfec; u8 map; - bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0; + bool fault, x, w, u, smap = 0, pku = 0; + bool wf, uf, ff, smapf, rsvdf, pkuf; + bool cr4_smap, cr4_smep, cr4_pku; cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); + cr4_pku = kvm_read_cr4_bits(vcpu, X86_CR4_PKE); + for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) { pfec = byte << 1; map = 0; wf = pfec & PFERR_WRITE_MASK; uf = pfec & PFERR_USER_MASK; ff = pfec & PFERR_FETCH_MASK; + rsvdf = pfec & PFERR_RSVD_MASK; + pkuf = pfec & PFERR_PK_MASK; /* * PFERR_RSVD_MASK bit is set in PFEC if the access is not * subject to SMAP restrictions, and cleared otherwise. The @@ -3841,12 +3847,34 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, * clearer. */ smap = cr4_smap && u && !uf && !ff; + + /* + * PKU:additional mechanism by which the paging + * controls access to user-mode addresses based + * on the value in the PKRU register. A fault is + * considered as a PKU violation if all of the + * following conditions are true: + * 1.CR4_PKE=1. + * 2.EFER_LMA=1. + * 3.page is present with no reserved bit + * violations. + * 4.the access is not an instruction fetch. + * 5.the access is to a user page. + * 6.PKRU.AD=1 + * or The access is a data write and + * PKRU.WD=1 and either CR0.WP=1 + * or it is a user access. + * + * The 2nd and 6th conditions are computed + * dynamically in permission_fault. + */ + pku = cr4_pku && !rsvdf && !ff && u; } else /* Not really needed: no U/S accesses on ept */ u = 1; fault = (ff && !x) || (uf && !u) || (wf && !w) || - (smapf && smap); + (smapf && smap) || (pkuf && pku); map |= fault << bit; } mmu->permissions[byte] = map; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] KVM, pkeys: add pkeys support for permission_fault logic
Protection keys define a new 4-bit protection key field (PKEY) in bits 62:59 of leaf entries of the page tables, the PKEY is an index to PKRU register(16 domains), every domain has 2 bits(write disable bit, access disable bit). Static logic has been produced in update_permission_bitmask, dynamic logic need read pkey from page table entries, get pkru value, and deduce the correct result. Signed-off-by: Huaitong Han diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index e4202e4..bbb 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -3,6 +3,7 @@ #include #include "kvm_cache_regs.h" +#include "x86.h" #define PT64_PT_BITS 9 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS) @@ -24,6 +25,11 @@ #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT) #define PT_PAT_MASK (1ULL << 7) #define PT_GLOBAL_MASK (1ULL << 8) + +#define PT64_PKEY_BIT0 (1ULL << _PAGE_BIT_PKEY_BIT0) +#define PT64_PKEY_BIT1 (1ULL << _PAGE_BIT_PKEY_BIT1) +#define PT64_PKEY_BIT2 (1ULL << _PAGE_BIT_PKEY_BIT2) +#define PT64_PKEY_BIT3 (1ULL << _PAGE_BIT_PKEY_BIT3) #define PT64_NX_SHIFT 63 #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT) @@ -45,6 +51,15 @@ #define PT_PAGE_TABLE_LEVEL 1 #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1) +#define PKEYS_BIT0_VALUE (1ULL << 0) +#define PKEYS_BIT1_VALUE (1ULL << 1) +#define PKEYS_BIT2_VALUE (1ULL << 2) +#define PKEYS_BIT3_VALUE (1ULL << 3) + +#define PKRU_READ 0 +#define PKRU_WRITE 1 +#define PKRU_ATTRS 2 + static inline u64 rsvd_bits(int s, int e) { return ((1ULL << (e - s + 1)) - 1) << s; @@ -145,10 +160,44 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu) * fault with the given access (in ACC_* format)? */ static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - unsigned pte_access, unsigned pfec) + unsigned pte_access, unsigned pte_pkeys, unsigned pfec) { - int cpl = kvm_x86_ops->get_cpl(vcpu); - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + unsigned long smap, rflags; + u32 pkru; + int cpl, index; + bool wf, uf, pk, pkru_ad, pkru_wd; + + cpl = kvm_x86_ops->get_cpl(vcpu); + rflags = kvm_x86_ops->get_rflags(vcpu); + + pkru = read_pkru(); + + /* + * PKRU defines 32 bits, there are 16 domains and 2 attribute bits per + * domain in pkru, pkey is index to a defined domain, so the value of + * pkey * PKRU_ATTRS + R/W is offset of a defined domain attribute. + */ + pkru_ad = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_READ)) & 1; + pkru_wd = (pkru >> (pte_pkeys * PKRU_ATTRS + PKRU_WRITE)) & 1; + + wf = pfec & PFERR_WRITE_MASK; + uf = pfec & PFERR_USER_MASK; + + /* + * PKeys 2nd and 6th conditions: + * 2.EFER_LMA=1 + * 6.PKRU.AD=1 + * or The access is a data write and PKRU.WD=1 and + * either CR0.WP=1 or it is a user mode access + */ + pk = is_long_mode(vcpu) && (pkru_ad || + (pkru_wd && wf && (is_write_protection(vcpu) || uf))); + + /* + * PK bit right value in pfec equal to + * PK bit current value in pfec and pk value. + */ + pfec &= (pk << PFERR_PK_BIT) + ~PFERR_PK_MASK; /* * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. @@ -163,8 +212,8 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, * but it will be one in index if SMAP checks are being overridden. * It is important to keep this branchless. */ - unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); - int index = (pfec >> 1) + + smap = (cpl - 3) & (rflags & X86_EFLAGS_AC); + index = (pfec >> 1) + (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)); WARN_ON(pfec & PFERR_RSVD_MASK); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 736e6ab..99563bc 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -253,6 +253,17 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, } return 0; } +static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte) +{ + unsigned pkeys = 0; +#if PTTYPE == 64 + pkeys = ((gpte & PT64_PKEY_BIT0) ? PKEYS_BIT0_VALUE : 0) | + ((gpte & PT64_PKEY_BIT1) ? PKEYS_BIT1_VALUE : 0) | + ((gpte & PT64_PKEY_BIT2) ? PKEYS_BIT2_VALUE : 0) | + ((gpte & PT64_PKEY_BIT3) ? PKEYS_BIT3_VALUE : 0); +#endif + return pkeys; +} /* * Fetch a guest pte for a guest virtual address @@ -265,12 +276,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, pt_element_t pte; pt_element_t __user *uninitialized_var(ptep_user); gfn_t table_gfn; - unsigned index, pt_access, pte_access, accessed_dirty; + unsigned index,
[PATCH 3/3] qemu, pkeys: add pkeys support for qemu migration
This patch adds pkeys support for qemu migration. Signed-off-by: Huaitong Han --- target-i386/machine.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/target-i386/machine.c b/target-i386/machine.c index a0df64b..1b190c7 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -725,6 +725,26 @@ static const VMStateDescription vmstate_xss = { VMSTATE_END_OF_LIST() } }; +#ifdef TARGET_X86_64 +static bool pkru_needed(void *opaque) +{ +X86CPU *cpu = opaque; +CPUX86State *env = &cpu->env; + +return env->pkru != 0; +} + +static const VMStateDescription vmstate_pkru = { +.name = "cpu/pkru", +.version_id = 1, +.minimum_version_id = 1, +.needed = pkru_needed, +.fields = (VMStateField[]){ +VMSTATE_UINT32(env.pkru, X86CPU), +VMSTATE_END_OF_LIST() +} +}; +#endif VMStateDescription vmstate_x86_cpu = { .name = "cpu", @@ -844,6 +864,9 @@ VMStateDescription vmstate_x86_cpu = { &vmstate_msr_hyperv_time, &vmstate_avx512, &vmstate_xss, +#ifdef TARGET_X86_64 +&vmstate_pkru, +#endif NULL } }; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] KVM, pkeys: add pkeys support for xsave state
This patch adds pkeys support for xsave state. Signed-off-by: Huaitong Han diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index f2afa5f..0f71d5d 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -182,7 +182,8 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, #define KVM_SUPPORTED_XCR0 (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \ | XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \ - | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512) + | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ + | XFEATURE_MASK_PKRU) extern u64 host_xcr0; extern u64 kvm_supported_xcr0(void); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/9] KVM, pkeys: disable PKU feature without ept
This patch disables CPUID:PKU without ept. Signed-off-by: Huaitong Han diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index ece687b..e1113ae 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -447,6 +447,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, entry->ebx |= F(TSC_ADJUST); entry->ecx &= kvm_supported_word11_x86_features; cpuid_mask(&entry->ecx, 13); + if (!tdp_enabled) + entry->ecx &= ~F(PKU); } else { entry->ebx = 0; entry->ecx = 0; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] KVM, pkeys: disable pkeys for guests in non-paging mode
Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM always uses paging mode to emulate guest non-paging, mode with TDP. To emulate this behavior, pkeys needs to be manually disabled when guest switches to non-paging mode. Signed-off-by: Huaitong Han diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d019868..9b12c80 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3645,14 +3645,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) hw_cr4 &= ~X86_CR4_PAE; hw_cr4 |= X86_CR4_PSE; /* -* SMEP/SMAP is disabled if CPU is in non-paging mode -* in hardware. However KVM always uses paging mode to -* emulate guest non-paging mode with TDP. -* To emulate this behavior, SMEP/SMAP needs to be +* SMEP/SMAP/PKU is disabled if CPU is in non-paging +* mode in hardware. However KVM always uses paging +* mode to emulate guest non-paging mode with TDP. +* To emulate this behavior, SMEP/SMAP/PKU needs to be * manually disabled when guest switches to non-paging * mode. */ - hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP); + hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE); } else if (!(cr4 & X86_CR4_PAE)) { hw_cr4 &= ~X86_CR4_PAE; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] qemu, pkeys: add pkeys support for qemu cpuid handling
This patch adds pkeys support for qemu cpuid handling. Signed-off-by: Huaitong Han --- target-i386/cpu.c | 21 - target-i386/cpu.h | 4 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 4d1b085..575ad8d 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -264,6 +264,17 @@ static const char *cpuid_7_0_ebx_feature_name[] = { NULL, NULL, "avx512pf", "avx512er", "avx512cd", NULL, NULL, NULL, }; +static const char *cpuid_7_0_ecx_feature_name[] = { +NULL, NULL, "pku", "ospke", +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +}; + static const char *cpuid_apm_edx_feature_name[] = { NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, @@ -351,6 +362,7 @@ static const char *cpuid_6_feature_name[] = { CPUID_7_0_EBX_FSGSBASE, CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2, CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM, CPUID_7_0_EBX_RDSEED */ +#define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_PKU | CPUID_7_0_ECX_OSPKE) #define TCG_APM_FEATURES 0 #define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT @@ -408,6 +420,13 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = { .cpuid_reg = R_EBX, .tcg_features = TCG_7_0_EBX_FEATURES, }, +[FEAT_7_0_ECX] = { +.feat_names = cpuid_7_0_ecx_feature_name, +.cpuid_eax = 7, +.cpuid_needs_ecx = true, .cpuid_ecx = 0, +.cpuid_reg = R_ECX, +.tcg_features = TCG_7_0_ECX_FEATURES, +}, [FEAT_8000_0007_EDX] = { .feat_names = cpuid_apm_edx_feature_name, .cpuid_eax = 0x8007, @@ -2401,7 +2420,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (count == 0) { *eax = 0; /* Maximum ECX value for sub-leaves */ *ebx = env->features[FEAT_7_0_EBX]; /* Feature flags */ -*ecx = 0; /* Reserved */ +*ecx = env->features[FEAT_7_0_ECX]; /* Feature flags */ *edx = 0; /* Reserved */ } else { *eax = 0; diff --git a/target-i386/cpu.h b/target-i386/cpu.h index ead2832..c2e7501 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -408,6 +408,7 @@ typedef enum FeatureWord { FEAT_1_EDX, /* CPUID[1].EDX */ FEAT_1_ECX, /* CPUID[1].ECX */ FEAT_7_0_EBX, /* CPUID[EAX=7,ECX=0].EBX */ +FEAT_7_0_ECX, /* CPUID[EAX=7,ECX=0].ECX */ FEAT_8000_0001_EDX, /* CPUID[8000_0001].EDX */ FEAT_8000_0001_ECX, /* CPUID[8000_0001].ECX */ FEAT_8000_0007_EDX, /* CPUID[8000_0007].EDX */ @@ -576,6 +577,9 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS]; #define CPUID_7_0_EBX_AVX512ER (1U << 27) /* AVX-512 Exponential and Reciprocal */ #define CPUID_7_0_EBX_AVX512CD (1U << 28) /* AVX-512 Conflict Detection */ +#define CPUID_7_0_ECX_PKU (1U << 3) +#define CPUID_7_0_ECX_OSPKE(1U << 4) + #define CPUID_XSAVE_XSAVEOPT (1U << 0) #define CPUID_XSAVE_XSAVEC (1U << 1) #define CPUID_XSAVE_XGETBV1(1U << 2) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/9] KVM, pkeys: add memory protection-key support
The protection-key feature provides an additional mechanism by which IA-32e paging controls access to usermode addresses. Hardware support for protection keys for user pages is enumerated with CPUID feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE with the setting of CR4.PKE(bit 22). When CR4.PKE = 1, every linear address is associated with the 4-bit protection key located in bits 62:59 of the paging-structure entry that mapped the page containing the linear address. The PKRU register determines, for each protection key, whether user-mode addresses with that protection key may be read or written. The PKRU register (protection key rights for user pages) is a 32-bit register with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable bit for protection key i (WDi). Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus be read and written by instructions in the XSAVE feature set. PFEC.PK (bit 5) is defined as protection key violations. The specification of Protection Keys can be found at SDM (4.6.2, volume 3) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf. The kernel native patchset have not yet been merged to upstream, you can found at git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v007. Huaitong Han (9): KVM, pkeys: expose CPUID:PKU to guest KVM, pkeys: add pkeys support when setting CR4 KVM, pkeys: expose CPUID:OSPKE to guest KVM, pkeys: disable pkeys for guests in non-paging mode KVM, pkeys: update memeory permission bitmask for pkeys KVM, pkeys: add pkeys support for permission_fault logic KVM, pkeys: Add pkeys support for gva_to_gpa funcions KVM, pkeys: add pkeys support for xsave state KVM, pkeys: disable PKU feature without ept arch/x86/include/asm/kvm_host.h | 9 +-- arch/x86/kvm/cpuid.c| 23 ++-- arch/x86/kvm/cpuid.h| 8 ++ arch/x86/kvm/mmu.c | 32 -- arch/x86/kvm/mmu.h | 59 + arch/x86/kvm/paging_tmpl.h | 20 +++--- arch/x86/kvm/vmx.c | 10 +++ arch/x86/kvm/x86.c | 34 +++- arch/x86/kvm/x86.h | 3 ++- 9 files changed, 171 insertions(+), 27 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] KVM, pkeys: expose CPUID:OSPKE to guest
This patch exposes X86_FEATURE_OSPKE to guest, X86_FEATURE_OSPKE is software support for pkeys, enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of CR4.PKE. Signed-off-by: Huaitong Han diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 29e6502..ece687b 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -81,6 +81,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) apic->lapic_timer.timer_mode_mask = 1 << 17; } + best = kvm_find_cpuid_entry(vcpu, 7, 0); + if (!best) + return 0; + + /*Update OSPKE bit */ + if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) { + best->ecx &= ~F(OSPKE); + if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE)) + best->ecx |= F(OSPKE); + } + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); if (!best) { vcpu->arch.guest_supported_xcr0 = 0; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d268da0..5181834 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -754,7 +754,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE))) kvm_mmu_reset_context(vcpu); - if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE) + if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE)) kvm_update_cpuid(vcpu); return 0; @@ -6841,7 +6841,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4; kvm_x86_ops->set_cr4(vcpu, sregs->cr4); - if (sregs->cr4 & X86_CR4_OSXSAVE) + if (sregs->cr4 & (X86_CR4_OSXSAVE | X86_CR4_PKE)) kvm_update_cpuid(vcpu); idx = srcu_read_lock(&vcpu->kvm->srcu); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] KVM, pkeys: add pkeys support when setting CR4
This patch adds pkeys support when setting CR4.PKE (bit 22). Signed-off-by: Huaitong Han diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c12e845..3bbc1cb 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -55,7 +55,8 @@ | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \ | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \ - | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP)) + | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \ + | X86_CR4_PKE)) #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR) diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index dd05b9c..7775158 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -70,6 +70,14 @@ static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu) return best && (best->ebx & bit(X86_FEATURE_FSGSBASE)); } +static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu) +{ + struct kvm_cpuid_entry2 *best; + + best = kvm_find_cpuid_entry(vcpu, 7, 0); + return best && (best->ecx & bit(X86_FEATURE_PKU)); +} + static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2d4e54d..d268da0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -709,7 +709,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) { unsigned long old_cr4 = kvm_read_cr4(vcpu); unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE | - X86_CR4_SMEP | X86_CR4_SMAP; + X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE; if (cr4 & CR4_RESERVED_BITS) return 1; @@ -726,6 +726,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) if (!guest_cpuid_has_fsgsbase(vcpu) && (cr4 & X86_CR4_FSGSBASE)) return 1; + if (!guest_cpuid_has_pku(vcpu) && (cr4 & X86_CR4_PKE)) + return 1; + if (is_long_mode(vcpu)) { if (!(cr4 & X86_CR4_PAE)) return 1; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] KVM, pkeys: expose CPUID:PKU to guest
This patch expose X86_FEATURE_PKU to guest, X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation: CPUID.7.0.ECX[3]:PKU. Signed-off-by: Huaitong Han diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 156441b..29e6502 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -354,6 +354,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, const u32 kvm_supported_word10_x86_features = F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | f_xsaves; + /* cpuid 7.0.ecx*/ + const u32 kvm_supported_word11_x86_features = F(PKU) | 0 /*OSPKE*/; + /* all calls to cpuid_count() should be made on the same cpu */ get_cpu(); @@ -431,10 +434,13 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, cpuid_mask(&entry->ebx, 9); // TSC_ADJUST is emulated entry->ebx |= F(TSC_ADJUST); - } else + entry->ecx &= kvm_supported_word11_x86_features; + cpuid_mask(&entry->ecx, 13); + } else { entry->ebx = 0; + entry->ecx = 0; + } entry->eax = 0; - entry->ecx = 0; entry->edx = 0; break; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 05/33] acpi: add aml_object_type
On Fri, Oct 30, 2015 at 01:55:59PM +0800, Xiao Guangrong wrote: > Implement ObjectType which is used by NVDIMM _DSM method in > later patch > > Signed-off-by: Xiao Guangrong I had to go dig in the _DSM patch to see how it's used. And sure enough, callers have to know AML to make sense of it. So pls don't split out tiny patches like this. include the callee with the caller. > --- > hw/acpi/aml-build.c | 8 > include/hw/acpi/aml-build.h | 1 + > 2 files changed, 9 insertions(+) > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c > index efc06ab..9f792ab 100644 > --- a/hw/acpi/aml-build.c > +++ b/hw/acpi/aml-build.c > @@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml > *target) > return var; > } > > +/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */ > +Aml *aml_object_type(Aml *object) > +{ > +Aml *var = aml_opcode(0x8E /* ObjectTypeOp */); > +aml_append(var, object); > +return var; > +} > + It would be better to have a higher level API that can be used without knowning AML. For example: aml_object_type_is_package() > void > build_header(GArray *linker, GArray *table_data, > AcpiTableHeader *h, const char *sig, int len, uint8_t rev) > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h > index 325782d..5b8a118 100644 > --- a/include/hw/acpi/aml-build.h > +++ b/include/hw/acpi/aml-build.h > @@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg); > Aml *aml_sizeof(Aml *arg); > Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name); > Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target); > +Aml *aml_object_type(Aml *object); > > void > build_header(GArray *linker, GArray *table_data, > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 06/33] acpi: add aml_method_serialized
On 11/09/2015 07:14 PM, Michael S. Tsirkin wrote: On Fri, Oct 30, 2015 at 01:56:00PM +0800, Xiao Guangrong wrote: It avoid explicit Mutex and will be used by NVDIMM ACPI Signed-off-by: Xiao Guangrong I'd rather you squashed these utility patches in with where the code is used. This is just making it harder to review as I have to jump back and forth. Okay to me. --- hw/acpi/aml-build.c | 26 -- include/hw/acpi/aml-build.h | 1 + 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c index 9f792ab..8bee8b2 100644 --- a/hw/acpi/aml-build.c +++ b/hw/acpi/aml-build.c @@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate) } /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */ -Aml *aml_method(const char *name, int arg_count) +static Aml *__aml_method(const char *name, int arg_count, bool serialized) Please don't prefix names with __. what should you call this? For example, you can call it aml_method_serialized. Igor disliked that aml_method_serialized() is spepated from aml_method(), so i will unify them as a single function instead. :) { Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE); +int methodflags; + +/* + * MethodFlags: + * bit 0-2: ArgCount (0-7) + * bit 3: SerializeFlag + * 0: NotSerialized + * 1: Serialized + * bit 4-7: reserved (must be 0) + */ +assert(!(arg_count & ~7)); Or shorter assert(arg_count < 8); Good to me. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 12/33] pc-dimm: remove DEFAULT_PC_DIMMSIZE
On 11/09/2015 06:40 PM, Michael S. Tsirkin wrote: On Fri, Oct 30, 2015 at 01:56:06PM +0800, Xiao Guangrong wrote: It's not used any more Signed-off-by: Xiao Guangrong You should leave the renames and cleanups off for later. This patchset is large enough as it is. Okay. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 19/33] dimm: keep the state of the whole backend memory
On 11/09/2015 07:04 PM, Michael S. Tsirkin wrote: On Fri, Oct 30, 2015 at 01:56:13PM +0800, Xiao Guangrong wrote: QEMU keeps the state of memory of dimm device during live migration, however, it is not enough for nvdimm device as its memory does not contain its label data, so that we should protect the whole backend memory instead Signed-off-by: Xiao Guangrong It looks like there's now a difference between host_memory_backend_get_memory and get_memory_region, whereas previously they were exactly interchangeable. Yes, it is. host_memory_backend_get_memory() is used to get the whole backend memory region. get_memory_region() is used to get the memory region which will be mapped to guest's address space. They are on the same page for pc-dimm, but it's different for nvdimm since part of backend memory used as label data which is not directly mapped to guest. This needs some thought, in particular the theoretically generic dimm.c has to do tricks to accomodate nvdimm. The missing piece for NVDIMM is the 128k label space at the end, right? Can't nvdimm specific code just register that as a separate RAM chunk in order to migrate it? Yes, it is. I thought this before, then we need to have addition callbackes: - plug(), which is called when the dimm device is plugged, then nvdimm has the chance to do it own migration things. - unplug(), let nvdimm undo the thing of migration. It needs more codes but the logic seems more clear. If you like this way, i will do. --- hw/mem/dimm.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c index 498d380..7d1 100644 --- a/hw/mem/dimm.c +++ b/hw/mem/dimm.c @@ -134,9 +134,16 @@ void dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms, } memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr); -vmstate_register_ram(mr, dev); numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node); +/* + * save the state only for @mr is not enough as it does not contain + * the label data of NVDIMM device, so that we keep the state of + * whole hostmem instead. + */ +vmstate_register_ram(host_memory_backend_get_memory(dimm->hostmem, errp), + dev); + out: error_propagate(errp, local_err); } @@ -145,10 +152,13 @@ void dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms, MemoryRegion *mr) { DIMMDevice *dimm = DIMM(dev); +MemoryRegion *backend_mr; + +backend_mr = host_memory_backend_get_memory(dimm->hostmem, &error_abort); numa_unset_mem_node_id(dimm->addr, memory_region_size(mr), dimm->node); memory_region_del_subregion(&hpms->mr, mr); -vmstate_unregister_ram(mr, dev); +vmstate_unregister_ram(backend_mr, dev); } int qmp_dimm_device_list(Object *obj, void *opaque) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 06/33] acpi: add aml_method_serialized
On Fri, Oct 30, 2015 at 01:56:00PM +0800, Xiao Guangrong wrote: > It avoid explicit Mutex and will be used by NVDIMM ACPI > > Signed-off-by: Xiao Guangrong I'd rather you squashed these utility patches in with where the code is used. This is just making it harder to review as I have to jump back and forth. > --- > hw/acpi/aml-build.c | 26 -- > include/hw/acpi/aml-build.h | 1 + > 2 files changed, 25 insertions(+), 2 deletions(-) > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c > index 9f792ab..8bee8b2 100644 > --- a/hw/acpi/aml-build.c > +++ b/hw/acpi/aml-build.c > @@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate) > } > > /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */ > -Aml *aml_method(const char *name, int arg_count) > +static Aml *__aml_method(const char *name, int arg_count, bool serialized) Please don't prefix names with __. what should you call this? For example, you can call it aml_method_serialized. > { > Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE); > +int methodflags; > + > +/* > + * MethodFlags: > + * bit 0-2: ArgCount (0-7) > + * bit 3: SerializeFlag > + * 0: NotSerialized > + * 1: Serialized > + * bit 4-7: reserved (must be 0) > + */ > +assert(!(arg_count & ~7)); Or shorter assert(arg_count < 8); > +methodflags = arg_count | (serialized << 3); > build_append_namestring(var->buf, "%s", name); > -build_append_byte(var->buf, arg_count); /* MethodFlags: ArgCount */ > +build_append_byte(var->buf, methodflags); > return var; > } > > +Aml *aml_method(const char *name, int arg_count) > +{ > +return __aml_method(name, arg_count, false); > +} > + > +Aml *aml_method_serialized(const char *name, int arg_count) > +{ > +return __aml_method(name, arg_count, true); > +} > + > /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefDevice */ > Aml *aml_device(const char *name_format, ...) > { > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h > index 5b8a118..00cf40e 100644 > --- a/include/hw/acpi/aml-build.h > +++ b/include/hw/acpi/aml-build.h > @@ -263,6 +263,7 @@ Aml *aml_qword_memory(AmlDecode dec, AmlMinFixed > min_fixed, > Aml *aml_scope(const char *name_format, ...) GCC_FMT_ATTR(1, 2); > Aml *aml_device(const char *name_format, ...) GCC_FMT_ATTR(1, 2); > Aml *aml_method(const char *name, int arg_count); > +Aml *aml_method_serialized(const char *name, int arg_count); > Aml *aml_if(Aml *predicate); > Aml *aml_else(void); > Aml *aml_while(Aml *predicate); > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v7 25/35] nvdimm acpi: init the resource used by NVDIMM ACPI
On Fri, 6 Nov 2015 16:31:43 +0800 Xiao Guangrong wrote: > > > On 11/05/2015 10:49 PM, Igor Mammedov wrote: > > On Thu, 5 Nov 2015 21:33:39 +0800 > > Xiao Guangrong wrote: > > > >> > >> > >> On 11/05/2015 09:03 PM, Igor Mammedov wrote: > >>> On Thu, 5 Nov 2015 18:15:31 +0800 > >>> Xiao Guangrong wrote: > >>> > > > On 11/05/2015 05:58 PM, Igor Mammedov wrote: > > On Mon, 2 Nov 2015 17:13:27 +0800 > > Xiao Guangrong wrote: > > > >> A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are > > ^^ missing one 0??? > > > >> reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt > >> for detailed design > >> > >> A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC > >> that controls if nvdimm support is enabled, it is true on default and > >> it is false on 2.4 and its earlier version to keep compatibility > >> > >> Signed-off-by: Xiao Guangrong > > [...] > > > >> @@ -33,6 +33,15 @@ > >> */ > >> #define MIN_NAMESPACE_LABEL_SIZE (128UL << 10) > >> > >> +/* > >> + * A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest > >> are > > ^^^ missing 0 or value in define > > below has an extra 0 > > > >> + * reserved for NVDIMM ACPI emulation, refer to > >> docs/specs/acpi_nvdimm.txt > >> + * for detailed design. > >> + */ > >> +#define NVDIMM_ACPI_MEM_BASE 0xFF00ULL > > it still maps RAM at arbitrary place, > > that's the reason why VMGenID patches hasn't been merged for > > more than several months. > > I'm not against of using (more exactly I'm for it) direct mapping > > but we should reach consensus when and how to use it first. > > > > I'd wouldn't use addresses below 4G as it may be used firmware or > > legacy hardware and I won't bet that 0xFF00ULL won't conflict > > with anything. > > An alternative place to allocate reserve from could be high memory. > > For pc we have "reserved-memory-end" which currently makes sure > > that hotpluggable memory range isn't used by firmware. > > > > How about making API that allows to map additional memory > > ranges before reserved-memory-end and pushes it up as mappings are > > added. [...] > > Really a good study case to me, i tried your patch and moved the 64 bit > staffs to the private method, it worked. :) > > Igor, is this the API you want? Lets get ack from Michael on the idea of RAM mapping before "reserved-memory-end" first. If he rejects it then there isn't any other way except of switching to MMIO instead. > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > index 6bf569a..aba29df 100644 > --- a/hw/i386/pc.c > +++ b/hw/i386/pc.c > @@ -1291,6 +1291,38 @@ FWCfgState *xen_load_linux(PCMachineState *pcms, > return fw_cfg; > } > > +static void pc_reserve_high_memory_init(PCMachineState *pcms, > +uint64_t base, uint64_t align) > +{ > +pcms->reserve_high_memory.current_addr = ROUND_UP(base, align); > +} > + > +static uint64_t > +pc_reserve_high_memory_end(PCMachineState *pcms, int64_t align) > +{ > +return ROUND_UP(pcms->reserve_high_memory.current_addr, align); > +} > + > +uint64_t pc_reserve_high_memory(PCMachineState *pcms, uint64_t size, > +int64_t align, Error **errp) > +{ > +uint64_t end_addr, current_addr = pcms->reserve_high_memory.current_addr; > + > +if (!current_addr) { > +error_setg(errp, "reserved high memory is not initialized."); > +return 0; > +} > + > +end_addr = pc_reserve_high_memory_end(pcms, align) + size; > +if (current_addr > end_addr) { > +error_setg(errp, "reserved high memory is not enough."); > +return 0; > +} > + > +pcms->reserve_high_memory.current_addr = end_addr; > +return end_addr; > +} > + > FWCfgState *pc_memory_init(PCMachineState *pcms, > MemoryRegion *system_memory, > MemoryRegion *rom_memory, > @@ -1379,6 +1411,8 @@ FWCfgState *pc_memory_init(PCMachineState *pcms, > "hotplug-memory", hotplug_mem_size); > memory_region_add_subregion(system_memory, > pcms->hotplug_memory.base, > &pcms->hotplug_memory.mr); > +pc_reserve_high_memory_init(pcms, pcms->hotplug_memory.base + > +hotplug_mem_size, 1ULL << 30); > } > > /* Initialize PC system firmware */ > @@ -1403,7 +1437,7 @@ FWCfgState *pc_memory_init(PCMachineState *pcms, > uint64_t res_mem_end = pcms->hotplug_memory.base; > > if (!pcmc->broken_reserved_end) { > -res_mem_end += memory_region_size(&pcms->hotplug_memory.mr); > +res_mem_end =
Re: [PATCH v6 08/33] exec: allow memory to be allocated from any kind of path
On 11/09/2015 06:39 PM, Michael S. Tsirkin wrote: On Fri, Oct 30, 2015 at 01:56:02PM +0800, Xiao Guangrong wrote: Currently file_ram_alloc() is designed for hugetlbfs, however, the memory of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file locates at DAX enabled filesystem So this patch let it work on any kind of path Signed-off-by: Xiao Guangrong So this allows regular memory to be specified directly. This needs to be split out and merged separately from acpi/nvdimm bits. Yup, that is in my plan. Alternatively, if it's possible to use nvdimm with DAX fs (similar to hugetlbfs), leave these patches off for now. DAX is a filesystem mount options, it is not easy to get this info from a file. --- exec.c | 56 +--- 1 file changed, 17 insertions(+), 39 deletions(-) diff --git a/exec.c b/exec.c index 8af2570..3ca7e50 100644 --- a/exec.c +++ b/exec.c @@ -1174,32 +1174,6 @@ void qemu_mutex_unlock_ramlist(void) } #ifdef __linux__ - -#include - -#define HUGETLBFS_MAGIC 0x958458f6 - -static long gethugepagesize(const char *path, Error **errp) -{ -struct statfs fs; -int ret; - -do { -ret = statfs(path, &fs); -} while (ret != 0 && errno == EINTR); - -if (ret != 0) { -error_setg_errno(errp, errno, "failed to get page size of file %s", - path); -return 0; -} - -if (fs.f_type != HUGETLBFS_MAGIC) -fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path); - -return fs.f_bsize; -} - static void *file_ram_alloc(RAMBlock *block, ram_addr_t memory, const char *path, @@ -1210,20 +1184,24 @@ static void *file_ram_alloc(RAMBlock *block, char *c; void *area; int fd; -uint64_t hpagesize; -Error *local_err = NULL; +uint64_t pagesize; -hpagesize = gethugepagesize(path, &local_err); -if (local_err) { -error_propagate(errp, local_err); +pagesize = qemu_file_get_page_size(path); +if (!pagesize) { +error_setg(errp, "can't get page size for %s", path); goto error; } -block->mr->align = hpagesize; -if (memory < hpagesize) { +if (pagesize == getpagesize()) { +fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n"); +} + +block->mr->align = pagesize; + +if (memory < pagesize) { error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to " - "or larger than huge page size 0x%" PRIx64, - memory, hpagesize); + "or larger than page size 0x%" PRIx64, + memory, pagesize); goto error; } @@ -1247,14 +1225,14 @@ static void *file_ram_alloc(RAMBlock *block, fd = mkstemp(filename); if (fd < 0) { error_setg_errno(errp, errno, - "unable to create backing store for hugepages"); + "unable to create backing store for path %s", path); g_free(filename); goto error; } unlink(filename); g_free(filename); Looks like we are still calling mkstemp/unlink here. How does this work? Hmm? We have got the fd so the file can be safely unlinked (kernel does not actually unlink the file since mkstemp() holds a refcount.). And this patch just renames the variables, no logic changed. Will drop it from the serials for now on. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 19/33] dimm: keep the state of the whole backend memory
On Fri, Oct 30, 2015 at 01:56:13PM +0800, Xiao Guangrong wrote: > QEMU keeps the state of memory of dimm device during live migration, > however, it is not enough for nvdimm device as its memory does not > contain its label data, so that we should protect the whole backend > memory instead > > Signed-off-by: Xiao Guangrong It looks like there's now a difference between host_memory_backend_get_memory and get_memory_region, whereas previously they were exactly interchangeable. This needs some thought, in particular the theoretically generic dimm.c has to do tricks to accomodate nvdimm. The missing piece for NVDIMM is the 128k label space at the end, right? Can't nvdimm specific code just register that as a separate RAM chunk in order to migrate it? > --- > hw/mem/dimm.c | 14 -- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c > index 498d380..7d1 100644 > --- a/hw/mem/dimm.c > +++ b/hw/mem/dimm.c > @@ -134,9 +134,16 @@ void dimm_memory_plug(DeviceState *dev, > MemoryHotplugState *hpms, > } > > memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr); > -vmstate_register_ram(mr, dev); > numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node); > > +/* > + * save the state only for @mr is not enough as it does not contain > + * the label data of NVDIMM device, so that we keep the state of > + * whole hostmem instead. > + */ > +vmstate_register_ram(host_memory_backend_get_memory(dimm->hostmem, errp), > + dev); > + > out: > error_propagate(errp, local_err); > } > @@ -145,10 +152,13 @@ void dimm_memory_unplug(DeviceState *dev, > MemoryHotplugState *hpms, > MemoryRegion *mr) > { > DIMMDevice *dimm = DIMM(dev); > +MemoryRegion *backend_mr; > + > +backend_mr = host_memory_backend_get_memory(dimm->hostmem, &error_abort); > > numa_unset_mem_node_id(dimm->addr, memory_region_size(mr), dimm->node); > memory_region_del_subregion(&hpms->mr, mr); > -vmstate_unregister_ram(mr, dev); > +vmstate_unregister_ram(backend_mr, dev); > } > > int qmp_dimm_device_list(Object *obj, void *opaque) > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()
On 11/09/2015 06:33 PM, Michael S. Tsirkin wrote: On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote: There are three places use the some logic to get the page size on the file path or file fd This patch introduces qemu_file_get_page_size() to unify the code Signed-off-by: Xiao Guangrong --- include/qemu/osdep.h | 1 + target-ppc/kvm.c | 21 +++-- util/oslib-posix.c | 16 util/oslib-win32.c | 5 + 4 files changed, 25 insertions(+), 18 deletions(-) diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index b568424..d4dde02 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -302,4 +302,5 @@ int qemu_read_password(char *buf, int buf_size); */ pid_t qemu_fork(Error **errp); +size_t qemu_file_get_page_size(const char *mem_path); #endif diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index ac70f08..c661f1c 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct kvm_ppc_smmu_info *info) static long gethugepagesize(const char *mem_path) { -struct statfs fs; -int ret; - -do { -ret = statfs(mem_path, &fs); -} while (ret != 0 && errno == EINTR); +long size = qemu_file_get_page_size(mem_path); -if (ret != 0) { -fprintf(stderr, "Couldn't statfs() memory path: %s\n", -strerror(errno)); +if (!size) { exit(1); } -#define HUGETLBFS_MAGIC 0x958458f6 - -if (fs.f_type != HUGETLBFS_MAGIC) { -/* Explicit mempath, but it's ordinary pages */ -return getpagesize(); -} - -/* It's hugepage, return the huge page size */ -return fs.f_bsize; +return size; } static int find_max_supported_pagesize(Object *obj, void *opaque) diff --git a/util/oslib-posix.c b/util/oslib-posix.c index 914cef5..ad94c5a 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd) return getpagesize(); } +size_t qemu_file_get_page_size(const char *path) +{ +size_t size = 0; +int fd = qemu_open(path, O_RDONLY); + +if (fd < 0) { +fprintf(stderr, "Could not open %s.\n", path); +goto exit; +} + +size = fd_getpagesize(fd); +qemu_close(fd); +exit: +return size; +} + void os_mem_prealloc(int fd, char *area, size_t memory) { int ret; So this is opening the file for the sole purpose of doing the fstatfs on it. Seems strange, just do statfs instead. In fact, maybe we want statfs_getpagesize. It is just to reuse the code of fd_getpagesize() which already has the logic to check pagesize. diff --git a/util/oslib-win32.c b/util/oslib-win32.c index 09f9e98..a18aa87 100644 --- a/util/oslib-win32.c +++ b/util/oslib-win32.c @@ -462,6 +462,11 @@ size_t getpagesize(void) return system_info.dwPageSize; } +size_t qemu_file_get_page_size(const char *path) +{ +return getpagesize(); +} + void os_mem_prealloc(int fd, char *area, size_t memory) { int i; And why is this needed on win32? It is not actually used, just make osdep.h happy which has qemu_file_get_page_size() declare. BTW, i will drop this patch for now on. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 11/33] hostmem-file: use whole file size if possible
On 11/09/2015 06:17 PM, Michael S. Tsirkin wrote: On Fri, Oct 30, 2015 at 01:56:05PM +0800, Xiao Guangrong wrote: Use the whole file size if @size is not specified which is useful if we want to directly pass a file to guest Signed-off-by: Xiao Guangrong Better split these simplifications off from the series. Yep, i am also little tired to respin this long series, will drop this one. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 09/33] exec: allow file_ram_alloc to work on file
On 11/09/2015 06:13 PM, Michael S. Tsirkin wrote: On Fri, Oct 30, 2015 at 01:56:03PM +0800, Xiao Guangrong wrote: Currently, file_ram_alloc() only works on directory - it creates a file under @path and do mmap on it This patch tries to allow it to work on file directly, if @path is a directory it works as before, otherwise it treats @path as the target file then directly allocate memory from it Signed-off-by: Xiao Guangrong --- exec.c | 80 ++ 1 file changed, 51 insertions(+), 29 deletions(-) diff --git a/exec.c b/exec.c index 3ca7e50..f219010 100644 --- a/exec.c +++ b/exec.c @@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void) } #ifdef __linux__ +static bool path_is_dir(const char *path) +{ +struct stat fs; + +return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode); This means file doesn't exist is treated as a file. Can't figure out if that's intentional, should be documented in any case. The path is dir only if it passes S_ISDIR test. Otherwise, it is treated as regular file. Any error on the file will be figured out by open() and mmap() later. +} + +static int open_file_path(RAMBlock *block, const char *path, size_t size) +{ +char *filename; +char *sanitized_name; +char *c; +int fd; + +if (!path_is_dir(path)) { +int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY; Why does this make sense? RAM_SHARED means its write is invisible to others also it does not actual change the content of the file. Readonly permission is enough for this case. It is also help us to pass a readonly file to the guess but discard its change after guest shutdown. + +flags |= O_EXCL; And why does this makes sense? It' used if we pass a block device as a NVDIMM backend memory: O_EXCL can be used without O_CREAT if pathname refers to a block device. If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY BTW, the similar logic has already been in upsteam QEMU which is introduced by commit 8d31d6b6 (backends/hostmem-file: Allow to specify full pathname for backing file). I will drop these patches: util: introduce qemu_file_get_page_size() exec: allow memory to be allocated from any kind of path exec: allow file_ram_alloc to work on file -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 12/33] pc-dimm: remove DEFAULT_PC_DIMMSIZE
On Fri, Oct 30, 2015 at 01:56:06PM +0800, Xiao Guangrong wrote: > It's not used any more > > Signed-off-by: Xiao Guangrong You should leave the renames and cleanups off for later. This patchset is large enough as it is. > --- > include/hw/mem/pc-dimm.h | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h > index d83bf30..11a8937 100644 > --- a/include/hw/mem/pc-dimm.h > +++ b/include/hw/mem/pc-dimm.h > @@ -20,8 +20,6 @@ > #include "sysemu/hostmem.h" > #include "hw/qdev.h" > > -#define DEFAULT_PC_DIMMSIZE (1024*1024*1024) > - > #define TYPE_PC_DIMM "pc-dimm" > #define PC_DIMM(obj) \ > OBJECT_CHECK(PCDIMMDevice, (obj), TYPE_PC_DIMM) > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 08/33] exec: allow memory to be allocated from any kind of path
On Fri, Oct 30, 2015 at 01:56:02PM +0800, Xiao Guangrong wrote: > Currently file_ram_alloc() is designed for hugetlbfs, however, the memory > of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file > locates at DAX enabled filesystem > > So this patch let it work on any kind of path > > Signed-off-by: Xiao Guangrong So this allows regular memory to be specified directly. This needs to be split out and merged separately from acpi/nvdimm bits. Alternatively, if it's possible to use nvdimm with DAX fs (similar to hugetlbfs), leave these patches off for now. > --- > exec.c | 56 +--- > 1 file changed, 17 insertions(+), 39 deletions(-) > > diff --git a/exec.c b/exec.c > index 8af2570..3ca7e50 100644 > --- a/exec.c > +++ b/exec.c > @@ -1174,32 +1174,6 @@ void qemu_mutex_unlock_ramlist(void) > } > > #ifdef __linux__ > - > -#include > - > -#define HUGETLBFS_MAGIC 0x958458f6 > - > -static long gethugepagesize(const char *path, Error **errp) > -{ > -struct statfs fs; > -int ret; > - > -do { > -ret = statfs(path, &fs); > -} while (ret != 0 && errno == EINTR); > - > -if (ret != 0) { > -error_setg_errno(errp, errno, "failed to get page size of file %s", > - path); > -return 0; > -} > - > -if (fs.f_type != HUGETLBFS_MAGIC) > -fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path); > - > -return fs.f_bsize; > -} > - > static void *file_ram_alloc(RAMBlock *block, > ram_addr_t memory, > const char *path, > @@ -1210,20 +1184,24 @@ static void *file_ram_alloc(RAMBlock *block, > char *c; > void *area; > int fd; > -uint64_t hpagesize; > -Error *local_err = NULL; > +uint64_t pagesize; > > -hpagesize = gethugepagesize(path, &local_err); > -if (local_err) { > -error_propagate(errp, local_err); > +pagesize = qemu_file_get_page_size(path); > +if (!pagesize) { > +error_setg(errp, "can't get page size for %s", path); > goto error; > } > -block->mr->align = hpagesize; > > -if (memory < hpagesize) { > +if (pagesize == getpagesize()) { > +fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n"); > +} > + > +block->mr->align = pagesize; > + > +if (memory < pagesize) { > error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to " > - "or larger than huge page size 0x%" PRIx64, > - memory, hpagesize); > + "or larger than page size 0x%" PRIx64, > + memory, pagesize); > goto error; > } > > @@ -1247,14 +1225,14 @@ static void *file_ram_alloc(RAMBlock *block, > fd = mkstemp(filename); > if (fd < 0) { > error_setg_errno(errp, errno, > - "unable to create backing store for hugepages"); > + "unable to create backing store for path %s", path); > g_free(filename); > goto error; > } > unlink(filename); > g_free(filename); Looks like we are still calling mkstemp/unlink here. How does this work? > > -memory = ROUND_UP(memory, hpagesize); > +memory = ROUND_UP(memory, pagesize); > > /* > * ftruncate is not supported by hugetlbfs in older > @@ -1266,10 +1244,10 @@ static void *file_ram_alloc(RAMBlock *block, > perror("ftruncate"); > } > > -area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & RAM_SHARED); > +area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED); > if (area == MAP_FAILED) { > error_setg_errno(errp, errno, > - "unable to map backing store for hugepages"); > + "unable to map backing store for path %s", path); > close(fd); > goto error; > } > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 07/33] util: introduce qemu_file_get_page_size()
On Fri, Oct 30, 2015 at 01:56:01PM +0800, Xiao Guangrong wrote: > There are three places use the some logic to get the page size on > the file path or file fd > > This patch introduces qemu_file_get_page_size() to unify the code > > Signed-off-by: Xiao Guangrong > --- > include/qemu/osdep.h | 1 + > target-ppc/kvm.c | 21 +++-- > util/oslib-posix.c | 16 > util/oslib-win32.c | 5 + > 4 files changed, 25 insertions(+), 18 deletions(-) > > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h > index b568424..d4dde02 100644 > --- a/include/qemu/osdep.h > +++ b/include/qemu/osdep.h > @@ -302,4 +302,5 @@ int qemu_read_password(char *buf, int buf_size); > */ > pid_t qemu_fork(Error **errp); > > +size_t qemu_file_get_page_size(const char *mem_path); > #endif > diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c > index ac70f08..c661f1c 100644 > --- a/target-ppc/kvm.c > +++ b/target-ppc/kvm.c > @@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct > kvm_ppc_smmu_info *info) > > static long gethugepagesize(const char *mem_path) > { > -struct statfs fs; > -int ret; > - > -do { > -ret = statfs(mem_path, &fs); > -} while (ret != 0 && errno == EINTR); > +long size = qemu_file_get_page_size(mem_path); > > -if (ret != 0) { > -fprintf(stderr, "Couldn't statfs() memory path: %s\n", > -strerror(errno)); > +if (!size) { > exit(1); > } > > -#define HUGETLBFS_MAGIC 0x958458f6 > - > -if (fs.f_type != HUGETLBFS_MAGIC) { > -/* Explicit mempath, but it's ordinary pages */ > -return getpagesize(); > -} > - > -/* It's hugepage, return the huge page size */ > -return fs.f_bsize; > +return size; > } > > static int find_max_supported_pagesize(Object *obj, void *opaque) > diff --git a/util/oslib-posix.c b/util/oslib-posix.c > index 914cef5..ad94c5a 100644 > --- a/util/oslib-posix.c > +++ b/util/oslib-posix.c > @@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd) > return getpagesize(); > } > > +size_t qemu_file_get_page_size(const char *path) > +{ > +size_t size = 0; > +int fd = qemu_open(path, O_RDONLY); > + > +if (fd < 0) { > +fprintf(stderr, "Could not open %s.\n", path); > +goto exit; > +} > + > +size = fd_getpagesize(fd); > +qemu_close(fd); > +exit: > +return size; > +} > + > void os_mem_prealloc(int fd, char *area, size_t memory) > { > int ret; So this is opening the file for the sole purpose of doing the fstatfs on it. Seems strange, just do statfs instead. In fact, maybe we want statfs_getpagesize. > diff --git a/util/oslib-win32.c b/util/oslib-win32.c > index 09f9e98..a18aa87 100644 > --- a/util/oslib-win32.c > +++ b/util/oslib-win32.c > @@ -462,6 +462,11 @@ size_t getpagesize(void) > return system_info.dwPageSize; > } > > +size_t qemu_file_get_page_size(const char *path) > +{ > +return getpagesize(); > +} > + > void os_mem_prealloc(int fd, char *area, size_t memory) > { > int i; And why is this needed on win32? > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 11/33] hostmem-file: use whole file size if possible
On Fri, Oct 30, 2015 at 01:56:05PM +0800, Xiao Guangrong wrote: > Use the whole file size if @size is not specified which is useful > if we want to directly pass a file to guest > > Signed-off-by: Xiao Guangrong Better split these simplifications off from the series. > --- > backends/hostmem-file.c | 48 > 1 file changed, 44 insertions(+), 4 deletions(-) > > diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c > index 9097a57..e1bc9ff 100644 > --- a/backends/hostmem-file.c > +++ b/backends/hostmem-file.c > @@ -9,6 +9,9 @@ > * This work is licensed under the terms of the GNU GPL, version 2 or later. > * See the COPYING file in the top-level directory. > */ > +#include > +#include > + > #include "qemu-common.h" > #include "sysemu/hostmem.h" > #include "sysemu/sysemu.h" > @@ -33,20 +36,57 @@ struct HostMemoryBackendFile { > char *mem_path; > }; > > +static uint64_t get_file_size(const char *file) > +{ > +struct stat stat_buf; > +uint64_t size = 0; > +int fd; > + > +fd = open(file, O_RDONLY); > +if (fd < 0) { > +return 0; > +} > + > +if (stat(file, &stat_buf) < 0) { > +goto exit; > +} > + > +if ((S_ISBLK(stat_buf.st_mode)) && !ioctl(fd, BLKGETSIZE64, &size)) { You must test S_ISDIR too. > +goto exit; > +} > + > +size = lseek(fd, 0, SEEK_END); > +if (size == -1) { > +size = 0; > +} > +exit: > +close(fd); > +return size; > +} > + > static void > file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) > { > HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend); > > -if (!backend->size) { > -error_setg(errp, "can't create backend with size 0"); > -return; > -} > if (!fb->mem_path) { > error_setg(errp, "mem-path property not set"); > return; > } > > +if (!backend->size) { > +/* > + * use the whole file size if @size is not specified. > + */ > +backend->size = get_file_size(fb->mem_path); > +} > + > +if (!backend->size) { > +error_setg(errp, "failed to get file size for %s, can't create " > + "backend on it", mem_path); > +return; > +} > + > backend->force_prealloc = mem_prealloc; > memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), > object_get_canonical_path(OBJECT(backend)), > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] KVM: x86: MMU: Clean up x86's mmu code for future work
On 06/11/2015 08:20, Takuya Yoshikawa wrote: > Patch 1/2/3 are easy ones. > > Following two, patch 4/5, may not be ideal solutions, but at least > explain, or try to explain, the problems. They are okay! I replied to patch 5 with a suggestion for further cleanup. I'll apply them for 4.5. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: MMU: Consolidate WARN_ON/BUG_ON checks for reverse-mapped sptes
On 06/11/2015 08:25, Takuya Yoshikawa wrote: > At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is > placed right after the call to detect unrelated sptes which should not > be found in the reverse-mapping list. > > Move this check in rmap_get_first/next() so that all call sites, not > just the users of the for_each_rmap_spte() macro, will be checked the > same way. In addition, change the BUG_ON to WARN_ON since killing the > whole host is the last thing that KVM should try. > > One thing to keep in mind is that kvm_mmu_unlink_parents() also uses > rmap_get_first() to handle parent sptes. The change will not break it > because parent sptes are present, at least until drop_parent_pte() > actually unlinks them, and not mmio-sptes. Can you also change kvm_mmu_mark_parents_unsync to use for_each_rmap_spte instead of pte_list_walk? It is the last use of pte_list_walk, and it's nice if we have two uses of for_each_rmap_spte with parent_ptes as the argument. BTW, on my todo list is to change the rmap items to a struct (with a single u64 inside) for type safety. Since you are touching this code, perhaps you can give it a shot? Paolo > Signed-off-by: Takuya Yoshikawa > --- > Documentation/virtual/kvm/mmu.txt | 4 ++-- > arch/x86/kvm/mmu.c| 31 ++- > 2 files changed, 24 insertions(+), 11 deletions(-) > > diff --git a/Documentation/virtual/kvm/mmu.txt > b/Documentation/virtual/kvm/mmu.txt > index 3a4d681..daf9c0f 100644 > --- a/Documentation/virtual/kvm/mmu.txt > +++ b/Documentation/virtual/kvm/mmu.txt > @@ -203,10 +203,10 @@ Shadow pages contain the following information: > page cannot be destroyed. See role.invalid. >parent_ptes: > The reverse mapping for the pte/ptes pointing at this page's spt. If > -parent_ptes bit 0 is zero, only one spte points at this pages and > +parent_ptes bit 0 is zero, only one spte points at this page and > parent_ptes points at this single spte, otherwise, there exists multiple > sptes pointing at this page and (parent_ptes & ~0x1) points at a data > -structure with a list of parent_ptes. > +structure with a list of parent sptes. >unsync: > If true, then the translations in this page may not match the guest's > translation. This is equivalent to the state of the tlb when a pte is > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index c5e2363..353d752 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -1099,17 +1099,28 @@ struct rmap_iterator { > */ > static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter) > { > + u64 *sptep; > + > if (!rmap) > return NULL; > > if (!(rmap & 1)) { > iter->desc = NULL; > - return (u64 *)rmap; > + sptep = (u64 *)rmap; > + goto out; > } > > iter->desc = (struct pte_list_desc *)(rmap & ~1ul); > iter->pos = 0; > - return iter->desc->sptes[iter->pos]; > + sptep = iter->desc->sptes[iter->pos]; > +out: > + /* > + * Parent sptes found in sp->parent_ptes lists are also checked here > + * since kvm_mmu_unlink_parents() uses this function. If the condition > + * needs to be changed for them, make another wrapper function. > + */ > + WARN_ON(!is_shadow_present_pte(*sptep)); > + return sptep; > } > > /* > @@ -1119,14 +1130,14 @@ static u64 *rmap_get_first(unsigned long rmap, struct > rmap_iterator *iter) > */ > static u64 *rmap_get_next(struct rmap_iterator *iter) > { > + u64 *sptep; > + > if (iter->desc) { > if (iter->pos < PTE_LIST_EXT - 1) { > - u64 *sptep; > - > ++iter->pos; > sptep = iter->desc->sptes[iter->pos]; > if (sptep) > - return sptep; > + goto out; > } > > iter->desc = iter->desc->more; > @@ -1134,17 +1145,20 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) > if (iter->desc) { > iter->pos = 0; > /* desc->sptes[0] cannot be NULL */ > - return iter->desc->sptes[iter->pos]; > + sptep = iter->desc->sptes[iter->pos]; > + goto out; > } > } > > return NULL; > +out: > + WARN_ON(!is_shadow_present_pte(*sptep)); > + return sptep; > } > > #define for_each_rmap_spte(_rmap_, _iter_, _spte_) \ > for (_spte_ = rmap_get_first(*_rmap_, _iter_); \ > - _spte_ && ({BUG_ON(!is_shadow_present_pte(*_spte_)); 1;}); \ > - _spte_ = rmap_get_next(_iter_)) > + _spte_; _spte_ = rmap_get_next(_iter_)) > > static void drop_spte(struct kvm *kvm, u64 *sptep) > { > @@ -1358,7 +1372,6 @@ static bool kvm_zap_rmap
Re: [PATCH v6 09/33] exec: allow file_ram_alloc to work on file
On Fri, Oct 30, 2015 at 01:56:03PM +0800, Xiao Guangrong wrote: > Currently, file_ram_alloc() only works on directory - it creates a file > under @path and do mmap on it > > This patch tries to allow it to work on file directly, if @path is a > directory it works as before, otherwise it treats @path as the target > file then directly allocate memory from it > > Signed-off-by: Xiao Guangrong > --- > exec.c | 80 > ++ > 1 file changed, 51 insertions(+), 29 deletions(-) > > diff --git a/exec.c b/exec.c > index 3ca7e50..f219010 100644 > --- a/exec.c > +++ b/exec.c > @@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void) > } > > #ifdef __linux__ > +static bool path_is_dir(const char *path) > +{ > +struct stat fs; > + > +return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode); This means file doesn't exist is treated as a file. Can't figure out if that's intentional, should be documented in any case. > +} > + > +static int open_file_path(RAMBlock *block, const char *path, size_t size) > +{ > +char *filename; > +char *sanitized_name; > +char *c; > +int fd; > + > +if (!path_is_dir(path)) { > +int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY; Why does this make sense? > + > +flags |= O_EXCL; And why does this makes sense? > +return open(path, flags); > +} > + > +/* Make name safe to use with mkstemp by replacing '/' with '_'. */ > +sanitized_name = g_strdup(memory_region_name(block->mr)); > +for (c = sanitized_name; *c != '\0'; c++) { > +if (*c == '/') { > +*c = '_'; > +} > +} > +filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path, > + sanitized_name); > +g_free(sanitized_name); > +fd = mkstemp(filename); > +if (fd >= 0) { > +unlink(filename); > +/* > + * ftruncate is not supported by hugetlbfs in older > + * hosts, so don't bother bailing out on errors. > + * If anything goes wrong with it under other filesystems, > + * mmap will fail. > + */ > +if (ftruncate(fd, size)) { > +perror("ftruncate"); > +} > +} > +g_free(filename); > + > +return fd; > +} > + > static void *file_ram_alloc(RAMBlock *block, > ram_addr_t memory, > const char *path, > Error **errp) > { > -char *filename; > -char *sanitized_name; > -char *c; > void *area; > int fd; > uint64_t pagesize; > @@ -1211,38 +1257,14 @@ static void *file_ram_alloc(RAMBlock *block, > goto error; > } > > -/* Make name safe to use with mkstemp by replacing '/' with '_'. */ > -sanitized_name = g_strdup(memory_region_name(block->mr)); > -for (c = sanitized_name; *c != '\0'; c++) { > -if (*c == '/') > -*c = '_'; > -} > - > -filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path, > - sanitized_name); > -g_free(sanitized_name); > +memory = ROUND_UP(memory, pagesize); > > -fd = mkstemp(filename); > +fd = open_file_path(block, path, memory); > if (fd < 0) { > error_setg_errno(errp, errno, > "unable to create backing store for path %s", path); > -g_free(filename); > goto error; > } > -unlink(filename); > -g_free(filename); > - > -memory = ROUND_UP(memory, pagesize); > - > -/* > - * ftruncate is not supported by hugetlbfs in older > - * hosts, so don't bother bailing out on errors. > - * If anything goes wrong with it under other filesystems, > - * mmap will fail. > - */ > -if (ftruncate(fd, memory)) { > -perror("ftruncate"); > -} > > area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED); > if (area == MAP_FAILED) { > -- > 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] KVM: VMX: enable LBR virtualization
On 2015/11/9 17:06, Paolo Bonzini wrote: On 09/11/2015 02:33, Jian Zhou wrote: Hi Paolo, May I ask that any suggestion about the version 2 of VMX LBRV? This version is updated following your advices in version 1. BTW the kvm-unit-test for this feature has sent too, and I have tested the CPUs emulated by QEMU. Hi, since these patches will not be part of Linux 4.4, I will review them after the end of the merge window (or at least, after I've finished sending material for 4.4). Thanks for your consideration. Jian Paolo . -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] KVM: VMX: enable LBR virtualization
On 09/11/2015 02:33, Jian Zhou wrote: > Hi Paolo, > > May I ask that any suggestion about the version 2 of VMX LBRV? > This version is updated following your advices in version 1. > BTW the kvm-unit-test for this feature has sent too, and I > have tested the CPUs emulated by QEMU. Hi, since these patches will not be part of Linux 4.4, I will review them after the end of the merge window (or at least, after I've finished sending material for 4.4). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html