memory-hotplug : possible circular locking dependency detected
When I offline a memory on linux-3.6-rc5, possible circular locking dependency detected messages are shown. Are the messages known problem? [ 201.596363] Offlined Pages 32768 [ 201.596373] remove from free list 14 1024 148000 [ 201.596493] remove from free list 140400 1024 148000 [ 201.596612] remove from free list 140800 1024 148000 [ 201.596730] remove from free list 140c00 1024 148000 [ 201.596849] remove from free list 141000 1024 148000 [ 201.596968] remove from free list 141400 1024 148000 [ 201.597049] remove from free list 141800 1024 148000 [ 201.597049] remove from free list 141c00 1024 148000 [ 201.597049] remove from free list 142000 1024 148000 [ 201.597049] remove from free list 142400 1024 148000 [ 201.597049] remove from free list 142800 1024 148000 [ 201.597049] remove from free list 142c00 1024 148000 [ 201.597049] remove from free list 143000 1024 148000 [ 201.597049] remove from free list 143400 1024 148000 [ 201.597049] remove from free list 143800 1024 148000 [ 201.597049] remove from free list 143c00 1024 148000 [ 201.597049] remove from free list 144000 1024 148000 [ 201.597049] remove from free list 144400 1024 148000 [ 201.597049] remove from free list 144800 1024 148000 [ 201.597049] remove from free list 144c00 1024 148000 [ 201.597049] remove from free list 145000 1024 148000 [ 201.597049] remove from free list 145400 1024 148000 [ 201.597049] remove from free list 145800 1024 148000 [ 201.597049] remove from free list 145c00 1024 148000 [ 201.597049] remove from free list 146000 1024 148000 [ 201.597049] remove from free list 146400 1024 148000 [ 201.597049] remove from free list 146800 1024 148000 [ 201.597049] remove from free list 146c00 1024 148000 [ 201.597049] remove from free list 147000 1024 148000 [ 201.597049] remove from free list 147400 1024 148000 [ 201.597049] remove from free list 147800 1024 148000 [ 201.597049] remove from free list 147c00 1024 148000 [ 201.602143] [ 201.602150] == [ 201.602153] [ INFO: possible circular locking dependency detected ] [ 201.602157] 3.6.0-rc5 #1 Not tainted [ 201.602159] --- [ 201.602162] bash/2789 is trying to acquire lock: [ 201.602164] ((memory_chain).rwsem){.+.+.+}, at: [8109fe16] __blocking_notifier_call_chain+0x66/0xd0 [ 201.602180] [ 201.602180] but task is already holding lock: [ 201.602182] (ksm_thread_mutex/1){+.+.+.}, at: [811b41fa] ksm_memory_callback+0x3a/0xc0 [ 201.602194] [ 201.602194] which lock already depends on the new lock. [ 201.602194] [ 201.602197] [ 201.602197] the existing dependency chain (in reverse order) is: [ 201.602200] [ 201.602200] - #1 (ksm_thread_mutex/1){+.+.+.}: [ 201.602208][810dbee9] validate_chain+0x6d9/0x7e0 [ 201.602214][810dc2e6] __lock_acquire+0x2f6/0x4f0 [ 201.602219][810dc57d] lock_acquire+0x9d/0x190 [ 201.602223][8166b4fc] __mutex_lock_common+0x5c/0x420 [ 201.602229][8166ba2a] mutex_lock_nested+0x4a/0x60 [ 201.602234][811b41fa] ksm_memory_callback+0x3a/0xc0 [ 201.602239][81673447] notifier_call_chain+0x67/0x150 [ 201.602244][8109fe2b] __blocking_notifier_call_chain+0x7b/0xd0 [ 201.602250][8109fe96] blocking_notifier_call_chain+0x16/0x20 [ 201.602255][8144c53b] memory_notify+0x1b/0x20 [ 201.602261][81653c51] offline_pages+0x1b1/0x470 [ 201.602267][811bfcae] remove_memory+0x1e/0x20 [ 201.602273][8144c661] memory_block_action+0xa1/0x190 [ 201.602278][8144c7c9] memory_block_change_state+0x79/0xe0 [ 201.602282][8144c8f2] store_mem_state+0xc2/0xd0 [ 201.602287][81436980] dev_attr_store+0x20/0x30 [ 201.602293][812498d3] sysfs_write_file+0xa3/0x100 [ 201.602299][811cba80] vfs_write+0xd0/0x1a0 [ 201.602304][811cbc54] sys_write+0x54/0xa0 [ 201.602309][81678529] system_call_fastpath+0x16/0x1b [ 201.602315] [ 201.602315] - #0 ((memory_chain).rwsem){.+.+.+}: [ 201.602322][810db7e7] check_prev_add+0x527/0x550 [ 201.602326][810dbee9] validate_chain+0x6d9/0x7e0 [ 201.602331][810dc2e6] __lock_acquire+0x2f6/0x4f0 [ 201.602335][810dc57d] lock_acquire+0x9d/0x190 [ 201.602340][8166c1a1] down_read+0x51/0xa0 [ 201.602345][8109fe16] __blocking_notifier_call_chain+0x66/0xd0 [ 201.602350][8109fe96] blocking_notifier_call_chain+0x16/0x20 [ 201.602355][8144c53b] memory_notify+0x1b/0x20 [ 201.602360][81653e67] offline_pages+0x3c7/0x470 [ 201.602365][811bfcae] remove_memory+0x1e/0x20 [ 201.602370]
Re: memtest 4.20+ does not work with -cpu host
On 10.09.2012 14:32, Avi Kivity wrote: On 09/10/2012 03:29 PM, Peter Lieven wrote: On 09/10/12 14:21, Gleb Natapov wrote: On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote: Il 10/09/2012 13:52, Peter Lieven ha scritto: dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd it only works without the skip. but the msr device returns all zeroes. Hmm, the strange API of the MSR device doesn't work well with dd (dd skips to 0x194 * 8 because bs is 8. You can try this program: There is rdmsr/wrmsr in msr-tools. rdmsr returns it cannot read those MSRs. regardless if I use -cpu host or -cpu qemu64. On the host. did you get my output? #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 cheers, peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memtest 4.20+ does not work with -cpu host
Il 13/09/2012 09:53, Peter Lieven ha scritto: rdmsr returns it cannot read those MSRs. regardless if I use -cpu host or -cpu qemu64. On the host. did you get my output? #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memtest 4.20+ does not work with -cpu host
On Thu, Sep 13, 2012 at 09:55:06AM +0200, Paolo Bonzini wrote: Il 13/09/2012 09:53, Peter Lieven ha scritto: rdmsr returns it cannot read those MSRs. regardless if I use -cpu host or -cpu qemu64. On the host. did you get my output? #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Why would anyone will want to run memtest in a vm? May be just add those MSRs to ignore list and that's it. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-dimensional Paging in Nested virtualization
On Tue, Sep 11, 2012, siddhesh phadke wrote about Multi-dimensional Paging in Nested virtualization: I read turtles project paper where they have explained how multi-dimensional page tables are built on L0. L2 is launched with empty EPT 0-2 and EPT 0-2 is built on-the-fly. I tried to find out how this is done in kvm code but i could not find where EPT 0-2 is built. Nested EPT is not yet included in the mainline KVM. The original nested EPT code that we had written as part of the Turtles paper became obsolete when much of KVM's MMU code has been rewritten. I have since rewritten the nested EPT code for the modern KVM. I sent the second (latest) version of these patches to the KVM mailing list in August, and you can find them in, for example, http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395 These patches were not yet accepted into KVM. They have bugs in various setups (which I have not yet found the time to fix, unfortunately), and some known issues found by Avi Kivity on this mailing lest. Does L1 handle ept violation first and then L0 updates its EPT0-2? How this is done? This is explained in the turtles paper, but here's the short story: L1 defines an EPT table for L2 which we call EPT12. L0 builds from this an EPT02, with L1 addresses changed to L0. Now, when L2 runs and we get an EPT violation, we exit to L0 (in nested vmx, any exit first gets to L0). L0 checks if the translation is missing already in EPT12, and if it isn't it emulates an exit into L1 - and inject the EPT violation into L1. But if the translation wasn't missing in EPT12, then it's L0's problem, and we just need to update EPT02. Can anybody give me some pointers about where to look into the code? Please look at the patches above. Each patch is also documented. Nadav. -- Nadav Har'El| Thursday, Sep 13 2012, 26 Elul 5772 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |error compiling committee.c: too many http://nadav.harel.org.il |arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memtest 4.20+ does not work with -cpu host
Il 13/09/2012 09:57, Gleb Natapov ha scritto: #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Why would anyone will want to run memtest in a vm? May be just add those MSRs to ignore list and that's it. From the output it looks like it's basically a list of bits. Returning something sensible is better, same as for the speed scaling MSRs. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memtest 4.20+ does not work with -cpu host
On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote: Il 13/09/2012 09:57, Gleb Natapov ha scritto: #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Why would anyone will want to run memtest in a vm? May be just add those MSRs to ignore list and that's it. From the output it looks like it's basically a list of bits. Returning something sensible is better, same as for the speed scaling MSRs. Everything is list of bits in computers :) At least 0xce is documented in SDM. It cannot be implemented in a migration safe manner. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Optimize page table walk
On 09/13/2012 01:20 AM, Marcelo Tosatti wrote: On Wed, Sep 12, 2012 at 05:29:49PM +0300, Avi Kivity wrote: (resend due to mail server malfunction) The page table walk has gotten crufty over the years and is threatening to become even more crufty when SMAP is introduced. Clean it up (and optimize it) somewhat. What is SMAP? Supervisor Mode Access Prevention, see http://software.intel.com/sites/default/files/319433-014.pdf. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Prepare kvm for lto
On 09/12/2012 10:17 PM, Andi Kleen wrote: On Wed, Sep 12, 2012 at 05:50:41PM +0300, Avi Kivity wrote: vmx.c has an lto-unfriendly bit, fix it up. While there, clean up our asm code. Avi Kivity (3): KVM: VMX: Make lto-friendly KVM: VMX: Make use of asm.h KVM: SVM: Make use of asm.h Works for me in my LTO build, thanks Avi. I cannot guarantee I always hit the unit splitting case, but it looks good so far. Actually I think patch 1 is missing a .global vmx_return. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: graphics card pci passthrough success report
On 2012-09-13 07:55, Gerd Hoffmann wrote: Hi, - Apply the patches at the end of this mail to kvm and SeaBIOS to allow for more BAR space under 4G. (The relevant BARs on the graphics cards _are_ 64 bit BARs, but kvm seemed to turn those into 32 bit BARs in the guest.) Which qemu/seabios versions have you used? qemu-1.2 (+ bundled seabios) should handle that just fine without patching. There is no fixed I/O window any more, all memory space above lowmem is available for pci, i.e. if you give 2G to your guest everything above 0x8000. And if there isn't enougth address space below 4G (if you assign lot of memory to your guest so qemu keeps only the 0xe000 - 0x window free) seabios should try to map 64bit bars above 4G. - Apply the hacky patch at the end of this mail to SeaBIOS to always skip initialising the Radeon's option ROMs, or the VM would hang inside the Radeon option ROM if you boot the VM without the default cirrus video. A better way to handle that would probably be to add an pci passthrough config option to not expose the rom to the guest. -device pci-assign,option-rom=,... Any clue *why* the rom doesn't run? Maybe because we are not passing through the legacy VGA I/O ranges, maybe because the card is accessing one of the famous side channels to configure its mappings, and we do not virtualize them (as we usually do not know them). Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv3] KVM: optimize apic interrupt delivery
Most interrupt are delivered to only one vcpu. Use pre-build tables to find interrupt destination instead of looping through all vcpus. In case of logical mode loop only through vcpus in a logical cluster irq is sent to. Signed-off-by: Gleb Natapov g...@redhat.com --- Changelog: - v2-v3 * sparse annotation for rcu usage * move mutex above map * use mask/shift to calculate cluster/dst ids * use gotos * add comment about logic behind logical table creation - v1-v2 * fix race Avi noticed * rcu_read_lock() out of the block as per Avi * fix rcu issues pointed to by MST. All but one. Still use call_rcu(). Do not think this is serious issue. If it is should be solved by RCU subsystem. * Fix phys_map overflow pointed to by MST * recalculate_apic_map() does not return error any more. * add optimization for low prio logical mode with one cpu as dst (it happens) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 64adb61..9dcfd3e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot { struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1]; }; +struct kvm_apic_map { + struct rcu_head rcu; + u8 ldr_bits; + u32 cid_shift, cid_mask, lid_mask; + struct kvm_lapic *phys_map[256]; + struct kvm_lapic *logical_map[16][16]; +}; + struct kvm_arch { unsigned int n_used_mmu_pages; unsigned int n_requested_mmu_pages; @@ -528,6 +536,8 @@ struct kvm_arch { struct kvm_ioapic *vioapic; struct kvm_pit *vpit; int vapics_in_nmi_mode; + struct mutex apic_map_lock; + struct kvm_apic_map *apic_map; unsigned int tss_addr; struct page *apic_access_page; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 07ad628..a03d4aa 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic *apic) (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \ APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER) +static inline int apic_x2apic_mode(struct kvm_lapic *apic) +{ + return apic-vcpu-arch.apic_base X2APIC_ENABLE; +} + static inline int kvm_apic_id(struct kvm_lapic *apic) { return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = 32 - map-ldr_bits; + return (ldr map-cid_shift) map-cid_mask; +} + +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = (32 - map-ldr_bits); + return ldr map-lid_mask; +} + +static inline void recalculate_apic_map(struct kvm *kvm) +{ + struct kvm_apic_map *new, *old = NULL; + struct kvm_vcpu *vcpu; + int i; + + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL); + + mutex_lock(kvm-arch.apic_map_lock); + + if (!new) + goto out; + + new-ldr_bits = 8; + /* flat mode is deafult */ + new-cid_shift = 8; + new-cid_mask = 0; + new-lid_mask = 0xff; + + kvm_for_each_vcpu(i, vcpu, kvm) { + struct kvm_lapic *apic = vcpu-arch.apic; + u16 cid, lid; + u32 ldr; + + if (!kvm_apic_present(vcpu)) + continue; + + /* +* All APICs have to be configured in the same mode by an OS. +* We take advatage of this while building logical id loockup +* table. After reset APICs are in xapic/flat mode, so if we +* find apic with different setting we assume this is the mode +* os wants all apics to be in and build lookup table +* accordingly. +*/ + if (apic_x2apic_mode(apic)) { + new-ldr_bits = 32; + new-cid_shift = 16; + new-cid_mask = new-lid_mask = 0x; + } else if (kvm_apic_sw_enabled(apic) + !new-cid_mask /* flat mode */ + kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER) { + new-cid_shift = 4; + new-cid_mask = 0xf; + new-lid_mask = 0xf; + } + + new-phys_map[kvm_apic_id(apic)] = apic; + + ldr = kvm_apic_get_reg(apic, APIC_LDR); + cid = apic_cluster_id(new, ldr); + lid = apic_logical_id(new, ldr); + + if (lid) + new-logical_map[cid][ffs(lid) - 1] = apic; + } +out: + old = rcu_dereference_protected(kvm-arch.apic_map, 1); + rcu_assign_pointer(kvm-arch.apic_map, new); + mutex_unlock(kvm-arch.apic_map_lock); + + if (old) + kfree_rcu(old, rcu); +} + +static inline void
Re: [PATCHv3] KVM: optimize apic interrupt delivery
On 09/13/2012 12:00 PM, Gleb Natapov wrote: Most interrupt are delivered to only one vcpu. Use pre-build tables to find interrupt destination instead of looping through all vcpus. In case of logical mode loop only through vcpus in a logical cluster irq is sent to. Looks good. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 47451] New: need to re-load driver in guest to make a hot-plug VF work
https://bugzilla.kernel.org/show_bug.cgi?id=47451 Summary: need to re-load driver in guest to make a hot-plug VF work Product: Virtualization Version: unspecified Kernel Version: 3.5.0 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm AssignedTo: virtualization_...@kernel-bugs.osdl.org ReportedBy: yongjie@intel.com Regression: Yes Environment: Host OS (ia32/ia32e/IA64):ia32e Guest OS (ia32/ia32e/IA64):ia32e Guest OS Type (Linux/Windows):Linux (RHEL6u3) kvm.git Commit:37e41afa97307a3e54b200a5c9179ada1632a844(master branch) qemu-kvm Commit:28c3a9b197900c88f27b14f8862a7a15c00dc7f0(master branch) Host Kernel Version:3.5.0-rc6 (Also exists in 3.6.0-rc3) Hardware:Romley-EP (SandyBridge system) Bug detailed description: -- After hot plugging a VF to a Linux guest (e.g.RHEL6.3) in qemu monitor, the VF cannot work in the guest by. I need to remove the VF driver (e.g. igbvf, ixgbevf) and probe it again, then the VF can work in guest. NIC: Intel 82599 NIC, Intel 82576 NIC It needn't reload VF driver in hot-plug case when using an old kernel. It's a regression in kernel. (commits are in kvm.git and qemu-kvm.git tree) kvm + qemu-kvm =result 37e41afa + 28c3a9b1 =bad 322728e5 + 28c3a9b1 =good Note: 1. When assigning a VF in qemu-kvm command line (not hot-plug), VF can work fine after boot-up. 2. It's easier to reproduce this in guest with 512/1024MB memory and 1/2 vCPUs. 3. Can't always reproduce with 2048MB and 2vCPUs. (Not very stable.) Reproduce steps: 1.start up a host with kvm 2.qemu-system-x86_64 -m 512 smp 2 –net none –hda /root/rhel6u3.img 3.switch to qemu monitor (ctrl+Alt+2) 4.device_add pci-assign,host=02:10.0,id=mynic (02:10.0 is VF's BDF number.) 5.switch to guest (ctrl+Alt+1) 6.check network of the VF. (it can't work) 7. remove VF driver in guest ('rmmod igbvf') 8. re-probe VF driver in guest ('modprobe igbvf') 9. check network of the VF. (It should work this time.) Current result: The VF cannot work in the guest by default. Need to re-load VF driver in guest. Expected result: VF works well in the guest by default after hot-plug. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug.-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Biweekly KVM Test report, kernel 9a781977... qemu 4c3e02be...
Hi All, This is KVM upstream test result against kvm.git next branch and qemu-kvm.git master branch. kvm.git next branch: 9a7819774e4236e8736a074b7e85276967911924 based on kernel 3.6.0-rc3 qemu-kvm.git master branch: 4c3e02beed9878a5f760eeceb6cd42c475cf0127 We found 1 new bug and no bug fixed in the past two weeks. New issue (1): 1. need to re-load driver in guest to make a hot-plug VF work https://bugzilla.kernel.org/show_bug.cgi?id=47451 -- It's a regression in kernel side about 2 or 3 months ago. Fixed issue (0): Old issues (5): -- 1. Nested-virt: L1 (kvm on kvm)guest panic with parameter -cpu host in qemu command line. https://bugs.launchpad.net/qemu/+bug/994378 2. Can't install or boot up 32bit win8 guest. https://bugs.launchpad.net/qemu/+bug/1007269 3. vCPU hot-add makes the guest abort. https://bugs.launchpad.net/qemu/+bug/1019179 4. Nested Virt: VMX can't be initialized in L1 Xen (Xen on KVM) https://bugzilla.kernel.org/show_bug.cgi?id=45931 5. Guest has no xsave feature with parameter -cpu qemu64,+xsave in qemu command line. https://bugs.launchpad.net/qemu/+bug/1042561 Test environment: == Platform Westmere-EPSandybridge-EP CPU Cores 2432 Memory size 24G 32G Best Regards, Yongjie Ren (Jay) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: graphics card pci passthrough success report
On Thu, Sep 13, 2012 at 07:55:00AM +0200, Gerd Hoffmann wrote: Hi, Hi, - Apply the patches at the end of this mail to kvm and SeaBIOS to allow for more BAR space under 4G. (The relevant BARs on the graphics cards _are_ 64 bit BARs, but kvm seemed to turn those into 32 bit BARs in the guest.) Which qemu/seabios versions have you used? qemu-1.2 (+ bundled seabios) should handle that just fine without patching. There is no fixed I/O window any more, all memory space above lowmem is available for pci, i.e. if you give 2G to your guest everything above 0x8000. And if there isn't enougth address space below 4G (if you assign lot of memory to your guest so qemu keeps only the 0xe000 - 0x window free) seabios should try to map 64bit bars above 4G. This was some time ago, on (L)ubuntu 12.04, which has qemu-kvm 1.0 and seabios 0.6.2. We'll retry on a newer distro soon. - Apply the hacky patch at the end of this mail to SeaBIOS to always skip initialising the Radeon's option ROMs, or the VM would hang inside the Radeon option ROM if you boot the VM without the default cirrus video. A better way to handle that would probably be to add an pci passthrough config option to not expose the rom to the guest. Any clue *why* the rom doesn't run? No idea, we didn't look into that -- this was just a one afternoon hacking session. thanks, Lennert -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/5] KVM: MMU: Optimize is_last_gpte()
On 09/12/2012 09:03 PM, Avi Kivity wrote: On 09/12/2012 08:49 PM, Avi Kivity wrote: Instead of branchy code depending on level, gpte.ps, and mmu configuration, prepare everything in a bitmap during mode changes and look it up during runtime. 6/5 is buggy, sorry, will update it tomorrow. 8--8-- From: Avi Kivity a...@redhat.com Date: Wed, 12 Sep 2012 20:46:56 +0300 Subject: [PATCH v2 6/5] KVM: MMU: Optimize is_last_gpte() Instead of branchy code depending on level, gpte.ps, and mmu configuration, prepare everything in a bitmap during mode changes and look it up during runtime. Signed-off-by: Avi Kivity a...@redhat.com --- v2: rearrange bitmap (one less shift) avoid stomping on local variable fix index calculation move check back to a function arch/x86/include/asm/kvm_host.h | 7 +++ arch/x86/kvm/mmu.c | 31 +++ arch/x86/kvm/mmu.h | 3 ++- arch/x86/kvm/paging_tmpl.h | 22 +- 4 files changed, 41 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3318bde..f9a48cf 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -298,6 +298,13 @@ struct kvm_mmu { u64 *lm_root; u64 rsvd_bits_mask[2][4]; + /* +* Bitmap: bit set = last pte in walk +* index[0]: pte.ps +* index[1:2]: level +*/ + u8 last_pte_bitmap; + bool nx; u64 pdptrs[4]; /* pae */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ce78408..32fe597 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3447,6 +3447,15 @@ static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte) return access; } +static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte) +{ + unsigned index; + + index = level - 1; + index |= (gpte PT_PAGE_SIZE_MASK) (PT_PAGE_SIZE_SHIFT - 2); + return mmu-last_pte_bitmap (1 index); +} + #define PTTYPE 64 #include paging_tmpl.h #undef PTTYPE @@ -3548,6 +3557,24 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu } } +static void update_last_pte_bitmap(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) +{ + u8 map; + unsigned level, root_level = mmu-root_level; + const unsigned ps_set_index = 1 2; /* bit 2 of index: ps */ + + if (root_level == PT32E_ROOT_LEVEL) + --root_level; + /* PT_PAGE_TABLE_LEVEL always terminates */ + map = 1 | (1 ps_set_index); + for (level = PT_DIRECTORY_LEVEL; level = root_level; ++level) { + if (level = PT_PDPE_LEVEL +(mmu-root_level = PT32E_ROOT_LEVEL || is_pse(vcpu))) + map |= 1 (ps_set_index | (level - 1)); + } + mmu-last_pte_bitmap = map; +} + static int paging64_init_context_common(struct kvm_vcpu *vcpu, struct kvm_mmu *context, int level) @@ -3557,6 +3584,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, reset_rsvds_bits_mask(vcpu, context); update_permission_bitmask(vcpu, context); + update_last_pte_bitmap(vcpu, context); ASSERT(is_pae(vcpu)); context-new_cr3 = paging_new_cr3; @@ -3586,6 +3614,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu, reset_rsvds_bits_mask(vcpu, context); update_permission_bitmask(vcpu, context); + update_last_pte_bitmap(vcpu, context); context-new_cr3 = paging_new_cr3; context-page_fault = paging32_page_fault; @@ -3647,6 +3676,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) } update_permission_bitmask(vcpu, context); + update_last_pte_bitmap(vcpu, context); return 0; } @@ -3724,6 +3754,7 @@ static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu) } update_permission_bitmask(vcpu, g_context); + update_last_pte_bitmap(vcpu, g_context); return 0; } diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 143ee70..b08dd34 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -20,7 +20,8 @@ #define PT_ACCESSED_MASK (1ULL PT_ACCESSED_SHIFT) #define PT_DIRTY_SHIFT 6 #define PT_DIRTY_MASK (1ULL PT_DIRTY_SHIFT) -#define PT_PAGE_SIZE_MASK (1ULL 7) +#define PT_PAGE_SIZE_SHIFT 7 +#define PT_PAGE_SIZE_MASK (1ULL PT_PAGE_SIZE_SHIFT) #define PT_PAT_MASK (1ULL 7) #define PT_GLOBAL_MASK (1ULL 8) #define PT64_NX_SHIFT 63 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index eb4a668..ec1e101 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -101,24 +101,6 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return (ret != orig_pte); } -static bool
Re: Windows VM slow boot
On Wed, Sep 12, 2012 at 05:46:15PM +0100, Richard Davies wrote: Hi Mel - thanks for replying to my underhand bcc! Mel Gorman wrote: I see that this is an old-ish bug but I did not read the full history. Is it now booting faster than 3.5.0 was? I'm asking because I'm interested to see if commit c67fe375 helped your particular case. Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be improved, as discussed. What are the boot times for each kernel? PATCH SNIPPED I have applied and tested again - perf results below. isolate_migratepages_range is indeed much reduced. There is now a lot of time in isolate_freepages_block and still quite a lot of lock contention, although in a different place. This on top please. ---8--- From: Shaohua Li s...@fusionio.com compaction: abort compaction loop if lock is contended or run too long isolate_migratepages_range() might isolate none pages, for example, when zone-lru_lock is contended and compaction is async. In this case, we should abort compaction, otherwise, compact_zone will run a useless loop and make zone-lru_lock is even contended. V2: only abort the compaction if lock is contended or run too long Rearranged the code by Andrea Arcangeli. [minc...@kernel.org: Putback pages isolated for migration if aborting] [a...@linux-foundation.org: Fixup one contended usage site] Signed-off-by: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Shaohua Li s...@fusionio.com Signed-off-by: Mel Gorman mgor...@suse.de --- mm/compaction.c | 17 - mm/internal.h |2 +- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 7fcd3a5..a8de20d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -70,8 +70,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, /* async aborts if taking too long or contended */ if (!cc-sync) { - if (cc-contended) - *cc-contended = true; + cc-contended = true; return false; } @@ -634,7 +633,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, /* Perform the isolation */ low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn); - if (!low_pfn) + if (!low_pfn || cc-contended) return ISOLATE_ABORT; cc-migrate_pfn = low_pfn; @@ -787,6 +786,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) switch (isolate_migratepages(zone, cc)) { case ISOLATE_ABORT: ret = COMPACT_PARTIAL; + putback_lru_pages(cc-migratepages); + cc-nr_migratepages = 0; goto out; case ISOLATE_NONE: continue; @@ -831,6 +832,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order, gfp_t gfp_mask, bool sync, bool *contended) { + unsigned long ret; struct compact_control cc = { .nr_freepages = 0, .nr_migratepages = 0, @@ -838,12 +840,17 @@ static unsigned long compact_zone_order(struct zone *zone, .migratetype = allocflags_to_migratetype(gfp_mask), .zone = zone, .sync = sync, - .contended = contended, }; INIT_LIST_HEAD(cc.freepages); INIT_LIST_HEAD(cc.migratepages); - return compact_zone(zone, cc); + ret = compact_zone(zone, cc); + + VM_BUG_ON(!list_empty(cc.freepages)); + VM_BUG_ON(!list_empty(cc.migratepages)); + + *contended = cc.contended; + return ret; } int sysctl_extfrag_threshold = 500; diff --git a/mm/internal.h b/mm/internal.h index b8c91b3..4bd7c0e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -130,7 +130,7 @@ struct compact_control { int order; /* order a direct compactor needs */ int migratetype;/* MOVABLE, RECLAIMABLE etc */ struct zone *zone; - bool *contended;/* True if a lock was contended */ + bool contended; /* True if a lock was contended */ }; unsigned long -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm loops after kernel udpate
On 09/12/2012 09:11 PM, Jiri Slaby wrote: On 09/12/2012 10:18 AM, Avi Kivity wrote: On 09/12/2012 11:13 AM, Jiri Slaby wrote: Please provide the output of vmxcap (http://goo.gl/c5lUO), Unrestricted guest no The big real mode fixes. and a snapshot of kvm_stat while the guest is hung. kvm statistics exits 6778198 615942 host_state_reload 1988 187 irq_exits 1523 138 mmu_cache_miss 4 0 fpu_reload 1 0 Please run this as root so we get the tracepoint based output; and press 'x' when it's running so we get more detailed output. kvm statistics kvm_exit 13798699 330708 kvm_entry 13799110 330708 kvm_page_fault13793650 330604 kvm_exit(EXCEPTION_NMI)6188458 330604 kvm_exit(EXTERNAL_INTERRUPT) 2169 105 kvm_exit(TPR_BELOW_THRESHOLD) 82 0 kvm_exit(IO_INSTRUCTION) 6 0 Strange, it's unable to fault in the very first page. Please provide a trace as per http://www.linux-kvm.org/page/Tracing (but append -e kvmmmu to the command line). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3] KVM: optimize apic interrupt delivery
On 2012-09-13 11:00, Gleb Natapov wrote: Most interrupt are delivered to only one vcpu. Use pre-build tables to find interrupt destination instead of looping through all vcpus. In case of logical mode loop only through vcpus in a logical cluster irq is sent to. Signed-off-by: Gleb Natapov g...@redhat.com --- Changelog: - v2-v3 * sparse annotation for rcu usage * move mutex above map * use mask/shift to calculate cluster/dst ids * use gotos * add comment about logic behind logical table creation - v1-v2 * fix race Avi noticed * rcu_read_lock() out of the block as per Avi * fix rcu issues pointed to by MST. All but one. Still use call_rcu(). Do not think this is serious issue. If it is should be solved by RCU subsystem. * Fix phys_map overflow pointed to by MST * recalculate_apic_map() does not return error any more. * add optimization for low prio logical mode with one cpu as dst (it happens) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 64adb61..9dcfd3e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot { struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1]; }; +struct kvm_apic_map { + struct rcu_head rcu; + u8 ldr_bits; + u32 cid_shift, cid_mask, lid_mask; + struct kvm_lapic *phys_map[256]; + struct kvm_lapic *logical_map[16][16]; +}; + struct kvm_arch { unsigned int n_used_mmu_pages; unsigned int n_requested_mmu_pages; @@ -528,6 +536,8 @@ struct kvm_arch { struct kvm_ioapic *vioapic; struct kvm_pit *vpit; int vapics_in_nmi_mode; + struct mutex apic_map_lock; + struct kvm_apic_map *apic_map; unsigned int tss_addr; struct page *apic_access_page; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 07ad628..a03d4aa 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic *apic) (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \ APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER) +static inline int apic_x2apic_mode(struct kvm_lapic *apic) +{ + return apic-vcpu-arch.apic_base X2APIC_ENABLE; +} + static inline int kvm_apic_id(struct kvm_lapic *apic) { return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = 32 - map-ldr_bits; + return (ldr map-cid_shift) map-cid_mask; +} + +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = (32 - map-ldr_bits); + return ldr map-lid_mask; +} + +static inline void recalculate_apic_map(struct kvm *kvm) Inline? No recent compiler will respect it anyway, but it still looks strange for this function. +{ + struct kvm_apic_map *new, *old = NULL; + struct kvm_vcpu *vcpu; + int i; + + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL); + + mutex_lock(kvm-arch.apic_map_lock); + + if (!new) + goto out; + + new-ldr_bits = 8; + /* flat mode is deafult */ + new-cid_shift = 8; + new-cid_mask = 0; + new-lid_mask = 0xff; + + kvm_for_each_vcpu(i, vcpu, kvm) { + struct kvm_lapic *apic = vcpu-arch.apic; + u16 cid, lid; + u32 ldr; + + if (!kvm_apic_present(vcpu)) + continue; + + /* + * All APICs have to be configured in the same mode by an OS. + * We take advatage of this while building logical id loockup + * table. After reset APICs are in xapic/flat mode, so if we + * find apic with different setting we assume this is the mode + * os wants all apics to be in and build lookup table + * accordingly. + */ + if (apic_x2apic_mode(apic)) { + new-ldr_bits = 32; + new-cid_shift = 16; + new-cid_mask = new-lid_mask = 0x; + } else if (kvm_apic_sw_enabled(apic) + !new-cid_mask /* flat mode */ + kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER) { + new-cid_shift = 4; + new-cid_mask = 0xf; + new-lid_mask = 0xf; + } + + new-phys_map[kvm_apic_id(apic)] = apic; + + ldr = kvm_apic_get_reg(apic, APIC_LDR); + cid = apic_cluster_id(new, ldr); + lid = apic_logical_id(new, ldr); + + if (lid) + new-logical_map[cid][ffs(lid) - 1] = apic; + } +out: + old = rcu_dereference_protected(kvm-arch.apic_map, 1); + rcu_assign_pointer(kvm-arch.apic_map,
Re: [PATCHv3] KVM: optimize apic interrupt delivery
On Thu, Sep 13, 2012 at 12:29:44PM +0200, Jan Kiszka wrote: On 2012-09-13 11:00, Gleb Natapov wrote: Most interrupt are delivered to only one vcpu. Use pre-build tables to find interrupt destination instead of looping through all vcpus. In case of logical mode loop only through vcpus in a logical cluster irq is sent to. Signed-off-by: Gleb Natapov g...@redhat.com --- Changelog: - v2-v3 * sparse annotation for rcu usage * move mutex above map * use mask/shift to calculate cluster/dst ids * use gotos * add comment about logic behind logical table creation - v1-v2 * fix race Avi noticed * rcu_read_lock() out of the block as per Avi * fix rcu issues pointed to by MST. All but one. Still use call_rcu(). Do not think this is serious issue. If it is should be solved by RCU subsystem. * Fix phys_map overflow pointed to by MST * recalculate_apic_map() does not return error any more. * add optimization for low prio logical mode with one cpu as dst (it happens) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 64adb61..9dcfd3e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot { struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1]; }; +struct kvm_apic_map { + struct rcu_head rcu; + u8 ldr_bits; + u32 cid_shift, cid_mask, lid_mask; + struct kvm_lapic *phys_map[256]; + struct kvm_lapic *logical_map[16][16]; +}; + struct kvm_arch { unsigned int n_used_mmu_pages; unsigned int n_requested_mmu_pages; @@ -528,6 +536,8 @@ struct kvm_arch { struct kvm_ioapic *vioapic; struct kvm_pit *vpit; int vapics_in_nmi_mode; + struct mutex apic_map_lock; + struct kvm_apic_map *apic_map; unsigned int tss_addr; struct page *apic_access_page; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 07ad628..a03d4aa 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic *apic) (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \ APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER) +static inline int apic_x2apic_mode(struct kvm_lapic *apic) +{ + return apic-vcpu-arch.apic_base X2APIC_ENABLE; +} + static inline int kvm_apic_id(struct kvm_lapic *apic) { return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = 32 - map-ldr_bits; + return (ldr map-cid_shift) map-cid_mask; +} + +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = (32 - map-ldr_bits); + return ldr map-lid_mask; +} + +static inline void recalculate_apic_map(struct kvm *kvm) Inline? No recent compiler will respect it anyway, but it still looks strange for this function. Agree. I marked it inline when it was much smaller. Avi/Marcelo should I resend or you can edit before applying? +{ + struct kvm_apic_map *new, *old = NULL; + struct kvm_vcpu *vcpu; + int i; + + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL); + + mutex_lock(kvm-arch.apic_map_lock); + + if (!new) + goto out; + + new-ldr_bits = 8; + /* flat mode is deafult */ + new-cid_shift = 8; + new-cid_mask = 0; + new-lid_mask = 0xff; + + kvm_for_each_vcpu(i, vcpu, kvm) { + struct kvm_lapic *apic = vcpu-arch.apic; + u16 cid, lid; + u32 ldr; + + if (!kvm_apic_present(vcpu)) + continue; + + /* +* All APICs have to be configured in the same mode by an OS. +* We take advatage of this while building logical id loockup +* table. After reset APICs are in xapic/flat mode, so if we +* find apic with different setting we assume this is the mode +* os wants all apics to be in and build lookup table +* accordingly. +*/ + if (apic_x2apic_mode(apic)) { + new-ldr_bits = 32; + new-cid_shift = 16; + new-cid_mask = new-lid_mask = 0x; + } else if (kvm_apic_sw_enabled(apic) + !new-cid_mask /* flat mode */ + kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER) { + new-cid_shift = 4; + new-cid_mask = 0xf; + new-lid_mask = 0xf; + } + + new-phys_map[kvm_apic_id(apic)] = apic; + + ldr = kvm_apic_get_reg(apic, APIC_LDR); + cid = apic_cluster_id(new, ldr); + lid = apic_logical_id(new, ldr); + +
Re: [PATCHv3] KVM: optimize apic interrupt delivery
On 2012-09-13 12:33, Gleb Natapov wrote: So, this can be the foundation for direct MSI delivery as well, right? What do you mean by direct MSI delivery? kvm_irq_delivery_to_apic() is called by MSI. If you mean delivery from irq context, then yes, mst plans to do so. Yes, that's what I was aiming at. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: MMU: Push clean gpte write protection out of gpte_access()
On 09/12/2012 10:29 PM, Avi Kivity wrote: gpte_access() computes the access permissions of a guest pte and also write-protects clean gptes. This is wrong when we are servicing a write fault (since we'll be setting the dirty bit momentarily) but correct when instantiating a speculative spte, or when servicing a read fault (since we'll want to trap a following write in order to set the dirty bit). It doesn't seem to hurt in practice, but in order to make the code In current code, it seems that we will get two #PF if guest write memory through clean pte: one mark the dirty bit, then fault again, set W bit. readable, push the write protection out of gpte_access() and into a new protect_clean_gpte() which is called explicitly when needed. Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: MMU: Optimize gpte_access() slightly
On 09/12/2012 10:29 PM, Avi Kivity wrote: If nx is disabled, then is gpte[63] is set we will hit a reserved bit set fault before checking permissions; so we can ignore the setting of efer.nxe. Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: MMU: Move gpte_access() out of paging_tmpl.h
On 09/12/2012 10:29 PM, Avi Kivity wrote: static bool FNAME(is_last_gpte)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, pt_element_t gpte) @@ -217,7 +206,7 @@ retry_walk: last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte); if (last_gpte) { - pte_access = pt_access FNAME(gpte_access)(vcpu, pte); + pte_access = pt_access gpte_access(vcpu, pte); It can pass 32bit variable to gpte_access without cast, no warning? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3] KVM: optimize apic interrupt delivery
On Thu, Sep 13, 2012 at 12:00:59PM +0300, Gleb Natapov wrote: Most interrupt are delivered to only one vcpu. Use pre-build tables to find interrupt destination instead of looping through all vcpus. In case of logical mode loop only through vcpus in a logical cluster irq is sent to. Signed-off-by: Gleb Natapov g...@redhat.com Some comments below. The code's pretty complex now, I think adding some comments will be helpful. Below, I noted where this would be especially beneficial. Thanks! --- Changelog: - v2-v3 * sparse annotation for rcu usage * move mutex above map * use mask/shift to calculate cluster/dst ids * use gotos * add comment about logic behind logical table creation - v1-v2 * fix race Avi noticed * rcu_read_lock() out of the block as per Avi * fix rcu issues pointed to by MST. All but one. Still use call_rcu(). Do not think this is serious issue. If it is should be solved by RCU subsystem. * Fix phys_map overflow pointed to by MST * recalculate_apic_map() does not return error any more. * add optimization for low prio logical mode with one cpu as dst (it happens) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 64adb61..9dcfd3e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot { struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1]; }; +struct kvm_apic_map { + struct rcu_head rcu; + u8 ldr_bits; ldr_bits are never used directly, always 32 - ldr_bits. It might be a good idea to just store 32 - ldr_bits. I am not sure. + u32 cid_shift, cid_mask, lid_mask; + struct kvm_lapic *phys_map[256]; + struct kvm_lapic *logical_map[16][16]; Would be nice to add documentation for structure fields: what does each field include? For example what are the index values into logical_map? When is each mode used? I am guessing this will address some questions below. 16 is used in sevral places it code. We also have 0xf which is really 16 - 1. Would be nice to have defines here. +}; + struct kvm_arch { unsigned int n_used_mmu_pages; unsigned int n_requested_mmu_pages; @@ -528,6 +536,8 @@ struct kvm_arch { struct kvm_ioapic *vioapic; struct kvm_pit *vpit; int vapics_in_nmi_mode; + struct mutex apic_map_lock; + struct kvm_apic_map *apic_map; unsigned int tss_addr; struct page *apic_access_page; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 07ad628..a03d4aa 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -139,11 +139,105 @@ static inline int apic_enabled(struct kvm_lapic *apic) (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \ APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER) +static inline int apic_x2apic_mode(struct kvm_lapic *apic) +{ + return apic-vcpu-arch.apic_base X2APIC_ENABLE; +} + static inline int kvm_apic_id(struct kvm_lapic *apic) { return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr) Why is this u16? It seems the only legal values are 0-15 since this is used as index in lookup in logical_map. Maybe add a comment explaning legal values are 0-15. Or maybe BUG_ON to check result is 0 to 15. +{ + ldr = 32 - map-ldr_bits; + return (ldr map-cid_shift) map-cid_mask; +} + +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = (32 - map-ldr_bits); + return ldr map-lid_mask; +} + +static inline void recalculate_apic_map(struct kvm *kvm) +{ + struct kvm_apic_map *new, *old = NULL; + struct kvm_vcpu *vcpu; + int i; + + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL); + + mutex_lock(kvm-arch.apic_map_lock); + + if (!new) + goto out; + + new-ldr_bits = 8; + /* flat mode is deafult */ Typo + new-cid_shift = 8; + new-cid_mask = 0; + new-lid_mask = 0xff; + + kvm_for_each_vcpu(i, vcpu, kvm) { + struct kvm_lapic *apic = vcpu-arch.apic; + u16 cid, lid; + u32 ldr; + + if (!kvm_apic_present(vcpu)) + continue; + + /* + * All APICs have to be configured in the same mode by an OS. + * We take advatage of this while building logical id loockup + * table. After reset APICs are in xapic/flat mode, so if we + * find apic with different setting we assume this is the mode + * os wants all apics to be in s/os/OS (for consistency). and build lookup table accordingly. A bit clearer: ; we build the lookup table accordingly. (otherwise it reads as if os builds the lookup table) + */ + if (apic_x2apic_mode(apic)) { +
Re: [PATCH 3/5] KVM: MMU: Move gpte_access() out of paging_tmpl.h
On 09/13/2012 02:48 PM, Xiao Guangrong wrote: On 09/12/2012 10:29 PM, Avi Kivity wrote: static bool FNAME(is_last_gpte)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, pt_element_t gpte) @@ -217,7 +206,7 @@ retry_walk: last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte); if (last_gpte) { -pte_access = pt_access FNAME(gpte_access)(vcpu, pte); +pte_access = pt_access gpte_access(vcpu, pte); It can pass 32bit variable to gpte_access without cast, no warning? No warning. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
* Andrew Theurer haban...@linux.vnet.ibm.com [2012-09-11 13:27:41]: On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: On 09/11/2012 01:42 AM, Andrew Theurer wrote: On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) +{ + if (!curr-sched_class-yield_to_task) + return false; + + if (curr-sched_class != p-sched_class) + return false; Peter, Should we also add a check if the runq has a skip buddy (as pointed out by Raghu) and return if the skip buddy is already set. Oh right, I missed that suggestion.. the performance improvement went from 81% to 139% using this, right? It might make more sense to keep that separate, outside of this function, since its not a strict prerequisite. + if (task_running(p_rq, p) || p-state) + return false; + + return true; +} @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt) rq = this_rq(); again: + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p)) + goto out_irq; + So add something like: /* Optimistic, if we 'raced' with another yield_to(), don't bother */ if (p_rq-cfs_rq-skip) goto out_irq; p_rq = task_rq(p); double_rq_lock(rq, p_rq); But I do have a question on this optimization though,.. Why do we check p_rq-cfs_rq-skip and not rq-cfs_rq-skip ? That is, I'd like to see this thing explained a little better. Does it go something like: p_rq is the runqueue of the task we'd like to yield to, rq is our own, they might be the same. If we have a -skip, there's nothing we can do about it, OTOH p_rq having a -skip and failing the yield_to() simply means us picking the next VCPU thread, which might be running on an entirely different cpu (rq) and could succeed? Here's two new versions, both include a __yield_to_candidate(): v3 uses the check for p_rq-curr in guest mode, and v4 uses the cfs_rq skip check. Raghu, I am not sure if this is exactly what you want implemented in v4. Andrew, Yes that is what I had. I think there was a mis-understanding. My intention was to if there is a directed_yield happened in runqueue (say rqA), do not bother to directed yield to that. But unfortunately as PeterZ pointed that would have resulted in setting next buddy of a different run queue than rqA. So we can drop this skip idea. Pondering more over what to do? can we use next buddy itself ... thinking.. As I mentioned earlier today, I did not have your changes from kvm.git tree when I tested my changes. Here are your changes and my changes compared: throughput in MB/sec kvm_vcpu_on_spin changes: 4636 +/- 15.74% yield_to changes:4515 +/- 12.73% I would be inclined to stick with your changes which are kept in kvm code. I did try both combined, and did not get good results: both changes:4074 +/- 19.12% So, having both is probably not a good idea. However, I feel like there's more work to be done. With no over-commit (10 VMs), total throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some overhead, but a reduction to ~4500 is still terrible. By contrast, 8-way VMs with 2x over-commit have a total throughput roughly 10% less than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread host). We still have what appears to be scalability problems, but now it's not so much in runqueue locks for yield_to(), but now get_pid_task(): Hi Andrew, IMHO, reducing the double runqueue lock overhead is a good idea, and may be we see the benefits when we increase the overcommit further. The explaination for not seeing good benefit on top of PLE handler optimization patch is because we filter the yield_to candidates, and hence resulting in less contention for double runqueue lock. and extra code overhead during genuine yield_to might have resulted in some degradation in the case you tested. However, did you use cfs.next also?. I hope it helps, when we combine. Here is the result that is showing positive benefit. I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s). +---+---+---++---+ kernbench time in sec, lower is better +---+---+---++---+ base stddev patched stddev %improve +---+---+---++---+ 1x44.3880 1.869940.8180 1.9173 8.04271 2x96.7580 4.278793.4188 3.5150 3.45108 +---+---+---++---+
Re: memtest 4.20+ does not work with -cpu host
On 13.09.2012 10:05, Gleb Natapov wrote: On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote: Il 13/09/2012 09:57, Gleb Natapov ha scritto: #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Why would anyone will want to run memtest in a vm? May be just add those MSRs to ignore list and that's it. From the output it looks like it's basically a list of bits. Returning something sensible is better, same as for the speed scaling MSRs. Everything is list of bits in computers :) At least 0xce is documented in SDM. It cannot be implemented in a migration safe manner. What do you suggest just say memtest does not work? I am wondering why it is working with -cpu qemu64. Peter -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks
On 09/12/2012 10:29 PM, Avi Kivity wrote: walk_addr_generic() permission checks are a maze of branchy code, which is performed four times per lookup. It depends on the type of access, efer.nxe, cr0.wp, cr4.smep, and in the near future, cr4.smap. Optimize this away by precalculating all variants and storing them in a bitmap. The bitmap is recalculated when rarely-changing variables change (cr0, cr4) and is indexed by the often-changing variables (page fault error code, pte access permissions). Really graceful! The result is short, branch-free code. Signed-off-by: Avi Kivity a...@redhat.com +static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) +{ + unsigned bit, byte, pfec; + u8 map; + bool fault, x, w, u, wf, uf, ff, smep; + + smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); + for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { + pfec = byte 1; + map = 0; + wf = pfec PFERR_WRITE_MASK; + uf = pfec PFERR_USER_MASK; + ff = pfec PFERR_FETCH_MASK; + for (bit = 0; bit 8; ++bit) { + x = bit ACC_EXEC_MASK; + w = bit ACC_WRITE_MASK; + u = bit ACC_USER_MASK; + + /* Not really needed: !nx will cause pte.nx to fault */ + x |= !mmu-nx; + /* Allow supervisor writes if !cr0.wp */ + w |= !is_write_protection(vcpu) !uf; + /* Disallow supervisor fetches if cr4.smep */ + x = !(smep !uf); In the case of smep, supervisor mode can fetch the memory if pte.u == 0, so, it should be x = !(smep !uf u)? @@ -3672,20 +3672,18 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, gpa_t *gpa, struct x86_exception *exception, bool write) { - u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; + u32 access = ((kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0) + | (write ? PFERR_WRITE_MASK : 0); + u8 bit = vcpu-arch.access; - if (vcpu_match_mmio_gva(vcpu, gva) - check_write_user_access(vcpu, write, access, - vcpu-arch.access)) { + if (vcpu_match_mmio_gva(vcpu, gva) + ((vcpu-arch.walk_mmu-permissions[access 1] bit) 1)) { !((vcpu-arch.walk_mmu-permissions[access 1] bit) 1) ? It is better introducing a function to do the permission check? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On 09/11/2012 09:27 PM, Andrew Theurer wrote: So, having both is probably not a good idea. However, I feel like there's more work to be done. With no over-commit (10 VMs), total throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some overhead, but a reduction to ~4500 is still terrible. By contrast, 8-way VMs with 2x over-commit have a total throughput roughly 10% less than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread host). We still have what appears to be scalability problems, but now it's not so much in runqueue locks for yield_to(), but now get_pid_task(): perf on host: 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to 9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin 7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to 3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock 3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run 2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run 2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest 2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule 1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq 1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task 1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel 0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair 0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe 0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc get_pid_task() uses some rcu fucntions, wondering how scalable this is I tend to think of rcu as -not- having issues like this... is there a rcu stat/tracing tool which would help identify potential problems? It's not, it's the atomics + cache line bouncing. We're basically guaranteed to bounce here. Here we're finally paying for the ioctl() based interface. A syscall based interface would have a 1:1 correspondence between vcpus and tasks, so these games would be unnecessary. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks
On 09/13/2012 03:09 PM, Xiao Guangrong wrote: The result is short, branch-free code. Signed-off-by: Avi Kivity a...@redhat.com +static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) +{ +unsigned bit, byte, pfec; +u8 map; +bool fault, x, w, u, wf, uf, ff, smep; + +smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); +for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { +pfec = byte 1; +map = 0; +wf = pfec PFERR_WRITE_MASK; +uf = pfec PFERR_USER_MASK; +ff = pfec PFERR_FETCH_MASK; +for (bit = 0; bit 8; ++bit) { +x = bit ACC_EXEC_MASK; +w = bit ACC_WRITE_MASK; +u = bit ACC_USER_MASK; + +/* Not really needed: !nx will cause pte.nx to fault */ +x |= !mmu-nx; +/* Allow supervisor writes if !cr0.wp */ +w |= !is_write_protection(vcpu) !uf; +/* Disallow supervisor fetches if cr4.smep */ +x = !(smep !uf); In the case of smep, supervisor mode can fetch the memory if pte.u == 0, so, it should be x = !(smep !uf u)? Good catch, will fix. @@ -3672,20 +3672,18 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, gpa_t *gpa, struct x86_exception *exception, bool write) { -u32 access = (kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0; +u32 access = ((kvm_x86_ops-get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0) +| (write ? PFERR_WRITE_MASK : 0); +u8 bit = vcpu-arch.access; -if (vcpu_match_mmio_gva(vcpu, gva) - check_write_user_access(vcpu, write, access, - vcpu-arch.access)) { +if (vcpu_match_mmio_gva(vcpu, gva) + ((vcpu-arch.walk_mmu-permissions[access 1] bit) 1)) { !((vcpu-arch.walk_mmu-permissions[access 1] bit) 1) ? It is better introducing a function to do the permission check? Probably, I'll rethink it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks
On 09/12/2012 10:29 PM, Avi Kivity wrote: + pte_access = pt_access gpte_access(vcpu, pte); + eperm |= (mmu-permissions[access 1] pte_access) 1; last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte); - if (last_gpte) { - pte_access = pt_access gpte_access(vcpu, pte); - /* check if the kernel is fetching from user page */ - if (unlikely(pte_access PT_USER_MASK) - kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)) - if (fetch_fault !user_fault) - eperm = true; - } I see this in the SDM: If CR4.SMEP = 1, instructions may be fetched from any linear address with a valid translation for which the U/S flag (bit 2) is 0 in at least one of the paging-structure entries controlling the translation. This patch checks smep on every levels, breaks this rule. (current code checks smep on the last level). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memtest 4.20+ does not work with -cpu host
On Thu, Sep 13, 2012 at 02:05:23PM +0200, Peter Lieven wrote: On 13.09.2012 10:05, Gleb Natapov wrote: On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote: Il 13/09/2012 09:57, Gleb Natapov ha scritto: #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Why would anyone will want to run memtest in a vm? May be just add those MSRs to ignore list and that's it. From the output it looks like it's basically a list of bits. Returning something sensible is better, same as for the speed scaling MSRs. Everything is list of bits in computers :) At least 0xce is documented in SDM. It cannot be implemented in a migration safe manner. What do you suggest just say memtest does not work? Why do you want to run it in a guest? I am wondering why it is working with -cpu qemu64. Because memtest has different code for different cpu models. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro
-Original Message- From: Wood Scott-B07421 Sent: Thursday, September 13, 2012 12:54 AM To: Alexander Graf Cc: Caraman Mihai Claudiu-B02008; kvm-...@vger.kernel.org; linuxppc- d...@lists.ozlabs.org; kvm@vger.kernel.org Subject: Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro On 09/12/2012 04:45 PM, Alexander Graf wrote: On 12.09.2012, at 23:38, Scott Wood scottw...@freescale.com wrote: On 09/12/2012 01:56 PM, Alexander Graf wrote: On 12.09.2012, at 15:18, Mihai Caraman mihai.cara...@freescale.com wrote: The current form of DO_KVM macro restricts its use to one call per input parameter set. This is caused by kvmppc_resume_\intno\()_\srr1 symbol definition. Duplicate calls of DO_KVM are required by distinct implementations of exeption handlers which are delegated at runtime. Not sure I understand what you're trying to achieve here. Please elaborate ;) On 64-bit book3e we compile multiple versions of the TLB miss handlers, and choose from them at runtime. The exception handler patching is active in __early_init_mmu() function powerpc/mm/tlb_nohash.c for quite a few years. For tlb miss exceptions there are three handler versions: standard, HW tablewalk and bolted. I posted a patch to add another variant, for e6500-style hardware tablewalk, which shares the bolted prolog/epilog (besides prolog/epilog performance, e6500 is incompatible with the IBM tablewalk code for various reasons). That caused us to have two DO_KVMs for the same exception type. Sorry, I missed to cc kvm-ppc mailist when I replayed to that discussion thread. -Mike
Re: memtest 4.20+ does not work with -cpu host
On 13.09.2012 14:42, Gleb Natapov wrote: On Thu, Sep 13, 2012 at 02:05:23PM +0200, Peter Lieven wrote: On 13.09.2012 10:05, Gleb Natapov wrote: On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote: Il 13/09/2012 09:57, Gleb Natapov ha scritto: #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Why would anyone will want to run memtest in a vm? May be just add those MSRs to ignore list and that's it. From the output it looks like it's basically a list of bits. Returning something sensible is better, same as for the speed scaling MSRs. Everything is list of bits in computers :) At least 0xce is documented in SDM. It cannot be implemented in a migration safe manner. What do you suggest just say memtest does not work? Why do you want to run it in a guest? Testing memory thorughput of different host memory layouts/settings (hugepages, ksm etc.). Stress testing new settings and qemu-kvm builds. Testing new nodes with a VM which claims all available pages. Its a lot easier than booting a node with a CD and attaching to the Console. This, of course, is all not missing critical and call also be done with cpu model qemu64. I just came across memtest no longer working and where wondering if there is a general regressing. BTW, from http://opensource.apple.com/source/xnu/xnu-1228.15.4/osfmk/i386/tsc.c?txt #define MSR_FLEX_RATIO 0x194 #define MSR_PLATFORM_INFO 0x0ce #define BASE_NHM_CLOCK_SOURCE 1ULL #define CPUID_MODEL_NEHALEM 26 switch (cpuid_info()-cpuid_model) { case CPUID_MODEL_NEHALEM: { uint64_t cpu_mhz; uint64_t msr_flex_ratio; uint64_t msr_platform_info; /* See if FLEX_RATIO is being used */ msr_flex_ratio = rdmsr64(MSR_FLEX_RATIO); msr_platform_info = rdmsr64(MSR_PLATFORM_INFO); flex_ratio_min = (uint32_t)bitfield(msr_platform_info, 47, 40); flex_ratio_max = (uint32_t)bitfield(msr_platform_info, 15, 8); /* No BIOS-programed flex ratio. Use hardware max as default */ tscGranularity = flex_ratio_max; if (msr_flex_ratio bit(16)) { /* Flex Enabled: Use this MSR if less than max */ flex_ratio = (uint32_t)bitfield(msr_flex_ratio, 15, 8); if (flex_ratio flex_ratio_max) tscGranularity = flex_ratio; } /* If EFI isn't configured correctly, use a constant * value. See 6036811. */ if (busFreq == 0) busFreq = BASE_NHM_CLOCK_SOURCE; cpu_mhz = tscGranularity * BASE_NHM_CLOCK_SOURCE; kprintf([NHM] Maximum Non-Turbo Ratio = [%d]\n, (uint32_t)tscGranularity); kprintf([NHM] CPU: Frequency = %6d.%04dMhz\n, (uint32_t)(cpu_mhz / Mega), (uint32_t)(cpu_mhz % Mega)); break; } Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: MMU: Optimize pte permission checks
On 09/13/2012 03:41 PM, Xiao Guangrong wrote: On 09/12/2012 10:29 PM, Avi Kivity wrote: +pte_access = pt_access gpte_access(vcpu, pte); +eperm |= (mmu-permissions[access 1] pte_access) 1; last_gpte = FNAME(is_last_gpte)(walker, vcpu, mmu, pte); -if (last_gpte) { -pte_access = pt_access gpte_access(vcpu, pte); -/* check if the kernel is fetching from user page */ -if (unlikely(pte_access PT_USER_MASK) -kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)) -if (fetch_fault !user_fault) -eperm = true; -} I see this in the SDM: If CR4.SMEP = 1, instructions may be fetched from any linear address with a valid translation for which the U/S flag (bit 2) is 0 in at least one of the paging-structure entries controlling the translation. Another good catch. This patch checks smep on every levels, breaks this rule. (current code checks smep on the last level). We can just move the permission check to the end of the loop. We used to terminate the loop on a permission error, but now we do the whole thing anyway. It does mean that we'll need to set accessed bits after the loop is complete. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/5] KVM: MMU: Optimize is_last_gpte()
On 09/13/2012 05:47 PM, Avi Kivity wrote: On 09/12/2012 09:03 PM, Avi Kivity wrote: On 09/12/2012 08:49 PM, Avi Kivity wrote: Instead of branchy code depending on level, gpte.ps, and mmu configuration, prepare everything in a bitmap during mode changes and look it up during runtime. 6/5 is buggy, sorry, will update it tomorrow. 8--8-- From: Avi Kivity a...@redhat.com Date: Wed, 12 Sep 2012 20:46:56 +0300 Subject: [PATCH v2 6/5] KVM: MMU: Optimize is_last_gpte() Instead of branchy code depending on level, gpte.ps, and mmu configuration, prepare everything in a bitmap during mode changes and look it up during runtime. Signed-off-by: Avi Kivity a...@redhat.com --- v2: rearrange bitmap (one less shift) avoid stomping on local variable fix index calculation move check back to a function arch/x86/include/asm/kvm_host.h | 7 +++ arch/x86/kvm/mmu.c | 31 +++ arch/x86/kvm/mmu.h | 3 ++- arch/x86/kvm/paging_tmpl.h | 22 +- 4 files changed, 41 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3318bde..f9a48cf 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -298,6 +298,13 @@ struct kvm_mmu { u64 *lm_root; u64 rsvd_bits_mask[2][4]; + /* + * Bitmap: bit set = last pte in walk + * index[0]: pte.ps + * index[1:2]: level + */ Opposite? index[2]: pte.pse? Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic
On Wed, 12 Sep 2012, Matthew Ogilvie wrote: Also, how big of a concern is a very rare gained or lost IRQ0 actually? Under normal conditions, I would expect this to at most cause a one time clock drift in the guest OS of a fraction of a second. If that only happens when rebooting or migrating the guest... It depends on how you define very rare. Once per month or probably even per day is probably acceptable although you'll see a disruption in the system clock. This is still likely unwanted if the system is used as a clock reference and not just wants to keep its clock right for own purposes. Anything more frequent and NTP does care very much; an accurate system clock is important in many uses, starting from basic ones such as where timestamps of files exported over NFS are concerned. Speaking of real hw -- I don't know whether that really matters for emulated systems. Thanks for looking into the 8254 PIT in details. Maciej -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memtest 4.20+ does not work with -cpu host
On Thu, Sep 13, 2012 at 02:56:33PM +0200, Peter Lieven wrote: On 13.09.2012 14:42, Gleb Natapov wrote: On Thu, Sep 13, 2012 at 02:05:23PM +0200, Peter Lieven wrote: On 13.09.2012 10:05, Gleb Natapov wrote: On Thu, Sep 13, 2012 at 10:00:26AM +0200, Paolo Bonzini wrote: Il 13/09/2012 09:57, Gleb Natapov ha scritto: #rdmsr -0 0x194 00011100 #rdmsr -0 0xce 0c0004011103 Yes, that can help implementing it in KVM. But without a spec to understand what the bits actually mean, it's just as risky... Peter, do you have any idea where to get the spec of the memory controller MSRs in Nehalem and newer processors? Apparently, memtest is using them (and in particular 0x194) to find the speed of the FSB, or something like that. Why would anyone will want to run memtest in a vm? May be just add those MSRs to ignore list and that's it. From the output it looks like it's basically a list of bits. Returning something sensible is better, same as for the speed scaling MSRs. Everything is list of bits in computers :) At least 0xce is documented in SDM. It cannot be implemented in a migration safe manner. What do you suggest just say memtest does not work? Why do you want to run it in a guest? Testing memory thorughput of different host memory layouts/settings (hugepages, ksm etc.). In may days memtets looked for memory errors. This does not make much sense in virtualized environment. What does it do today? Calculates throughput? Does it prefaults memory before doing so, because otherwise numbers will not be very meaningful when running inside VM. But since memtets works on physical memory I doubt it prefaults. Stress testing new settings and qemu-kvm builds. Why guest accessing memory stress qemu-kvm? Testing new nodes with a VM which claims all available pages. Its a lot easier than booting a node with a CD and attaching to the Console. Boot Window, it access all memory :) or run with qemu64 like you say below. This, of course, is all not missing critical and call also be done with cpu model qemu64. I just came across memtest no longer working and where wondering if there is a general regressing. If it is a regression it is likely in memtest BTW, from http://opensource.apple.com/source/xnu/xnu-1228.15.4/osfmk/i386/tsc.c?txt You can send them patch to check that it runs in a VM and skip all that. #define MSR_FLEX_RATIO 0x194 #define MSR_PLATFORM_INFO 0x0ce #define BASE_NHM_CLOCK_SOURCE 1ULL #define CPUID_MODEL_NEHALEM 26 switch (cpuid_info()-cpuid_model) { case CPUID_MODEL_NEHALEM: { uint64_t cpu_mhz; uint64_t msr_flex_ratio; uint64_t msr_platform_info; /* See if FLEX_RATIO is being used */ msr_flex_ratio = rdmsr64(MSR_FLEX_RATIO); msr_platform_info = rdmsr64(MSR_PLATFORM_INFO); flex_ratio_min = (uint32_t)bitfield(msr_platform_info, 47, 40); flex_ratio_max = (uint32_t)bitfield(msr_platform_info, 15, 8); /* No BIOS-programed flex ratio. Use hardware max as default */ tscGranularity = flex_ratio_max; if (msr_flex_ratio bit(16)) { /* Flex Enabled: Use this MSR if less than max */ flex_ratio = (uint32_t)bitfield(msr_flex_ratio, 15, 8); if (flex_ratio flex_ratio_max) tscGranularity = flex_ratio; } /* If EFI isn't configured correctly, use a constant * value. See 6036811. */ if (busFreq == 0) busFreq = BASE_NHM_CLOCK_SOURCE; cpu_mhz = tscGranularity * BASE_NHM_CLOCK_SOURCE; kprintf([NHM] Maximum Non-Turbo Ratio = [%d]\n, (uint32_t)tscGranularity); kprintf([NHM] CPU: Frequency = %6d.%04dMhz\n, (uint32_t)(cpu_mhz / Mega), (uint32_t)(cpu_mhz % Mega)); break; } Peter -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool
Em Wed, Sep 12, 2012 at 10:56:44PM -0600, David Ahern escreveu: static const char * const kvm_usage[] = { +perf kvm [options] {top|record|report|diff|buildid-list|stat}, The usage for the report/record sub commands of stat is never shown. e.g., $ perf kvm stat -- shows help for perf-stat $ perf kvm -- shows the above and perf-kvm's usage [I deleted this thread, so having to reply to one of my responses. hopefully noone is unduly harmed by this.] I've been using this command a bit lately -- especially on nested virtualization -- and I think the syntax is quirky - meaning wrong. In my case I always follow up a record with a report and end up using a shell script wrapper that combines the 2 and running it repeatedly. e.g., $PERF kvm stat record -o $FILE -p $pid -- sleep $time [ $? -eq 0 ] $PERF --no-pager kvm -i $FILE stat report As my daughter likes to say - awkward. That suggests what is really needed is a 'live' mode - a continual updating of the output like perf top, not a record and analyze later mode. Which does come back to why I responded to this email -- the syntax is klunky and awkward. So, I spent a fair amount of time today implementing a live mode. And after a lot of swearing at the tracepoint processing code I What kind of swearing? I'm working on 'perf test' entries for tracepoints to make sure we don't regress on the perf/libtraceevent junction, doing that as prep work for further simplifying tracepoint tools like sched, kvm, kmem, etc. finally have it working. And the format extends easily (meaning day and the next step) to a perf-based kvm_stat replacement. Example syntax is: perf kvm stat [-p pid|-a|...] which defaults to an update delay of 1 second, and vmexit analysis. The guts of the processing logic come from the existing kvm-events code. The changes focus on combining the record and report paths into one. The display needs some help (Arnaldo?), but it seems to work well. I'd like to get opinions on what next? IMO, the record/report path should not get a foot hold from a backward compatibility perspective and having to maintain those options. I am willing to take the existing patches into git to maintain authorship and from there apply patches to make the live mode work - which includes a bit of refactoring of perf code (like the stats changes). Before I march down this path, any objections, opinions, etc? Can I see the code? - Arnaldo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic
On 2012-09-13 15:41, Maciej W. Rozycki wrote: On Wed, 12 Sep 2012, Matthew Ogilvie wrote: Also, how big of a concern is a very rare gained or lost IRQ0 actually? Under normal conditions, I would expect this to at most cause a one time clock drift in the guest OS of a fraction of a second. If that only happens when rebooting or migrating the guest... It depends on how you define very rare. Once per month or probably even per day is probably acceptable although you'll see a disruption in the system clock. This is still likely unwanted if the system is used as a clock reference and not just wants to keep its clock right for own purposes. Anything more frequent and NTP does care very much; an accurate system clock is important in many uses, starting from basic ones such as where timestamps of files exported over NFS are concerned. Speaking of real hw -- I don't know whether that really matters for emulated systems. Thanks for looking into the 8254 PIT in details. First correct, then fast. That rule applies at least to the conceptual phase. Also, for rarely used PIT modes, I would refrain from optimizing them away from the specified behaviour. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SDP-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic
On 2012-09-13 07:49, Matthew Ogilvie wrote: On Wed, Sep 12, 2012 at 10:57:57AM +0200, Jan Kiszka wrote: On 2012-09-12 10:51, Avi Kivity wrote: On 09/12/2012 11:48 AM, Jan Kiszka wrote: On 2012-09-12 10:01, Avi Kivity wrote: On 09/10/2012 04:29 AM, Matthew Ogilvie wrote: Intel's definition of edge triggered means: asserted with a low-to-high transition at the time an interrupt is registered and then kept high until the interrupt is served via one of the EOI mechanisms or goes away unhandled. So the only difference between edge triggered and level triggered is in the leading edge, with no difference in the trailing edge. This bug manifested itself when the guest was Microport UNIX System V/386 v2.1 (ca. 1987), because it would sometimes mask off IRQ14 in the slave IMR after it had already been asserted. The master would still try to deliver an interrupt even though IRQ2 had dropped again, resulting in a spurious interupt (IRQ15) and a panicked UNIX kernel. diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index adba28f..5cbba99 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -302,8 +302,12 @@ static void pit_do_work(struct kthread_work *work) } spin_unlock(ps-inject_lock); if (inject) { -kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1); +/* Clear previous interrupt, then create a rising + * edge to request another interupt, and leave it at + * level=1 until time to inject another one. + */ kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 0); +kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1); /* I thought I understood this, now I'm not sure. How can this be correct? Real hardware doesn't act like this. What if the PIT is disabled after this? You're injecting a spurious interrupt then. Yes, the PIT has to raise the output as long as specified, i.e. according to the datasheet. That's important now due to the corrections to the PIC. We can then carefully check if there is room for simplifications / optimizations. I also cannot imagine that the above already fulfills these requirements. And if the PIT is disabled by the HPET, we need to clear the output explicitly as we inject the IRQ#0 under a different source ID than userspace HPET does (which will logically take over IRQ#0 control). The kernel would otherwise OR both sources to an incorrect result. I guess we need to double the hrtimer rate then in order to generate a square wave. It's getting ridiculous how accurate our model needs to be. I would suggest to solve this for the userspace model first, ensure that it works properly in all modes, maybe optimize it, and then decide how to map all this on kernel space. As long as we have two models, we can also make use of them. Thoughts about the 8254 PIT: First, this summary of (real) 8254 PIT behavior seems fairly good, as far it goes: On Tue, Sep 04, 2012 at 07:27:38PM +0100, Maciej W. Rozycki wrote: * The 8254 PIT is normally configured in mode 2 or 3 in the PC/AT architecture. In the former its output is high (active) all the time except from one (last) clock cycle. In the latter a wave that has a duty cycle close or equal to 0.5 (depending on whether the divider is odd or even) is produced, so no short pulses either. I don't remember the other four modes -- have a look at the datasheet if interested, but I reckon they're not really compatible with the wiring anyway, e.g. the gate is hardwired enabled. I've also just skimmed parts of the 8254 section of The Indispensable PC Hardware Book, by Hans-Peter Messmer, Copyright 1994 Addison-Wesley, although I probably ought to read it more carefully. http://download.intel.com/design/archives/periphrl/docs/23124406.pdf should be the primary reference - as long as it leaves no open questions. Under normal conditions, the 8254 part of the patch above should be indistinguishable from previous behavior. The 8259's IRR will still show up as 1 until the interrupt is actually serviced, and no new interrupt will be serviced after one is serviced until another edge is injected via the high-low-high transition of the new code. (Unless the guest resets the 8259 or maybe messes with IMR, but real hardware would generate extra interrupts in such cases as well.) The new code sounds much closer to mode 2 described by Maciej, compared to the old code - except the duty cycle is effectively 100 percent instead of 99.[some number of 9's] percent. - But there might be some concerns in abnormal conditions: * If some guest is actually depending on a 50 percent duty cycle (maybe some kind of polling rather than interrupts), I would expect it to be just as broken before this patch as after, unless it is really weird (handles
Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool
On 9/13/12 7:45 AM, Arnaldo Carvalho de Melo wrote: Em Wed, Sep 12, 2012 at 10:56:44PM -0600, David Ahern escreveu: static const char * const kvm_usage[] = { +perf kvm [options] {top|record|report|diff|buildid-list|stat}, The usage for the report/record sub commands of stat is never shown. e.g., $ perf kvm stat -- shows help for perf-stat $ perf kvm -- shows the above and perf-kvm's usage [I deleted this thread, so having to reply to one of my responses. hopefully noone is unduly harmed by this.] I've been using this command a bit lately -- especially on nested virtualization -- and I think the syntax is quirky - meaning wrong. In my case I always follow up a record with a report and end up using a shell script wrapper that combines the 2 and running it repeatedly. e.g., $PERF kvm stat record -o $FILE -p $pid -- sleep $time [ $? -eq 0 ] $PERF --no-pager kvm -i $FILE stat report As my daughter likes to say - awkward. That suggests what is really needed is a 'live' mode - a continual updating of the output like perf top, not a record and analyze later mode. Which does come back to why I responded to this email -- the syntax is klunky and awkward. So, I spent a fair amount of time today implementing a live mode. And after a lot of swearing at the tracepoint processing code I What kind of swearing? I'm working on 'perf test' entries for tracepoints to make sure we don't regress on the perf/libtraceevent junction, doing that as prep work for further simplifying tracepoint tools like sched, kvm, kmem, etc. Have you seen how the tracing initialization is done? ugly. record generates tracing data events and report uses those to do the init so you can access the raw_data. I ended up writing this: static int perf_kvm__tracing_init(void) { struct tracing_data *tdata; char temp_file[] = /tmp/perf-; int fd; fd = mkstemp(temp_file); if (fd 0) { pr_err(mkstemp failed\n); return -1; } unlink(temp_file); tdata = tracing_data_get(kvm_events.evlist-entries, fd, false); if (!tdata) return -1; lseek(fd, 0, SEEK_SET); (void) trace_report(fd, kvm_events.session-pevent, false); tracing_data_put(tdata); return 0; } finally have it working. And the format extends easily (meaning day and the next step) to a perf-based kvm_stat replacement. Example syntax is: perf kvm stat [-p pid|-a|...] which defaults to an update delay of 1 second, and vmexit analysis. The guts of the processing logic come from the existing kvm-events code. The changes focus on combining the record and report paths into one. The display needs some help (Arnaldo?), but it seems to work well. I'd like to get opinions on what next? IMO, the record/report path should not get a foot hold from a backward compatibility perspective and having to maintain those options. I am willing to take the existing patches into git to maintain authorship and from there apply patches to make the live mode work - which includes a bit of refactoring of perf code (like the stats changes). Before I march down this path, any objections, opinions, etc? Can I see the code? Let me clean it up over the weekend and send out an RFC for it. David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv4] KVM: optimize apic interrupt delivery
Most interrupt are delivered to only one vcpu. Use pre-build tables to find interrupt destination instead of looping through all vcpus. In case of logical mode loop only through vcpus in a logical cluster irq is sent to. Signed-off-by: Gleb Natapov g...@redhat.com --- Changelog: - v3-v4 * remove inline from recalculate_apic_map() * add BUG_ON() to apic_cluster_id() * add comments to non self explanatory kvm_apic_map fields * MST convinced me that we do not need to optimize low prio logical mode with one cpu as dst, so drop it * fix some typo and comments * remove unneeded cast - v2-v3 * sparse annotation for rcu usage * move mutex above map * use mask/shift to calculate cluster/dst ids * use gotos * add comment about logic behind logical table creation - v1-v2 * fix race Avi noticed * rcu_read_lock() out of the block as per Avi * fix rcu issues pointed to by MST. All but one. Still use call_rcu(). Do not think this is serious issue. If it is should be solved by RCU subsystem. * Fix phys_map overflow pointed to by MST * recalculate_apic_map() does not return error any more. * add optimization for low prio logical mode with one cpu as dst (it happens) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 64adb61..742f91b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -511,6 +511,16 @@ struct kvm_arch_memory_slot { struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1]; }; +struct kvm_apic_map { + struct rcu_head rcu; + u8 ldr_bits; + /* fields bellow are used to decode ldr values in different modes */ + u32 cid_shift, cid_mask, lid_mask; + struct kvm_lapic *phys_map[256]; + /* first index is cluster id second is cpu id in a cluster */ + struct kvm_lapic *logical_map[16][16]; +}; + struct kvm_arch { unsigned int n_used_mmu_pages; unsigned int n_requested_mmu_pages; @@ -528,6 +538,8 @@ struct kvm_arch { struct kvm_ioapic *vioapic; struct kvm_pit *vpit; int vapics_in_nmi_mode; + struct mutex apic_map_lock; + struct kvm_apic_map *apic_map; unsigned int tss_addr; struct page *apic_access_page; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 07ad628..6e12ddd 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -139,11 +139,110 @@ static inline int apic_enabled(struct kvm_lapic *apic) (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \ APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER) +static inline int apic_x2apic_mode(struct kvm_lapic *apic) +{ + return apic-vcpu-arch.apic_base X2APIC_ENABLE; +} + static inline int kvm_apic_id(struct kvm_lapic *apic) { return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr) +{ + u16 cid; + ldr = 32 - map-ldr_bits; + cid = (ldr map-cid_shift) map-cid_mask; + + BUG_ON(cid = ARRAY_SIZE(map-logical_map)); + + return cid; +} + +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr) +{ + ldr = (32 - map-ldr_bits); + return ldr map-lid_mask; +} + +static void recalculate_apic_map(struct kvm *kvm) +{ + struct kvm_apic_map *new, *old = NULL; + struct kvm_vcpu *vcpu; + int i; + + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL); + + mutex_lock(kvm-arch.apic_map_lock); + + if (!new) + goto out; + + new-ldr_bits = 8; + /* flat mode is default */ + new-cid_shift = 8; + new-cid_mask = 0; + new-lid_mask = 0xff; + + kvm_for_each_vcpu(i, vcpu, kvm) { + struct kvm_lapic *apic = vcpu-arch.apic; + u16 cid, lid; + u32 ldr; + + if (!kvm_apic_present(vcpu)) + continue; + + /* +* All APICs have to be configured in the same mode by an OS. +* We take advatage of this while building logical id loockup +* table. After reset APICs are in xapic/flat mode, so if we +* find apic with different setting we assume this is the mode +* OS wants all apics to be in; build lookup table accordingly. +*/ + if (apic_x2apic_mode(apic)) { + new-ldr_bits = 32; + new-cid_shift = 16; + new-cid_mask = new-lid_mask = 0x; + } else if (kvm_apic_sw_enabled(apic) + !new-cid_mask /* flat mode */ + kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER) { + new-cid_shift = 4; + new-cid_mask = 0xf; + new-lid_mask = 0xf; + } + +
Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool
Em Thu, Sep 13, 2012 at 08:14:37AM -0600, David Ahern escreveu: On 9/13/12 7:45 AM, Arnaldo Carvalho de Melo wrote: Em Wed, Sep 12, 2012 at 10:56:44PM -0600, David Ahern escreveu: So, I spent a fair amount of time today implementing a live mode. And after a lot of swearing at the tracepoint processing code I What kind of swearing? I'm working on 'perf test' entries for tracepoints to make sure we don't regress on the perf/libtraceevent junction, doing that as prep work for further simplifying tracepoint tools like sched, kvm, kmem, etc. Have you seen how the tracing initialization is done? ugly. record generates tracing data events and report uses those to do the init so you can access the raw_data. I ended up writing this: And all we need is the list of fields so that we can use perf_evsel__{int,str}val like I did in my 'perf sched' patch series (in my perf/core branch), and even those accessors I'll tweak some more as we don't need to check the endianness of the events, its in the same machine, etc. I'm trying to get by without using a 'pevent' just using 'event_format', its doable when everything is local, as a single machine top tool is. I want to just create the tracepoint events and process them like in 'top', using code more or less like what is in test__PERF_RECORD. This still needs more work, so I think you can continue in your path and eventually we'll have infrastructure to do it the way I'm describing, optimizing the case where the record and top are in the same machine, i.e. a short circuited 'live mode' with the top machinery completely reused for tools, be it written in C, like 'sched', 'kvm', 'kmem', etc, or in perl or python. - Arnaldo static int perf_kvm__tracing_init(void) { struct tracing_data *tdata; char temp_file[] = /tmp/perf-; int fd; fd = mkstemp(temp_file); if (fd 0) { pr_err(mkstemp failed\n); return -1; } unlink(temp_file); tdata = tracing_data_get(kvm_events.evlist-entries, fd, false); if (!tdata) return -1; lseek(fd, 0, SEEK_SET); (void) trace_report(fd, kvm_events.session-pevent, false); tracing_data_put(tdata); return 0; } finally have it working. And the format extends easily (meaning day and the next step) to a perf-based kvm_stat replacement. Example syntax is: perf kvm stat [-p pid|-a|...] which defaults to an update delay of 1 second, and vmexit analysis. The guts of the processing logic come from the existing kvm-events code. The changes focus on combining the record and report paths into one. The display needs some help (Arnaldo?), but it seems to work well. I'd like to get opinions on what next? IMO, the record/report path should not get a foot hold from a backward compatibility perspective and having to maintain those options. I am willing to take the existing patches into git to maintain authorship and from there apply patches to make the live mode work - which includes a bit of refactoring of perf code (like the stats changes). Before I march down this path, any objections, opinions, etc? Can I see the code? Let me clean it up over the weekend and send out an RFC for it. David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro
On 09/12/2012 03:18 PM, Mihai Caraman wrote: The current form of DO_KVM macro restricts its use to one call per input parameter set. This is caused by kvmppc_resume_\intno\()_\srr1 symbol definition. Duplicate calls of DO_KVM are required by distinct implementations of exeption handlers which are delegated at runtime. Use a rare label number to avoid conflicts with the calling contexts. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic
On Thu, 13 Sep 2012, Jan Kiszka wrote: I've also just skimmed parts of the 8254 section of The Indispensable PC Hardware Book, by Hans-Peter Messmer, Copyright 1994 Addison-Wesley, although I probably ought to read it more carefully. http://download.intel.com/design/archives/periphrl/docs/23124406.pdf should be the primary reference - as long as it leaves no open questions. Oh, I'm glad they've put it online after all, so there's an ultimate place to refer to. I've only got a copy of this datasheet I got from Intel on a CD some 15 years ago. And for the record -- they used to publish the 8259A datasheet as well, but it appears to have gone from its place. However it can be easily tracked down by an Internet search engine of your choice by referring to its order # as 231468.pdf (no revision number embedded there in the file name as there was none as it was originally published either). Maciej -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Prepare kvm for lto
On Thu, Sep 13, 2012 at 11:27:43AM +0300, Avi Kivity wrote: On 09/12/2012 10:17 PM, Andi Kleen wrote: On Wed, Sep 12, 2012 at 05:50:41PM +0300, Avi Kivity wrote: vmx.c has an lto-unfriendly bit, fix it up. While there, clean up our asm code. Avi Kivity (3): KVM: VMX: Make lto-friendly KVM: VMX: Make use of asm.h KVM: SVM: Make use of asm.h Works for me in my LTO build, thanks Avi. I cannot guarantee I always hit the unit splitting case, but it looks good so far. Actually I think patch 1 is missing a .global vmx_return. Ok can you add it please? It always depends how the LTO partitioner decides to split the subunits. I can run it with randomconfig in a loop over night. That's the best way I know to try to cover these cases. -Andi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU
On Wed, Sep 12, 2012 at 04:10:24PM +0800, Xudong Hao wrote: Enable KVM FPU fully eager restore, if there is other FPU state which isn't tracked by CR0.TS bit. v3 changes from v2: - Make fpu active explicitly while guest xsave is enabling and non-lazy xstate bit exist. How about a guest_xcr0_can_lazy_saverestore bool to control this? It only needs to be updated when guest xcr0 is updated. That seems cleaner. Avi? v2 changes from v1: - Expand KVM_XSTATE_LAZY to 64 bits before negating it. Signed-off-by: Xudong Hao xudong@intel.com --- arch/x86/include/asm/kvm.h |4 arch/x86/kvm/vmx.c |2 ++ arch/x86/kvm/x86.c | 15 ++- 3 files changed, 20 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index 521bf25..4c27056 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -8,6 +8,8 @@ #include linux/types.h #include linux/ioctl.h +#include asm/user.h +#include asm/xsave.h /* Select x86 specific features in linux/kvm.h */ #define __KVM_HAVE_PIT @@ -30,6 +32,8 @@ /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 +#define KVM_XSTATE_LAZY (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) + struct kvm_memory_alias { __u32 slot; /* this has a different namespace than memory slots */ __u32 flags; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 248c2b4..853e875 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!vcpu-fpu_active) hw_cr0 |= X86_CR0_TS | X86_CR0_MP; + else + hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP); vmcs_writel(CR0_READ_SHADOW, cr0); vmcs_writel(GUEST_CR0, hw_cr0); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 20f2266..183cf60 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) return 1; if (xcr0 ~host_xcr0) return 1; + if (xcr0 ~((u64)KVM_XSTATE_LAZY)) + vcpu-fpu_active = 1; vcpu-arch.xcr0 = xcr0; vcpu-guest_xcr0_loaded = 0; return 0; @@ -5969,7 +5971,18 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) vcpu-guest_fpu_loaded = 0; fpu_save_init(vcpu-arch.guest_fpu); ++vcpu-stat.fpu_reload; - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); + /* + * Currently KVM trigger FPU restore by #NM (via CR0.TS), + * till now only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked + * by TS bit, there might be other FPU state is not tracked + * by TS bit. Here it only make FPU deactivate request and do + * FPU lazy restore for these cases: 1)xsave isn't enabled + * in guest, 2)all guest FPU states can be tracked by TS bit. + * For others, doing fully FPU eager restore. + */ + if (!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) || + !(vcpu-arch.xcr0 ~((u64)KVM_XSTATE_LAZY))) + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); } -- 1.5.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU
On Thu, Sep 13, 2012 at 01:26:36PM -0300, Marcelo Tosatti wrote: On Wed, Sep 12, 2012 at 04:10:24PM +0800, Xudong Hao wrote: Enable KVM FPU fully eager restore, if there is other FPU state which isn't tracked by CR0.TS bit. v3 changes from v2: - Make fpu active explicitly while guest xsave is enabling and non-lazy xstate bit exist. How about a guest_xcr0_can_lazy_saverestore bool to control this? It only needs to be updated when guest xcr0 is updated. That seems cleaner. Avi? Reasoning below. v2 changes from v1: - Expand KVM_XSTATE_LAZY to 64 bits before negating it. Signed-off-by: Xudong Hao xudong@intel.com --- arch/x86/include/asm/kvm.h |4 arch/x86/kvm/vmx.c |2 ++ arch/x86/kvm/x86.c | 15 ++- 3 files changed, 20 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index 521bf25..4c27056 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -8,6 +8,8 @@ #include linux/types.h #include linux/ioctl.h +#include asm/user.h +#include asm/xsave.h /* Select x86 specific features in linux/kvm.h */ #define __KVM_HAVE_PIT @@ -30,6 +32,8 @@ /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 +#define KVM_XSTATE_LAZY(XSTATE_FP | XSTATE_SSE | XSTATE_YMM) + struct kvm_memory_alias { __u32 slot; /* this has a different namespace than memory slots */ __u32 flags; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 248c2b4..853e875 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!vcpu-fpu_active) hw_cr0 |= X86_CR0_TS | X86_CR0_MP; + else + hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP); vmcs_writel(CR0_READ_SHADOW, cr0); vmcs_writel(GUEST_CR0, hw_cr0); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 20f2266..183cf60 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) return 1; if (xcr0 ~host_xcr0) return 1; + if (xcr0 ~((u64)KVM_XSTATE_LAZY)) + vcpu-fpu_active = 1; This is confusing. The variable allows to decrease the number of places the decision is made. vcpu-arch.xcr0 = xcr0; vcpu-guest_xcr0_loaded = 0; return 0; @@ -5969,7 +5971,18 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) vcpu-guest_fpu_loaded = 0; fpu_save_init(vcpu-arch.guest_fpu); ++vcpu-stat.fpu_reload; - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); + /* +* Currently KVM trigger FPU restore by #NM (via CR0.TS), +* till now only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked +* by TS bit, there might be other FPU state is not tracked +* by TS bit. Here it only make FPU deactivate request and do +* FPU lazy restore for these cases: 1)xsave isn't enabled +* in guest, 2)all guest FPU states can be tracked by TS bit. +* For others, doing fully FPU eager restore. +*/ + if (!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) || + !(vcpu-arch.xcr0 ~((u64)KVM_XSTATE_LAZY))) + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); } -- 1.5.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU
On 09/12/2012 11:10 AM, Xudong Hao wrote: Enable KVM FPU fully eager restore, if there is other FPU state which isn't tracked by CR0.TS bit. v3 changes from v2: - Make fpu active explicitly while guest xsave is enabling and non-lazy xstate bit exist. v2 changes from v1: - Expand KVM_XSTATE_LAZY to 64 bits before negating it. diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 248c2b4..853e875 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!vcpu-fpu_active) hw_cr0 |= X86_CR0_TS | X86_CR0_MP; + else + hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP); Why? The guest may wish to receive #NM faults. vmcs_writel(CR0_READ_SHADOW, cr0); vmcs_writel(GUEST_CR0, hw_cr0); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 20f2266..183cf60 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) return 1; if (xcr0 ~host_xcr0) return 1; + if (xcr0 ~((u64)KVM_XSTATE_LAZY)) + vcpu-fpu_active = 1; vcpu-arch.xcr0 = xcr0; vcpu-guest_xcr0_loaded = 0; return 0; @@ -5969,7 +5971,18 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) vcpu-guest_fpu_loaded = 0; fpu_save_init(vcpu-arch.guest_fpu); ++vcpu-stat.fpu_reload; - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); + /* + * Currently KVM trigger FPU restore by #NM (via CR0.TS), + * till now only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked currently, till now, don't tell someone reading the code in six months anything. Just say how the code works. + * by TS bit, there might be other FPU state is not tracked + * by TS bit. Here it only make FPU deactivate request and do + * FPU lazy restore for these cases: 1)xsave isn't enabled + * in guest, 2)all guest FPU states can be tracked by TS bit. + * For others, doing fully FPU eager restore. + */ + if (!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) || + !(vcpu-arch.xcr0 ~((u64)KVM_XSTATE_LAZY))) + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); } -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU
On 09/13/2012 07:29 PM, Marcelo Tosatti wrote: On Thu, Sep 13, 2012 at 01:26:36PM -0300, Marcelo Tosatti wrote: On Wed, Sep 12, 2012 at 04:10:24PM +0800, Xudong Hao wrote: Enable KVM FPU fully eager restore, if there is other FPU state which isn't tracked by CR0.TS bit. v3 changes from v2: - Make fpu active explicitly while guest xsave is enabling and non-lazy xstate bit exist. How about a guest_xcr0_can_lazy_saverestore bool to control this? It only needs to be updated when guest xcr0 is updated. That seems cleaner. Avi? Reasoning below. v2 changes from v1: - Expand KVM_XSTATE_LAZY to 64 bits before negating it. Signed-off-by: Xudong Hao xudong@intel.com --- arch/x86/include/asm/kvm.h |4 arch/x86/kvm/vmx.c |2 ++ arch/x86/kvm/x86.c | 15 ++- 3 files changed, 20 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index 521bf25..4c27056 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -8,6 +8,8 @@ #include linux/types.h #include linux/ioctl.h +#include asm/user.h +#include asm/xsave.h /* Select x86 specific features in linux/kvm.h */ #define __KVM_HAVE_PIT @@ -30,6 +32,8 @@ /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 +#define KVM_XSTATE_LAZY (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) + struct kvm_memory_alias { __u32 slot; /* this has a different namespace than memory slots */ __u32 flags; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 248c2b4..853e875 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3028,6 +3028,8 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) if (!vcpu-fpu_active) hw_cr0 |= X86_CR0_TS | X86_CR0_MP; + else + hw_cr0 = ~(X86_CR0_TS | X86_CR0_MP); vmcs_writel(CR0_READ_SHADOW, cr0); vmcs_writel(GUEST_CR0, hw_cr0); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 20f2266..183cf60 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -560,6 +560,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) return 1; if (xcr0 ~host_xcr0) return 1; + if (xcr0 ~((u64)KVM_XSTATE_LAZY)) + vcpu-fpu_active = 1; This is confusing. The variable allows to decrease the number of places the decision is made. Better to have a helper function (lazy_fpu_allowed(), for example). Variables raise the question of whether they are maintained correctly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM: PPC: Memslot handling improvements
On 09/12/2012 01:26 AM, Paul Mackerras wrote: This series of 3 patches fixes up the memslot handling for Book3S HV style KVM on powerpc, making slot deletion and modification work and making sure we have the appropriate SRCU synchronization against updates. The series is against the next branch of the kvm tree. These patches have all been posted before, but I am reposting them now because Marcelo's patches that are a prerequisite for the third patch (2df72e9bc4, KVM: split kvm_arch_flush_shadow and 12d6e7538e, KVM: perform an invalid memslot step for gpa base change) have now gone into the kvm next branch. Thanks, applied all to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: graphics card pci passthrough success report
On Thu, 2012-09-13 at 11:40 +0200, Lennert Buytenhek wrote: On Thu, Sep 13, 2012 at 07:55:00AM +0200, Gerd Hoffmann wrote: Hi, Hi, - Apply the patches at the end of this mail to kvm and SeaBIOS to allow for more BAR space under 4G. (The relevant BARs on the graphics cards _are_ 64 bit BARs, but kvm seemed to turn those into 32 bit BARs in the guest.) Which qemu/seabios versions have you used? qemu-1.2 (+ bundled seabios) should handle that just fine without patching. There is no fixed I/O window any more, all memory space above lowmem is available for pci, i.e. if you give 2G to your guest everything above 0x8000. And if there isn't enougth address space below 4G (if you assign lot of memory to your guest so qemu keeps only the 0xe000 - 0x window free) seabios should try to map 64bit bars above 4G. This was some time ago, on (L)ubuntu 12.04, which has qemu-kvm 1.0 and seabios 0.6.2. We'll retry on a newer distro soon. - Apply the hacky patch at the end of this mail to SeaBIOS to always skip initialising the Radeon's option ROMs, or the VM would hang inside the Radeon option ROM if you boot the VM without the default cirrus video. A better way to handle that would probably be to add an pci passthrough config option to not expose the rom to the guest. Any clue *why* the rom doesn't run? No idea, we didn't look into that -- this was just a one afternoon hacking session. Thanks for the report. Spawned by your success, I tested a Radeon HD 5450 using VFIO based device assignment. I can get it to work on Windows XP, with no changes (from the version I'll post soon), but Win7 dies (still need to play around more with your suggestions of cpu type). For skipping the option rom, is it sufficient to not expose it (rombar=0) or does the guest OS driver need it as well? Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: graphics card pci passthrough success report
On Thu, Sep 13, 2012 at 11:05:07AM -0600, Alex Williamson wrote: - Apply the hacky patch at the end of this mail to SeaBIOS to always skip initialising the Radeon's option ROMs, or the VM would hang inside the Radeon option ROM if you boot the VM without the default cirrus video. A better way to handle that would probably be to add an pci passthrough config option to not expose the rom to the guest. Any clue *why* the rom doesn't run? No idea, we didn't look into that -- this was just a one afternoon hacking session. Thanks for the report. Spawned by your success, I tested a Radeon HD 5450 using VFIO based device assignment. I can get it to work on Windows XP, with no changes (from the version I'll post soon), Yay! but Win7 dies (still need to play around more with your suggestions of cpu type). ACK. That's a nasty one, don't ask how we found that out... Note that the bluescreen I described when the cpu type is wrong only actually happened for us if the AMD drivers were installed in the VM -- maybe you can try without AMD drivers to see if that makes it go away. If you still have a bluescreen without the AMD drivers installed, it's probably a different issue. For skipping the option rom, is it sufficient to not expose it (rombar=0) or does the guest OS driver need it as well? I don't actually know. Something to try out when we get round to testing this again, I suppose.. cheers, Lennert -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm loops after kernel udpate
On 09/13/2012 11:59 AM, Avi Kivity wrote: On 09/12/2012 09:11 PM, Jiri Slaby wrote: On 09/12/2012 10:18 AM, Avi Kivity wrote: On 09/12/2012 11:13 AM, Jiri Slaby wrote: Please provide the output of vmxcap (http://goo.gl/c5lUO), Unrestricted guest no The big real mode fixes. and a snapshot of kvm_stat while the guest is hung. kvm statistics exits 6778198 615942 host_state_reload 1988 187 irq_exits 1523 138 mmu_cache_miss 4 0 fpu_reload 1 0 Please run this as root so we get the tracepoint based output; and press 'x' when it's running so we get more detailed output. kvm statistics kvm_exit 13798699 330708 kvm_entry 13799110 330708 kvm_page_fault13793650 330604 kvm_exit(EXCEPTION_NMI)6188458 330604 kvm_exit(EXTERNAL_INTERRUPT) 2169 105 kvm_exit(TPR_BELOW_THRESHOLD) 82 0 kvm_exit(IO_INSTRUCTION) 6 0 Strange, it's unable to fault in the very first page. Please provide a trace as per http://www.linux-kvm.org/page/Tracing (but append -e kvmmmu to the command line). Attached. Does it make sense? It wrote things like: failed to read event print fmt for kvm_mmu_unsync_page to the stderr. thanks, -- js suse labs version = 6 CPU 0 is empty cpus=2 qemu-kvm-6170 [001] 457.811896: kvm_mmu_get_page: [FAILED TO PARSE] gfn=0 role=122882 root_count=0 unsync=0 created=1 qemu-kvm-6170 [001] 457.811899: kvm_mmu_get_page: [FAILED TO PARSE] gfn=262144 role=122882 root_count=0 unsync=0 created=1 qemu-kvm-6170 [001] 457.811900: kvm_mmu_get_page: [FAILED TO PARSE] gfn=524288 role=122882 root_count=0 unsync=0 created=1 qemu-kvm-6170 [001] 457.811902: kvm_mmu_get_page: [FAILED TO PARSE] gfn=786432 role=122882 root_count=0 unsync=0 created=1 qemu-kvm-6171 [001] 462.416705: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=786432 role=122882 root_count=1 unsync=0 qemu-kvm-6171 [001] 462.416712: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=524288 role=122882 root_count=1 unsync=0 qemu-kvm-6171 [001] 462.416715: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=262144 role=122882 root_count=1 unsync=0 qemu-kvm-6171 [001] 462.416717: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=0 role=122882 root_count=1 unsync=0 qemu-kvm-6171 [001] 462.485197: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=0 role=253954 root_count=0 unsync=0 qemu-kvm-6171 [001] 462.485202: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=262144 role=253954 root_count=0 unsync=0 qemu-kvm-6171 [001] 462.485205: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=524288 role=253954 root_count=0 unsync=0 qemu-kvm-6171 [001] 462.485209: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=786432 role=253954 root_count=0 unsync=0
Re: [PATCH v2 3/4] target-i386: Allow changing of Hypervisor CPUIDs.
On 09/12/12 13:55, Marcelo Tosatti wrote: The problem with integrating this is that it has little or no assurance from documentation. The Linux kernel source is a good source, then say accordingly to VMWare guest support code in version xyz in the changelog. I will work on getting a list of the documentation and sources used to generate this. Also extracting this information in a text file (or comment in the code) would be better than just adding code. I am not sure what information you are talking about here. Are you asking about the known Hypervisor CPUIDs, or what a lot of Linux version look at to determine the Hypervisor they are on, or something else? On Tue, Sep 11, 2012 at 10:07:46AM -0400, Don Slutz wrote: This is primarily done so that the guest will think it is running under vmware when hypervisor-vendor=vmware is specified as a property of a cpu. Signed-off-by: Don Slutz d...@cloudswitch.com --- target-i386/cpu.c | 214 + target-i386/cpu.h | 21 + target-i386/kvm.c | 33 +++-- 3 files changed, 262 insertions(+), 6 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 5f9866a..9f1f390 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1135,6 +1135,36 @@ static void x86_cpuid_set_model_id(Object *obj, const char *model_id, } } +static void x86_cpuid_set_vmware_extra(Object *obj) +{ +X86CPU *cpu = X86_CPU(obj); + +if ((cpu-env.tsc_khz != 0) +(cpu-env.cpuid_hv_level == CPUID_HV_LEVEL_VMARE_4) +(cpu-env.cpuid_hv_vendor1 == CPUID_HV_VENDOR_VMWARE_1) +(cpu-env.cpuid_hv_vendor2 == CPUID_HV_VENDOR_VMWARE_2) +(cpu-env.cpuid_hv_vendor3 == CPUID_HV_VENDOR_VMWARE_3)) { +const uint32_t apic_khz = 100L; + +/* + * From article.gmane.org/gmane.comp.emulators.kvm.devel/22643 + * + *Leaf 0x4010, Timing Information. + * + *VMware has defined the first generic leaf to provide timing + *information. This leaf returns the current TSC frequency and + *current Bus frequency in kHz. + * + *# EAX: (Virtual) TSC frequency in kHz. + *# EBX: (Virtual) Bus (local apic timer) frequency in kHz. + *# ECX, EDX: RESERVED (Per above, reserved fields are set to zero). + */ +cpu-env.cpuid_hv_extra = 0x4010; +cpu-env.cpuid_hv_extra_a = (uint32_t)cpu-env.tsc_khz; +cpu-env.cpuid_hv_extra_b = apic_khz; +} +} What happens in case you migrate the vmware guest to a host with different frequency? How is that transmitted to the vmware-guest-running-on-kvm ? Or is migration not supported? As far as I know, it would be the same as for a non-vmware guest. http://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg01656.html is related to this. I did not look to see if this has been done since then. All this change does is to allow the guest to read the tsc-frequency instead of trying to calculate it. I will look into the current state of migration when tsc_freq=X is specified. The machine I have been doing most of the testing on (Intel Xeon E3-1260L) when I add tsc_freq=2.0G or tsc_freq=2.4G, the guest does not see any difference in accel=kvm. +static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +uint32_t value; + +visit_type_uint32(v, value, name, errp); +if (error_is_set(errp)) { +return; +} + +if ((value != 0) (value 0x4000)) { +value += 0x4000; +} +cpu-env.cpuid_hv_level = value; +} + +static char *x86_cpuid_get_hv_vendor(Object *obj, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +CPUX86State *env = cpu-env; +char *value; +int i; + +value = (char *)g_malloc(CPUID_VENDOR_SZ + 1); +for (i = 0; i 4; i++) { +value[i + 0] = env-cpuid_hv_vendor1 (8 * i); +value[i + 4] = env-cpuid_hv_vendor2 (8 * i); +value[i + 8] = env-cpuid_hv_vendor3 (8 * i); +} +value[CPUID_VENDOR_SZ] = '\0'; + +/* Convert known names */ +if (!strcmp(value, CPUID_HV_VENDOR_VMWARE)) { +if (env-cpuid_hv_level == CPUID_HV_LEVEL_VMARE_4) { +pstrcpy(value, sizeof(value), vmware4); +} else if (env-cpuid_hv_level == CPUID_HV_LEVEL_VMARE_3) { +pstrcpy(value, sizeof(value), vmware3); +} +} else if (!strcmp(value, CPUID_HV_VENDOR_XEN) + env-cpuid_hv_level == CPUID_HV_LEVEL_XEN) { +pstrcpy(value, sizeof(value), xen); +} else if (!strcmp(value, CPUID_HV_VENDOR_KVM) + env-cpuid_hv_level == 0) { +pstrcpy(value, sizeof(value), kvm); +} +return value; +} + +static void x86_cpuid_set_hv_vendor(Object *obj, const char *value, + Error **errp) +{ +X86CPU
Re: Multi-dimensional Paging in Nested virtualization
Thanks a lot Nadav.This was really helpful. Siddhesh On Thu, Sep 13, 2012 at 3:49 AM, Nadav Har'El n...@math.technion.ac.il wrote: On Tue, Sep 11, 2012, siddhesh phadke wrote about Multi-dimensional Paging in Nested virtualization: I read turtles project paper where they have explained how multi-dimensional page tables are built on L0. L2 is launched with empty EPT 0-2 and EPT 0-2 is built on-the-fly. I tried to find out how this is done in kvm code but i could not find where EPT 0-2 is built. Nested EPT is not yet included in the mainline KVM. The original nested EPT code that we had written as part of the Turtles paper became obsolete when much of KVM's MMU code has been rewritten. I have since rewritten the nested EPT code for the modern KVM. I sent the second (latest) version of these patches to the KVM mailing list in August, and you can find them in, for example, http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395 These patches were not yet accepted into KVM. They have bugs in various setups (which I have not yet found the time to fix, unfortunately), and some known issues found by Avi Kivity on this mailing lest. Does L1 handle ept violation first and then L0 updates its EPT0-2? How this is done? This is explained in the turtles paper, but here's the short story: L1 defines an EPT table for L2 which we call EPT12. L0 builds from this an EPT02, with L1 addresses changed to L0. Now, when L2 runs and we get an EPT violation, we exit to L0 (in nested vmx, any exit first gets to L0). L0 checks if the translation is missing already in EPT12, and if it isn't it emulates an exit into L1 - and inject the EPT violation into L1. But if the translation wasn't missing in EPT12, then it's L0's problem, and we just need to update EPT02. Can anybody give me some pointers about where to look into the code? Please look at the patches above. Each patch is also documented. Nadav. -- Nadav Har'El| Thursday, Sep 13 2012, 26 Elul 5772 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |error compiling committee.c: too many http://nadav.harel.org.il |arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Revert mm: have order 0 compaction start near a pageblock with free pages
On Wed, 12 Sep 2012 17:46:15 +0100 Richard Davies rich...@arachsys.com wrote: Mel Gorman wrote: I see that this is an old-ish bug but I did not read the full history. Is it now booting faster than 3.5.0 was? I'm asking because I'm interested to see if commit c67fe375 helped your particular case. Yes, I think 3.6.0-rc5 is already better than 3.5.x but can still be improved, as discussed. Re-reading Mel's commit de74f1cc3b1e9730d9b58580cd11361d30cd182d, I believe it re-introduces the quadratic behaviour that the code was suffering from before, by not moving zone-compact_cached_free_pfn down when no more free pfns are found in a page block. This mail reverts that changeset, the next introduces what I hope to be the proper fix. Richard, would you be willing to give these patches a try, since your system seems to reproduce this bug easily? ---8--- Revert mm: have order 0 compaction start near a pageblock with free pages This reverts commit de74f1cc3b1e9730d9b58580cd11361d30cd182d. Mel found a real issue with my skip ahead logic in the compaction code, but unfortunately his approach appears to have re-introduced quadratic behaviour in that the value of zone-compact_cached_free_pfn is never advanced until the compaction run wraps around the start of the zone. This merely moved the starting point for the quadratic behaviour further into the zone, but the behaviour has still been observed. It looks like another fix is required. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Richard Davies rich...@daviesmail.org diff --git a/mm/compaction.c b/mm/compaction.c index 7fcd3a5..771775d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -431,20 +431,6 @@ static bool suitable_migration_target(struct page *page) } /* - * Returns the start pfn of the last page block in a zone. This is the starting - * point for full compaction of a zone. Compaction searches for free pages from - * the end of each zone, while isolate_freepages_block scans forward inside each - * page block. - */ -static unsigned long start_free_pfn(struct zone *zone) -{ - unsigned long free_pfn; - free_pfn = zone-zone_start_pfn + zone-spanned_pages; - free_pfn = ~(pageblock_nr_pages-1); - return free_pfn; -} - -/* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. */ @@ -483,6 +469,17 @@ static void isolate_freepages(struct zone *zone, pfn -= pageblock_nr_pages) { unsigned long isolated; + /* +* Skip ahead if another thread is compacting in the area +* simultaneously. If we wrapped around, we can only skip +* ahead if zone-compact_cached_free_pfn also wrapped to +* above our starting point. +*/ + if (cc-order 0 (!cc-wrapped || + zone-compact_cached_free_pfn + cc-start_free_pfn)) + pfn = min(pfn, zone-compact_cached_free_pfn); + if (!pfn_valid(pfn)) continue; @@ -533,15 +530,7 @@ static void isolate_freepages(struct zone *zone, */ if (isolated) { high_pfn = max(high_pfn, pfn); - - /* -* If the free scanner has wrapped, update -* compact_cached_free_pfn to point to the highest -* pageblock with free pages. This reduces excessive -* scanning of full pageblocks near the end of the -* zone -*/ - if (cc-order 0 cc-wrapped) + if (cc-order 0) zone-compact_cached_free_pfn = high_pfn; } } @@ -551,11 +540,6 @@ static void isolate_freepages(struct zone *zone, cc-free_pfn = high_pfn; cc-nr_freepages = nr_freepages; - - /* If compact_cached_free_pfn is reset then set it now */ - if (cc-order 0 !cc-wrapped - zone-compact_cached_free_pfn == start_free_pfn(zone)) - zone-compact_cached_free_pfn = high_pfn; } /* @@ -642,6 +626,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, return ISOLATE_SUCCESS; } +/* + * Returns the start pfn of the last page block in a zone. This is the starting + * point for full compaction of a zone. Compaction searches for free pages from + * the end of each zone, while isolate_freepages_block scans forward inside each + * page block. + */ +static unsigned long start_free_pfn(struct zone *zone) +{ + unsigned long free_pfn; + free_pfn = zone-zone_start_pfn + zone-spanned_pages; + free_pfn = ~(pageblock_nr_pages-1); + return free_pfn; +} +
[PATCH 2/2] make the compaction skip ahead logic robust
Make the skip ahead logic in compaction resistant to compaction wrapping around to the end of the zone. This can lead to less efficient compaction when one thread has wrapped around to the end of the zone, and another simultaneous compactor has not done so yet. However, it should ensure that we do not suffer quadratic behaviour any more. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Richard Davies rich...@daviesmail.org diff --git a/mm/compaction.c b/mm/compaction.c index 771775d..0656759 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page) } /* + * We scan the zone in a circular fashion, starting at + * zone-compact_cached_free_pfn. Be careful not to skip if + * one compacting thread has just wrapped back to the end of the + * zone, but another thread has not. + */ +static bool compaction_may_skip(struct zone *zone, + struct compact_control *cc) +{ + if (!cc-wrapped zone-compact_free_pfn cc-start_pfn) + return true; + + if (cc-wrapped zone_compact_free_pfn cc-start_pfn) + return true; + + return false; +} + +/* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. */ @@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone, /* * Skip ahead if another thread is compacting in the area -* simultaneously. If we wrapped around, we can only skip -* ahead if zone-compact_cached_free_pfn also wrapped to -* above our starting point. +* simultaneously, and has finished with this page block. */ - if (cc-order 0 (!cc-wrapped || - zone-compact_cached_free_pfn - cc-start_free_pfn)) + if (cc-order 0 compaction_may_skip(zone, cc)) pfn = min(pfn, zone-compact_cached_free_pfn); if (!pfn_valid(pfn)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2 2/2] make the compaction skip ahead logic robust
Argh. And of course I send out the version from _before_ the compile test, instead of the one after! I am not used to caffeine any more and have had way too much tea... ---8--- Make the skip ahead logic in compaction resistant to compaction wrapping around to the end of the zone. This can lead to less efficient compaction when one thread has wrapped around to the end of the zone, and another simultaneous compactor has not done so yet. However, it should ensure that we do not suffer quadratic behaviour any more. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Richard Davies rich...@daviesmail.org diff --git a/mm/compaction.c b/mm/compaction.c index 771775d..0656759 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -431,6 +431,24 @@ static bool suitable_migration_target(struct page *page) } /* + * We scan the zone in a circular fashion, starting at + * zone-compact_cached_free_pfn. Be careful not to skip if + * one compacting thread has just wrapped back to the end of the + * zone, but another thread has not. + */ +static bool compaction_may_skip(struct zone *zone, + struct compact_control *cc) +{ + if (!cc-wrapped zone-compact_cached_free_pfn cc-start_free_pfn) + return true; + + if (cc-wrapped zone-compact_cached_free_pfn cc-start_free_pfn) + return true; + + return false; +} + +/* * Based on information in the current compact_control, find blocks * suitable for isolating free pages from and then isolate them. */ @@ -471,13 +489,9 @@ static void isolate_freepages(struct zone *zone, /* * Skip ahead if another thread is compacting in the area -* simultaneously. If we wrapped around, we can only skip -* ahead if zone-compact_cached_free_pfn also wrapped to -* above our starting point. +* simultaneously, and has finished with this page block. */ - if (cc-order 0 (!cc-wrapped || - zone-compact_cached_free_pfn - cc-start_free_pfn)) + if (cc-order 0 compaction_may_skip(zone, cc)) pfn = min(pfn, zone-compact_cached_free_pfn); if (!pfn_valid(pfn)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/4] VFIO-based PCI device assignment
Here's an updated version of the VFIO PCI device assignment series. Now that we're targetting QEMU 1.3, I've opened up support so that vfio-pci is added to all softmmu targets supporting PCI on Linux hosts. Only some printf format changes were required to make this build. I also added a workaround for INTx support. Ideally we'd like to know when an EOI is written to the interrupt controller to know when to de-assert and unmask an interrupt, but as a substitute we can consider a BAR access to be a response to an interrupt and do the de-assert and unmask then. The device will re-assert the interrupt until it's been handled. The benefit is that the solution is generic, the draw-back is that we can't make use of the mmap'd memory region in this mode. The memory API conveniently has a way to toggle enabling the mmap'd region that fits nicely with this usage. I've also added an x-intx=off option to disable INTx support for a device, which can be useful for devices that don't make use of any interrupts and for which the overhead of trapping BAR access is too high (graphics cards, including a Radeon HD 5450 which I was able to get working under WinXP with this version). This option should be considered experimental, thus the x- prefix. Future EOI acceleration should make this option unnecessary where KVM is available. I was also successful in passing through both a tg3 and e1000e NIC from an x86 host to powerpc guest (g3beiege) using this series. This guest machine doesn't appear to support MSI, so the INTx mechanism above is necessary to trigger an EOI. In addition to these series here, the code is able at: git://github.com/awilliam/qemu-vfio.git branch vfio-for-qemu as well is in signed tag vfio-pci-for-qemu-v4. Thanks, Alex --- Alex Williamson (4): vfio: Enable vfio-pci and mark supported vfio: vfio-pci device assignment driver Update Linux kernel headers Update kernel header script to include vfio MAINTAINERS |5 configure |6 hw/Makefile.objs|3 hw/vfio_pci.c | 1860 +++ hw/vfio_pci_int.h | 114 ++ linux-headers/linux/vfio.h | 368 scripts/update-linux-headers.sh |2 7 files changed, 2356 insertions(+), 2 deletions(-) create mode 100644 hw/vfio_pci.c create mode 100644 hw/vfio_pci_int.h create mode 100644 linux-headers/linux/vfio.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/4] Update kernel header script to include vfio
Signed-off-by: Alex Williamson alex.william...@redhat.com --- scripts/update-linux-headers.sh |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh index a639c5b..605102f 100755 --- a/scripts/update-linux-headers.sh +++ b/scripts/update-linux-headers.sh @@ -43,7 +43,7 @@ done rm -rf $output/linux-headers/linux mkdir -p $output/linux-headers/linux -for header in kvm.h kvm_para.h vhost.h virtio_config.h virtio_ring.h; do +for header in kvm.h kvm_para.h vfio.h vhost.h virtio_config.h virtio_ring.h; do cp $tmpdir/include/linux/$header $output/linux-headers/linux done rm -rf $output/linux-headers/asm-generic -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/4] Update Linux kernel headers
Based on Linux as of 1a95620. Signed-off-by: Alex Williamson alex.william...@redhat.com --- linux-headers/linux/vfio.h | 368 1 file changed, 368 insertions(+) create mode 100644 linux-headers/linux/vfio.h diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h new file mode 100644 index 000..f787b72 --- /dev/null +++ b/linux-headers/linux/vfio.h @@ -0,0 +1,368 @@ +/* + * VFIO API definition + * + * Copyright (C) 2012 Red Hat, Inc. All rights reserved. + * Author: Alex Williamson alex.william...@redhat.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#ifndef VFIO_H +#define VFIO_H + +#include linux/types.h +#include linux/ioctl.h + +#define VFIO_API_VERSION 0 + + +/* Kernel User level defines for VFIO IOCTLs. */ + +/* Extensions */ + +#define VFIO_TYPE1_IOMMU 1 + +/* + * The IOCTL interface is designed for extensibility by embedding the + * structure length (argsz) and flags into structures passed between + * kernel and userspace. We therefore use the _IO() macro for these + * defines to avoid implicitly embedding a size into the ioctl request. + * As structure fields are added, argsz will increase to match and flag + * bits will be defined to indicate additional fields with valid data. + * It's *always* the caller's responsibility to indicate the size of + * the structure passed by setting argsz appropriately. + */ + +#define VFIO_TYPE (';') +#define VFIO_BASE 100 + +/* IOCTLs for VFIO file descriptor (/dev/vfio/vfio) */ + +/** + * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0) + * + * Report the version of the VFIO API. This allows us to bump the entire + * API version should we later need to add or change features in incompatible + * ways. + * Return: VFIO_API_VERSION + * Availability: Always + */ +#define VFIO_GET_API_VERSION _IO(VFIO_TYPE, VFIO_BASE + 0) + +/** + * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __u32) + * + * Check whether an extension is supported. + * Return: 0 if not supported, 1 (or some other positive integer) if supported. + * Availability: Always + */ +#define VFIO_CHECK_EXTENSION _IO(VFIO_TYPE, VFIO_BASE + 1) + +/** + * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32) + * + * Set the iommu to the given type. The type must be supported by an + * iommu driver as verified by calling CHECK_EXTENSION using the same + * type. A group must be set to this file descriptor before this + * ioctl is available. The IOMMU interfaces enabled by this call are + * specific to the value set. + * Return: 0 on success, -errno on failure + * Availability: When VFIO group attached + */ +#define VFIO_SET_IOMMU _IO(VFIO_TYPE, VFIO_BASE + 2) + +/* IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) */ + +/** + * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3, + * struct vfio_group_status) + * + * Retrieve information about the group. Fills in provided + * struct vfio_group_info. Caller sets argsz. + * Return: 0 on succes, -errno on failure. + * Availability: Always + */ +struct vfio_group_status { + __u32 argsz; + __u32 flags; +#define VFIO_GROUP_FLAGS_VIABLE(1 0) +#define VFIO_GROUP_FLAGS_CONTAINER_SET (1 1) +}; +#define VFIO_GROUP_GET_STATUS _IO(VFIO_TYPE, VFIO_BASE + 3) + +/** + * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32) + * + * Set the container for the VFIO group to the open VFIO file + * descriptor provided. Groups may only belong to a single + * container. Containers may, at their discretion, support multiple + * groups. Only when a container is set are all of the interfaces + * of the VFIO file descriptor and the VFIO group file descriptor + * available to the user. + * Return: 0 on success, -errno on failure. + * Availability: Always + */ +#define VFIO_GROUP_SET_CONTAINER _IO(VFIO_TYPE, VFIO_BASE + 4) + +/** + * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5) + * + * Remove the group from the attached container. This is the + * opposite of the SET_CONTAINER call and returns the group to + * an initial state. All device file descriptors must be released + * prior to calling this interface. When removing the last group + * from a container, the IOMMU will be disabled and all state lost, + * effectively also returning the VFIO file descriptor to an initial + * state. + * Return: 0 on success, -errno on failure. + * Availability: When attached to container + */ +#define VFIO_GROUP_UNSET_CONTAINER _IO(VFIO_TYPE, VFIO_BASE + 5) + +/** + * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char) + * + * Return a new file descriptor for the device object described by + * the provided string. The
[PATCH v4 4/4] vfio: Enable vfio-pci and mark supported
Enabled for all softmmu guests supporting PCI on Linux hosts. Note that currently only x86 hosts have the kernel side VFIO IOMMU support for this. PPC (g3beige) is the only non-x86 guest known to work. ARM (veratile) hangs in firmware, others untested. Signed-off-by: Alex Williamson alex.william...@redhat.com --- MAINTAINERS |5 + configure|6 ++ hw/Makefile.objs |3 ++- 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 61f8b45..fd3eca0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -474,6 +474,11 @@ M: Gerd Hoffmann kra...@redhat.com S: Maintained F: hw/usb* +VFIO +M: Alex Williamson alex.william...@redhat.com +S: Supported +F: hw/vfio* + vhost M: Michael S. Tsirkin m...@redhat.com S: Supported diff --git a/configure b/configure index 30be784..b56e61f 100755 --- a/configure +++ b/configure @@ -167,6 +167,7 @@ attr= libattr= xfs= +vfio_pci=no vhost_net=no kvm=no gprof=no @@ -528,6 +529,7 @@ Haiku) usb=linux kvm=yes vhost_net=yes + vfio_pci=yes if [ $cpu = i386 -o $cpu = x86_64 ] ; then audio_possible_drivers=$audio_possible_drivers fmod fi @@ -3180,6 +3182,7 @@ echo libiscsi support $libiscsi echo build guest agent $guest_agent echo seccomp support $seccomp echo coroutine backend $coroutine_backend +echo VFIO PCI support $vfio_pci if test $sdl_too_old = yes; then echo - Your SDL version is too old - please upgrade to have SDL support @@ -3921,6 +3924,9 @@ if test $target_softmmu = yes ; then if test $smartcard_nss = yes ; then echo subdir-$target: subdir-libcacard $config_host_mak fi + if test $vfio_pci = yes ; then +echo CONFIG_VFIO_PCI=y $config_target_mak + fi case $target_arch2 in i386|x86_64) echo CONFIG_HAVE_CORE_DUMP=y $config_target_mak diff --git a/hw/Makefile.objs b/hw/Makefile.objs index 6dfebd2..7f8d3e4 100644 --- a/hw/Makefile.objs +++ b/hw/Makefile.objs @@ -198,7 +198,8 @@ obj-$(CONFIG_VGA) += vga.o obj-$(CONFIG_SOFTMMU) += device-hotplug.o obj-$(CONFIG_XEN) += xen_domainbuild.o xen_machine_pv.o -# Inter-VM PCI shared memory +# Inter-VM PCI shared memory VFIO PCI device assignment ifeq ($(CONFIG_PCI), y) obj-$(CONFIG_KVM) += ivshmem.o +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] vhost-scsi: Add support for host virtualized target
On Tue, 2012-09-11 at 12:36 +0800, Asias He wrote: Hello Nicholas, Hello Asias! On 09/07/2012 02:48 PM, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org Hello Anthony Co, This is the fourth installment to add host virtualized target support for the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc. The series is available directly from the following git branch: git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3 Note the code is cut against yesterday's QEMU head, and dispite the name of the tree is based upon mainline qemu.org git code + has thus far been running overnight with 100K IOPs small block 4k workloads using v3.6-rc2+ based target code with RAMDISK_DR backstores. Are you still seeing the performance degradation discussed in the thread vhost-scsi port to v1.1.0 + MSI-X performance regression So the performance regression reported here with QEMU v1.2-rc + virtio-scsi ended up being related to virtio interrupts being delivered across multiple CPUs. After explicitly setting the IRQ affinity of the virtio0-request MSI-X vector to a specific CPU, the small block (4k) mixed random I/O performance jumped back up to the expected ~100K IOPs for a single LUN. FYI, I just tried this again with the most recent QEMU v1.2.50 (v1.3-rc) code, and both cases appear to be performing as expected once again regardless of the explicit IRQ affinity setting. --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote: * Andrew Theurer haban...@linux.vnet.ibm.com [2012-09-11 13:27:41]: On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: On 09/11/2012 01:42 AM, Andrew Theurer wrote: On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p) +{ + if (!curr-sched_class-yield_to_task) + return false; + + if (curr-sched_class != p-sched_class) + return false; Peter, Should we also add a check if the runq has a skip buddy (as pointed out by Raghu) and return if the skip buddy is already set. Oh right, I missed that suggestion.. the performance improvement went from 81% to 139% using this, right? It might make more sense to keep that separate, outside of this function, since its not a strict prerequisite. + if (task_running(p_rq, p) || p-state) + return false; + + return true; +} @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool preempt) rq = this_rq(); again: + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p)) + goto out_irq; + So add something like: /* Optimistic, if we 'raced' with another yield_to(), don't bother */ if (p_rq-cfs_rq-skip) goto out_irq; p_rq = task_rq(p); double_rq_lock(rq, p_rq); But I do have a question on this optimization though,.. Why do we check p_rq-cfs_rq-skip and not rq-cfs_rq-skip ? That is, I'd like to see this thing explained a little better. Does it go something like: p_rq is the runqueue of the task we'd like to yield to, rq is our own, they might be the same. If we have a -skip, there's nothing we can do about it, OTOH p_rq having a -skip and failing the yield_to() simply means us picking the next VCPU thread, which might be running on an entirely different cpu (rq) and could succeed? Here's two new versions, both include a __yield_to_candidate(): v3 uses the check for p_rq-curr in guest mode, and v4 uses the cfs_rq skip check. Raghu, I am not sure if this is exactly what you want implemented in v4. Andrew, Yes that is what I had. I think there was a mis-understanding. My intention was to if there is a directed_yield happened in runqueue (say rqA), do not bother to directed yield to that. But unfortunately as PeterZ pointed that would have resulted in setting next buddy of a different run queue than rqA. So we can drop this skip idea. Pondering more over what to do? can we use next buddy itself ... thinking.. As I mentioned earlier today, I did not have your changes from kvm.git tree when I tested my changes. Here are your changes and my changes compared: throughput in MB/sec kvm_vcpu_on_spin changes: 4636 +/- 15.74% yield_to changes: 4515 +/- 12.73% I would be inclined to stick with your changes which are kept in kvm code. I did try both combined, and did not get good results: both changes: 4074 +/- 19.12% So, having both is probably not a good idea. However, I feel like there's more work to be done. With no over-commit (10 VMs), total throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some overhead, but a reduction to ~4500 is still terrible. By contrast, 8-way VMs with 2x over-commit have a total throughput roughly 10% less than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread host). We still have what appears to be scalability problems, but now it's not so much in runqueue locks for yield_to(), but now get_pid_task(): Hi Andrew, IMHO, reducing the double runqueue lock overhead is a good idea, and may be we see the benefits when we increase the overcommit further. The explaination for not seeing good benefit on top of PLE handler optimization patch is because we filter the yield_to candidates, and hence resulting in less contention for double runqueue lock. and extra code overhead during genuine yield_to might have resulted in some degradation in the case you tested. However, did you use cfs.next also?. I hope it helps, when we combine. Here is the result that is showing positive benefit. I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s). +---+---+---++---+ kernbench time in sec, lower is better +---+---+---++---+ base stddev patched stddev %improve
Re: [PATCH 4/5] virtio-scsi: Add start/stop functionality for vhost-scsi
On Tue, 2012-09-11 at 18:07 +0300, Michael S. Tsirkin wrote: On Tue, Sep 11, 2012 at 08:46:34AM -0500, Anthony Liguori wrote: On 09/10/2012 01:24 AM, Michael S. Tsirkin wrote: On Mon, Sep 10, 2012 at 08:16:54AM +0200, Paolo Bonzini wrote: Il 09/09/2012 00:40, Michael S. Tsirkin ha scritto: On Fri, Sep 07, 2012 at 06:00:50PM +0200, Paolo Bonzini wrote: SNIP Please create a completely separate device vhost-scsi-pci instead (or virtio-scsi-tcm-pci, or something like that). It is used completely differently from virtio-scsi-pci, it does not make sense to conflate the two. Ideally the name would say how it is different, not what backend it uses. Any good suggestions? I chose the backend name because, ideally, there would be no other difference. QEMU _could_ implement all the goodies in vhost-scsi (such as reservations or ALUA), it just doesn't do that yet. Paolo Then why do you say It is used completely differently from virtio-scsi-pci? Isn't it just a different backend? If yes then it should be a backend option, like it is for virtio-net. I don't mean to bike shed here so don't take this as a nack on making it a backend option, but in retrospect, the way we did vhost-net was a mistake even though I strongly advocated for it to be a backend option. The code to do it is really, really ugly. I think it would have made a lot more sense to just make it a device and then have it not use a netdev backend or any other kind of backend split. For instance: qemu -device vhost-net-pci,tapfd=X I know this breaks the model of separate backends and frontends but since vhost-net absolutely requires a tap fd, I think it's better in the long run to not abuse the netdev backend to prevent user confusion. Having a dedicated backend type that only has one possible option and can only be used by one device is a bit silly too. So I would be in favor of dropping/squashing 3/5 and radically simplifying how this was exposed to the user. I would just take qemu_vhost_scsi_opts and make them device properties. Regards, Anthony Liguori I'd like to clarify that I'm fine with either approach. Even a separate device is OK if this is what others want though I like it the least. Hi MST, Paolo Co, I've been out the better part of the week with the flu, and am just now catching up on emails from the last days.. So to better understand the reasoning for adding an separate PCI device for vhost-scsi ahead of implementing the code changes, here are main points from folk's comments: *) Convert vhost-scsi into a separate standalone vhost-scsi-pci device - Lets userspace know that virtio-scsi + QEMU block and virtio-scsi + tcm_vhost do not track SCSI state (such as reservations + ALUA), and hence are not interchangeable during live-migration. - Reduces complexity of adding vhost-scsi related logic into existing virtio-scsi-pci code path. - Having backends with one possible option doesn’t make much sense. *) Keep vhost-scsi as a backend to virtio-scsi-pci - Reduces duplicated code amongst multiple virtio-scsi backends. - Follows the split for what existing vhost-net code already does. So that said, two quick questions for Paolo Co.. For the standalone vhost-scsi-pci device case, can you give a brief idea as to what extent you'd like to see virtio-scsi.c code/defs duplicated and/or shared amongst a new vhost-scsi-pci device..? Also to help me along, can you give an example based on the current usage below how the QEMU command line arguments would change with a standalone vhost-scsi-pci device..? ./x86_64-softmmu/qemu-system-x86_64 -enable-kvm -smp 4 -m 2048 \ -hda /usr/src/qemu-vhost.git/debian_squeeze_amd64_standard-old.qcow2 \ -vhost-scsi id=vhost-scsi0,wwpn=naa.600140579ad21088,tpgt=1 \ -device virtio-scsi-pci,vhost-scsi=vhost-scsi0,event_idx=off Thank you! --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: PPC: Book3S HV: Get/set guest SPRs using the GET/SET_ONE_REG interface
On 12.09.2012, at 02:18, Paul Mackerras wrote: This enables userspace to get and set various SPRs (special-purpose registers) using the KVM_[GS]ET_ONE_REG ioctls. With this, userspace can get and set all the SPRs that are part of the guest state, either through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or the KVM_[GS]ET_ONE_REG ioctls. The SPRs that are added here are: - DABR: Data address breakpoint register - DSCR: Data stream control register - PURR: Processor utilization of resources register - SPURR: Scaled PURR - DAR: Data address register - DSISR: Data storage interrupt status register - AMR: Authority mask register - UAMOR: User authority mask override register - MMCR0, MMCR1, MMCRA: Performance monitor unit control registers - PMC1..PMC8: Performance monitor unit counter registers Signed-off-by: Paul Mackerras pau...@samba.org --- arch/powerpc/include/asm/kvm.h | 21 arch/powerpc/kvm/book3s_hv.c | 106 Documentation/virtual/kvm/api.txt | +++ :) Alex 2 files changed, 127 insertions(+) diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h index 3c14202..9557576 100644 --- a/arch/powerpc/include/asm/kvm.h +++ b/arch/powerpc/include/asm/kvm.h @@ -338,5 +338,26 @@ struct kvm_book3e_206_tlb_params { #define KVM_REG_PPC_IAC4 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x5) #define KVM_REG_PPC_DAC1 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x6) #define KVM_REG_PPC_DAC2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x7) +#define KVM_REG_PPC_DABR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8) +#define KVM_REG_PPC_DSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9) +#define KVM_REG_PPC_PURR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa) +#define KVM_REG_PPC_SPURR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb) +#define KVM_REG_PPC_DAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc) +#define KVM_REG_PPC_DSISR(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xd) +#define KVM_REG_PPC_AMR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xe) +#define KVM_REG_PPC_UAMOR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xf) + +#define KVM_REG_PPC_MMCR0(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10) +#define KVM_REG_PPC_MMCR1(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11) +#define KVM_REG_PPC_MMCRA(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12) + +#define KVM_REG_PPC_PMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18) +#define KVM_REG_PPC_PMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19) +#define KVM_REG_PPC_PMC3 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1a) +#define KVM_REG_PPC_PMC4 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1b) +#define KVM_REG_PPC_PMC5 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1c) +#define KVM_REG_PPC_PMC6 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1d) +#define KVM_REG_PPC_PMC7 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1e) +#define KVM_REG_PPC_PMC8 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1f) #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 83e929e..7fe5c9a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -538,11 +538,53 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg) { int r = -EINVAL; + long int i; switch (reg-id) { case KVM_REG_PPC_HIOR: r = put_user(0, (u64 __user *)reg-addr); break; + case KVM_REG_PPC_DABR: + r = put_user(vcpu-arch.dabr, (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_DSCR: + r = put_user(vcpu-arch.dscr, (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_PURR: + r = put_user(vcpu-arch.purr, (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_SPURR: + r = put_user(vcpu-arch.spurr, (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_DAR: + r = put_user(vcpu-arch.shregs.dar, (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_DSISR: + r = put_user(vcpu-arch.shregs.dsisr, (u32 __user *)reg-addr); + break; + case KVM_REG_PPC_AMR: + r = put_user(vcpu-arch.amr, (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_UAMOR: + r = put_user(vcpu-arch.uamor, (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_MMCR0: + case KVM_REG_PPC_MMCR1: + case KVM_REG_PPC_MMCRA: + i = reg-id - KVM_REG_PPC_MMCR0; + r = put_user(vcpu-arch.mmcr[i], (u64 __user *)reg-addr); + break; + case KVM_REG_PPC_PMC1: + case KVM_REG_PPC_PMC2: + case KVM_REG_PPC_PMC3: + case KVM_REG_PPC_PMC4: + case KVM_REG_PPC_PMC5: + case KVM_REG_PPC_PMC6: + case KVM_REG_PPC_PMC7: + case KVM_REG_PPC_PMC8: + i = reg-id -
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 12.09.2012, at 02:19, Paul Mackerras wrote: Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP error on powerpc. This implements them for Book 3S processors. Since those processors have more than just the 32 basic floating-point registers, this extends the kvm_fpu structure to have space for the additional registers -- the 32 vector registers (128 bits each) for VMX/Altivec and the 32 additional 64-bit registers that were added on POWER7 for the vector-scalar extension (VSX). It also adds a `valid' field, which is a bitmap indicating which elements contain valid data. The layout of the floating-point register data in the vcpu struct is mostly the same between different flavors of KVM on Book 3S processors, but the set of supported registers may differ depending on what the CPU hardware supports and how much is emulated. Therefore we have a flavor-specific function to work out which set of registers to supply for the get function. On POWER7 processors using the Book 3S HV flavor of KVM, we save the standard floating-point registers together with their corresponding VSX extension register in the vcpu-arch.vsr[] array, since each pair can be loaded or stored with one instruction. This is different to other flavors of KVM, and to other processors (i.e. PPC970) with HV KVM, which store the standard FPRs in vcpu-arch.fpr[]. To cope with this, we use the kvmppc_core_get_fpu_valid() and kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[] and arch.vsr[] arrays as needed. Signed-off-by: Paul Mackerras pau...@samba.org Any reason to not use ONE_REG here? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM: PPC: Book3S HV: More flexible allocator for linear memory
On 12.09.2012, at 02:34, Paul Mackerras wrote: This series of 3 patches makes it possible for guests to allocate whatever size of HPT they need from linear memory preallocated at boot, rather than being restricted to a single size of HPT (by default, 16MB) and having to use the kernel page allocator for anything else -- which in practice limits them to at most 16MB given the default value for the maximum page order. Instead of allocating many individual pieces of memory, this allocates a single contiguous area and uses a simple bitmap-based allocator to hand out pieces of it as required. Have you tried to play with CMA for this? It sounds like it could buy us exactly what we need. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote: On 12.09.2012, at 02:19, Paul Mackerras wrote: Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP error on powerpc. This implements them for Book 3S processors. Since those processors have more than just the 32 basic floating-point registers, this extends the kvm_fpu structure to have space for the additional registers -- the 32 vector registers (128 bits each) for VMX/Altivec and the 32 additional 64-bit registers that were added on POWER7 for the vector-scalar extension (VSX). It also adds a `valid' field, which is a bitmap indicating which elements contain valid data. The layout of the floating-point register data in the vcpu struct is mostly the same between different flavors of KVM on Book 3S processors, but the set of supported registers may differ depending on what the CPU hardware supports and how much is emulated. Therefore we have a flavor-specific function to work out which set of registers to supply for the get function. On POWER7 processors using the Book 3S HV flavor of KVM, we save the standard floating-point registers together with their corresponding VSX extension register in the vcpu-arch.vsr[] array, since each pair can be loaded or stored with one instruction. This is different to other flavors of KVM, and to other processors (i.e. PPC970) with HV KVM, which store the standard FPRs in vcpu-arch.fpr[]. To cope with this, we use the kvmppc_core_get_fpu_valid() and kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[] and arch.vsr[] arrays as needed. Signed-off-by: Paul Mackerras pau...@samba.org Any reason to not use ONE_REG here? Just consistency with x86 -- they have an xmm[][] field in their struct kvm_fpu which looks like it contains their vector state. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 14.09.2012, at 01:58, Paul Mackerras wrote: On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote: On 12.09.2012, at 02:19, Paul Mackerras wrote: Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP error on powerpc. This implements them for Book 3S processors. Since those processors have more than just the 32 basic floating-point registers, this extends the kvm_fpu structure to have space for the additional registers -- the 32 vector registers (128 bits each) for VMX/Altivec and the 32 additional 64-bit registers that were added on POWER7 for the vector-scalar extension (VSX). It also adds a `valid' field, which is a bitmap indicating which elements contain valid data. The layout of the floating-point register data in the vcpu struct is mostly the same between different flavors of KVM on Book 3S processors, but the set of supported registers may differ depending on what the CPU hardware supports and how much is emulated. Therefore we have a flavor-specific function to work out which set of registers to supply for the get function. On POWER7 processors using the Book 3S HV flavor of KVM, we save the standard floating-point registers together with their corresponding VSX extension register in the vcpu-arch.vsr[] array, since each pair can be loaded or stored with one instruction. This is different to other flavors of KVM, and to other processors (i.e. PPC970) with HV KVM, which store the standard FPRs in vcpu-arch.fpr[]. To cope with this, we use the kvmppc_core_get_fpu_valid() and kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[] and arch.vsr[] arrays as needed. Signed-off-by: Paul Mackerras pau...@samba.org Any reason to not use ONE_REG here? Just consistency with x86 -- they have an xmm[][] field in their struct kvm_fpu which looks like it contains their vector state. Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote: Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. It just seems perverse to ignore the existing interface that every other architecture uses, and instead do something unique that is actually slower, but whatever... Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 14.09.2012, at 03:36, Paul Mackerras wrote: On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote: Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. It just seems perverse to ignore the existing interface that every other architecture uses, and instead do something unique that is actually slower, but whatever... We're slowly moving towards ONE_REG. ARM is already going full steam ahead and I'd like to have every new register in PPC be modeled with it as well. The old interface broke on us one time too often now :). As I said, if we run into performance problems, we will implement ways to improve performance. At the end of the day, x86 will be the odd one out. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 14.09.2012, at 03:44, Alexander Graf wrote: On 14.09.2012, at 03:36, Paul Mackerras wrote: On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote: Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. It just seems perverse to ignore the existing interface that every other architecture uses, and instead do something unique that is actually slower, but whatever... We're slowly moving towards ONE_REG. ARM is already going full steam ahead and I'd like to have every new register in PPC be modeled with it as well. The old interface broke on us one time too often now :). As I said, if we run into performance problems, we will implement ways to improve performance. At the end of the day, x86 will be the odd one out. (plus your patch breaks abi compatibility with old user space) Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 3/3] KVM: perf: kvm events analysis tool
On 09/13/2012 12:56 PM, David Ahern wrote: That suggests what is really needed is a 'live' mode - a continual updating of the output like perf top, not a record and analyze later mode. Which does come back to why I responded to this email -- the syntax is klunky and awkward. So, I spent a fair amount of time today implementing a live mode. And after a lot of swearing at the tracepoint processing code I finally have it working. And the format extends easily (meaning day and the next step) to a perf-based kvm_stat replacement. Example syntax is: perf kvm stat [-p pid|-a|...] which defaults to an update delay of 1 second, and vmexit analysis. Hi David, I am very glad to see the live mode, it is very similar with kvm_stat(*). I think kvm guys will like it. The guts of the processing logic come from the existing kvm-events code. The changes focus on combining the record and report paths into one. The display needs some help (Arnaldo?), but it seems to work well. I'd like to get opinions on what next? IMO, the record/report path should not get a foot hold from a backward compatibility perspective and having to maintain those options. I am willing to take the existing patches into git to maintain authorship and from there apply patches to make the live mode work - which includes a bit of refactoring of perf code (like the stats changes). We'd better keep the record/report function, sometimes, we can only get perf.data from the customers whose machine can not be reached for us. Especially, other tracepoints are also interesting for us when the customers encounter the performance issue, we always ask costumes to use perf kvm stat -e xxx to append other events, like lock:*. Then, we can get not only the information of kvm events by using 'perf kvm stat' but also other informations like 'perf lock' or 'perf script' to get the whole sequences. Before I march down this path, any objections, opinions, etc? And, i think live mode is also useful for 'perf lock/sched', could you implement it in perf core? By the way, the new version of our patchset is ready, do you want to add your implement after it is accepted by Arnaldo? Or are you going to post it with our patchset together? Thanks! * kvm_stat can be found at scripts/kvm/kvm_stat in the code of Qemu which locate at https://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memory-hotplug : possible circular locking dependency detected
At 09/13/2012 02:19 PM, Yasuaki Ishimatsu Wrote: When I offline a memory on linux-3.6-rc5, possible circular locking dependency detected messages are shown. Are the messages known problem? It is a known problem, but it doesn't cause a deadlock. There is 3 locks: memory hotplug's lock, memory hotplug notifier's lock, and ksm_thread_mutex. ksm_thread_mutex is locked when the memory is going offline and is unlocked when the memory is offlined or the offlining is cancelled. So we meet the warning messages. But it doesn't cause deadlock, because we lock mem_hotplug_mutex first. Thanks Wen Congyang [ 201.596363] Offlined Pages 32768 [ 201.596373] remove from free list 14 1024 148000 [ 201.596493] remove from free list 140400 1024 148000 [ 201.596612] remove from free list 140800 1024 148000 [ 201.596730] remove from free list 140c00 1024 148000 [ 201.596849] remove from free list 141000 1024 148000 [ 201.596968] remove from free list 141400 1024 148000 [ 201.597049] remove from free list 141800 1024 148000 [ 201.597049] remove from free list 141c00 1024 148000 [ 201.597049] remove from free list 142000 1024 148000 [ 201.597049] remove from free list 142400 1024 148000 [ 201.597049] remove from free list 142800 1024 148000 [ 201.597049] remove from free list 142c00 1024 148000 [ 201.597049] remove from free list 143000 1024 148000 [ 201.597049] remove from free list 143400 1024 148000 [ 201.597049] remove from free list 143800 1024 148000 [ 201.597049] remove from free list 143c00 1024 148000 [ 201.597049] remove from free list 144000 1024 148000 [ 201.597049] remove from free list 144400 1024 148000 [ 201.597049] remove from free list 144800 1024 148000 [ 201.597049] remove from free list 144c00 1024 148000 [ 201.597049] remove from free list 145000 1024 148000 [ 201.597049] remove from free list 145400 1024 148000 [ 201.597049] remove from free list 145800 1024 148000 [ 201.597049] remove from free list 145c00 1024 148000 [ 201.597049] remove from free list 146000 1024 148000 [ 201.597049] remove from free list 146400 1024 148000 [ 201.597049] remove from free list 146800 1024 148000 [ 201.597049] remove from free list 146c00 1024 148000 [ 201.597049] remove from free list 147000 1024 148000 [ 201.597049] remove from free list 147400 1024 148000 [ 201.597049] remove from free list 147800 1024 148000 [ 201.597049] remove from free list 147c00 1024 148000 [ 201.602143] [ 201.602150] == [ 201.602153] [ INFO: possible circular locking dependency detected ] [ 201.602157] 3.6.0-rc5 #1 Not tainted [ 201.602159] --- [ 201.602162] bash/2789 is trying to acquire lock: [ 201.602164] ((memory_chain).rwsem){.+.+.+}, at: [8109fe16] __blocking_notifier_call_chain+0x66/0xd0 [ 201.602180] [ 201.602180] but task is already holding lock: [ 201.602182] (ksm_thread_mutex/1){+.+.+.}, at: [811b41fa] ksm_memory_callback+0x3a/0xc0 [ 201.602194] [ 201.602194] which lock already depends on the new lock. [ 201.602194] [ 201.602197] [ 201.602197] the existing dependency chain (in reverse order) is: [ 201.602200] [ 201.602200] - #1 (ksm_thread_mutex/1){+.+.+.}: [ 201.602208][810dbee9] validate_chain+0x6d9/0x7e0 [ 201.602214][810dc2e6] __lock_acquire+0x2f6/0x4f0 [ 201.602219][810dc57d] lock_acquire+0x9d/0x190 [ 201.602223][8166b4fc] __mutex_lock_common+0x5c/0x420 [ 201.602229][8166ba2a] mutex_lock_nested+0x4a/0x60 [ 201.602234][811b41fa] ksm_memory_callback+0x3a/0xc0 [ 201.602239][81673447] notifier_call_chain+0x67/0x150 [ 201.602244][8109fe2b] __blocking_notifier_call_chain+0x7b/0xd0 [ 201.602250][8109fe96] blocking_notifier_call_chain+0x16/0x20 [ 201.602255][8144c53b] memory_notify+0x1b/0x20 [ 201.602261][81653c51] offline_pages+0x1b1/0x470 [ 201.602267][811bfcae] remove_memory+0x1e/0x20 [ 201.602273][8144c661] memory_block_action+0xa1/0x190 [ 201.602278][8144c7c9] memory_block_change_state+0x79/0xe0 [ 201.602282][8144c8f2] store_mem_state+0xc2/0xd0 [ 201.602287][81436980] dev_attr_store+0x20/0x30 [ 201.602293][812498d3] sysfs_write_file+0xa3/0x100 [ 201.602299][811cba80] vfs_write+0xd0/0x1a0 [ 201.602304][811cbc54] sys_write+0x54/0xa0 [ 201.602309][81678529] system_call_fastpath+0x16/0x1b [ 201.602315] [ 201.602315] - #0 ((memory_chain).rwsem){.+.+.+}: [ 201.602322][810db7e7] check_prev_add+0x527/0x550 [ 201.602326][810dbee9] validate_chain+0x6d9/0x7e0 [ 201.602331][810dc2e6]
Re: [3.5.0 BUG] vmx_handle_exit: unexpected, valid vectoring info (0x80000b0e)
On 09/12/2012 04:15 PM, Avi Kivity wrote: On 09/12/2012 07:40 AM, Fengguang Wu wrote: Hi, 3 of my test boxes running v3.5 kernel become unaccessible and I find two of them kept emitting this dmesg: vmx_handle_exit: unexpected, valid vectoring info (0x8b0e) and exit reason is 0x31 The other one has froze and the above lines are the last dmesg. Any ideas? First, that printk should be rate-limited. Second, we should add EXIT_REASON_EPT_MISCONFIG (0x31) to if ((vectoring_info VECTORING_INFO_VALID_MASK) (exit_reason != EXIT_REASON_EXCEPTION_NMI exit_reason != EXIT_REASON_EPT_VIOLATION exit_reason != EXIT_REASON_TASK_SWITCH)) printk(KERN_WARNING %s: unexpected, valid vectoring info (0x%x) and exit reason is 0x%x\n, __func__, vectoring_info, exit_reason); since it's easily caused by the guest. Yes, i will do these. Third, it's really unexpected. It seems the guest was attempting to deliver a page fault exception (0x0e) but encountered an mmio page during delivery (in the IDT, TSS, stack, or page tables). Is this reproducible? If so it's easy to patch kvm to halt in that case and allow examining the guest via qemu. Have no idea yet why the box was frozen under this case, will try to write a test case, hope it can help me to find the reason out. Maybe we should do so regardless (return a KVM_EXIT_INTERNAL_ERROR). I think this is reasonable. Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro
-Original Message- From: Wood Scott-B07421 Sent: Thursday, September 13, 2012 12:54 AM To: Alexander Graf Cc: Caraman Mihai Claudiu-B02008; kvm-ppc@vger.kernel.org; linuxppc- d...@lists.ozlabs.org; k...@vger.kernel.org Subject: Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro On 09/12/2012 04:45 PM, Alexander Graf wrote: On 12.09.2012, at 23:38, Scott Wood scottw...@freescale.com wrote: On 09/12/2012 01:56 PM, Alexander Graf wrote: On 12.09.2012, at 15:18, Mihai Caraman mihai.cara...@freescale.com wrote: The current form of DO_KVM macro restricts its use to one call per input parameter set. This is caused by kvmppc_resume_\intno\()_\srr1 symbol definition. Duplicate calls of DO_KVM are required by distinct implementations of exeption handlers which are delegated at runtime. Not sure I understand what you're trying to achieve here. Please elaborate ;) On 64-bit book3e we compile multiple versions of the TLB miss handlers, and choose from them at runtime. The exception handler patching is active in __early_init_mmu() function powerpc/mm/tlb_nohash.c for quite a few years. For tlb miss exceptions there are three handler versions: standard, HW tablewalk and bolted. I posted a patch to add another variant, for e6500-style hardware tablewalk, which shares the bolted prolog/epilog (besides prolog/epilog performance, e6500 is incompatible with the IBM tablewalk code for various reasons). That caused us to have two DO_KVMs for the same exception type. Sorry, I missed to cc kvm-ppc mailist when I replayed to that discussion thread. -Mike
Re: [PATCH] KVM: PPC: bookehv: Allow duplicate calls of DO_KVM macro
On 09/12/2012 03:18 PM, Mihai Caraman wrote: The current form of DO_KVM macro restricts its use to one call per input parameter set. This is caused by kvmppc_resume_\intno\()_\srr1 symbol definition. Duplicate calls of DO_KVM are required by distinct implementations of exeption handlers which are delegated at runtime. Use a rare label number to avoid conflicts with the calling contexts. Signed-off-by: Mihai Caraman mihai.cara...@freescale.com Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 12.09.2012, at 02:19, Paul Mackerras wrote: Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP error on powerpc. This implements them for Book 3S processors. Since those processors have more than just the 32 basic floating-point registers, this extends the kvm_fpu structure to have space for the additional registers -- the 32 vector registers (128 bits each) for VMX/Altivec and the 32 additional 64-bit registers that were added on POWER7 for the vector-scalar extension (VSX). It also adds a `valid' field, which is a bitmap indicating which elements contain valid data. The layout of the floating-point register data in the vcpu struct is mostly the same between different flavors of KVM on Book 3S processors, but the set of supported registers may differ depending on what the CPU hardware supports and how much is emulated. Therefore we have a flavor-specific function to work out which set of registers to supply for the get function. On POWER7 processors using the Book 3S HV flavor of KVM, we save the standard floating-point registers together with their corresponding VSX extension register in the vcpu-arch.vsr[] array, since each pair can be loaded or stored with one instruction. This is different to other flavors of KVM, and to other processors (i.e. PPC970) with HV KVM, which store the standard FPRs in vcpu-arch.fpr[]. To cope with this, we use the kvmppc_core_get_fpu_valid() and kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[] and arch.vsr[] arrays as needed. Signed-off-by: Paul Mackerras pau...@samba.org Any reason to not use ONE_REG here? Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM: PPC: Book3S HV: More flexible allocator for linear memory
On 12.09.2012, at 02:34, Paul Mackerras wrote: This series of 3 patches makes it possible for guests to allocate whatever size of HPT they need from linear memory preallocated at boot, rather than being restricted to a single size of HPT (by default, 16MB) and having to use the kernel page allocator for anything else -- which in practice limits them to at most 16MB given the default value for the maximum page order. Instead of allocating many individual pieces of memory, this allocates a single contiguous area and uses a simple bitmap-based allocator to hand out pieces of it as required. Have you tried to play with CMA for this? It sounds like it could buy us exactly what we need. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote: On 12.09.2012, at 02:19, Paul Mackerras wrote: Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP error on powerpc. This implements them for Book 3S processors. Since those processors have more than just the 32 basic floating-point registers, this extends the kvm_fpu structure to have space for the additional registers -- the 32 vector registers (128 bits each) for VMX/Altivec and the 32 additional 64-bit registers that were added on POWER7 for the vector-scalar extension (VSX). It also adds a `valid' field, which is a bitmap indicating which elements contain valid data. The layout of the floating-point register data in the vcpu struct is mostly the same between different flavors of KVM on Book 3S processors, but the set of supported registers may differ depending on what the CPU hardware supports and how much is emulated. Therefore we have a flavor-specific function to work out which set of registers to supply for the get function. On POWER7 processors using the Book 3S HV flavor of KVM, we save the standard floating-point registers together with their corresponding VSX extension register in the vcpu-arch.vsr[] array, since each pair can be loaded or stored with one instruction. This is different to other flavors of KVM, and to other processors (i.e. PPC970) with HV KVM, which store the standard FPRs in vcpu-arch.fpr[]. To cope with this, we use the kvmppc_core_get_fpu_valid() and kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[] and arch.vsr[] arrays as needed. Signed-off-by: Paul Mackerras pau...@samba.org Any reason to not use ONE_REG here? Just consistency with x86 -- they have an xmm[][] field in their struct kvm_fpu which looks like it contains their vector state. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 14.09.2012, at 01:58, Paul Mackerras wrote: On Fri, Sep 14, 2012 at 01:30:51AM +0200, Alexander Graf wrote: On 12.09.2012, at 02:19, Paul Mackerras wrote: Currently the KVM_GET_FPU and KVM_SET_FPU ioctls return an EOPNOTSUPP error on powerpc. This implements them for Book 3S processors. Since those processors have more than just the 32 basic floating-point registers, this extends the kvm_fpu structure to have space for the additional registers -- the 32 vector registers (128 bits each) for VMX/Altivec and the 32 additional 64-bit registers that were added on POWER7 for the vector-scalar extension (VSX). It also adds a `valid' field, which is a bitmap indicating which elements contain valid data. The layout of the floating-point register data in the vcpu struct is mostly the same between different flavors of KVM on Book 3S processors, but the set of supported registers may differ depending on what the CPU hardware supports and how much is emulated. Therefore we have a flavor-specific function to work out which set of registers to supply for the get function. On POWER7 processors using the Book 3S HV flavor of KVM, we save the standard floating-point registers together with their corresponding VSX extension register in the vcpu-arch.vsr[] array, since each pair can be loaded or stored with one instruction. This is different to other flavors of KVM, and to other processors (i.e. PPC970) with HV KVM, which store the standard FPRs in vcpu-arch.fpr[]. To cope with this, we use the kvmppc_core_get_fpu_valid() and kvmppc_core_set_fpu_valid() functions to sync between the arch.fpr[] and arch.vsr[] arrays as needed. Signed-off-by: Paul Mackerras pau...@samba.org Any reason to not use ONE_REG here? Just consistency with x86 -- they have an xmm[][] field in their struct kvm_fpu which looks like it contains their vector state. Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote: Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. It just seems perverse to ignore the existing interface that every other architecture uses, and instead do something unique that is actually slower, but whatever... Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 14.09.2012, at 03:36, Paul Mackerras wrote: On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote: Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. It just seems perverse to ignore the existing interface that every other architecture uses, and instead do something unique that is actually slower, but whatever... We're slowly moving towards ONE_REG. ARM is already going full steam ahead and I'd like to have every new register in PPC be modeled with it as well. The old interface broke on us one time too often now :). As I said, if we run into performance problems, we will implement ways to improve performance. At the end of the day, x86 will be the odd one out. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3S: Implement floating-point state get/set functions
On 14.09.2012, at 03:44, Alexander Graf wrote: On 14.09.2012, at 03:36, Paul Mackerras wrote: On Fri, Sep 14, 2012 at 02:03:15AM +0200, Alexander Graf wrote: Yup, Considering how different the FPU state on differnet ppc cores is, I'd be more happy with shoving it into something that allows for more dynamic control. Otherwise we'd end up with yet another struct sregs that can contain SPE registers, altivec, and a dozen additions to it :). Please just use one_reg for all of the register synchronization you want to add, unless there's a compelling reason to do it differently. It will make our live a lot easier in the future. If we need to transfer too much data and actually run into performance trouble, we can always add a GET_MANY_REG ioctl. It just seems perverse to ignore the existing interface that every other architecture uses, and instead do something unique that is actually slower, but whatever... We're slowly moving towards ONE_REG. ARM is already going full steam ahead and I'd like to have every new register in PPC be modeled with it as well. The old interface broke on us one time too often now :). As I said, if we run into performance problems, we will implement ways to improve performance. At the end of the day, x86 will be the odd one out. (plus your patch breaks abi compatibility with old user space) Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html