Re: question about napi_disable (was Re: [PATCH] virtio_net: set/cancel work on ndo_open/ndo_stop)
On 04/04/2012 05:32 PM, Michael S. Tsirkin wrote: On Thu, Dec 29, 2011 at 09:12:38PM +1030, Rusty Russell wrote: Michael S. Tsirkin noticed that we could run the refill work after ndo_close, which can re-enable napi - we don't disable it until virtnet_remove. This is clearly wrong, so move the workqueue control to ndo_open and ndo_stop (aka. virtnet_open and virtnet_close). One subtle point: virtnet_probe() could simply fail if it couldn't allocate a receive buffer, but that's less polite in virtnet_open() so we schedule a refill as we do in the normal receive path if we run out of memory. Signed-off-by: Rusty Russellru...@rustcorp.com.au Doh. napi_disable does not prevent the following napi_schedule, does it? Can someone confirm that I am not seeing things please? Looks like napi_disable() does prevent the following scheduling, as napi_schedule_prep() returns true only when there's an 0 - 1 transition of NAPI_STATE_SCHED bit. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On 2012-04-05 05:42, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those here. We especially do not want MSI-X To be precised: It is the specified state after reset which we fail to restore so far. + * enabled since it lives
Re: KVM and Hylafax
We have an Hylafax 5.2.5 CentOS 5 installation hosted inside a Xen virtual machine. It works quite well, but now I'm in the process of upgrading/migrating it to a KVM virtual machine running Ubuntu 10.04. The problem I'm having is that - while receiving works fine - sending faxes is extremely unreliable, I get lots of No response to MPS repeated 3 tries, or Failure to transmit clean ECM image data, just what would happen if the phone line was noisy. For the record, I switched to a CentOS VM instead of an Ubuntu one, and now it's working much better -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42636] PCI passthrough does not work with AMD iommu for PCI device
https://bugzilla.kernel.org/show_bug.cgi?id=42636 --- Comment #9 from Klaus Mueller kmuel...@justmail.de 2012-04-05 08:08:58 --- I tested the same device here and got the same error as you reported. Good to know, that Xen doesn't have any problem. This really means, that it is most probably a kvm or qemu or AMD iommu bug. kvm 0.15, kernel: linux 3.1.10 (openSUSE) -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CPU softlockup due to smp_call_function()
On Thu, Apr 5, 2012 at 6:06 AM, Milton Miller milt...@bga.com wrote: On Wed, 4 Apr 2012 about 22:12:36 +0200, Sasha Levin wrote: I've starting seeing soft lockups resulting from smp_call_function() calls. I've attached two different backtraces of this happening with different code paths. This is running inside a KVM guest with the trinity fuzzer, using today's linux-next kernel. Hi Sasha. You have two different call sites (arch/x86/mm/pageattr.c cpa_flush_range and net/core/dev.c netdev_run_todo), and both use call on_each_cpu with wait=1. I tried a few options but can't get close enough to your compiled length of 2a0 to know if the code is spinning on the first csd_lock_wait in csd_lock or in the second csd_lock_wait after the call to arch_send_call_function_ipi_mask (aka smp_ops + 0x44 in my x86_64 compile). Please check your disassembly and report. Hi Milton, # addr2line -i -e /usr/src/linux/vmlinux 8111f30e /usr/src/linux/kernel/smp.c:102 /usr/src/linux/kernel/smp.c:554 So It's the 2nd lock. I'll work on adding the debug code mentioned in your mail. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On 04/01/2012 07:23 PM, Avi Kivity wrote: On 04/01/2012 04:48 PM, Raghavendra K T wrote: I have patch something like below in mind to try: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d3b98b1..5127668 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) * else and called schedule in __vcpu_run. Hopefully that * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. + * Priority is given to vcpu that are unhalted. */ -for (pass = 0; pass 2 !yielded; pass++) { +for (pass = 0; pass 3 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { struct task_struct *task = NULL; struct pid *pid; -if (!pass i last_boosted_vcpu) { +if (!pass !vcpu-pv_unhalted) +continue; +else if (pass == 1 i last_boosted_vcpu) { i = last_boosted_vcpu; continue; -} else if (pass i last_boosted_vcpu) +} else if (pass == 2 i last_boosted_vcpu) break; if (vcpu == me) continue; [...] I'm interested in how PLE does vs. your patches, both with PLE enabled and disabled. Here is the result taken on PLE machine. Results seem to support all our assumptions. Following are the observations from results: 1) There is a huge benefit for Non PLE based configuration. (base_nople vs pv_ple) (around 90%) 2) ticketlock + kvm patches does go well along with PLE (more benefit is seen not degradation) (base_ple vs pv_ple) 3) The ticketlock + kvm patches make behaves almost like PLE enabled machine (base_ple vs pv_nople) 4) ple handler modification patches seem to give advantage (pv_ple vs pv_ple_optimized). More study needed probably with higher M/N ratio Avi pointed. configurations: base_nople = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n - PLE base_ple = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=n + PLE pv_ple = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE + ticketlock + kvm patches pv_nople = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y - PLE + ticketlock + kvm patches pv_ple_optimized = 3.3-rc6 with CONFIG_PARAVIRT_SPINLOCK=y + PLE + optimizaton patch + ticketlock + kvm patches + posted with ple_handler modification (yield to kicked vcpu). Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 core, with 8 online cores and 4*64GB RAM 3 guests running with 2GB RAM, 8vCPU. Results: --- case A) 1x: 1 kernbench 2 idle 2x: 1 kernbench 1 while1 hog 1 idle 3x: 1 kernbench 2 while1 hog Average time taken in sec for kernbench run (std). [ lower the value better ] base_nople base_ple pv_ple pv_nople pv_ple_optimized 1x 72.8284 (89.8757) 70.475 (85.6979) 63.5033 (72.7041) 65.7634 (77.0504) 64.3284 (73.2688) 2x 823.053 (1113.05) 110.971 (132.829) 105.099 (128.738) 139.058 (165.156) 106.268 (129.611) 3x 3244.37 (4707.61) 150.265 (184.766) 138.341 (172.69) 139.106 (163.549) 133.238 (168.388) Percentage improvement calculation w.r.t base_nople - base_ple pv_plepv_nople pv_ple_optimized 1x3.23143 12.8042 9.70089 11.6713 2x86.5172 87.2306 83.1046 87.0886 3x95.3684 95.73695.7124 95.8933 --- Percentage improvement calculation w.r.t base_ple - base_nople pv_plepv_nople pv_ple_optimized 1x -3.33939.89244 6.68549 8.72167 2x -641.683 5.29147 -25.3102 4.23804 3x -2059.17.93531 7.42621 11.3313 case B) all 3 guests running kernbench Average time taken in sec for kernbench run (std). [ lower the value better ]. Note that std is calculated over 6*3 run average from all 3 guests given by kernbench base_noplebase_plepv_ple pv_nople pv_ple_opt 2886.92 (18.289131) 204.80333 (7.1784039) 200.22517 (10.134804) 202.091 (12.249673) 201.60683 (7.881737) Percentage improvement calculation w.r.t base_nople - base_ple pv_plepv_nople pv_ple_optimized 92.905893.0644 9393.0166 Percentage improvement calculation w.r.t base_ple - base_nople pv_plepv_nople pv_ple_optimized -1309.6062.23541.324 1.5607 I hope the experimental results should convey same message if somebody does benchmarking. Also as Ian pointed in the thread, the earlier results from Attilio and me was to convince that
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On 04/02/2012 12:51 PM, Raghavendra K T wrote: On 04/01/2012 07:23 PM, Avi Kivity wrote: On 04/01/2012 04:48 PM, Raghavendra K T wrote: I have patch something like below in mind to try: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d3b98b1..5127668 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) * else and called schedule in __vcpu_run. Hopefully that * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. + * Priority is given to vcpu that are unhalted. */ -for (pass = 0; pass 2 !yielded; pass++) { +for (pass = 0; pass 3 !yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { struct task_struct *task = NULL; struct pid *pid; -if (!pass i last_boosted_vcpu) { +if (!pass !vcpu-pv_unhalted) +continue; +else if (pass == 1 i last_boosted_vcpu) { i = last_boosted_vcpu; continue; -} else if (pass i last_boosted_vcpu) +} else if (pass == 2 i last_boosted_vcpu) break; if (vcpu == me) continue; Actually I think this is unneeded. The loops tries to find vcpus that are runnable but not running (vcpu_active(vcpu-wq)), and halted vcpus don't match this condition. Oh! I think I misinterpreted your statement. hmm I got it. you told to remove if (vcpu == me) condition. No, the entire patch is unneeded. My original comment was: from the PLE handler, don't wake up a vcpu that is sleeping because it is waiting for a kick But the PLE handler never wakes up sleeping vcpus anyway. There's still a conflict with PLE in that it may trigger during the spin phase and send a random yield_to() somewhere. Maybe it's sufficient to tune the PLE timeout to be longer than the spinlock timeout. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On 04/02/2012 12:26 PM, Thomas Gleixner wrote: One thing about it is that it can give many false positives. Consider a fine-grained spinlock that is being accessed by many threads. That is, the lock is taken and released with high frequency, but there is no contention, because each vcpu is accessing a different instance. So the host scheduler will needlessly delay preemption of vcpus that happen to be holding a lock, even though this gains nothing. You're talking about per cpu locks, right? I can see the point there, but per cpu stuff might be worth annotating to avoid this. Or any lock which is simply uncontended. Say a single process is running, the rest of the system is idle. It will take and release many locks; but it can be preempted at any point by the hypervisor with no performance loss. The overhead is arming a timer twice and an extra exit per deferred context switch. Perhaps not much given that you don't see tons of context switches on virt workloads, at least without threaded interrupts (or maybe interrupt threads should override this mechanism, by being realtime threads). A second issue may happen with a lock that is taken and released with high frequency, with a high hold percentage. The host scheduler may always sample the guest in a held state, leading it to conclude that it's exceeding its timeout when in fact the lock is held for a short time only. Well, no. You can avoid that. host VCPU spin_lock() modify_global_state() exit check_global_state() mark_global_state() reschedule vcpu spin_unlock() check_global_state() trigger_exit() So when an exit occures in the locked section, then the host can mark the global state to tell the guest to issue a trap on unlock. Right. How does this nest? Do we trigger the exit on the outermost unlock? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. I don't think there's a disagreement: don't we all agree they should be forwarded to qemu and on the guest if possible, ignored if not? hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..2e6b93e 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1613,6 +1613,29 @@ static void reset_assigned_device(DeviceState *dev) const char reset[] = 1; int fd, ret; +/* + * If a guest is reset without being shutdown, MSI/MSI-X can still + * be running. We want to return the device to a known state on + * reset, so disable those
Re: [PATCH v4 1/4] kvm: Update kernel headers against kvm.git
On Thu, Mar 08, 2012 at 11:10:24AM +0100, Jan Kiszka wrote: Required for INTx masking / host IRQ sharing support of assigned devices. Corresponding kvm.git hash: 186195928e Signed-off-by: Jan Kiszka jan.kis...@siemens.com Acked-by: Michael S. Tsirkin m...@redhat.com --- linux-headers/asm-s390/kvm.h |2 ++ linux-headers/linux/kvm.h|6 ++ 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/linux-headers/asm-s390/kvm.h b/linux-headers/asm-s390/kvm.h index 9acbde4..9607667 100644 --- a/linux-headers/asm-s390/kvm.h +++ b/linux-headers/asm-s390/kvm.h @@ -44,10 +44,12 @@ struct kvm_guest_debug_arch { #define KVM_SYNC_PREFIX (1UL 0) #define KVM_SYNC_GPRS (1UL 1) #define KVM_SYNC_ACRS (1UL 2) +#define KVM_SYNC_CRS(1UL 3) /* definition of registers in kvm_run */ struct kvm_sync_regs { __u64 prefix; /* prefix register */ __u64 gprs[16]; /* general purpose registers */ __u32 acrs[16]; /* access registers */ + __u64 crs[16]; /* control registers */ }; #endif diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index f6b5343..cf7b098 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -588,6 +588,7 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_TSC_DEADLINE_TIMER 72 #define KVM_CAP_S390_UCONTROL 73 #define KVM_CAP_SYNC_REGS 74 +#define KVM_CAP_PCI_2_3 75 #ifdef KVM_CAP_IRQ_ROUTING @@ -784,6 +785,9 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_TSC_CONTROL */ #define KVM_SET_TSC_KHZ _IO(KVMIO, 0xa2) #define KVM_GET_TSC_KHZ _IO(KVMIO, 0xa3) +/* Available with KVM_CAP_PCI_2_3 */ +#define KVM_ASSIGN_SET_INTX_MASK _IOW(KVMIO, 0xa4, \ +struct kvm_assigned_pci_dev) /* * ioctls for vcpu fds @@ -857,6 +861,8 @@ struct kvm_s390_ucas_mapping { #define KVM_SET_ONE_REG_IOW(KVMIO, 0xac, struct kvm_one_reg) #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 0) +#define KVM_DEV_ASSIGN_PCI_2_3 (1 1) +#define KVM_DEV_ASSIGN_MASK_INTX (1 2) struct kvm_assigned_pci_dev { __u32 assigned_dev_id; -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/4] kvm: Introduce kvm_has_intx_set_mask
On Thu, Mar 08, 2012 at 11:10:25AM +0100, Jan Kiszka wrote: Will be used by PCI device assignment code. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Acked-by: Michael S. Tsirkin m...@redhat.com --- kvm-all.c |8 kvm.h |1 + 2 files changed, 9 insertions(+), 0 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 7fed5c9..471293d 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -78,6 +78,7 @@ struct KVMState int xsave, xcrs; int many_ioeventfds; int pit_state2; +int intx_set_mask; int irqchip_inject_ioctl; #ifdef KVM_CAP_IRQ_ROUTING struct kvm_irq_routing *irq_routes; @@ -960,6 +961,8 @@ int kvm_init(void) s-pit_state2 = kvm_check_extension(s, KVM_CAP_PIT_STATE2); #endif +s-intx_set_mask = kvm_check_extension(s, KVM_CAP_PCI_2_3); + ret = kvm_arch_init(s); if (ret 0) { goto err; @@ -1319,6 +1322,11 @@ int kvm_has_gsi_routing(void) #endif } +int kvm_has_intx_set_mask(void) +{ +return kvm_state-intx_set_mask; +} + int kvm_allows_irq0_override(void) { return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing(); diff --git a/kvm.h b/kvm.h index 43d4c68..1885be1 100644 --- a/kvm.h +++ b/kvm.h @@ -57,6 +57,7 @@ int kvm_has_xcrs(void); int kvm_has_many_ioeventfds(void); int kvm_has_pit_state2(void); int kvm_has_gsi_routing(void); +int kvm_has_intx_set_mask(void); int kvm_allows_irq0_override(void); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/4] pci-assign: Use PCI-2.3-based shared legacy interrupts
On Thu, Mar 08, 2012 at 11:10:26AM +0100, Jan Kiszka wrote: Enable the new KVM feature that allows legacy interrupt sharing for PCI-2.3-compliant devices. This requires to synchronize any guest change of the INTx mask bit to the kernel. The feature is controlled by the property 'share_intx' and is off by default for now. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Acked-by: Michael S. Tsirkin m...@redhat.com --- hw/device-assignment.c | 25 + hw/device-assignment.h | 10 ++ qemu-kvm.c |9 + qemu-kvm.h |2 ++ 4 files changed, 42 insertions(+), 4 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index a5f1abb..b7cabd4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -782,6 +782,13 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } +if (dev-features ASSIGNED_DEVICE_SHARE_INTX_MASK) { +assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; + +/* hide host-side INTx masking from the guest */ +dev-emulate_config_read[PCI_COMMAND + 1] |= +PCI_COMMAND_INTX_DISABLE 8; +} r = kvm_assign_pci_device(kvm_state, assigned_dev_data); if (r 0) { @@ -1121,10 +1128,26 @@ static void assigned_dev_pci_write_config(PCIDevice *pci_dev, uint32_t address, uint32_t val, int len) { AssignedDevice *assigned_dev = DO_UPCAST(AssignedDevice, dev, pci_dev); +uint16_t old_cmd = pci_get_word(pci_dev-config + PCI_COMMAND); uint32_t emulate_mask, full_emulation_mask; +int ret; pci_default_write_config(pci_dev, address, val, len); +if (kvm_has_intx_set_mask() +range_covers_byte(address, len, PCI_COMMAND + 1)) { +bool intx_masked = (pci_get_word(pci_dev-config + PCI_COMMAND) +PCI_COMMAND_INTX_DISABLE); + +if (intx_masked != !!(old_cmd PCI_COMMAND_INTX_DISABLE)) { +ret = kvm_device_intx_set_mask(kvm_state, + calc_assigned_dev_id(assigned_dev), + intx_masked); +if (ret) { +perror(assigned_dev_pci_write_config: set intx mask); +} +} +} if (assigned_dev-cap.available ASSIGNED_DEVICE_CAP_MSI) { if (range_covers_byte(address, len, pci_dev-msi_cap + PCI_MSI_FLAGS)) { @@ -1748,6 +1771,8 @@ static Property da_properties[] = ASSIGNED_DEVICE_USE_IOMMU_BIT, true), DEFINE_PROP_BIT(prefer_msi, AssignedDevice, features, ASSIGNED_DEVICE_PREFER_MSI_BIT, true), +DEFINE_PROP_BIT(share_intx, AssignedDevice, features, +ASSIGNED_DEVICE_SHARE_INTX_BIT, false), DEFINE_PROP_INT32(bootindex, AssignedDevice, bootindex, -1), DEFINE_PROP_STRING(configfd, AssignedDevice, configfd_name), DEFINE_PROP_END_OF_LIST(), diff --git a/hw/device-assignment.h b/hw/device-assignment.h index b4bcfa6..5d271d5 100644 --- a/hw/device-assignment.h +++ b/hw/device-assignment.h @@ -74,11 +74,13 @@ typedef struct { PCIRegion *region; } AssignedDevRegion; -#define ASSIGNED_DEVICE_USE_IOMMU_BIT0 -#define ASSIGNED_DEVICE_PREFER_MSI_BIT 1 +#define ASSIGNED_DEVICE_USE_IOMMU_BIT 0 +#define ASSIGNED_DEVICE_PREFER_MSI_BIT 1 +#define ASSIGNED_DEVICE_SHARE_INTX_BIT 2 -#define ASSIGNED_DEVICE_USE_IOMMU_MASK (1 ASSIGNED_DEVICE_USE_IOMMU_BIT) -#define ASSIGNED_DEVICE_PREFER_MSI_MASK (1 ASSIGNED_DEVICE_PREFER_MSI_BIT) +#define ASSIGNED_DEVICE_USE_IOMMU_MASK (1 ASSIGNED_DEVICE_USE_IOMMU_BIT) +#define ASSIGNED_DEVICE_PREFER_MSI_MASK (1 ASSIGNED_DEVICE_PREFER_MSI_BIT) +#define ASSIGNED_DEVICE_SHARE_INTX_MASK (1 ASSIGNED_DEVICE_SHARE_INTX_BIT) typedef struct { uint32_t addr_lo; diff --git a/qemu-kvm.c b/qemu-kvm.c index 09a35f0..2047ebb 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -54,6 +54,15 @@ static int kvm_old_assign_irq(KVMState *s, return kvm_vm_ioctl(s, KVM_ASSIGN_IRQ, assigned_irq); } +int kvm_device_intx_set_mask(KVMState *s, uint32_t dev_id, bool masked) +{ +struct kvm_assigned_pci_dev assigned_dev; + +assigned_dev.assigned_dev_id = dev_id; +assigned_dev.flags = masked ? KVM_DEV_ASSIGN_MASK_INTX : 0; +return kvm_vm_ioctl(s, KVM_ASSIGN_SET_INTX_MASK, assigned_dev); +} + #ifdef KVM_CAP_ASSIGN_DEV_IRQ int kvm_assign_irq(KVMState *s, struct kvm_assigned_irq *assigned_irq) { diff --git a/qemu-kvm.h b/qemu-kvm.h index 3c4f023..2b23daf 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -114,6 +114,8 @@ int kvm_assign_irq(KVMState *s, struct kvm_assigned_irq *assigned_irq); */ int kvm_deassign_irq(KVMState *s,
Re: [PATCH v4 4/4] pci_assign: Flip defaults of prefer_msi and share_intx
On Thu, Mar 08, 2012 at 11:10:27AM +0100, Jan Kiszka wrote: INTx sharing is a bit more expensive than exclusive host interrupts, but this channel is not supposed to be used for high-performance scenarios anyway. Modern devices support MSI/MSI-X and do not depend on using INTx under critical workload, real old devices do not support INTx sharing anyway. For those in the middle, the user experience is much better if they just work even when IRQ sharing is required. If there is nothing to share, share_intx=off can still be applied as tuning parameter. With INTx sharing as default, the primary reason for prefer_msi=on is gone. Make it default off, specifically as it is known to cause troubles with devices that have incomplete/broken MSI support or otherwise stumble if host IRQ configuration does not match guest driver expectation. Signed-off-by: Jan Kiszka jan.kis...@siemens.com I'm not familiar enough with what prefer_msi does so no comment here. Maybe Alex can ack. If not I'll try to find the time to understand it a bit better but will take some time. hw/device-assignment.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index b7cabd4..caa3602 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1770,9 +1770,9 @@ static Property da_properties[] = DEFINE_PROP_BIT(iommu, AssignedDevice, features, ASSIGNED_DEVICE_USE_IOMMU_BIT, true), DEFINE_PROP_BIT(prefer_msi, AssignedDevice, features, - ASSIGNED_DEVICE_PREFER_MSI_BIT, true), +ASSIGNED_DEVICE_PREFER_MSI_BIT, false), DEFINE_PROP_BIT(share_intx, AssignedDevice, features, -ASSIGNED_DEVICE_SHARE_INTX_BIT, false), +ASSIGNED_DEVICE_SHARE_INTX_BIT, true), DEFINE_PROP_INT32(bootindex, AssignedDevice, bootindex, -1), DEFINE_PROP_STRING(configfd, AssignedDevice, configfd_name), DEFINE_PROP_END_OF_LIST(), -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On 04/05/2012 02:31 PM, Avi Kivity wrote: On 04/02/2012 12:51 PM, Raghavendra K T wrote: On 04/01/2012 07:23 PM, Avi Kivity wrote: On 04/01/2012 04:48 PM, Raghavendra K T wrote: I have patch something like below in mind to try: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d3b98b1..5127668 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) * else and called schedule in __vcpu_run. Hopefully that * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. + * Priority is given to vcpu that are unhalted. */ -for (pass = 0; pass2!yielded; pass++) { +for (pass = 0; pass3!yielded; pass++) { kvm_for_each_vcpu(i, vcpu, kvm) { struct task_struct *task = NULL; struct pid *pid; -if (!passilast_boosted_vcpu) { +if (!pass!vcpu-pv_unhalted) +continue; +else if (pass == 1ilast_boosted_vcpu) { i = last_boosted_vcpu; continue; -} else if (passilast_boosted_vcpu) +} else if (pass == 2ilast_boosted_vcpu) break; if (vcpu == me) continue; Actually I think this is unneeded. The loops tries to find vcpus that are runnable but not running (vcpu_active(vcpu-wq)), and halted vcpus don't match this condition. Oh! I think I misinterpreted your statement. hmm I got it. you told to remove if (vcpu == me) condition. No, the entire patch is unneeded. My original comment was: from the PLE handler, don't wake up a vcpu that is sleeping because it is waiting for a kick But the PLE handler never wakes up sleeping vcpus anyway. I agree with you. It is already doing that. But my approach here is little different. In 2 classes of vcpus we have (one is subset of another when we try to do yield_to) viz, 1) runnable and kicked (subset of) 2) just runnable what we are trying to do here is targeting 1) first so that we get good lock progress. Here was the sequence I was talking. vcpu1 releases lock-finds that vcpu2 is next candidate - kick hypercall - kvm_pv_kick_cpu_op - set kicked flag - vcpu-kick(vcpu2) at this point we have vcpu2 waiting for getting scheduled. But above yield call can wake *anybody*. I agree this is not serious (rather it is overhead) when there are are less number of vcpus, But when we have more number of vcpu/vm.. it may not scale well. My attempt was to fix that. Let me know if I am completely missing something.. There's still a conflict with PLE in that it may trigger during the spin phase and send a random yield_to() somewhere. Maybe it's sufficient to tune the PLE timeout to be longer than the spinlock timeout. Ok ... But we also should be cautious that, we may do more halt, though we are about to get spinlock. Need more study on this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CPU softlockup due to smp_call_function()
On 04/04/2012 11:12 PM, Sasha Levin wrote: Hi all, I've starting seeing soft lockups resulting from smp_call_function() calls. I've attached two different backtraces of this happening with different code paths. This is running inside a KVM guest with the trinity fuzzer, using today's linux-next kernel. [ 6540.134009] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u:1:38] [ 6540.134048] irq event stamp: 286811770 [ 6540.134048] hardirqs last enabled at (286811769): [82669e74] restore_args+0x0/0x30 [ 6540.134048] hardirqs last disabled at (286811770): [8266b3ea] apic_timer_interrupt+0x6a/0x80 [ 6540.134048] softirqs last enabled at (286811768): [810b746e] __do_softirq+0x16e/0x190 [ 6540.134048] softirqs last disabled at (286811749): [8266bdec] call_softirq+0x1c/0x30 [ 6540.134048] CPU 0 [ 6540.134048] Pid: 38, comm: kworker/u:1 Tainted: GW 3.4.0-rc1-next-20120404-sasha-dirty #72 [ 6540.134048] RIP: 0010:[8111f30e] [8111f30e] smp_call_function_many+0x27e/0x2a0 This cpu is waiting for some other cpu to process a function (likely rps_trigger_softirq(), from the trace). Can you get a backtrace on all cpus when this happens? It would be good to enhance smp_call_function_*() to do this automatically when it happens - it's spinning there anyway, so it might as well count the iterations and NMI the lagging cpu if it waits for too long. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On 04/04/2012 01:29 PM, Gleb Natapov wrote: ok. seems to be. will move over to perf as its working fine inside guest. Good riddance IMO. I managed to run it on a guest (but not on my host!). The thing is buggy. It does not use global ctrl MSR to enable counters and kvm has all of them disabled by default. I didn't find what value this MSR should have after reset, so this may be either kvm bug or real BIOSes enable all counters in global ctrl MSR for PMUv1 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The second problem is that oprofile reprogram PMU counters without disabling them first and this is explicitly prohibited by Intel SDM. The patch below solve that, but oprofile is the one who should be fixed. Both should be fixed, there may be other profilers affected. diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index a73f0c1..be05028 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -396,6 +396,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data) (pmc = get_fixed_pmc(pmu, index))) { data = (s64)(s32)data; pmc-counter += data - read_pmc(pmc); + reprogram_gp_counter(pmc, pmc-eventsel); return 0; } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) { if (data == pmc-eventsel) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/1] NUMA aware scheduling per vhost thread patch
On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote: On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote: Hi: Thanks for the work and it looks very reasonable, some questions below. Yes I am happy to see the per-cpu work resurrected. Some comments below. On 03/23/2012 07:48 AM, Shirley Ma wrote: Sorry for being late to submit this patch. I have spent lots of time trying to find the best approach. This effort is still going on... This patch is built against net-next tree. This is an experimental RFC patch. The purpose of this patch is to address KVM networking scalability and NUMA scheduling issue. Need also test for non-NUMA machine, I see that you just choose the cpu that initiates the work for non-numa machine which seems sub optimal. Good suggestions. I don't have any non-numa systems. But KK run some tests on non-numa system. He could see around 20% performance gain for single VMs local host to guest. I hope we can run a full test on non-numa system. On non-numa system, the same per vhost-cpu thread will be always picked up consistently for a particular vq since all cores are on same cpu socket. So there will be two per-cpu vhost threads handle TX/RX simultaneously. The existing implementation of vhost creats a vhost thread per-device (virtio_net) based. RX and TX work of a VMs per-device is handled by same vhost thread. One of the limitation of this implementation is with increasing the number VMs or the number of virtio-net interfces, more vhost threads are created, it will consume more kernel resources, and induce more threads context switches/scheduling overhead. We noticed that the KVM network performance doesn't scale with increasing number of VMs. The other limitation is to have single vhost thread to process both RX and TX, the work will be blocked. So we create this per cpu vhost thread implementation. The number of vhost cpu threads is limited to the number of cpus on the host. To address these limitations, we are propsing a per-cpu vhost thread model where the number of vhost threads are limited and equal to the number of online cpus on the host. The number of vhost thread needs more consideration. Consider that we have a 1024 cores host with a card have 16 tx/rx queues, do we really need 1024 vhost threads? In this case, we could add a module parameter to limit the number of cores/sockets to be used. Hmm. And then which cores would we run on? Also, is the parameter different between guests? Another idea is to scale the # of threads on demand. Sharing the same thread between guests is also an interesting approach, if we did this then per-cpu won't be so expensive but making this work well with cgroups would be a challenge. Based on our testing experience, the vcpus can be scheduled across cpu sockets even when the number of vcpus is smaller than the number of cores per cpu socket and there is no other activities besides KVM networking workload. We found that if vhost thread is scheduled on the same socket as the work is received, the performance will be better. So in this per cpu vhost thread implementation, a vhost thread is selected dynamically based on where the TX/RX work is initiated. A vhost thread on the same cpu socket is selected but not on the same cpu as the vcpu/interrupt thread that initizated the TX/RX work. When we test this RFC patch, the other interesting thing we found is the performance results also seem related to NIC flow steering. We are spending time on evaluate different NICs flow director implementation now. We will enhance this patch based on our findings later. We have tried different scheduling: per-device based, per vq based and per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost scheduling, we found that so far the per vq based scheduling is good enough for now. Could you please explain more about those scheduling strategies? Does per-device based means let a dedicated vhost thread to handle all work from that vhost device? As you mentioned, maybe an improvement of the scheduling to take flow steering info (queue mapping, rxhash etc.) of skb in host into account. Yes, per-device scheduling means one per-cpu vhost theads handle all works from one particular vhost-device. Yes, we think scheduling to take flow steering info would help performance. I am studying this now. Did anything interesing turn up? We also tried different algorithm to select which cpu vhost thread will running on a specific cpu socket: avg_load balance, and randomly... May worth to account the out-of-oder packet during the test as for a single stream as different cpu/vhost/physical queue may be chose to do the packet transmission/reception? Good point. I haven't gone through all data yet. netstat
Re: CPU softlockup due to smp_call_function()
On Thu, Apr 5, 2012 at 2:24 PM, Avi Kivity a...@redhat.com wrote: On 04/04/2012 11:12 PM, Sasha Levin wrote: Hi all, I've starting seeing soft lockups resulting from smp_call_function() calls. I've attached two different backtraces of this happening with different code paths. This is running inside a KVM guest with the trinity fuzzer, using today's linux-next kernel. [ 6540.134009] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u:1:38] [ 6540.134048] irq event stamp: 286811770 [ 6540.134048] hardirqs last enabled at (286811769): [82669e74] restore_args+0x0/0x30 [ 6540.134048] hardirqs last disabled at (286811770): [8266b3ea] apic_timer_interrupt+0x6a/0x80 [ 6540.134048] softirqs last enabled at (286811768): [810b746e] __do_softirq+0x16e/0x190 [ 6540.134048] softirqs last disabled at (286811749): [8266bdec] call_softirq+0x1c/0x30 [ 6540.134048] CPU 0 [ 6540.134048] Pid: 38, comm: kworker/u:1 Tainted: G W 3.4.0-rc1-next-20120404-sasha-dirty #72 [ 6540.134048] RIP: 0010:[8111f30e] [8111f30e] smp_call_function_many+0x27e/0x2a0 This cpu is waiting for some other cpu to process a function (likely rps_trigger_softirq(), from the trace). Can you get a backtrace on all cpus when this happens? It would be good to enhance smp_call_function_*() to do this automatically when it happens - it's spinning there anyway, so it might as well count the iterations and NMI the lagging cpu if it waits for too long. What do you think about modifying the softlockup detector to NMI all CPUs if it's going to panic because it detected a lockup? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote: On 04/04/2012 01:29 PM, Gleb Natapov wrote: ok. seems to be. will move over to perf as its working fine inside guest. Good riddance IMO. I managed to run it on a guest (but not on my host!). The thing is buggy. It does not use global ctrl MSR to enable counters and kvm has all of them disabled by default. I didn't find what value this MSR should have after reset, so this may be either kvm bug or real BIOSes enable all counters in global ctrl MSR for PMUv1 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The second problem is that oprofile reprogram PMU counters without disabling them first and this is explicitly prohibited by Intel SDM. The patch below solve that, but oprofile is the one who should be fixed. Both should be fixed, there may be other profilers affected. Global ctrl msr issue I need to investigate further, but second one is bug that should be fixed in oprofile. Intel spec clearly says: EN (Enable Counters) Flag (bit 22) — When set, performance counting is enabled in the corresponding performance-monitoring counter; when clear, the corresponding counter is disabled. The event logic unit for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0, before writing to IA32_PMCx. I suspect that on real HW they got wrong result too. It simply subtly wrong, so it is not as noticeable as with kvm. diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index a73f0c1..be05028 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -396,6 +396,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data) (pmc = get_fixed_pmc(pmu, index))) { data = (s64)(s32)data; pmc-counter += data - read_pmc(pmc); + reprogram_gp_counter(pmc, pmc-eventsel); return 0; } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) { if (data == pmc-eventsel) -- error compiling committee.c: too many arguments to function -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On 04/05/2012 03:37 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote: On 04/04/2012 01:29 PM, Gleb Natapov wrote: ok. seems to be. will move over to perf as its working fine inside guest. Good riddance IMO. I managed to run it on a guest (but not on my host!). The thing is buggy. It does not use global ctrl MSR to enable counters and kvm has all of them disabled by default. I didn't find what value this MSR should have after reset, so this may be either kvm bug or real BIOSes enable all counters in global ctrl MSR for PMUv1 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The second problem is that oprofile reprogram PMU counters without disabling them first and this is explicitly prohibited by Intel SDM. The patch below solve that, but oprofile is the one who should be fixed. Both should be fixed, there may be other profilers affected. Global ctrl msr issue I need to investigate further, but second one is bug that should be fixed in oprofile. Intel spec clearly says: EN (Enable Counters) Flag (bit 22) — When set, performance counting is enabled in the corresponding performance-monitoring counter; when clear, the corresponding counter is disabled. The event logic unit for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0, before writing to IA32_PMCx. I suspect that on real HW they got wrong result too. It simply subtly wrong, so it is not as noticeable as with kvm. First, there is the standard Linus rant to never break userspace (and for us a guest kernel is userspace). Second, the Intel rule may have been added later (as preparation for a change in behaviour), so it may actually work correctly on older hardware. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CPU softlockup due to smp_call_function()
On 04/05/2012 03:32 PM, Sasha Levin wrote: It would be good to enhance smp_call_function_*() to do this automatically when it happens - it's spinning there anyway, so it might as well count the iterations and NMI the lagging cpu if it waits for too long. What do you think about modifying the softlockup detector to NMI all CPUs if it's going to panic because it detected a lockup? Fine by me. The triggering cpu should be dumped first though. We lose knowledge of exactly which cpu is hung, but it should usually be easily seen from the traces (the exceptional case would be a busy server that is handling a lot of IPIs at the moment we trigger the traces). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote: On 04/05/2012 03:37 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote: On 04/04/2012 01:29 PM, Gleb Natapov wrote: ok. seems to be. will move over to perf as its working fine inside guest. Good riddance IMO. I managed to run it on a guest (but not on my host!). The thing is buggy. It does not use global ctrl MSR to enable counters and kvm has all of them disabled by default. I didn't find what value this MSR should have after reset, so this may be either kvm bug or real BIOSes enable all counters in global ctrl MSR for PMUv1 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The second problem is that oprofile reprogram PMU counters without disabling them first and this is explicitly prohibited by Intel SDM. The patch below solve that, but oprofile is the one who should be fixed. Both should be fixed, there may be other profilers affected. Global ctrl msr issue I need to investigate further, but second one is bug that should be fixed in oprofile. Intel spec clearly says: EN (Enable Counters) Flag (bit 22) — When set, performance counting is enabled in the corresponding performance-monitoring counter; when clear, the corresponding counter is disabled. The event logic unit for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0, before writing to IA32_PMCx. I suspect that on real HW they got wrong result too. It simply subtly wrong, so it is not as noticeable as with kvm. First, there is the standard Linus rant to never break userspace (and We broke nothing. Oprofile is broken and never could have worked according to Intel spec. (It didn't manage to get any result from it on real CPU, but may be this is unrelated). for us a guest kernel is userspace). Second, the Intel rule may have been added later (as preparation for a change in behaviour), so it may actually work correctly on older hardware. May be (the rule is specified for MPUv1 BTW), but we emulate newer HW, the one that spec describes and we report this in cpuid leaf 10. I highly doubt Intel would change CPU in a way that may make old software to work incorrectly on newer cpus. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On 04/05/2012 04:26 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote: On 04/05/2012 03:37 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote: On 04/04/2012 01:29 PM, Gleb Natapov wrote: ok. seems to be. will move over to perf as its working fine inside guest. Good riddance IMO. I managed to run it on a guest (but not on my host!). The thing is buggy. It does not use global ctrl MSR to enable counters and kvm has all of them disabled by default. I didn't find what value this MSR should have after reset, so this may be either kvm bug or real BIOSes enable all counters in global ctrl MSR for PMUv1 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The second problem is that oprofile reprogram PMU counters without disabling them first and this is explicitly prohibited by Intel SDM. The patch below solve that, but oprofile is the one who should be fixed. Both should be fixed, there may be other profilers affected. Global ctrl msr issue I need to investigate further, but second one is bug that should be fixed in oprofile. Intel spec clearly says: EN (Enable Counters) Flag (bit 22) — When set, performance counting is enabled in the corresponding performance-monitoring counter; when clear, the corresponding counter is disabled. The event logic unit for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0, before writing to IA32_PMCx. I suspect that on real HW they got wrong result too. It simply subtly wrong, so it is not as noticeable as with kvm. First, there is the standard Linus rant to never break userspace (and We broke nothing. Oprofile is broken and never could have worked according to Intel spec. (It didn't manage to get any result from it on real CPU, but may be this is unrelated). But it did work sometime in the past, I've used it. for us a guest kernel is userspace). Second, the Intel rule may have been added later (as preparation for a change in behaviour), so it may actually work correctly on older hardware. May be (the rule is specified for MPUv1 BTW), but we emulate newer HW, the one that spec describes and we report this in cpuid leaf 10. I highly doubt Intel would change CPU in a way that may make old software to work incorrectly on newer cpus. The non-architectural PMU is version specific, so they could have made this change before arch PMU was introduced. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On Thu, Apr 05, 2012 at 04:28:48PM +0300, Avi Kivity wrote: On 04/05/2012 04:26 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote: On 04/05/2012 03:37 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote: On 04/04/2012 01:29 PM, Gleb Natapov wrote: ok. seems to be. will move over to perf as its working fine inside guest. Good riddance IMO. I managed to run it on a guest (but not on my host!). The thing is buggy. It does not use global ctrl MSR to enable counters and kvm has all of them disabled by default. I didn't find what value this MSR should have after reset, so this may be either kvm bug or real BIOSes enable all counters in global ctrl MSR for PMUv1 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The second problem is that oprofile reprogram PMU counters without disabling them first and this is explicitly prohibited by Intel SDM. The patch below solve that, but oprofile is the one who should be fixed. Both should be fixed, there may be other profilers affected. Global ctrl msr issue I need to investigate further, but second one is bug that should be fixed in oprofile. Intel spec clearly says: EN (Enable Counters) Flag (bit 22) — When set, performance counting is enabled in the corresponding performance-monitoring counter; when clear, the corresponding counter is disabled. The event logic unit for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0, before writing to IA32_PMCx. I suspect that on real HW they got wrong result too. It simply subtly wrong, so it is not as noticeable as with kvm. First, there is the standard Linus rant to never break userspace (and We broke nothing. Oprofile is broken and never could have worked according to Intel spec. (It didn't manage to get any result from it on real CPU, but may be this is unrelated). But it did work sometime in the past, I've used it. May be it used NMI based profiling. We should ask oprofile developers. As I said I am almost sure my inability to run it on a host is probably PEBKAC, although I ran the same script exactly on the host and the guest (the script is from the first email of this thread) for us a guest kernel is userspace). Second, the Intel rule may have been added later (as preparation for a change in behaviour), so it may actually work correctly on older hardware. May be (the rule is specified for MPUv1 BTW), but we emulate newer HW, the one that spec describes and we report this in cpuid leaf 10. I highly doubt Intel would change CPU in a way that may make old software to work incorrectly on newer cpus. The non-architectural PMU is version specific, so they could have made this change before arch PMU was introduced. So Table 18-11 section 18.4.4.3 Writing a PEBS Interrupt Service Routine has: For Processors based on Intel Core microarchitecture Counters MUST be stopped before writing. Must is written in all caps in the spec. For NetBurst this is optional. Really, I fail to see how this is not oprofile bug. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On Thu, Apr 05, 2012 at 04:48:57PM +0300, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 04:28:48PM +0300, Avi Kivity wrote: On 04/05/2012 04:26 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:48:51PM +0300, Avi Kivity wrote: On 04/05/2012 03:37 PM, Gleb Natapov wrote: On Thu, Apr 05, 2012 at 03:27:18PM +0300, Avi Kivity wrote: On 04/04/2012 01:29 PM, Gleb Natapov wrote: ok. seems to be. will move over to perf as its working fine inside guest. Good riddance IMO. I managed to run it on a guest (but not on my host!). The thing is buggy. It does not use global ctrl MSR to enable counters and kvm has all of them disabled by default. I didn't find what value this MSR should have after reset, so this may be either kvm bug or real BIOSes enable all counters in global ctrl MSR for PMUv1 compatibility. Doing wrmsr 0x38f 0x7000f solves this problem. The second problem is that oprofile reprogram PMU counters without disabling them first and this is explicitly prohibited by Intel SDM. The patch below solve that, but oprofile is the one who should be fixed. Both should be fixed, there may be other profilers affected. Global ctrl msr issue I need to investigate further, but second one is bug that should be fixed in oprofile. Intel spec clearly says: EN (Enable Counters) Flag (bit 22) — When set, performance counting is enabled in the corresponding performance-monitoring counter; when clear, the corresponding counter is disabled. The event logic unit for a UMASK must be disabled by setting IA32_PERFEVTSELx[bit 22] = 0, before writing to IA32_PMCx. I suspect that on real HW they got wrong result too. It simply subtly wrong, so it is not as noticeable as with kvm. First, there is the standard Linus rant to never break userspace (and We broke nothing. Oprofile is broken and never could have worked according to Intel spec. (It didn't manage to get any result from it on real CPU, but may be this is unrelated). But it did work sometime in the past, I've used it. May be it used NMI based profiling. We should ask oprofile developers. As I said I am almost sure my inability to run it on a host is probably PEBKAC, although I ran the same script exactly on the host and the guest (the script is from the first email of this thread) After upgrading the kernel to latest git from whatever it was there the same script works and counts CPU_CLK_UNHALT events. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On 04/05/2012 04:57 PM, Gleb Natapov wrote: May be it used NMI based profiling. We should ask oprofile developers. As I said I am almost sure my inability to run it on a host is probably PEBKAC, although I ran the same script exactly on the host and the guest (the script is from the first email of this thread) After upgrading the kernel to latest git from whatever it was there the same script works and counts CPU_CLK_UNHALT events. This is even while it violates the Intel guidelines? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questing regarding KVM Guest PMU
On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote: On 04/05/2012 04:57 PM, Gleb Natapov wrote: May be it used NMI based profiling. We should ask oprofile developers. As I said I am almost sure my inability to run it on a host is probably PEBKAC, although I ran the same script exactly on the host and the guest (the script is from the first email of this thread) After upgrading the kernel to latest git from whatever it was there the same script works and counts CPU_CLK_UNHALT events. This is even while it violates the Intel guidelines? Yes, but who says the result is correct :) It seems that we handle global ctrl msr wrong. That is counter can be enabled either in global ctrl or in eventsel. Trying to confirm that. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Thu, 2012-04-05 at 12:34 +0300, Michael S. Tsirkin wrote: On Wed, Apr 04, 2012 at 09:42:32PM -0600, Alex Williamson wrote: We've hit a kernel host panic, when issuing a 'system_reset' with an 82576 nic assigned and a Windows guest. Host system is a PowerEdge R815. [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993 [Hardware Error]: APEI generic hardware error status [Hardware Error]: severity: 1, fatal [Hardware Error]: section: 0, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 [Hardware Error]: section: 1, severity: 1, fatal [Hardware Error]: flags: 0x01 [Hardware Error]: primary [Hardware Error]: section_type: PCIe error [Hardware Error]: port_type: 0, PCIe end point [Hardware Error]: version: 1.0 [Hardware Error]: command: 0x, status: 0x0010 [Hardware Error]: device_id: :08:00.0 [Hardware Error]: slot: 1 [Hardware Error]: secondary_bus: 0x00 [Hardware Error]: vendor_id: 0x8086, device_id: 0x10c9 [Hardware Error]: class_code: 02 [Hardware Error]: aer_status: 0x0010, aer_mask: 0x00018000 [Hardware Error]: Unsupported Request [Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID [Hardware Error]: aer_uncor_severity: 0x00067011 [Hardware Error]: aer_tlp_header: 40001001 002f edbf800c 0100 Kernel panic - not syncing: Fatal hardware error! Pid: 0, comm: swapper Not tainted 2.6.32-242.el6.x86_64 #1 Call Trace: NMI [814f2fe5] ? panic+0xa0/0x168 [812f919c] ? ghes_notify_nmi+0x17c/0x180 [814f91d5] ? notifier_call_chain+0x55/0x80 [814f923a] ? atomic_notifier_call_chain+0x1a/0x20 [8109667e] ? notify_die+0x2e/0x30 [814f6e81] ? do_nmi+0x1a1/0x2b0 [814f6760] ? nmi+0x20/0x30 [8103762b] ? native_safe_halt+0xb/0x10 EOE [8101495d] ? default_idle+0x4d/0xb0 [81009e06] ? cpu_idle+0xb6/0x110 [814da63a] ? rest_init+0x7a/0x80 [81c1ff7b] ? start_kernel+0x424/0x430 [81c1f33a] ? x86_64_start_reservations+0x125/0x129 [81c1f438] ? x86_64_start_kernel+0xfa/0x109 The root cause of the problem is that the 'reset_assigned_device()' code first writes a 0 to the command register. Then, when qemu subsequently does a kvm_deassign_irq() (called by assign_irq(), in the system_reset path), the kernel ends up calling '__msix_mask_irq()', which performs a write to the memory mapped msi vector space. Since, we've explicitly told the device to disallow mmio access (via the 0 write to the command register), we end up with the above 'Unsupported Request'. The fix here is to first disable MSI-X, before doing the reset. We also disable MSI, leaving the device in INTx mode. In this way, the device is a known state after reset, and we avoid touching msi memory mapped space on any subsequent 'kvm_deassign_irq()'. Thanks to Michael S. Tsirkin for help in understanding what was going on here and Jason Baron, the original debugger of this problem. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Jason is out of the office for a couple weeks, so I'll try to resolve this while he's away. Somehow the emulated config updates were lost in Jason's original posting, so I've fixed that and taken Jan's suggestion to simply call into the update functions instead of open coding the interrupt disable. I think there still may be some disagreements about how to handle guest generated errors in the host, but that's a large project whereas this is something we should be doing at reset anyway, and even if only a workaround, resolves the problem above. I don't think there's a disagreement: don't we all agree they should be forwarded to qemu and on the guest if possible, ignored if not? With AER perhaps, but I'm not sure we have that flexibility with APEI, but I'd need to read up on the spec. It seems APEI defines that the BIOS gets first shot at handling the error and gets to dictate how the OS handles it. In this case, deciding that it's fatal. hw/device-assignment.c | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/hw/device-assignment.c
Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path
On Thu, Apr 05, 2012 at 08:42:03AM -0600, Alex Williamson wrote: So far so good. + * We especially do not want MSI-X + * enabled since it lives in MMIO space, which is about to get + * disabled. I think we are better off dropping the above, because it's a bug that we disable MMIO space on the physical device: we did not enable it (kvm did) let's not disable. I'd like to investigate removing the command register clearing, but it did solve a bug at the time it went in. That will be a separate patch, so we can remove the above sentence when that's removed. It may be a bug, but it's describing the current behavior. Fair enough. Acked-by: Michael S. Tsirkin m...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch
On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote: On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote: On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote: Hi: Thanks for the work and it looks very reasonable, some questions below. Yes I am happy to see the per-cpu work resurrected. Some comments below. Glad to see you have time on reviewing this. On 03/23/2012 07:48 AM, Shirley Ma wrote: Sorry for being late to submit this patch. I have spent lots of time trying to find the best approach. This effort is still going on... This patch is built against net-next tree. This is an experimental RFC patch. The purpose of this patch is to address KVM networking scalability and NUMA scheduling issue. Need also test for non-NUMA machine, I see that you just choose the cpu that initiates the work for non-numa machine which seems sub optimal. Good suggestions. I don't have any non-numa systems. But KK run some tests on non-numa system. He could see around 20% performance gain for single VMs local host to guest. I hope we can run a full test on non-numa system. On non-numa system, the same per vhost-cpu thread will be always picked up consistently for a particular vq since all cores are on same cpu socket. So there will be two per-cpu vhost threads handle TX/RX simultaneously. The existing implementation of vhost creats a vhost thread per-device (virtio_net) based. RX and TX work of a VMs per-device is handled by same vhost thread. One of the limitation of this implementation is with increasing the number VMs or the number of virtio-net interfces, more vhost threads are created, it will consume more kernel resources, and induce more threads context switches/scheduling overhead. We noticed that the KVM network performance doesn't scale with increasing number of VMs. The other limitation is to have single vhost thread to process both RX and TX, the work will be blocked. So we create this per cpu vhost thread implementation. The number of vhost cpu threads is limited to the number of cpus on the host. To address these limitations, we are propsing a per-cpu vhost thread model where the number of vhost threads are limited and equal to the number of online cpus on the host. The number of vhost thread needs more consideration. Consider that we have a 1024 cores host with a card have 16 tx/rx queues, do we really need 1024 vhost threads? In this case, we could add a module parameter to limit the number of cores/sockets to be used. Hmm. And then which cores would we run on? Also, is the parameter different between guests? Another idea is to scale the # of threads on demand. If we are able to pass number of guests/vcpus info to vhost, we can scale the vhost threads. Any API to get this info? Sharing the same thread between guests is also an interesting approach, if we did this then per-cpu won't be so expensive but making this work well with cgroups would be a challenge. Yes, I am comparing vhost thread pool to share among guests approach with per-cpu vhost approach now. It's challenge to work with cgroups anyway. Based on our testing experience, the vcpus can be scheduled across cpu sockets even when the number of vcpus is smaller than the number of cores per cpu socket and there is no other activities besides KVM networking workload. We found that if vhost thread is scheduled on the same socket as the work is received, the performance will be better. So in this per cpu vhost thread implementation, a vhost thread is selected dynamically based on where the TX/RX work is initiated. A vhost thread on the same cpu socket is selected but not on the same cpu as the vcpu/interrupt thread that initizated the TX/RX work. When we test this RFC patch, the other interesting thing we found is the performance results also seem related to NIC flow steering. We are spending time on evaluate different NICs flow director implementation now. We will enhance this patch based on our findings later. We have tried different scheduling: per-device based, per vq based and per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost scheduling, we found that so far the per vq based scheduling is good enough for now. Could you please explain more about those scheduling strategies? Does per-device based means let a dedicated vhost thread to handle all work from that vhost device? As you mentioned, maybe an improvement of the scheduling to take flow steering info (queue mapping, rxhash etc.) of skb in host into account. Yes, per-device scheduling means one per-cpu vhost theads handle all works from one particular vhost-device. Yes, we think
Re: [GIT PULL] Please pull powerpc KVM fixes
On 04/03/2012 03:22 PM, Paul Mackerras wrote: On Tue, Apr 03, 2012 at 10:03:05PM +1000, Paul Mackerras wrote: The following changes since commit 592f5d87b3feee9d60411f19d583038c0c7670ad: OK, I messed up the git request-pull command. The request should have looked like this: The following changes since commit b1a808ff436343956a6ae63178ea1810c5e5a3a1: Merge branch 'for-next' of git://gitorious.org/kernel-hsi/kernel-hsi (2012-04-02 09:50:40 -0700) are available in the git repository at: git://github.com/paulusmack/linux tags/powerpc-fixes Thanks, pulled, will push shortly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes April 3
On 04/04/2012 05:39 PM, Michael Roth wrote: Will command line take in account hot-plugged devices? No, that's a good point. We'd probably need to generate the options required to ensure the devices are created on the target, and we'd only be able to do that just before sending the device state. That means we need a way to create machines after we've completed tasks like memory migration, which probably has similar requirements to just being able to instantiate a machine from a serialized QOM composition tree. Or we forward the hotplug events over the migration protocol. This is needed for devices that include memory (like ivshmem.c) page page page hotplug event page page for the new device page -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM fixes for 3.4-rc1
Linus, please pull a few fixes from git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/3.4 A bunch of powerpc KVM fixes, a guest and a host RCU fix (unrelated), and a small build fix. Alexander Graf (3): KVM: PPC: Book3S: Compile fix for ppc32 in HIOR access code KVM: PPC: Save/Restore CR over vcpu_run KVM: PPC: Book3S: PR: Fix preemption Avi Kivity (1): Merge tag 'powerpc-fixes' of git://github.com/paulusmack/linux into new/master Gleb Natapov (1): KVM: Resolve RCU vs. async page fault problem Marcelo Tosatti (1): KVM: VMX: vmx_set_cr0 expects kvm-srcu locked Paul Mackerras (2): KVM: PPC: Book3S HV: Fix kvm_alloc_linear in case where no linears exist KVM: PPC: Book3S HV: Save and restore CR in __kvmppc_vcore_entry Sasikantha babu (1): KVM: PMU: Fix integer constant is too large warning in kvm_pmu_set_msr() -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch
On Thu, 2012-04-05 at 08:22 -0700, Shirley Ma wrote: Haven't had time to focus on single stream result yet. I forgot to mention that if I switch the vhost scheduling to per-device based from per vq based, this minor single stream test regression will be gone. However the improvement of tcp_rrs, udp_rrs and other stream case performance will also gone. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
New git workflow
In a recent conversation, Linus persuaded me that it's time for change in our git workflow; the following will bring it in line with the current practices of most trees. The current 'master' branch will be abandoned (still available for reviewing history). The new branch structure will be as follows: 'master': contains patches destined for fixes after the current merge windows; for example, after v3.4 has been released, master contains fixes which will be merged during the 3.4 cycle. 'next': contains patches destined for the next merge window; for example, if v3.4 has just been released, then 'next' will be based on the v3.5-rc1 tag (or thereabouts), and will be accepting patches targeted at the 3.6 merge window (when 3.5 is released). 'queue': holding area for patches which have not been tested yet; may be rebased; usually merged into 'next' 'kvm-updates/3.x': unchanged, patches destined for a stable branch Since 'next' is based on an -rc1 release, it may lack upstream non-kvm fixes (or features) needed for development. To work around that, an 'auto-next' branch will be provided that is a merge of current upstream and 'next' (this is more or less the equivalent of the old 'master' branch). That branch of course will be rebased often. How to work with the new layout: If you're a developer, usually developing against 'next' is okay. If 'next' is unstable for you or you need a new upstream API, work against 'auto-next', but let the maintainers know that when posting your patch. If you're working on a fix for the current cycle, work against upstream or 'master' (they should be equivalent most of the time). If you're a submaintainer, post git pull requests against 'next' or 'master', according to your merge target. Please ping Marcelo or myself with any question or suggestion. It's better to ask a silly question than develop against a stale tree accidentally and waste your efforts. I'll publish the new branches tomorrow, with any luck. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/13] KVM: MMU: fask check whether page is writable
Hi Avi, Thanks very much for your review! Sorry for the delay reply since i was on vacation. On 04/01/2012 11:52 PM, Avi Kivity wrote: On 03/29/2012 11:25 AM, Xiao Guangrong wrote: Using PTE_LIST_WRITE_PROTECT bit in rmap store the write-protect status to avoid unnecessary shadow page walking Also if no shadow page is indirect, the page is write-free @@ -2262,6 +2291,9 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, } if (need_unsync) kvm_unsync_pages(vcpu, gfn); + +*rmap = ~PTE_LIST_WRITE_PROTECT; + So what are the rules for PTE_LIST_WRITE_PROTECT? Is is a cache for the mmu_need_write_protect? I'd like to understand it, I guess it can be set while write protection is unneeded, and cleared on the next check? Yes, it is used as a cache for mmu_need_write_protect. When the gfn is protected by sync sp or read-only host page we set this bit, and it is be cleared when the sp become unsync or host page becomes writable. Maybe split into two functions, one the normal mmu_need_write_protect (but renamed) and a new mmu_need_write_protect(), with locked and unlocked variants, calling the old one. Okay, i will split it by introducing a new function named mmu_unsync_gfn_sp which checks whether sp can become unsync and unsync sp if it is allowed under the protection of mmu-lock. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock
On 04/01/2012 11:53 PM, Avi Kivity wrote: On 03/29/2012 11:25 AM, Xiao Guangrong wrote: It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know whether the page is writable out of mmu-lock Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 17 + arch/x86/kvm/paging_tmpl.h |2 +- 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3887a07..c029185 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) *rmapp |= PTE_LIST_WRITE_PROTECT; +/* + * Setting PTE_LIST_WRITE_PROTECT bit before doing page + * write-protect. + */ +smp_mb(); + wmb only needed. We should ensure setting this bit before reading spte, it cooperates with fast page fault path to avoid this case: On fast page fault path:On rmap_write_protect path: read spte: old_spte = *spte (reading spte is reordered to the front of setting PTE_LIST_WRITE_PROTECT bit) set spte.identification smp_mb if (!rmap.PTE_LIST_WRITE_PROTECT) set rmap.PTE_LIST_WRITE_PROTECT cmpxchg(sptep, spte, spte | WRITABLE) see old_spte.identification is not set, so it does not write-protect this page OOPS!!! Would it be better to store this bit in all the sptes instead? We're touching them in any case. More work to clear them, but un-write-protecting a page is beneficial anyway as it can save a fault. There are two reasons: - if we set this bit in rmap, we can do the quickly check to see the page is writble before doing shadow page walking. - since a full barrier is needed, we should use smp_mb for every spte like this: while ((spte = rmap_next(rmapp, spte))) { read spte smp_mb write-protect spte } smp_mb is called in the loop, i think it is not good, yes? If you just want to save the fault, we can let all spte to be writeable in mmu_need_write_protect, but we should cache gpte access bits into spte firstly. It should be another patchset i think. :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
smp option of qemu-kvm
Hi, I started a kvm VM by adding -smp 2 option. From inside the guest, I can see that /proc/cpuinfo outputs 2 cores. However, in the host, I only observe one qemu-kvm process for that VM. Does that mean this VM is actually running on one core? If so, how to make a VM to run on 2 or more cores? Thanks. -ha -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: smp option of qemu-kvm
On Thu, Apr 05, 2012 at 02:28:51PM -0400, Steven wrote: Hi, I started a kvm VM by adding -smp 2 option. From inside the guest, I can see that /proc/cpuinfo outputs 2 cores. However, in the host, I only observe one qemu-kvm process for that VM. Does that mean this VM is actually running on one core? If so, how to make a VM to run on 2 or more cores? Thanks. Each VCPU in KVM corresponds to a separate thread in the process. The 'ps' command only ever shows the thread leader by default - so you don't see those VCPU threads in the process list. eg ps -eLf to see all threads Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: smp option of qemu-kvm
Hi, Daniel, Thanks for your quick response. However, the ps -eLf show 4 threads for the VM and I checked 4 threads have the same tgid. But the VM I created is with -smp 2 option. Could you explain this? Thanks. On Thu, Apr 5, 2012 at 2:40 PM, Daniel P. Berrange berra...@redhat.com wrote: On Thu, Apr 05, 2012 at 02:28:51PM -0400, Steven wrote: Hi, I started a kvm VM by adding -smp 2 option. From inside the guest, I can see that /proc/cpuinfo outputs 2 cores. However, in the host, I only observe one qemu-kvm process for that VM. Does that mean this VM is actually running on one core? If so, how to make a VM to run on 2 or more cores? Thanks. Each VCPU in KVM corresponds to a separate thread in the process. The 'ps' command only ever shows the thread leader by default - so you don't see those VCPU threads in the process list. eg ps -eLf to see all threads Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: smp option of qemu-kvm
On Thu, Apr 05, 2012 at 02:52:40PM -0400, Steven wrote: Hi, Daniel, Thanks for your quick response. However, the ps -eLf show 4 threads for the VM and I checked 4 threads have the same tgid. But the VM I created is with -smp 2 option. Could you explain this? Thanks. As well as the vCPU threads, QEMU creates other threads as needed, typically for I/O - indeed the count of threads may vary over time. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/13] KVM: MMU: fast path of handling guest page fault
On 04/02/2012 12:23 AM, Avi Kivity wrote: On 03/29/2012 11:27 AM, Xiao Guangrong wrote: If the the present bit of page fault error code is set, it indicates the shadow page is populated on all levels, it means what we do is only modify the access bit which can be done out of mmu-lock The tricks in this patch is avoiding the race between fast page fault path and write-protect path, write-protect path is a read-check-modify path: read spte, check W bit, then clear W bit. What we do is populating a identification in spte, if write-protect meets it, it modify the spte even if the spte is readonly. See the comment in the code to get more information + +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn, + u32 error_code) +{ +unsigned long *rmap; +bool write = error_code PFERR_WRITE_MASK; + +/* + * #PF can be fast only if the shadow page table is present, that + * means we just need change the access bits (e.g: R/W, U/S...) + * which can be done out of mmu-lock. + */ +if (!(error_code PFERR_PRESENT_MASK)) +return false; + +if (unlikely(vcpu-vcpu_id max_vcpu_spte())) +return false; + +rmap = gfn_to_rmap(vcpu-kvm, gfn, PT_PAGE_TABLE_LEVEL); What prevents sp-gfns from being freed while this is going on? Did I miss the patch that makes mmu pages freed by rcu? sp-gfns is not used here, what we are using is rmap which came from memslot protected by srcu. In this patch, we called kvm_mmu_page_get_gfn in fast_pf_fetch_direct_spte which only used for direct sp, sp-gfns is also not touched in this case. But, i want to use it in the next version, it should be freed in the rcu context, see: http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88934 Also need barriers in kvm_mmu_get_page() to make sure sp-gfns is made visible before the page is hashed in. We do not walk the hash tables of shadow page, we walk shadow page table which currently is being used by vcpu, so we just need care the order of setting spte and setting sp-gfns, but it has data dependence in currently code: set spte if (is_shadow_present_pte(*sptep)) set sp-gfns[] setting sp-gfns can not be reordered to the head of set spte + smp_rmb() Actually, i do not want to use barrier here since a read barrier is not a complier barrier on X86. And we just quickly check the page can be writable before shadow page table walking. A real barrier is used before updating spte. + +/* Quickly check the page can be writable. */ +if (write (ACCESS_ONCE(*rmap) PTE_LIST_WRITE_PROTECT)) +return false; + +return true; +} + +typedef bool (*fast_pf_fetch_spte)(struct kvm_vcpu *vcpu, u64 *sptep, + u64 *new_spte, gfn_t gfn, u32 expect_access, + u64 spte); + +static bool +fast_pf_fetch_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 *new_spte, + gfn_t gfn, u32 expect_access, u64 spte) +{ +struct kvm_mmu_page *sp = page_header(__pa(sptep)); + Why is this stable? Another cpu could have removed it. It is ok for the removed sp since it is not freed on this path. And if the sp is remove, cmpxchg can see this change since the spte is zapped. +WARN_ON(!sp-role.direct); + +if (kvm_mmu_page_get_gfn(sp, sptep - sp-spt) != gfn) +return false; Simpler to check if the page is sync; if it is, gfn has not changed. (but the check for a sync page is itself unstable). Checking sp.sync is unsafe since there has a window between guest page write emulation and update spte: At the Beginning: gpte = gfn1 spte = gfn1's pfn (gfn1 is write-protect, gfn2 is write-free) VCPU 1 VCPU 2 modify gpte and let it point to gfn2 page fault write-emulation: gpte = gfn2 page fault on gpte: gfn = gfn2 fast page fault: sett gfn2 is write free and update spte: spte = gfn1's pfn + W OOPS!!! zap spte + +set_spte(vcpu, new_spte, sp-role.access, + expect_access ACC_USER_MASK, expect_access ACC_WRITE_MASK, + sp-role.level, gfn, spte_to_pfn(spte), false, false, + spte SPTE_HOST_WRITEABLE, true); + What if the conditions have changed meanwhile? Should we set the spte atomically via set_spte? For example an mmu notifier clearing the spte in parallel. All of these case are ok, we do not directly update spte and it is updated like this: new_spte = 0x0ull /* set it to nonpresent, so rmap path can not be touched. */ set_spte(vcpu, new_spte, ..) /* the expected spte is temporarily saved to new_spte.*/ cmpxchg(spte, spte_old, new_spte) It depends on cmpxchg. What if another code path has read *spte, and did
Re: WARNING: at arch/x86/kernel/smp.c:119 native_smp_send_reschedule+0x25/0x43()
A plain v3.3 kernel hits this when I just type reboot on a 32 cpu (2 socket * 8 core * 2 HT) system: sd 0:0:0:0: [sda] Synchronizing SCSI cache Restarting system. machine restart [ cut here ] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x5c/0x60() Hardware name: S2600CP Modules linked in: Pid: 10068, comm: reboot Not tainted 3.3.0- #1 Call Trace: IRQ [8104c37f] warn_slowpath_common+0x7f/0xc0 [8104c3da] warn_slowpath_null+0x1a/0x20 [810314bc] native_smp_send_reschedule+0x5c/0x60 [81084824] trigger_load_balance+0x244/0x2f0 [8107bf71] scheduler_tick+0x101/0x160 [8105b1de] update_process_times+0x6e/0x90 [8109e6a6] tick_sched_timer+0x66/0xc0 [81072d63] __run_hrtimer+0x83/0x1d0 [8109e640] ? tick_nohz_handler+0xf0/0xf0 [81073146] hrtimer_interrupt+0x106/0x240 [8157bf69] smp_apic_timer_interrupt+0x69/0x99 [8157ac1e] apic_timer_interrupt+0x6e/0x80 EOI [81033276] ? default_send_IPI_mask_allbutself_phys+0xd6/0x110 [81036397] physflat_send_IPI_allbutself+0x17/0x20 [81031849] native_nmi_stop_other_cpus+0xa9/0x110 [81030e24] native_machine_shutdown+0x64/0x90 [81030a97] native_machine_restart+0x27/0x40 [810309cf] machine_restart+0xf/0x20 [810637de] kernel_restart+0x3e/0x60 [810639e0] sys_reboot+0x1c0/0x240 [811734af] ? __d_free+0x4f/0x70 [8117353c] ? d_free+0x6c/0x80 [81174dbd] ? d_kill+0xad/0x110 [8117b603] ? mntput+0x23/0x40 [8115f9b7] ? fput+0x197/0x260 [8115ba63] ? filp_close+0x63/0x90 [8157a169] system_call_fastpath+0x16/0x1b ---[ end trace 5e0dddabdbb21c7e ]--- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/13] KVM: MMU: fast page fault
On 04/01/2012 08:58 PM, Avi Kivity wrote: On 03/30/2012 12:18 PM, Xiao Guangrong wrote: On 03/29/2012 08:57 PM, Avi Kivity wrote: On 03/29/2012 01:40 PM, Xiao Guangrong wrote: * Implementation We can freely walk the page between walk_shadow_page_lockless_begin and walk_shadow_page_lockless_end, it can ensure all the shadow page is valid. In the most case, cmpxchg is fair enough to change the access bit of spte, but the write-protect path on softmmu/nested mmu is a especial case: it is a read-check-modify path: read spte, check W bit, then clear W bit. We also set gpte.D and gpte.A, no? How do you handle that? We still need walk gust page table before fast page fault to check whether the access is valid. Ok. But that's outside the mmu lock. We can also skip that: if !sp-unsync, copy access rights and gpte.A/D into spare bits in the spte, and use that to check. Great! gpte.A need not be copied into spte since EFEC.P = 1 means the shadow page table is present, gpte.A must be set in this case. And, we do not need to cache gpte access rights into spte, instead of it, we can atomicly read gpte to get these information (anyway, we need read gpte to get the gfn.) Why do we need the gfn? If !sp-unsync then it is unchanged (it's also in sp-gfns). Oh, needed for mark_page_dirty(). Caching gpte access in the spte will save decoding the access buts and sp-role.access. Yes, we can directly get the gfn from sp-gfns (it is also safe for usync sp i think), but we want to have a quickly check (check if gfn can be writable) before shadow page table waking. I mentioned it in the previous reply, see: http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88934 What is your comment? I will do it in the next version. In order to avoid marking spte writable after/during page write-protect, we do the trick like below: fast page fault path: lock RCU set identification in the spte What if you can't (already taken)? Spin? Slow path? In this patch, it allows to concurrently access on the same spte: it freely set its identification on the spte, because i did not want to introduce other atomic operations. You remind me that there may be a risk: if many vcpu fault on the same spte, it will retry the spte forever. Hmm, how about fix it like this: if ( spte.identification = 0) { set spte.identification = vcpu.id goto cmpxchg-path } if (spte.identification == vcpu.id) goto cmpxchg-path return to guest and retry the address again; cmpxchg-path: do checks and cmpxchg It can ensure the spte can be updated. What's the difference between this and if test_and_set_bit(spte.lock) return_to_guest else do checks and cmpxchg ? test_and_set_bit is a atomic operation that is i want to avoid. Right. Can you check what the effect is (with say test_and_set_bit(spte, 0) in the same path). I'm not convinced it's significant. Okay, if you prefer to test_and_set_bit, we may introduce two bits: fast_pf and write_protect, do things like this: on fast page fault path: old_spte = *spte if (test_and_set_bit(spte.fast_pf)) return_to_guest old_spte |= fast_pf if (old_spte.wp) clear-fast-pf bit and return_to_guest if (!rmap.PTE_LIST_WRITE_PROTECT) cmpxchg(spte, old_spte, old_spte +w - fast_pf) else clear-fast-pf bit on page write-protect path: lock mmu-lock set rmap.PTE_LIST_WRITE_PROTECT smp_mb() if (spte.w || spte.fast_pf) spte -w -fast_pf + write_protect unlock mmu-lock I think it works but can not make page fault fast for unsync sp ( e.g: two sptes point to the gfn, a page fault on one of the spte make gfn's sp become unsync.), it is not too bad since we can cache access bits in spte and properly make all sptes become writable on mmu_need_write_protect path in a new patchset. The identification should be unique to avoid the below race: VCPU 0VCPU 1VCPU 2 lock RCU spte + identification check conditions do write-protect, clear identification lock RCU set identification cmpxchg + w - identification OOPS!!! Is it not sufficient to use just two bits? pf_lock - taken by page fault path wp_lock - taken by write protect path pf cmpxchg checks both bits. If we just use two byte as identification, it has to use atomic operations to maintain these bits? or i misunderstood? Ah, you're using byte writes to set the identification? remember I didn't read the patches yet. Yes. Please check the optimization manual, I remember there are warnings there about using different sized accesses for the
buildbot failure in kvm on next-ppc64
The Buildbot has detected a new failure on builder next-ppc64 while building kvm. Full details are available at: http://buildbot.b1-systems.de/kvm/builders/next-ppc64/builds/495 Buildbot URL: http://buildbot.b1-systems.de/kvm/ Buildslave for this Build: b1_kvm_1 Build Reason: The Nightly scheduler named 'nightly_next' triggered this build Build Source Stamp: [branch next] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot
buildbot failure in kvm on next-ppc44x
The Buildbot has detected a new failure on builder next-ppc44x while building kvm. Full details are available at: http://buildbot.b1-systems.de/kvm/builders/next-ppc44x/builds/494 Buildbot URL: http://buildbot.b1-systems.de/kvm/ Buildslave for this Build: b1_kvm_1 Build Reason: The Nightly scheduler named 'nightly_next' triggered this build Build Source Stamp: [branch next] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
Re: Questing regarding KVM Guest PMU
On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov g...@redhat.com wrote: On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote: On 04/05/2012 04:57 PM, Gleb Natapov wrote: May be it used NMI based profiling. We should ask oprofile developers. As I said I am almost sure my inability to run it on a host is probably PEBKAC, although I ran the same script exactly on the host and the guest (the script is from the first email of this thread) After upgrading the kernel to latest git from whatever it was there the same script works and counts CPU_CLK_UNHALT events. This is even while it violates the Intel guidelines? Yes, but who says the result is correct :) It seems that we handle global ctrl msr wrong. That is counter can be enabled either in global ctrl or in eventsel. Trying to confirm that. if that becomes true then will global ctrl msr have any significance ? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/13] KVM: MMU: fast page fault
On 04/06/2012 05:57 AM, Xiao Guangrong wrote: What's the difference between this and if test_and_set_bit(spte.lock) return_to_guest else do checks and cmpxchg ? test_and_set_bit is a atomic operation that is i want to avoid. Right. Can you check what the effect is (with say test_and_set_bit(spte, 0) in the same path). I'm not convinced it's significant. Okay, if you prefer to test_and_set_bit, we may introduce two bits: fast_pf and write_protect, do things like this: on fast page fault path: old_spte = *spte if (test_and_set_bit(spte.fast_pf)) return_to_guest old_spte |= fast_pf if (old_spte.wp) clear-fast-pf bit and return_to_guest if (!rmap.PTE_LIST_WRITE_PROTECT) cmpxchg(spte, old_spte, old_spte +w - fast_pf) else clear-fast-pf bit on page write-protect path: lock mmu-lock set rmap.PTE_LIST_WRITE_PROTECT smp_mb() if (spte.w || spte.fast_pf) spte -w -fast_pf + write_protect unlock mmu-lock I think it works but can not make page fault fast for unsync sp ( e.g: two sptes point to the gfn, a page fault on one of the spte make gfn's sp become unsync.), it is not too bad since we can cache access bits in spte and properly make all sptes become writable on mmu_need_write_protect path in a new patchset. Foolish me, i should be crazy. Sorry for my mistake. :( Unfortunately, it can not work, we can not get a stable gfn from gpte or sp-gfns[]. For example: beginning: Gpte = Gfn1 gfn_to_pfn(Gfn1) = Pfn Spte = Pfn Gfn1 is write-free Gfn2 is write-protected VCPU 0 VCPU 1 VCPU 2 fault on gpte fast page fault path: set Spte.fast_pf get Gfn1 from Gpte/sp-gfns[] if (Gfn1 is writable) Pfn is swapped out: Spte = 0 Gpte is modified to Gfn2, and Pfn is realloced and remapped to Gfn2, so: Spte = Pfn fast page fault path: set Spte.fast_pf cmpxchg Spte+w OOPS!!! we see Spte is not changed and happily make it writable, so gfn2 can be writable It seems only a unique identification can prevent this. :( -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Please pull powerpc KVM fixes
On 04/03/2012 03:22 PM, Paul Mackerras wrote: On Tue, Apr 03, 2012 at 10:03:05PM +1000, Paul Mackerras wrote: The following changes since commit 592f5d87b3feee9d60411f19d583038c0c7670ad: OK, I messed up the git request-pull command. The request should have looked like this: The following changes since commit b1a808ff436343956a6ae63178ea1810c5e5a3a1: Merge branch 'for-next' of git://gitorious.org/kernel-hsi/kernel-hsi (2012-04-02 09:50:40 -0700) are available in the git repository at: git://github.com/paulusmack/linux tags/powerpc-fixes Thanks, pulled, will push shortly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
New git workflow
In a recent conversation, Linus persuaded me that it's time for change in our git workflow; the following will bring it in line with the current practices of most trees. The current 'master' branch will be abandoned (still available for reviewing history). The new branch structure will be as follows: 'master': contains patches destined for fixes after the current merge windows; for example, after v3.4 has been released, master contains fixes which will be merged during the 3.4 cycle. 'next': contains patches destined for the next merge window; for example, if v3.4 has just been released, then 'next' will be based on the v3.5-rc1 tag (or thereabouts), and will be accepting patches targeted at the 3.6 merge window (when 3.5 is released). 'queue': holding area for patches which have not been tested yet; may be rebased; usually merged into 'next' 'kvm-updates/3.x': unchanged, patches destined for a stable branch Since 'next' is based on an -rc1 release, it may lack upstream non-kvm fixes (or features) needed for development. To work around that, an 'auto-next' branch will be provided that is a merge of current upstream and 'next' (this is more or less the equivalent of the old 'master' branch). That branch of course will be rebased often. How to work with the new layout: If you're a developer, usually developing against 'next' is okay. If 'next' is unstable for you or you need a new upstream API, work against 'auto-next', but let the maintainers know that when posting your patch. If you're working on a fix for the current cycle, work against upstream or 'master' (they should be equivalent most of the time). If you're a submaintainer, post git pull requests against 'next' or 'master', according to your merge target. Please ping Marcelo or myself with any question or suggestion. It's better to ask a silly question than develop against a stale tree accidentally and waste your efforts. I'll publish the new branches tomorrow, with any luck. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html