[PATCH] virtio-blk: Disable callback in virtblk_done()
This reduces unnecessary interrupts that host could send to guest while guest is in the progress of irq handling. If one vcpu is handling the irq, while another interrupt comes, in handle_edge_irq(), the guest will mask the interrupt via mask_msi_irq() which is a very heavy operation that goes all the way down to host. Signed-off-by: Asias He --- drivers/block/virtio_blk.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 53b81d5..0bdde8f 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -274,15 +274,18 @@ static void virtblk_done(struct virtqueue *vq) unsigned int len; spin_lock_irqsave(vblk->disk->queue->queue_lock, flags); - while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) { - if (vbr->bio) { - virtblk_bio_done(vbr); - bio_done = true; - } else { - virtblk_request_done(vbr); - req_done = true; + do { + virtqueue_disable_cb(vq); + while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) { + if (vbr->bio) { + virtblk_bio_done(vbr); + bio_done = true; + } else { + virtblk_request_done(vbr); + req_done = true; + } } - } + } while (!virtqueue_enable_cb(vq)); /* In case queue is stopped waiting for more buffers. */ if (req_done) blk_start_queue(vblk->disk->queue); -- 1.7.11.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU
> -Original Message- > From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On > Behalf Of Avi Kivity > Sent: Monday, September 24, 2012 10:17 PM > To: Hao, Xudong > Cc: kvm@vger.kernel.org; Zhang, Xiantao > Subject: Re: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU > > On 09/24/2012 05:28 AM, Xudong Hao wrote: > > Enable KVM FPU fully eager restore, if there is other FPU state which isn't > > tracked by CR0.TS bit. > > > > v4 changes from v3: > > - Wrap up some confused code with a clear functio lazy_fpu_allowed() > > - Update fpu while update cr4 too. > > > > v3 changes from v2: > > - Make fpu active explicitly while guest xsave is enabling and non-lazy > > xstate > > bit exist. > > > > v2 changes from v1: > > - Expand KVM_XSTATE_LAZY to 64 bits before negating it. > > > > int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); > > int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); > > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > > index 818fceb..fbdb44a 100644 > > --- a/arch/x86/kvm/svm.c > > +++ b/arch/x86/kvm/svm.c > > @@ -2941,6 +2941,7 @@ static int cr_interception(struct vcpu_svm *svm) > > break; > > case 4: > > err = kvm_set_cr4(&svm->vcpu, val); > > + update_lazy_fpu(vcpu); > > break; > > case 8: > > err = kvm_set_cr8(&svm->vcpu, val); > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 30bcb95..b3880c0 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -4488,6 +4488,7 @@ static int handle_cr(struct kvm_vcpu *vcpu) > > return 1; > > case 4: > > err = handle_set_cr4(vcpu, val); > > + update_lazy_fpu(vcpu); > > kvm_complete_insn_gp(vcpu, err); > > return 1; > > Why not in kvm_set_cr4()? > > > > case 8: { > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index fc2a0a1..2e14cec 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -544,6 +544,31 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned > long msw) > > } > > EXPORT_SYMBOL_GPL(kvm_lmsw); > > > > +/* > > + * KVM trigger FPU restore by #NM (via CR0.TS), > > + * only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked > > + * by TS bit, there might be other FPU state is not tracked > > + * by TS bit. > > + * This function lazy_fpu_allowed() return true with any of > > + * the following cases: 1)xsave isn't enabled in guest; > > + * 2)all guest FPU states can be tracked by TS bit. > > + */ > > +static bool lazy_fpu_allowed(struct kvm_vcpu *vcpu) > > +{ > > + return !!(!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) || > > + !(vcpu->arch.xcr0 & ~((u64)KVM_XSTATE_LAZY))); > > +} > > !! is !needed, bool conversion takes care of it. > > > + > > +void update_lazy_fpu(struct kvm_vcpu *vcpu) > > +{ > > + if (lazy_fpu_allowed(vcpu)) { > > + vcpu->fpu_active = 0; > > + kvm_x86_ops->fpu_deactivate(vcpu); > > + } > > There is no need to deactivate the fpu here. We try to deactivate the > fpu as late as possible, preempt notifiers or vcpu_put will do that for us. > > > + else > > + kvm_x86_ops->fpu_activate(vcpu); > > +} > > + > > int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) > > { > > u64 xcr0; > > @@ -571,6 +596,7 @@ int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, > u64 xcr) > > kvm_inject_gp(vcpu, 0); > > return 1; > > } > > + update_lazy_fpu(vcpu); > > return 0; > > } > > EXPORT_SYMBOL_GPL(kvm_set_xcr); > > @@ -5985,7 +6011,11 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) > > vcpu->guest_fpu_loaded = 0; > > fpu_save_init(&vcpu->arch.guest_fpu); > > ++vcpu->stat.fpu_reload; > > - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); > > + /* > > +* Make deactivate request while allow fpu lazy restore. > > +*/ > > + if (lazy_fpu_allowed(vcpu)) > > + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); > > trace_kvm_fpu(0); > > } > > > > > > btw, it is clear that long term the fpu will always be eagerly loaded, > as hosts and guests (and hardware) are updated. At that time it will > make sense to remove the lazy fpu code entirely. But maybe that time is > here already, since exits are rare and so the guest has a lot of chance > to use the fpu, so eager fpu saves the #NM vmexit. > > Can you check a kernel compile on a westmere system? If eager fpu is > faster there than lazy fpu, we can just make the fpu always eager and > remove quite a bit of code. > I remember westmere does not support Xsave, do you want performance of fxsave/fresotr ? > It will slow down guests not using in-kernel apic, or guests that just > process interrupts in the kernel and then HLT, or maybe i386 guests, but > I think it's worth it. > > -- > error compiling comm
Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Mon, 24 Sep 2012 10:39:38 +0100 Mel Gorman wrote: > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote: > > > Also, what has to be done to avoid the polling altogether? eg/ie, zap > > a pageblock's PB_migrate_skip synchronously, when something was done to > > that pageblock which justifies repolling it? > > > > The "something" event you are looking for is pages being freed or > allocated in the page allocator. A movable page being allocated in block > or a page being freed should clear the PB_migrate_skip bit if it's set. > Unfortunately this would impact the fast path of the alloc and free paths > of the page allocator. I felt that that was too high a price to pay. We already do a similar thing in the page allocator: clearing of ->all_unreclaimable and ->pages_scanned. But that isn't on the "fast path" really - it happens once per pcp unload. Can we do something like that? Drop some hint into the zone without having to visit each page? > > > > > > ... > > > > > > +static void reset_isolation_suitable(struct zone *zone) > > > +{ > > > + unsigned long start_pfn = zone->zone_start_pfn; > > > + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; > > > + unsigned long pfn; > > > + > > > + /* > > > + * Do not reset more than once every five seconds. If allocations are > > > + * failing sufficiently quickly to allow this to happen then continually > > > + * scanning for compaction is not going to help. The choice of five > > > + * seconds is arbitrary but will mitigate excessive scanning. > > > + */ > > > + if (time_before(jiffies, zone->compact_blockskip_expire)) > > > + return; > > > + zone->compact_blockskip_expire = jiffies + (HZ * 5); > > > + > > > + /* Walk the zone and mark every pageblock as suitable for isolation */ > > > + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { > > > + struct page *page; > > > + if (!pfn_valid(pfn)) > > > + continue; > > > + > > > + page = pfn_to_page(pfn); > > > + if (zone != page_zone(page)) > > > + continue; > > > + > > > + clear_pageblock_skip(page); > > > + } > > > > What's the worst-case loop count here? > > > > zone->spanned_pages >> pageblock_order What's the worst-case value of (zone->spanned_pages >> pageblock_order) :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] KVM: PPC: booke: use vcpu reference from thread_struct
On 21.08.2012, at 15:51, Bharat Bhushan wrote: > Like other places, use thread_struct to get vcpu reference. Please remove the definition of SPRN_SPRG_R/WVCPU as well. Alex > > Signed-off-by: Bharat Bhushan > --- > arch/powerpc/kernel/asm-offsets.c |2 +- > arch/powerpc/kvm/booke_interrupts.S |6 ++ > 2 files changed, 3 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/kernel/asm-offsets.c > b/arch/powerpc/kernel/asm-offsets.c > index 85b05c4..fbb999c 100644 > --- a/arch/powerpc/kernel/asm-offsets.c > +++ b/arch/powerpc/kernel/asm-offsets.c > @@ -116,7 +116,7 @@ int main(void) > #ifdef CONFIG_KVM_BOOK3S_32_HANDLER > DEFINE(THREAD_KVM_SVCPU, offsetof(struct thread_struct, > kvm_shadow_vcpu)); > #endif > -#ifdef CONFIG_KVM_BOOKE_HV > +#if defined(CONFIG_KVM) && defined(CONFIG_BOOKE) > DEFINE(THREAD_KVM_VCPU, offsetof(struct thread_struct, kvm_vcpu)); > #endif > > diff --git a/arch/powerpc/kvm/booke_interrupts.S > b/arch/powerpc/kvm/booke_interrupts.S > index bb46b32..ca16d57 100644 > --- a/arch/powerpc/kvm/booke_interrupts.S > +++ b/arch/powerpc/kvm/booke_interrupts.S > @@ -56,7 +56,8 @@ > _GLOBAL(kvmppc_handler_\ivor_nr) > /* Get pointer to vcpu and record exit number. */ > mtspr \scratch , r4 > - mfspr r4, SPRN_SPRG_RVCPU > + mfspr r4, SPRN_SPRG_THREAD > + lwz r4, THREAD_KVM_VCPU(r4) > stw r3, VCPU_GPR(R3)(r4) > stw r5, VCPU_GPR(R5)(r4) > stw r6, VCPU_GPR(R6)(r4) > @@ -402,9 +403,6 @@ lightweight_exit: > lwz r8, kvmppc_booke_handlers@l(r8) > mtspr SPRN_IVPR, r8 > > - /* Save vcpu pointer for the exception handlers. */ > - mtspr SPRN_SPRG_WVCPU, r4 > - > lwz r5, VCPU_SHARED(r4) > > /* Can't switch the stack pointer until after IVPR is switched, > -- > 1.7.0.4 > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 09/24/2012 06:14 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 18:06 +0200, Avi Kivity wrote: >> >> We would probably need a ->sched_exit() preempt notifier to make this >> work. Peter, I know how much you love those, would it be acceptable? > > Where exactly do you want this? TASK_DEAD? or another exit? TASK_DEAD of the task that registered the preempt notifier. The idea is that I want to hold on to the notifier even when exiting to userspace. Since userspace is under no obligation to call kvm again, I need a chance to unregister the notifier and otherwise clean up. Eh, looking at the code, we'll have a ->sched_out() after the state is set to TASK_DEAD. So all we need to do is examine the state. We'll need to examine the state anyway to see if we were preempted or blocking. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On 09/24/2012 06:13 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote: >> > Its also still a LAPIC write -- disguised as an MSR though :/ >> >> It's probably a whole lot faster though. > > I've been told its not, I haven't tried it. I'll see if I can find a machine with it (don't see it on my Westmere, it's probably on one of the Bridges. Or maybe the other Peter knows. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 06:03 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote: >> On 09/24/2012 03:54 PM, Peter Zijlstra wrote: >> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: >> >> However Rik had a genuine concern in the cases where runqueue is not >> >> equally distributed and lockholder might actually be on a different run >> >> queue but not running. >> > >> > Load should eventually get distributed equally -- that's what the >> > load-balancer is for -- so this is a temporary situation. >> >> What's the expected latency? This is the whole problem. Eventually the >> scheduler would pick the lock holder as well, the problem is that it's >> in the millisecond scale while lock hold times are in the microsecond >> scale, leading to a 1000x slowdown. > > Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes > something like accurate or fast, never both. > >> If we want to yield, we really want to boost someone. > > Now if only you knew which someone ;-) This non-modified guest nonsense > is such a snake pit.. but you know how I feel about all that. Actually if I knew that in addition to boosting someone, I also unboost myself enough to be preempted, it wouldn't matter. While boosting the lock holder is good, the main point is not spinning and doing useful work instead. We can detect spinners and avoid boosting them. That's the motivation for the "donate vruntime" approach I wanted earlier. > >> > We already try and favour the non running vcpu in this case, that's what >> > yield_to_task_fair() is about. If its still not eligible to run, tough >> > luck. >> >> Crazy idea: instead of yielding, just run that other vcpu in the thread >> that would otherwise spin. I can see about a million objections to this >> already though. > > Yah.. you want me to list a few? :-) It would require synchronization > with the other cpu to pull its task -- one really wants to avoid it also > running it. Yeah, it's quite a horrible idea. > > Do this at a high enough frequency and you're dead too. > > Anyway, you can do this inside the KVM stuff, simply flip the vcpu state > associated with a vcpu thread and use the preemption notifiers to sort > things against the scheduler or somesuch. That's what I thought when I wrote this, but I can't, I might be preempted in random kvm code. So my state includes the host stack and registers. Maybe we can special-case when we interrupt guest mode. > >> >> Do you think instead of using rq->nr_running, we could get a global >> >> sense of load using avenrun (something like avenrun/num_onlinecpus) >> > >> > To what purpose? Also, global stuff is expensive, so you should try and >> > stay away from it as hard as you possibly can. >> >> Spinning is also expensive. How about we do the global stuff every N >> times, to amortize the cost (and reduce contention)? > > Nah, spinning isn't expensive, its a waste of time, similar end result > for someone who wants to do useful work though, but not the same cause. > > Pick N and I'll come up with a scenario for which its wrong ;-) Sure. But if it's rare enough, then that's okay for us. > Anyway, its an ugly problem and one I really want to contain inside the > insanity that created it (virt), lets not taint the rest of the kernel > more than we need to. Agreed. Though given that postgres and others use userspace spinlocks, maybe it's not just virt. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] KVM: booke/bookehv: Add debug stub support
On 21.08.2012, at 15:52, Bharat Bhushan wrote: > This patch adds the debug stub support on booke/bookehv. > Now QEMU debug stub can use hw breakpoint, watchpoint and > software breakpoint to debug guest. > > Signed-off-by: Bharat Bhushan > --- > arch/powerpc/include/asm/kvm.h| 29 ++- > arch/powerpc/include/asm/kvm_host.h |5 + > arch/powerpc/kernel/asm-offsets.c | 26 ++ > arch/powerpc/kvm/booke.c | 144 + > arch/powerpc/kvm/booke_interrupts.S | 110 + > arch/powerpc/kvm/bookehv_interrupts.S | 141 +++- > arch/powerpc/kvm/e500mc.c |3 +- > 7 files changed, 435 insertions(+), 23 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h > index 61b197e..53479ea 100644 > --- a/arch/powerpc/include/asm/kvm.h > +++ b/arch/powerpc/include/asm/kvm.h > @@ -25,6 +25,7 @@ > /* Select powerpc specific features in */ > #define __KVM_HAVE_SPAPR_TCE > #define __KVM_HAVE_PPC_SMT > +#define __KVM_HAVE_GUEST_DEBUG > > struct kvm_regs { > __u64 pc; > @@ -264,7 +265,31 @@ struct kvm_fpu { > __u64 fpr[32]; > }; > > + > +/* > + * Defines for h/w breakpoint, watchpoint (read, write or both) and > + * software breakpoint. > + * These are used as "type" in KVM_SET_GUEST_DEBUG ioctl and "status" > + * for KVM_DEBUG_EXIT. > + */ > +#define KVMPPC_DEBUG_NONE0x0 > +#define KVMPPC_DEBUG_BREAKPOINT (1UL << 1) > +#define KVMPPC_DEBUG_WATCH_WRITE (1UL << 2) > +#define KVMPPC_DEBUG_WATCH_READ (1UL << 3) > struct kvm_debug_exit_arch { > + __u64 pc; > + /* > + * exception -> returns the exception number. If the KVM_DEBUG_EXIT > + * exit is not handled (say not h/w breakpoint or software breakpoint > + * set for this address) by qemu then it is supposed to inject this > + * exception to guest. > + */ > + __u32 exception; > + /* > + * exiting to userspace because of h/w breakpoint, watchpoint > + * (read, write or both) and software breakpoint. > + */ > + __u32 status; > }; > > /* for KVM_SET_GUEST_DEBUG */ > @@ -276,10 +301,6 @@ struct kvm_guest_debug_arch { >* Type denotes h/w breakpoint, read watchpoint, write >* watchpoint or watchpoint (both read and write). >*/ > -#define KVMPPC_DEBUG_NOTYPE 0x0 > -#define KVMPPC_DEBUG_BREAKPOINT (1UL << 1) > -#define KVMPPC_DEBUG_WATCH_WRITE (1UL << 2) > -#define KVMPPC_DEBUG_WATCH_READ (1UL << 3) > __u32 type; > __u32 pad1; > __u64 pad2; > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index c7219c1..3ba465a 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -496,7 +496,12 @@ struct kvm_vcpu_arch { > u32 mmucfg; > u32 epr; > u32 crit_save; > + /* guest debug registers*/ > struct kvmppc_booke_debug_reg dbg_reg; > + /* shadow debug registers */ > + struct kvmppc_booke_debug_reg shadow_dbg_reg; > + /* host debug registers*/ > + struct kvmppc_booke_debug_reg host_dbg_reg; > #endif > gpa_t paddr_accessed; > gva_t vaddr_accessed; > diff --git a/arch/powerpc/kernel/asm-offsets.c > b/arch/powerpc/kernel/asm-offsets.c > index 555448e..6987821 100644 > --- a/arch/powerpc/kernel/asm-offsets.c > +++ b/arch/powerpc/kernel/asm-offsets.c > @@ -564,6 +564,32 @@ int main(void) > DEFINE(VCPU_FAULT_DEAR, offsetof(struct kvm_vcpu, arch.fault_dear)); > DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr)); > DEFINE(VCPU_CRIT_SAVE, offsetof(struct kvm_vcpu, arch.crit_save)); > + DEFINE(VCPU_DBSR, offsetof(struct kvm_vcpu, arch.dbsr)); > + DEFINE(VCPU_SHADOW_DBG, offsetof(struct kvm_vcpu, arch.shadow_dbg_reg)); > + DEFINE(VCPU_HOST_DBG, offsetof(struct kvm_vcpu, arch.host_dbg_reg)); > + DEFINE(KVMPPC_DBG_DBCR0, offsetof(struct kvmppc_booke_debug_reg, > + dbcr0)); > + DEFINE(KVMPPC_DBG_DBCR1, offsetof(struct kvmppc_booke_debug_reg, > + dbcr1)); > + DEFINE(KVMPPC_DBG_DBCR2, offsetof(struct kvmppc_booke_debug_reg, > + dbcr2)); > +#ifdef CONFIG_KVM_E500MC > + DEFINE(KVMPPC_DBG_DBCR4, offsetof(struct kvmppc_booke_debug_reg, > + dbcr4)); > +#endif > + DEFINE(KVMPPC_DBG_IAC1, offsetof(struct kvmppc_booke_debug_reg, > + iac[0])); > + DEFINE(KVMPPC_DBG_IAC2, offsetof(struct kvmppc_booke_debug_reg, > + iac[1])); > + DEFINE(KVMPPC_DBG_IAC3, offsetof(struct kvmppc_booke_debug_reg, > + iac[2])); > + DEFINE(KVMPPC_DBG_IAC4, offsetof(stru
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Mon, 2012-09-24 at 18:06 +0200, Avi Kivity wrote: > > We would probably need a ->sched_exit() preempt notifier to make this > work. Peter, I know how much you love those, would it be acceptable? Where exactly do you want this? TASK_DEAD? or another exit? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote: > > Its also still a LAPIC write -- disguised as an MSR though :/ > > It's probably a whole lot faster though. I've been told its not, I haven't tried it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On 09/24/2012 06:05 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote: >> There is the TSC deadline timer mode of newer Intels. Programming the >> timer is a simple wrmsr, and it will fire immediately if it already >> expired. Unfortunately on AMDs it is not available, and on virtual >> hardware it will be slow (~1-2 usec). > > Its also still a LAPIC write -- disguised as an MSR though :/ It's probably a whole lot faster though. > Also, who gives a hoot about virtual crap ;-) I only mentioned it to see if your virtual crap detector is still working. Looks like it's still in top condition, low latency and 100% hit rate. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 09/24/2012 05:41 PM, Avi Kivity wrote: > >> >> case 2) >> rq1 : vcpu1->wait(lockA) (spinning) >> rq2 : vcpu3 (running) , vcpu2->holding(lockA) [scheduled out] >> >> I agree that checking rq1 length is not proper in this case, and as you >> rightly pointed out, we are in trouble here. >> nr_running()/num_online_cpus() would give more accurate picture here, >> but it seemed costly. May be load balancer save us a bit here in not >> running to such sort of cases. ( I agree load balancer is far too >> complex). > > In theory preempt notifier can tell us whether a vcpu is preempted or > not (except for exits to userspace), so we can keep track of whether > it's we're overcommitted in kvm itself. It also avoids false positives > from other guests and/or processes being overcommitted while our vm is fine. It also allows us to cheaply skip running vcpus. We would probably need a ->sched_exit() preempt notifier to make this work. Peter, I know how much you love those, would it be acceptable? We'd still need yield_to() but the pressure on it might be reduced. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote: > There is the TSC deadline timer mode of newer Intels. Programming the > timer is a simple wrmsr, and it will fire immediately if it already > expired. Unfortunately on AMDs it is not available, and on virtual > hardware it will be slow (~1-2 usec). Its also still a LAPIC write -- disguised as an MSR though :/ Also, who gives a hoot about virtual crap ;-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/30] KVM: x86: Export svm/vmx exit code and vector code to userspace
From: Xiao Guangrong Exporting KVM exit information to userspace to be consumed by perf. Signed-off-by: Dong Hao [ Dong Hao : rebase it on acme's git tree ] Signed-off-by: Xiao Guangrong Acked-by: Marcelo Tosatti Cc: Avi Kivity Cc: David Ahern Cc: Ingo Molnar Cc: Marcelo Tosatti Cc: kvm@vger.kernel.org Cc: Runzhen Wang Link: http://lkml.kernel.org/r/1347870675-31495-2-git-send-email-haod...@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- arch/x86/include/asm/kvm.h | 16 +++ arch/x86/include/asm/kvm_host.h | 16 --- arch/x86/include/asm/svm.h | 205 +-- arch/x86/include/asm/vmx.h | 127 arch/x86/kvm/trace.h| 89 - 5 files changed, 230 insertions(+), 223 deletions(-) diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index 246617e..41e08cb 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -9,6 +9,22 @@ #include #include +#define DE_VECTOR 0 +#define DB_VECTOR 1 +#define BP_VECTOR 3 +#define OF_VECTOR 4 +#define BR_VECTOR 5 +#define UD_VECTOR 6 +#define NM_VECTOR 7 +#define DF_VECTOR 8 +#define TS_VECTOR 10 +#define NP_VECTOR 11 +#define SS_VECTOR 12 +#define GP_VECTOR 13 +#define PF_VECTOR 14 +#define MF_VECTOR 16 +#define MC_VECTOR 18 + /* Select x86 specific features in */ #define __KVM_HAVE_PIT #define __KVM_HAVE_IOAPIC diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 09155d6..1eaa6b0 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -75,22 +75,6 @@ #define KVM_HPAGE_MASK(x) (~(KVM_HPAGE_SIZE(x) - 1)) #define KVM_PAGES_PER_HPAGE(x) (KVM_HPAGE_SIZE(x) / PAGE_SIZE) -#define DE_VECTOR 0 -#define DB_VECTOR 1 -#define BP_VECTOR 3 -#define OF_VECTOR 4 -#define BR_VECTOR 5 -#define UD_VECTOR 6 -#define NM_VECTOR 7 -#define DF_VECTOR 8 -#define TS_VECTOR 10 -#define NP_VECTOR 11 -#define SS_VECTOR 12 -#define GP_VECTOR 13 -#define PF_VECTOR 14 -#define MF_VECTOR 16 -#define MC_VECTOR 18 - #define SELECTOR_TI_MASK (1 << 2) #define SELECTOR_RPL_MASK 0x03 diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index f2b83bc..cdf5674 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -1,6 +1,135 @@ #ifndef __SVM_H #define __SVM_H +#define SVM_EXIT_READ_CR0 0x000 +#define SVM_EXIT_READ_CR3 0x003 +#define SVM_EXIT_READ_CR4 0x004 +#define SVM_EXIT_READ_CR8 0x008 +#define SVM_EXIT_WRITE_CR0 0x010 +#define SVM_EXIT_WRITE_CR3 0x013 +#define SVM_EXIT_WRITE_CR4 0x014 +#define SVM_EXIT_WRITE_CR8 0x018 +#define SVM_EXIT_READ_DR0 0x020 +#define SVM_EXIT_READ_DR1 0x021 +#define SVM_EXIT_READ_DR2 0x022 +#define SVM_EXIT_READ_DR3 0x023 +#define SVM_EXIT_READ_DR4 0x024 +#define SVM_EXIT_READ_DR5 0x025 +#define SVM_EXIT_READ_DR6 0x026 +#define SVM_EXIT_READ_DR7 0x027 +#define SVM_EXIT_WRITE_DR0 0x030 +#define SVM_EXIT_WRITE_DR1 0x031 +#define SVM_EXIT_WRITE_DR2 0x032 +#define SVM_EXIT_WRITE_DR3 0x033 +#define SVM_EXIT_WRITE_DR4 0x034 +#define SVM_EXIT_WRITE_DR5 0x035 +#define SVM_EXIT_WRITE_DR6 0x036 +#define SVM_EXIT_WRITE_DR7 0x037 +#define SVM_EXIT_EXCP_BASE 0x040 +#define SVM_EXIT_INTR 0x060 +#define SVM_EXIT_NMI 0x061 +#define SVM_EXIT_SMI 0x062 +#define SVM_EXIT_INIT 0x063 +#define SVM_EXIT_VINTR 0x064 +#define SVM_EXIT_CR0_SEL_WRITE 0x065 +#define SVM_EXIT_IDTR_READ 0x066 +#define SVM_EXIT_GDTR_READ 0x067 +#define SVM_EXIT_LDTR_READ 0x068 +#define SVM_EXIT_TR_READ 0x069 +#define SVM_EXIT_IDTR_WRITE0x06a +#define SVM_EXIT_GDTR_WRITE0x06b +#define SVM_EXIT_LDTR_WRITE0x06c +#define SVM_EXIT_TR_WRITE 0x06d +#define SVM_EXIT_RDTSC 0x06e +#define SVM_EXIT_RDPMC 0x06f +#define SVM_EXIT_PUSHF 0x070 +#define SVM_EXIT_POPF 0x071 +#define SVM_EXIT_CPUID 0x072 +#define SVM_EXIT_RSM 0x073 +#define SVM_EXIT_IRET 0x074 +#define SVM_EXIT_SWINT 0x075 +#define SVM_EXIT_INVD 0x076 +#define SVM_EXIT_PAUSE 0x077 +#define SVM_EXIT_HLT 0x078 +#define SVM_EXIT_INVLPG0x079 +#define SVM_EXIT_INVLPGA 0x07a +#define SVM_EXIT_IOIO 0x07b +#define SVM_EXIT_MSR 0x07c +#define SVM_EXIT_TASK_SWITCH 0x07d +#define SVM_EXIT_FERR_FREEZE 0x07e +#define SVM_EXIT_SHUTDOWN 0x07f +#define SVM_EXIT_VMRUN 0x080 +#define SVM_EXIT_VMMCALL 0x081 +#define SVM_EXIT_VMLOAD0x082 +#define SVM_EXIT_VMSAVE0x083 +#define SVM_EXIT_STGI 0x084 +#define SVM_EXIT_CLGI 0x085 +#define SVM_EXIT_SKINIT0x086 +#define SVM_EXIT_RDTSCP0x087 +#define SVM_EXIT_ICEBP 0x088 +#define SVM_EXIT_WBINVD0x089 +#define SVM_EXIT_MONITOR 0x08a +#define SVM_EXIT_MWAIT 0x08b
[PATCH 06/30] perf kvm: Events analysis tool
From: Xiao Guangrong Add 'perf kvm stat' support to analyze kvm vmexit/mmio/ioport smartly Usage: - kvm stat run a command and gather performance counter statistics, it is the alias of perf stat - trace kvm events: perf kvm stat record, or, if other tracepoints are interesting as well, we can append the events like this: perf kvm stat record -e timer:* -a If many guests are running, we can track the specified guest by using -p or --pid, -a is used to track events generated by all guests. - show the result: perf kvm stat report The output example is following: 13005 13059 total 2 guests are running on the host Then, track the guest whose pid is 13059: ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.253 MB perf.data.guest (~11065 samples) ] See the vmexit events: Analyze events for all VCPUs: VM-EXITSamples Samples% Time% Avg time APIC_ACCESS46070.55% 0.01% 22.44us ( +- 1.75% ) HLT 9314.26%99.98% 832077.26us ( +- 10.42% ) EXTERNAL_INTERRUPT 64 9.82% 0.00% 35.35us ( +- 14.21% ) PENDING_INTERRUPT 24 3.68% 0.00% 9.29us ( +- 31.39% ) CR_ACCESS 7 1.07% 0.00% 8.12us ( +- 5.76% ) IO_INSTRUCTION 3 0.46% 0.00% 18.00us ( +- 11.79% ) EXCEPTION_NMI 1 0.15% 0.00% 5.83us ( +- -nan% ) Total Samples:652, Total events handled time:77396109.80us. See the mmio events: Analyze events for all VCPUs: MMIO AccessSamples Samples% Time% Avg time 0xfee00380:W38784.31%79.28% 8.29us ( +- 3.32% ) 0xfee00300:W 24 5.23% 9.96% 16.79us ( +- 1.97% ) 0xfee00300:R 24 5.23% 7.83% 13.20us ( +- 3.00% ) 0xfee00310:W 24 5.23% 2.93% 4.94us ( +- 3.84% ) Total Samples:459, Total events handled time:4044.59us. See the ioport event: Analyze events for all VCPUs: IO Port AccessSamples Samples% Time% Avg time 0xc050:POUT 3 100.00% 100.00% 13.75us ( +- 10.83% ) Total Samples:3, Total events handled time:41.26us. And, --vcpu is used to track the specified vcpu and --key is used to sort the result: Analyze events for VCPU 0: VM-EXITSamples Samples% Time% Avg time HLT 2713.85%99.97% 405790.24us ( +- 12.70% ) EXTERNAL_INTERRUPT 13 6.67% 0.00% 27.94us ( +- 22.26% ) APIC_ACCESS14674.87% 0.03% 21.69us ( +- 2.91% ) IO_INSTRUCTION 2 1.03% 0.00% 17.77us ( +- 20.56% ) CR_ACCESS 2 1.03% 0.00% 8.55us ( +- 6.47% ) PENDING_INTERRUPT 5 2.56% 0.00% 6.27us ( +- 3.94% ) Total Samples:195, Total events handled time:10959950.90us. Signed-off-by: Dong Hao Signed-off-by: Runzhen Wang [ Dong Hao Runzhen Wang : - rebase it on current acme's tree - fix the compiling-error on i386 ] Signed-off-by: Xiao Guangrong Acked-by: David Ahern Cc: Avi Kivity Cc: David Ahern Cc: Ingo Molnar Cc: Marcelo Tosatti Cc: kvm@vger.kernel.org Cc: Runzhen Wang Link: http://lkml.kernel.org/r/1347870675-31495-4-git-send-email-haod...@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/Documentation/perf-kvm.txt | 30 ++- tools/perf/MANIFEST |3 + tools/perf/builtin-kvm.c | 840 - tools/perf/util/header.c | 59 +++- tools/perf/util/header.h |1 + tools/perf/util/thread.h |2 + 6 files changed, 929 insertions(+), 6 deletions(-) diff --git a/tools/perf/Documentation/perf-kvm.txt b/tools/perf/Documentation/perf-kvm.txt index dd84cb2..326f2cb 100644 --- a/tools/perf/Documentation/perf-kvm.txt +++ b/tools/perf/Documentation/perf-kvm.txt @@ -12,7 +12,7 @@ SYNOPSIS [--guestkallsyms= --guestmodules= | --guestvmlinux=]] {top|record|report|diff|buildid-list} 'perf kvm' [--host] [--guest] [--guestkallsyms= --guestmodules= - | --guestvmlinux=] {top|record|report|diff|buildid-list} + | --guestvmlinux=] {top|record|report|diff|buildid-list|stat} DESCRIPTION --- @@ -38,6 +38,18 @@ There are a couple of variants of perf kvm: so that other tools can be used to fetch packages with matching symbol tables for use by perf report. + 'perf kvm stat ' to run a command and gather performance counter + statistics. + Especially, perf 'kvm stat record/report' generates a statistical analysis + of KVM events. Currently, vmexit, mmio and ioport events are supported. + 'perf kvm stat record ' records kvm events and the events between + start and end . + And this command produces a file which contains tra
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote: > On 09/24/2012 03:54 PM, Peter Zijlstra wrote: > > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: > >> However Rik had a genuine concern in the cases where runqueue is not > >> equally distributed and lockholder might actually be on a different run > >> queue but not running. > > > > Load should eventually get distributed equally -- that's what the > > load-balancer is for -- so this is a temporary situation. > > What's the expected latency? This is the whole problem. Eventually the > scheduler would pick the lock holder as well, the problem is that it's > in the millisecond scale while lock hold times are in the microsecond > scale, leading to a 1000x slowdown. Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes something like accurate or fast, never both. > If we want to yield, we really want to boost someone. Now if only you knew which someone ;-) This non-modified guest nonsense is such a snake pit.. but you know how I feel about all that. > > We already try and favour the non running vcpu in this case, that's what > > yield_to_task_fair() is about. If its still not eligible to run, tough > > luck. > > Crazy idea: instead of yielding, just run that other vcpu in the thread > that would otherwise spin. I can see about a million objections to this > already though. Yah.. you want me to list a few? :-) It would require synchronization with the other cpu to pull its task -- one really wants to avoid it also running it. Do this at a high enough frequency and you're dead too. Anyway, you can do this inside the KVM stuff, simply flip the vcpu state associated with a vcpu thread and use the preemption notifiers to sort things against the scheduler or somesuch. > >> Do you think instead of using rq->nr_running, we could get a global > >> sense of load using avenrun (something like avenrun/num_onlinecpus) > > > > To what purpose? Also, global stuff is expensive, so you should try and > > stay away from it as hard as you possibly can. > > Spinning is also expensive. How about we do the global stuff every N > times, to amortize the cost (and reduce contention)? Nah, spinning isn't expensive, its a waste of time, similar end result for someone who wants to do useful work though, but not the same cause. Pick N and I'll come up with a scenario for which its wrong ;-) Anyway, its an ugly problem and one I really want to contain inside the insanity that created it (virt), lets not taint the rest of the kernel more than we need to. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL 00/30] perf/core improvements and fixes
Hi Ingo, Please consider pulling, - Arnaldo The following changes since commit 1e6dd8adc78d4a153db253d051fd4ef6c49c9019: perf: Fix off by one test in perf_reg_value() (2012-09-19 17:08:40 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux tags/perf-core-for-mingo for you to fetch changes up to b1ac754b67b5a875d63bee880f60ccb0c6bd8899: tools lib traceevent: Handle alloc_arg failure (2012-09-24 12:31:52 -0300) perf/core improvements and fixes: . Convert the trace builtins to use the growing evsel/evlist tracepoint infrastructure, removing several open coded constructs like switch like series of strcmp to dispatch events, etc. Basically what had already been showcased in 'perf sched'. . Add evsel constructor for tracepoints, that uses libtraceevent just to parse the /format events file, use it in a new 'perf test' to make sure the libtraceevent format parsing regressions can be more readily caught. . Some strange errors were happening in some builds, but not on the next, reported by several people, problem was some parser related files, generated during the build, didn't had proper make deps, fix from Eric Sandeen. . Fix some compiling errors on 32-bit, from Feng Tang. . Don't use sscanf extension %as, not available on bionic, reimplementation by Irina Tirdea. . Fix bfd.h/libbfd detection with recent binutils, from Markus Trippelsdorf. . Introduce struct and cache information about the environment where a perf.data file was captured, from Namhyung Kim. . Fix several error paths in libtraceevent, from Namhyung Kim. Print event causing perf_event_open() to fail in 'perf record', from Stephane Eranian. . New 'kvm' analysis tool, from Xiao Guangrong. Signed-off-by: Arnaldo Carvalho de Melo Arnaldo Carvalho de Melo (11): perf kvm: Use perf_evsel__intval perf kmem: Use perf_evsel__intval and perf_session__set_tracepoints_handlers perf lock: Use perf_evsel__intval and perf_session__set_tracepoints_handlers perf timechart: Use zalloc and fix a couple leaks tools lib traceevent: Use asprintf were applicable tools lib traceevent: Use calloc were applicable tools lib traceevent: Fix afterlife gotos tools lib traceevent: Remove some die() calls tools lib traceevent: Carve out events format parsing routine perf evsel: Provide a new constructor for tracepoints perf test: Add test for the sched tracepoint format fields Eric Sandeen (1): perf tools: Fix parallel build Feng Tang (2): perf tools: Fix a compiling error in trace-event-perl.c for 32 bits machine perf tools: Fix a compiling error in util/map.c Irina Tirdea (1): perf tools: remove sscanf extension %as Markus Trippelsdorf (1): perf tools: bfd.h/libbfd detection fails with recent binutils Namhyung Kim (11): perf header: Add struct perf_session_env perf header: Add ->process callbacks to most of features perf header: Use pre-processed session env when printing perf header: Remove unused @feat arg from ->process callback perf kvm: Use perf_session_env for reading cpuid perf header: Remove perf_header__read_feature tools lib traceevent: Fix error path on process_array() tools lib traceevent: Make sure that arg->op.right is set properly tools lib traceevent: Free field if an error occurs on process_fields tools lib traceevent: Free field if an error occurs on process_flags/symbols tools lib traceevent: Handle alloc_arg failure Stephane Eranian (1): perf record: Print event causing perf_event_open() to fail Xiao Guangrong (2): KVM: x86: Export svm/vmx exit code and vector code to userspace perf kvm: Events analysis tool arch/x86/include/asm/kvm.h | 16 + arch/x86/include/asm/kvm_host.h| 16 - arch/x86/include/asm/svm.h | 205 +++-- arch/x86/include/asm/vmx.h | 127 ++- arch/x86/kvm/trace.h | 89 --- tools/lib/traceevent/event-parse.c | 570 + tools/lib/traceevent/event-parse.h |3 + tools/perf/Documentation/perf-kvm.txt | 30 +- tools/perf/MANIFEST|3 + tools/perf/Makefile|6 +- tools/perf/builtin-kmem.c | 90 +-- tools/perf/builtin-kvm.c | 836 +++- tools/perf/builtin-lock.c | 233 ++ tools/perf/builtin-record.c|6 +- tools/perf/builtin-test.c | 86 ++ tools/perf/builtin-timechart.c | 40 +- tools/perf/uti
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On 09/24/2012 05:52 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote: >> Wouldn't this correspond to the scheduler interrupt firing and causing a >> reschedule? I thought the timer was programmed for exactly the point in >> time that CFS considers the right time for a switch. But I'm basing >> this on my mental model of CFS, not CFS itself. > > No, we tried this for hrtimer kernels for a while, but programming > hrtimers the whole time (every actual task-switch) turns out to be far > too expensive. So we're back to HZ ticks and 'polling' the preemption > state. Ok, so I wasn't completely off base. With HZ=1000, we can only be faster than the poll by a millisecond than the interrupt-driven schedule(), and we need to be a lot faster. > Even if we remove all the hrtimer infrastructure overhead (can do with a > few hacks) setting the hardware requires going out to the LAPIC, which > is stupid slow. > > Some hardware actually has fast/reliable/usable timers, sadly none of it > is popular. There is the TSC deadline timer mode of newer Intels. Programming the timer is a simple wrmsr, and it will fire immediately if it already expired. Unfortunately on AMDs it is not available, and on virtual hardware it will be slow (~1-2 usec). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote: > Wouldn't this correspond to the scheduler interrupt firing and causing a > reschedule? I thought the timer was programmed for exactly the point in > time that CFS considers the right time for a switch. But I'm basing > this on my mental model of CFS, not CFS itself. No, we tried this for hrtimer kernels for a while, but programming hrtimers the whole time (every actual task-switch) turns out to be far too expensive. So we're back to HZ ticks and 'polling' the preemption state. Even if we remove all the hrtimer infrastructure overhead (can do with a few hacks) setting the hardware requires going out to the LAPIC, which is stupid slow. Some hardware actually has fast/reliable/usable timers, sadly none of it is popular. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 03:54 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: >> However Rik had a genuine concern in the cases where runqueue is not >> equally distributed and lockholder might actually be on a different run >> queue but not running. > > Load should eventually get distributed equally -- that's what the > load-balancer is for -- so this is a temporary situation. What's the expected latency? This is the whole problem. Eventually the scheduler would pick the lock holder as well, the problem is that it's in the millisecond scale while lock hold times are in the microsecond scale, leading to a 1000x slowdown. If we want to yield, we really want to boost someone. > We already try and favour the non running vcpu in this case, that's what > yield_to_task_fair() is about. If its still not eligible to run, tough > luck. Crazy idea: instead of yielding, just run that other vcpu in the thread that would otherwise spin. I can see about a million objections to this already though. >> Do you think instead of using rq->nr_running, we could get a global >> sense of load using avenrun (something like avenrun/num_onlinecpus) > > To what purpose? Also, global stuff is expensive, so you should try and > stay away from it as hard as you possibly can. Spinning is also expensive. How about we do the global stuff every N times, to amortize the cost (and reduce contention)? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On 09/24/2012 05:34 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote: >> I think this is a no-op these (CFS) days. To get schedule() to do >> anything, you need to wake up a task, or let time pass, or block. >> Otherwise it will see that nothing has changed and as far as it's >> concerned you're still the best task to be running (otherwise it >> wouldn't have picked you in the first place). > > Time could have passed enough before calling this that there's now a > different/more eligible task around to schedule. Wouldn't this correspond to the scheduler interrupt firing and causing a reschedule? I thought the timer was programmed for exactly the point in time that CFS considers the right time for a switch. But I'm basing this on my mental model of CFS, not CFS itself. > Esp. for a !PREEMPT kernel this is could be significant. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/4] VFIO-based PCI device assignment
On Fri, Sep 14, 2012 at 05:01:35PM -0600, Alex Williamson wrote: > Same goodness as v4, plus: > > - Addressed comments by Blue Swirl (thanks for the review) >(hopefully w/o breaking anything wrt slow bar endianness) > - Fixed a couple checkpatch warnings that snuck in > > BTW, this works fine with Jason's Q35 patches though we will > need to add INTx routing support for KVM accelerated INTx > (and pci-assign). Thanks, > > Alex OK this is well contained, let's merge :). Acked-by: Michael S. Tsirkin > --- > > Alex Williamson (4): > vfio: Enable vfio-pci and mark supported > vfio: vfio-pci device assignment driver > Update Linux kernel headers > Update kernel header script to include vfio > > > MAINTAINERS |5 > configure |6 > hw/Makefile.objs|3 > hw/vfio_pci.c | 1864 > +++ > hw/vfio_pci_int.h | 114 ++ > linux-headers/linux/vfio.h | 368 > scripts/update-linux-headers.sh |2 > 7 files changed, 2360 insertions(+), 2 deletions(-) > create mode 100644 hw/vfio_pci.c > create mode 100644 hw/vfio_pci_int.h > create mode 100644 linux-headers/linux/vfio.h -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 09/21/2012 08:24 PM, Raghavendra K T wrote: > On 09/21/2012 06:32 PM, Rik van Riel wrote: >> On 09/21/2012 08:00 AM, Raghavendra K T wrote: >>> From: Raghavendra K T >>> >>> When total number of VCPUs of system is less than or equal to physical >>> CPUs, >>> PLE exits become costly since each VCPU can have dedicated PCPU, and >>> trying to find a target VCPU to yield_to just burns time in PLE handler. >>> >>> This patch reduces overhead, by simply doing a return in such >>> scenarios by >>> checking the length of current cpu runqueue. >> >> I am not convinced this is the way to go. >> >> The VCPU that is holding the lock, and is not releasing it, >> probably got scheduled out. That implies that VCPU is on a >> runqueue with at least one other task. > > I see your point here, we have two cases: > > case 1) > > rq1 : vcpu1->wait(lockA) (spinning) > rq2 : vcpu2->holding(lockA) (running) > > Here Ideally vcpu1 should not enter PLE handler, since it would surely > get the lock within ple_window cycle. (assuming ple_window is tuned for > that workload perfectly). > > May be this explains why we are not seeing benefit with kernbench. > > On the other side, Since we cannot have a perfect ple_window tuned for > all type of workloads, for those workloads, which may need more than > 4096 cycles, we gain. thinking is it that we are seeing in benefited > cases? Maybe we need to increase the ple window regardless. 4096 cycles is 2 microseconds or less (call it t_spin). The overhead from kvm_vcpu_on_spin() and the associated task switches is at least a few microseconds, increasing as contention is added (call it t_tield). The time for a natural context switch is several milliseconds (call it t_slice). There is also the time the lock holder owns the lock, assuming no contention (t_hold). If t_yield > t_spin, then in the undercommitted case it dominates t_spin. If t_hold > t_spin we lose badly. If t_spin > t_yield, then the undercommitted case doesn't suffer as much as most of the spinning happens in the guest instead of the host, so it can pick up the unlock timely. We don't lose too much in the overcommitted case provided the values aren't too far apart (say a factor of 3). Obviously t_spin must be significantly smaller than t_slice, otherwise it accomplishes nothing. Regarding t_hold: if it is small, then a larger t_spin helps avoid false exits. If it is large, then we're not very sensitive to t_spin. It doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up yielding for several milliseconds. So I think it's worth trying again with ple_window of 2-4. > > case 2) > rq1 : vcpu1->wait(lockA) (spinning) > rq2 : vcpu3 (running) , vcpu2->holding(lockA) [scheduled out] > > I agree that checking rq1 length is not proper in this case, and as you > rightly pointed out, we are in trouble here. > nr_running()/num_online_cpus() would give more accurate picture here, > but it seemed costly. May be load balancer save us a bit here in not > running to such sort of cases. ( I agree load balancer is far too > complex). In theory preempt notifier can tell us whether a vcpu is preempted or not (except for exits to userspace), so we can keep track of whether it's we're overcommitted in kvm itself. It also avoids false positives from other guests and/or processes being overcommitted while our vm is fine. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] KVM: PPC: debug stub interface parameter defined
On 21.08.2012, at 15:51, Bharat Bhushan wrote: > This patch defines the interface parameter for KVM_SET_GUEST_DEBUG > ioctl support. Follow up patches will use this for setting up > hardware breakpoints, watchpoints and software breakpoints. > > Signed-off-by: Bharat Bhushan > --- > arch/powerpc/include/asm/kvm.h | 33 + > arch/powerpc/kvm/book3s.c |6 ++ > arch/powerpc/kvm/booke.c |6 ++ > arch/powerpc/kvm/powerpc.c |6 -- > 4 files changed, 45 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h > index 3c14202..61b197e 100644 > --- a/arch/powerpc/include/asm/kvm.h > +++ b/arch/powerpc/include/asm/kvm.h > @@ -269,8 +269,41 @@ struct kvm_debug_exit_arch { > > /* for KVM_SET_GUEST_DEBUG */ > struct kvm_guest_debug_arch { > + struct { > + /* H/W breakpoint/watchpoint address */ > + __u64 addr; > + /* > + * Type denotes h/w breakpoint, read watchpoint, write > + * watchpoint or watchpoint (both read and write). > + */ > +#define KVMPPC_DEBUG_NOTYPE 0x0 > +#define KVMPPC_DEBUG_BREAKPOINT (1UL << 1) > +#define KVMPPC_DEBUG_WATCH_WRITE (1UL << 2) > +#define KVMPPC_DEBUG_WATCH_READ (1UL << 3) > + __u32 type; > + __u32 pad1; Why the padding? > + __u64 pad2; > + } bp[16]; Why 16? > }; > > +/* Debug related defines */ > +/* > + * kvm_guest_debug->control is a 32 bit field. The lower 16 bits are generic > + * and upper 16 bits are architecture specific. Architecture specific defines > + * that ioctl is for setting hardware breakpoint or software breakpoint. > + */ > +#define KVM_GUESTDBG_USE_SW_BP 0x0001 > +#define KVM_GUESTDBG_USE_HW_BP 0x0002 > + > +/* When setting software breakpoint, Change the software breakpoint > + * instruction to special trap instruction and set KVM_GUESTDBG_USE_SW_BP > + * flag in kvm_guest_debug->control. KVM does keep track of software > + * breakpoints. So when KVM_GUESTDBG_USE_SW_BP flag is set and special trap > + * instruction is executed by guest then exit to userspace. > + * NOTE: A Nice interface can be added to get the special trap instruction. > + */ > +#define KVMPPC_INST_GUEST_GDB0x7C00021C /* ehpriv OC=0 > */ This definitely has to be passed to user space (which writes that instruction into guest phys memory). Other PPC subarchs will use different instructions. Just model it as a read-only ONE_REG. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote: > I think this is a no-op these (CFS) days. To get schedule() to do > anything, you need to wake up a task, or let time pass, or block. > Otherwise it will see that nothing has changed and as far as it's > concerned you're still the best task to be running (otherwise it > wouldn't have picked you in the first place). Time could have passed enough before calling this that there's now a different/more eligible task around to schedule. Esp. for a !PREEMPT kernel this is could be significant. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 08/19] pc: calculate dimm physical addresses and adjust memory map
On Sat, Sep 22, 2012 at 02:15:28PM +, Blue Swirl wrote: > > + > > +/* Function to configure memory offsets of hotpluggable dimms */ > > + > > +target_phys_addr_t pc_set_hp_memory_offset(uint64_t size) > > +{ > > +target_phys_addr_t ret; > > + > > +/* on first call, initialize ram_hp_offset */ > > +if (!ram_hp_offset) { > > +if (ram_size >= PCI_HOLE_START ) { > > +ram_hp_offset = 0x1LL + (ram_size - PCI_HOLE_START); > > +} else { > > +ram_hp_offset = ram_size; > > +} > > +} > > + > > +if (ram_hp_offset >= 0x1LL) { > > +ret = ram_hp_offset; > > +above_4g_hp_mem_size += size; > > +ram_hp_offset += size; > > +} > > +/* if dimm fits before pci hole, append it normally */ > > +else if (ram_hp_offset + size <= PCI_HOLE_START) { > > } else if ... > > > +ret = ram_hp_offset; > > +below_4g_hp_mem_size += size; > > +ram_hp_offset += size; > > +} > > +/* otherwise place it above 4GB */ > > +else { > > } else { > > > +ret = 0x1LL; > > +above_4g_hp_mem_size += size; > > +ram_hp_offset = 0x1LL + size; > > +} > > + > > +return ret; > > +} > > But the function and use of lots of global variables is ugly. The dimm > devices should be just created in piix_pci.c (i440fx) directly with > correct offsets and sizes, so all below_4g_mem_size etc. calculations > should be moved there. That would implement the PMC part of i440fx. > > For ISA PC, probably the board should create the DIMMs since there may > not be a memory controller. The >4G logic does not make sense there > anyway. What about moving the implementation to pc_piix.c? Initial RAM and pci windows are already calculated in pc_init1, and then passed to i440fx_init. The memory bus could be attached to i440fx for pci-enabled pc and to isabus-bridge for isa-pc (isa-pc not tested yet). Something like the following: --- hw/pc.h |1 + hw/pc_piix.c | 57 +++-- 2 files changed, 52 insertions(+), 6 deletions(-) diff --git a/hw/pc.h b/hw/pc.h index e4db071..d6cc43b 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -10,6 +10,7 @@ #include "memory.h" #include "ioapic.h" +#define PCI_HOLE_START 0xe000 /* PC-style peripherals (also used by other machines). */ /* serial.c */ diff --git a/hw/pc_piix.c b/hw/pc_piix.c index 88ff041..17db95a 100644 --- a/hw/pc_piix.c +++ b/hw/pc_piix.c @@ -43,6 +43,7 @@ #include "xen.h" #include "memory.h" #include "exec-memory.h" +#include "dimm.h" #ifdef CONFIG_XEN # include #endif @@ -52,6 +53,8 @@ static const int ide_iobase[MAX_IDE_BUS] = { 0x1f0, 0x170 }; static const int ide_iobase2[MAX_IDE_BUS] = { 0x3f6, 0x376 }; static const int ide_irq[MAX_IDE_BUS] = { 14, 15 }; +static ram_addr_t below_4g_hp_mem_size = 0; +static ram_addr_t above_4g_hp_mem_size = 0; static void kvm_piix3_setup_irq_routing(bool pci_enabled) { @@ -117,6 +120,41 @@ static void ioapic_init(GSIState *gsi_state) } } +static target_phys_addr_t pc_set_hp_memory_offset(uint64_t size) +{ +target_phys_addr_t ret; +static ram_addr_t ram_hp_offset = 0; + +/* on first call, initialize ram_hp_offset */ +if (!ram_hp_offset) { +if (ram_size >= PCI_HOLE_START ) { +ram_hp_offset = 0x1LL + (ram_size - PCI_HOLE_START); +} else { +ram_hp_offset = ram_size; +} +} + +if (ram_hp_offset >= 0x1LL) { +ret = ram_hp_offset; +above_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* if dimm fits before pci hole, append it normally */ +else if (ram_hp_offset + size <= PCI_HOLE_START) { +ret = ram_hp_offset; +below_4g_hp_mem_size += size; +ram_hp_offset += size; +} +/* otherwise place it above 4GB */ +else { +ret = 0x1LL; +above_4g_hp_mem_size += size; +ram_hp_offset = 0x1LL + size; +} + +return ret; +} + /* PC hardware initialisation */ static void pc_init1(MemoryRegion *system_memory, MemoryRegion *system_io, @@ -155,9 +193,9 @@ static void pc_init1(MemoryRegion *system_memory, kvmclock_create(); } -if (ram_size >= 0xe000 ) { -above_4g_mem_size = ram_size - 0xe000; -below_4g_mem_size = 0xe000; +if (ram_size >= PCI_HOLE_START ) { +above_4g_mem_size = ram_size - PCI_HOLE_START; +below_4g_mem_size = PCI_HOLE_START; } else { above_4g_mem_size = 0; below_4g_mem_size = ram_size; @@ -172,6 +210,9 @@ static void pc_init1(MemoryRegion *system_memory, rom_memory = system_memory; } +/* adjust memory map for hotplug dimms */ +dimm_calc_offsets(pc_set_hp_memory_offset); + /* allocate ram and load rom/bios */ if (!xen_enabled()) { fw_cfg = pc_memory_init(syst
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On 09/21/2012 03:00 PM, Raghavendra K T wrote: > From: Raghavendra K T > > When PLE handler fails to find a better candidate to yield_to, it > goes back and does spin again. This is acceptable when we do not > have overcommit. > But in overcommitted scenarios (especially when we have large > number of small guests), it is better to yield. > > Reviewed-by: Srikar Dronamraju > Signed-off-by: Raghavendra K T > --- > virt/kvm/kvm_main.c |4 > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 8323685..713b677 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) > } > } > } > + /* In overcommitted cases, yield instead of spinning */ > + if (!yielded && rq_nr_running() > 1) > + schedule(); > + I think this is a no-op these (CFS) days. To get schedule() to do anything, you need to wake up a task, or let time pass, or block. Otherwise it will see that nothing has changed and as far as it's concerned you're still the best task to be running (otherwise it wouldn't have picked you in the first place). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] VFIO fixes for 3.6
Hi Linus, The following changes since commit c46de2263f42fb4bbde411b9126f471e9343cb22: Merge branch 'for-linus' of git://git.kernel.dk/linux-block (2012-09-19 11:04:34 -0700) are available in the git repository at: git://github.com/awilliam/linux-vfio.git tags/vfio-for-linus for you to fetch changes up to b68e7fa879cd3b1126a7c455d9da1b70299efc0d: vfio: Fix virqfd release race (2012-09-21 10:48:28 -0600) VFIO doc update and virqfd race fix Alex Williamson (2): vfio: Trivial Documentation correction vfio: Fix virqfd release race Documentation/vfio.txt| 2 +- drivers/vfio/pci/vfio_pci_intrs.c | 76 --- 2 files changed, 57 insertions(+), 21 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On 09/24/2012 04:32 PM, Takuya Yoshikawa wrote: > On Mon, 24 Sep 2012 12:09:00 +0200 > Avi Kivity wrote: > >> > while (vcpu->request) { >> >xchg(vcpu->request, request); >> > >> >for_each_set_bit(request) { >> >clear_bit(X); >> > >> >.. >> >} >> > >> > } >> >> In fact I had something like that in one of the earlier versions, but it >> was problematic. I'll try to see what the issue was. > > Unless there are many requests, the cost of for_each_set_bit() and a few > added code may exceed that of the original code. > (Looping using __ffs() is an alternative because requests is "long".) > > So I wanted to know the most common requests pattern. See other branch of this thread. But in short, I now think you are right and the special-case is warranted. (not for STEAL_UPDATE - that's likely a bug, it should only happen on overcommit) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On 09/24/2012 04:14 PM, Takuya Yoshikawa wrote: > On Mon, 24 Sep 2012 12:18:15 +0200 > Avi Kivity wrote: > >> On 09/24/2012 08:24 AM, Takuya Yoshikawa wrote: >> > This is an RFC since I have not done any comparison with the approach >> > using for_each_set_bit() which can be seen in Avi's work. >> > >> >Takuya >> > --- >> > >> > We did a simple test to see which requests we would get at the same time >> > in vcpu_enter_guest() and got the following numbers: >> > >> > |...|...||...|.| >> > (N) (E) (S) (ES) others >> > 22.3% 30.7% 16.0%29.5% 1.4% >> > >> > (N) : Nothing >> > (E) : Only KVM_REQ_EVENT >> > (S) : Only KVM_REQ_STEAL_UPDATE >> > (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE >> > >> > * Note that the exact numbers can change for other guests. >> >> What was the workload? STEAL_UPDATE is done on schedules and >> heavyweight exit (erronously), so it should be rare. > > Just ping'ing to the host. But even without that, I got many > STEAL_UPDATE's and KVM_REQ_EVENT's. Those are only relative measurements. Sure on an idle guest practically all of your exits are special, but there's so few of them that optimizing them doesn't matter much. However the trend is that as hardware improves exits get more and more special. Before NPT/EPT, most of the exits were page faults, and we optimized for them. Afterwards, most exits are APIC and interrupt related, HLT, and MMIO. Of these, some are special (HLT, interrupt injection) and some are not (read/write most APIC registers). I don't think one group dominates the other. So already vcpu->requests processing is not such a slow path, it is relatively common. We still see a lot of page faults during boot and during live migration though. With AVIC/APIC-V (still in the future) the mix will change again, with both special and non-special exits eliminated. We'll be left mostly with APIC timer and HLT (and ICR writes for APIC-V). So maybe the direction of your patch makes sense. Things like KVM_REQ_EVENT (or anything above 2-3% of exits) shouldn't be in vcpu->requests or maybe they deserve special treatment. > >> Or maybe we're recording HLT time as steal time? > > Not sure. I think we do, but on the other hand qemu doesn't even expose it so we can't measure it. > > BTW, schedule() is really rare? We do either cond_resched() or > heavy weight exit, no? If 25% of exits are HLT (like a ping workload), then 25% of your exits end up in schedule(). On modern hardware, a relatively larger percentage of exits are heavyweight (same analysis as above). On AVIC hardware most exits will be mmio, HLT, and host interrupts. Of these only host interrupts that don't lead to a context switch will be lightweight. > > I always see vcpu threads actively move around the cores. > (When I do not pin them.) Sure, but the frequency is quite low. If not that's a bug. > >> > This motivated us to optimize the following code in vcpu_enter_guest(): >> > >> > if (vcpu->requests) { /** (1) **/ >> > ... >> > if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) /** (2) **/ >> > record_steal_time(vcpu); >> > ... >> > } >> > >> > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { >> > ... >> > } >> > >> > - For case (E), we do kvm_check_request() for every request other than >> >KVM_REQ_EVENT in the block (1) and always get false. >> > - For case (S) and (ES), the only difference from case (E) is that we >> >get true for KVM_REQ_STEAL_UPDATE. >> > >> > This means that we were wasting a lot of time for the many if branches >> > in the block (1) even for these trivial three cases which dominated more >> > than 75% in total. >> > >> > This patch mitigates the issue as follows: >> > - For case (E), we change the first if condition to >> >if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) /** (1') **/ >> >so that we can skip the block completely. >> > - For case (S) and (ES), we move the check (2) upwards, out of the >> >block (1), to clear the KVM_REQ_STEAL_UPDATE flag before doing the >> >check (1'). >> > >> > Although this adds one if branch for case (N), the fact that steal >> > update occurs frequently enough except when we give each vcpu a >> > dedicated core justifies its tiny cost. >> >> Modern processors will eliminate KVM_REQ_EVENT in many cases, so the >> optmimization is wasted on them. > > Then, my Nehalem server was not so modern. Well I was referring to APIC-v/AVIC hardware which nobody has. On current hardware they're very common. So stuffing it in the vcpu->requests slow path is not warranted. My patch is cleaner than yours as it handles the problem generically, but yours matches reality better. > I did something like this: > > if requests == KVM_REQ_EVENT > ++counter1; > if requests == KVM_REQ_STEAL_UPDATE >
Re: [PATCH 6/6] KVM: booke/bookehv: Add debug stub support
On 07.09.2012, at 00:56, Scott Wood wrote: > On 09/06/2012 09:56 AM, Bhushan Bharat-R65777 wrote: >> >> >>> -Original Message- >>> From: Wood Scott-B07421 >>> Sent: Thursday, September 06, 2012 4:57 AM >>> To: Bhushan Bharat-R65777 >>> Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan >>> Bharat- >>> R65777 >>> Subject: Re: [PATCH 6/6] KVM: booke/bookehv: Add debug stub support >>> >>> On 09/05/2012 06:23 PM, Scott Wood wrote: On 08/21/2012 08:52 AM, Bharat Bhushan wrote: > This patch adds the debug stub support on booke/bookehv. > Now QEMU debug stub can use hw breakpoint, watchpoint and software > breakpoint to debug guest. > > Signed-off-by: Bharat Bhushan > --- > arch/powerpc/include/asm/kvm.h| 29 ++- > arch/powerpc/include/asm/kvm_host.h |5 + > arch/powerpc/kernel/asm-offsets.c | 26 ++ > arch/powerpc/kvm/booke.c | 144 > +-- >>> -- > arch/powerpc/kvm/booke_interrupts.S | 110 + > arch/powerpc/kvm/bookehv_interrupts.S | 141 >>> +++- > arch/powerpc/kvm/e500mc.c |3 +- > 7 files changed, 435 insertions(+), 23 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm.h > b/arch/powerpc/include/asm/kvm.h index 61b197e..53479ea 100644 > --- a/arch/powerpc/include/asm/kvm.h > +++ b/arch/powerpc/include/asm/kvm.h > @@ -25,6 +25,7 @@ > /* Select powerpc specific features in */ #define > __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT > +#define __KVM_HAVE_GUEST_DEBUG > > struct kvm_regs { > __u64 pc; > @@ -264,7 +265,31 @@ struct kvm_fpu { > __u64 fpr[32]; > }; > > + > +/* > + * Defines for h/w breakpoint, watchpoint (read, write or both) and > + * software breakpoint. > + * These are used as "type" in KVM_SET_GUEST_DEBUG ioctl and "status" > + * for KVM_DEBUG_EXIT. > + */ > +#define KVMPPC_DEBUG_NONE0x0 > +#define KVMPPC_DEBUG_BREAKPOINT (1UL << 1) > +#define KVMPPC_DEBUG_WATCH_WRITE (1UL << 2) > +#define KVMPPC_DEBUG_WATCH_READ (1UL << 3) > struct kvm_debug_exit_arch { That says "arch", but it's not in an arch-specific file. >>> >>> Sigh, I can't read today apparently. >>> > + __u64 pc; > + /* > + * exception -> returns the exception number. If the KVM_DEBUG_EXIT > + * exit is not handled (say not h/w breakpoint or software breakpoint > + * set for this address) by qemu then it is supposed to inject this > + * exception to guest. > + */ > + __u32 exception; > + /* > + * exiting to userspace because of h/w breakpoint, watchpoint > + * (read, write or both) and software breakpoint. > + */ > + __u32 status; > }; What does "exception number" mean in a generic API? >>> >>> Still, "exception number" is not a well-defined concept powerpc-wide. >> >> Just for background why we added is that, on x86 this exception number is >> used to inject the exception to guest if QEMU is not able to handle the >> debug exception. >> >> Should we just through a print with clearing the exception condition? Or >> something else you would like to suggest? > > We can pass up the exception type; it just needs more documentation > about what exactly you're referring to, and probably some enumeration > that says which exception numberspace it is. > > For booke the exception number should probably be related to the fixed > offsets rather than the IVOR number, as IVORs are phased out. Jan, I would like to get your comment on this one. Since we don't have standardized exception vectors like x86 does, we need to convert things between different semantics in user space if we want to make use of the exception type. Do we actually need to know about it in user space or do we only need to store it in case we get a migration at that point? If it's the latter, can we maybe keep the reinjection logic internal to KVM and make DEBUG exits non-migratable similar to how we already handle MMIO/PIO exits today? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 11/19] Implement qmp and hmp commands for notification lists
Hi, On Fri, Sep 21, 2012 at 04:03:26PM -0600, Eric Blake wrote: > On 09/21/2012 05:17 AM, Vasilis Liaskovitis wrote: > > Guest can respond to ACPI hotplug events e.g. with _EJ or _OST method. > > This patch implements a tail queue to store guest notifications for memory > > hot-add and hot-remove requests. > > > > Guest responses for memory hotplug command on a per-dimm basis can be > > detected > > with the new hmp command "info memhp" or the new qmp command "query-memhp" > > Naming doesn't match the QMP code. will fix. > > > Examples: > > > > (qemu) device_add dimm,id=ram0 > > > > > These notification items should probably be part of migration state (not yet > > implemented). > > In the case of libvirt driving migration, you already said in 10/19 that > libvirt has to start the destination with the populated=on|off fields > correct for each dimm according to the state it was in at the time the That patch actually alleviates this restriction for the off->on direction i.e. it allows for the target-VM to not have its args updated for dimm hot-add. (e.g. Let's say the source was started with a dimm, initialy off. The dimm is hot-plugged, and then migrated . WIth patch 10/19, the populated arg doesn't have to be updated on the target) The other direction (off->on) still needs correct arg change. If libvirt/management layers guarantee the dimm arguments are correctly changed, I don't see that we need 10/19 patch eventually. What I think is needed is another hmp/qmp command, that will report which dimms are on/off at any given time e.g. (monitor) info memory-hotplug dimm0: off dimm1: on ... dimmN: off This can be used on the source by libvirt / other layers to find out the populated dimms, and construct the correct command line on the destination. Does this make sense to you? The current patch only deals with success/failure event notifications (not on-off state of dimms) and should probably be renamed to "query-memory-hotplug-events". > host started the update. Can the host hot unplug memory after migration > has started? Good testcase. I would rather not allow any hotplug operations while the migration is happening. What do we do with pci hotplug during migration currently? I found a discussion dating from a year ago, suggesting the same as the simplest solution, but I don't know what's currently implemented. http://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg01204.html > > > + > > +## > > +# @MemHpInfo: > > +# > > +# Information about status of a memory hotplug command > > +# > > +# @dimm: the Dimm associated with the result > > +# > > +# @result: the result of the hotplug command > > +# > > +# Since: 1.3 > > +# > > +## > > +{ 'type': 'MemHpInfo', > > + 'data': {'dimm': 'str', 'request': 'str', 'result': 'str'} } > > Should 'result' be a bool (true for success, false for still pending) or > an enum, instead of a free-form string? Likewise, isn't 'request' going > to be exactly one of two values (plug or unplug)? agreed with 'request'. For 'result' it is also a boolean, but with 'success' and 'failure' (rather than 'pending'). Items are only queued when the guest has given us a definite _OST or _EJ result wich is either success or fail. If an operation is pending, nothing is queued here. Perhaps queueing pending operations also has a usecase, but this isn't addressed in this patch. thanks, - Vasilis -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 16/16] target-i386: Use Hypervisor leaf extra in -machine pc,accel=tcg.
Signed-off-by: Don Slutz --- target-i386/cpu.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index b77dbfe..1d81f00 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2001,6 +2001,17 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = 0; *edx = 0; break; +case 0x4002 ... 0x40FF: +if (index == env->cpuid_hv_extra) { +*eax = env->cpuid_hv_extra_a; +*ebx = env->cpuid_hv_extra_b; +} else { +*eax = 0; +*ebx = 0; +} +*ecx = 0; +*edx = 0; +break; case 0x8000: *eax = env->cpuid_xlevel; *ebx = env->cpuid_vendor1; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 15/16] target-i386: Use Hypervisor leaf extra in -machine pc,accel=kvm.
Signed-off-by: Don Slutz --- target-i386/kvm.c | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index f8a5177..ff82034 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -454,6 +454,25 @@ int kvm_arch_init_vcpu(CPUX86State *env) c->ebx = signature[0]; c->ecx = signature[1]; c->edx = signature[2]; +} else if (env->cpuid_hv_extra != 0) { +for (i = KVM_CPUID_FEATURES + 1; i <= env->cpuid_hv_level; i++) { +c = &cpuid_data.entries[cpuid_i++]; +memset(c, 0, sizeof(*c)); +c->function = i; +if (i == env->cpuid_hv_extra) { +c->eax = env->cpuid_hv_extra_a; +c->ebx = env->cpuid_hv_extra_b; +} +} + +c = &cpuid_data.entries[cpuid_i++]; +memset(c, 0, sizeof(*c)); +c->function = KVM_CPUID_SIGNATURE_NEXT; +memcpy(signature, "KVMKVMKVM\0\0\0", 12); +c->eax = 0; +c->ebx = signature[0]; +c->ecx = signature[1]; +c->edx = signature[2]; } has_msr_async_pf_en = c->eax & (1 << KVM_FEATURE_ASYNC_PF); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 13/16] target-i386: Add cpu object access routines for Hypervisor leaf extra.
Signed-off-by: Don Slutz --- target-i386/cpu.c | 66 + 1 files changed, 66 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 9ab29a7..8bb20c7 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1279,6 +1279,63 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const char *value, env->cpuid_hv_vendor_set = true; } +static void x86_cpuid_get_hv_extra(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, &cpu->env.cpuid_hv_extra, name, errp); +} + +static void x86_cpuid_set_hv_extra(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +uint32_t value; + +visit_type_uint32(v, &value, name, errp); +if (error_is_set(errp)) { +return; +} + +if ((value != 0) && (value < 0x4000)) { +value += 0x4000; +} +cpu->env.cpuid_hv_extra = value; +} + +static void x86_cpuid_get_hv_extra_a(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, &cpu->env.cpuid_hv_extra_a, name, errp); +} + +static void x86_cpuid_set_hv_extra_a(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, &cpu->env.cpuid_hv_extra_a, name, errp); +} + +static void x86_cpuid_get_hv_extra_b(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, &cpu->env.cpuid_hv_extra_b, name, errp); +} + +static void x86_cpuid_set_hv_extra_b(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, &cpu->env.cpuid_hv_extra_b, name, errp); +} + #if !defined(CONFIG_USER_ONLY) static void x86_set_hyperv(Object *obj, Error **errp) { @@ -2225,6 +2282,15 @@ static void x86_cpu_initfn(Object *obj) object_property_add_str(obj, "hypervisor-vendor", x86_cpuid_get_hv_vendor, x86_cpuid_set_hv_vendor, NULL); +object_property_add(obj, "hypervisor-extra", "int", +x86_cpuid_get_hv_extra, +x86_cpuid_set_hv_extra, NULL, NULL, NULL); +object_property_add(obj, "hypervisor-extra-a", "int", +x86_cpuid_get_hv_extra_a, +x86_cpuid_set_hv_extra_a, NULL, NULL, NULL); +object_property_add(obj, "hypervisor-extra-b", "int", +x86_cpuid_get_hv_extra_b, +x86_cpuid_set_hv_extra_b, NULL, NULL, NULL); #if !defined(CONFIG_USER_ONLY) object_property_add(obj, "hv_spinlocks", "int", x86_get_hv_spinlocks, -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 14/16] target-i386: Add setting of Hypervisor leaf extra for known vmare4.
This was taken from: http://article.gmane.org/gmane.comp.emulators.kvm.devel/22643 Signed-off-by: Don Slutz --- target-i386/cpu.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 8bb20c7..b77dbfe 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1135,6 +1135,36 @@ static void x86_cpuid_set_model_id(Object *obj, const char *model_id, } } +static void x86_cpuid_set_vmware_extra(Object *obj) +{ +X86CPU *cpu = X86_CPU(obj); + +if ((cpu->env.tsc_khz != 0) && +(cpu->env.cpuid_hv_level == CPUID_HV_LEVEL_VMWARE_4) && +(cpu->env.cpuid_hv_vendor1 == CPUID_HV_VENDOR_VMWARE_1) && +(cpu->env.cpuid_hv_vendor2 == CPUID_HV_VENDOR_VMWARE_2) && +(cpu->env.cpuid_hv_vendor3 == CPUID_HV_VENDOR_VMWARE_3)) { +const uint32_t apic_khz = 100L; + +/* + * From article.gmane.org/gmane.comp.emulators.kvm.devel/22643 + * + *Leaf 0x4010, Timing Information. + * + *VMware has defined the first generic leaf to provide timing + *information. This leaf returns the current TSC frequency and + *current Bus frequency in kHz. + * + *# EAX: (Virtual) TSC frequency in kHz. + *# EBX: (Virtual) Bus (local apic timer) frequency in kHz. + *# ECX, EDX: RESERVED (Per above, reserved fields are set to zero). + */ +cpu->env.cpuid_hv_extra = 0x4010; +cpu->env.cpuid_hv_extra_a = (uint32_t)cpu->env.tsc_khz; +cpu->env.cpuid_hv_extra_b = apic_khz; +} +} + static void x86_cpuid_get_tsc_freq(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) { @@ -1164,6 +1194,7 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor *v, void *opaque, } cpu->env.tsc_khz = value / 1000; +x86_cpuid_set_vmware_extra(obj); } static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque, @@ -1277,6 +1308,7 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const char *value, env->cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8]) << (8 * i); } env->cpuid_hv_vendor_set = true; +x86_cpuid_set_vmware_extra(obj); } static void x86_cpuid_get_hv_extra(Object *obj, Visitor *v, void *opaque, -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 12/16] target-i386: Add optional Hypervisor leaf extra.
Signed-off-by: Don Slutz --- target-i386/cpu.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index ebb3498..254ddef 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -812,6 +812,10 @@ typedef struct CPUX86State { uint32_t cpuid_hv_vendor1; uint32_t cpuid_hv_vendor2; uint32_t cpuid_hv_vendor3; +/* VMware extra data */ +uint32_t cpuid_hv_extra; +uint32_t cpuid_hv_extra_a; +uint32_t cpuid_hv_extra_b; /* MTRRs */ uint64_t mtrr_fixed[11]; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 11/16] target-i386: Add some known names to Hypervisor vendor.
Signed-off-by: Don Slutz --- target-i386/cpu.c | 57 +++- target-i386/cpu.h | 19 + 2 files changed, 74 insertions(+), 2 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index a929b64..9ab29a7 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1207,6 +1207,24 @@ static char *x86_cpuid_get_hv_vendor(Object *obj, Error **errp) } value[CPUID_VENDOR_SZ] = '\0'; +/* Convert known names */ +if (!strcmp(value, CPUID_HV_VENDOR_HYPERV) && + env->cpuid_hv_level == CPUID_HV_LEVEL_HYPERV) { +pstrcpy(value, sizeof(value), "hyperv"); +} else if (!strcmp(value, CPUID_HV_VENDOR_VMWARE)) { +if (env->cpuid_hv_level == CPUID_HV_LEVEL_VMWARE_4) { +pstrcpy(value, sizeof(value), "vmware4"); +} else if (env->cpuid_hv_level == CPUID_HV_LEVEL_VMWARE_3) { +pstrcpy(value, sizeof(value), "vmware3"); +} +} else if (!strcmp(value, CPUID_HV_VENDOR_XEN) && + env->cpuid_hv_level == CPUID_HV_LEVEL_XEN) { +pstrcpy(value, sizeof(value), "xen"); +} else if (!strcmp(value, CPUID_HV_VENDOR_KVM) && + (env->cpuid_hv_level == 0 || +env->cpuid_hv_level == CPUID_HV_LEVEL_KVM)) { +pstrcpy(value, sizeof(value), "kvm"); +} return value; } @@ -1220,7 +1238,35 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const char *value, memset(adj_value, 0, sizeof(adj_value)); -pstrcpy(adj_value, sizeof(adj_value), value); +/* Convert known names */ +if (!strcmp(value, "hyperv")) { +if (!env->cpuid_hv_level_set) { +env->cpuid_hv_level = CPUID_HV_LEVEL_HYPERV; +} +pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_HYPERV); +} else if (!strcmp(value, "vmware") || !strcmp(value, "vmware4")) { +if (!env->cpuid_hv_level_set) { +env->cpuid_hv_level = CPUID_HV_LEVEL_VMWARE_4; +} +pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_VMWARE); +} else if (!strcmp(value, "vmware3")) { +if (!env->cpuid_hv_level_set) { +env->cpuid_hv_level = CPUID_HV_LEVEL_VMWARE_3; +} +pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_VMWARE); +} else if (!strcmp(value, "xen")) { +if (!env->cpuid_hv_level_set) { +env->cpuid_hv_level = CPUID_HV_LEVEL_XEN; +} +pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_XEN); +} else if (!strcmp(value, "kvm")) { +if (!env->cpuid_hv_level_set) { +env->cpuid_hv_level = CPUID_HV_LEVEL_KVM; +} +pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_KVM); +} else { +pstrcpy(adj_value, sizeof(adj_value), value); +} env->cpuid_hv_vendor1 = 0; env->cpuid_hv_vendor2 = 0; @@ -1700,7 +1746,14 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, /* Handle Hypervisor CPUIDs */ if (real_level < 0x4000) { -real_level = 0x4000; +if (env->cpuid_hv_vendor1 == CPUID_HV_VENDOR_KVM_1 && +env->cpuid_hv_vendor2 == CPUID_HV_VENDOR_KVM_2 && +env->cpuid_hv_vendor3 == CPUID_HV_VENDOR_KVM_3 && +real_level == 0) { +real_level = CPUID_HV_LEVEL_KVM; +} else { +real_level = 0x4000; +} } if (index > real_level) { index = real_level; diff --git a/target-i386/cpu.h b/target-i386/cpu.h index eb6aa4a..ebb3498 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -490,8 +490,27 @@ #define CPUID_HV_VENDOR_HYPERV "Microsoft Hv" +#define CPUID_HV_VENDOR_VMWARE_1 0x61774d56 /* "VMwa" */ +#define CPUID_HV_VENDOR_VMWARE_2 0x4d566572 /* "reVM" */ +#define CPUID_HV_VENDOR_VMWARE_3 0x65726177 /* "ware" */ +#define CPUID_HV_VENDOR_VMWARE "VMwareVMware" + +#define CPUID_HV_VENDOR_XEN "XenVMMXenVMM" + +#define CPUID_HV_VENDOR_KVM_1 0x4b4d564b /* "KVMK" */ +#define CPUID_HV_VENDOR_KVM_2 0x564b4d56 /* "VMKV" */ +#define CPUID_HV_VENDOR_KVM_3 0x004d /* "M\0\0\0" */ +#define CPUID_HV_VENDOR_KVM "KVMKVMKVM" + #define CPUID_HV_LEVEL_HYPERV 0x4005 +#define CPUID_HV_LEVEL_VMWARE_3 0x4002 +#define CPUID_HV_LEVEL_VMWARE_4 0x4010 + +#define CPUID_HV_LEVEL_XEN 0x4002 + +#define CPUID_HV_LEVEL_KVM 0x4001 + #define CPUID_MWAIT_IBE (1 << 1) /* Interrupts can exit capability */ #define CPUID_MWAIT_EMX (1 << 0) /* enumeration supported */ -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 10/16] target-i386: Use Hypervisor vendor in -machine pc,accel=tcg.
Also known as Paravirtualization vendor. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html This is where the 0 is the same as 0x4001 is defined. VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 Signed-off-by: Don Slutz --- target-i386/cpu.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 964877f..a929b64 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1695,7 +1695,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, } } } else if (index & 0x4000) { -if (env->cpuid_hv_level_set) { +if (env->cpuid_hv_level_set || env->cpuid_hv_vendor_set) { uint32_t real_level = env->cpuid_hv_level; /* Handle Hypervisor CPUIDs */ @@ -1849,9 +1849,9 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 0x4000: *eax = env->cpuid_hv_level; -*ebx = 0; -*ecx = 0; -*edx = 0; +*ebx = env->cpuid_hv_vendor1; +*ecx = env->cpuid_hv_vendor2; +*edx = env->cpuid_hv_vendor3; break; case 0x4001: *eax = env->cpuid_kvm_features; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 09/16] target-i386: Use Hypervisor vendor in -machine pc,accel=kvm.
Also known as Paravirtualization vendor. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 Signed-off-by: Don Slutz --- target-i386/kvm.c | 12 +++- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 8462c75..f8a5177 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -389,16 +389,18 @@ int kvm_arch_init_vcpu(CPUX86State *env) c = &cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c->function = KVM_CPUID_SIGNATURE; -if (!env->cpuid_hv_level_set) { +if (!env->cpuid_hv_level_set && !env->cpuid_hv_vendor_set) { memcpy(signature, "KVMKVMKVM\0\0\0", 12); c->eax = 0; +c->ebx = signature[0]; +c->ecx = signature[1]; +c->edx = signature[2]; } else { -memcpy(signature, "Microsoft Hv", 12); c->eax = env->cpuid_hv_level; +c->ebx = env->cpuid_hv_vendor1; +c->ecx = env->cpuid_hv_vendor2; +c->edx = env->cpuid_hv_vendor3; } -c->ebx = signature[0]; -c->ecx = signature[1]; -c->edx = signature[2]; c = &cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On Mon, 24 Sep 2012 12:09:00 +0200 Avi Kivity wrote: > > while (vcpu->request) { > > xchg(vcpu->request, request); > > > > for_each_set_bit(request) { > > clear_bit(X); > > > > .. > > } > > > > } > > In fact I had something like that in one of the earlier versions, but it > was problematic. I'll try to see what the issue was. Unless there are many requests, the cost of for_each_set_bit() and a few added code may exceed that of the original code. (Looping using __ffs() is an alternative because requests is "long".) So I wanted to know the most common requests pattern. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 08/16] target-i386: Add cpu object access routines for Hypervisor vendor.
These are modeled after x86_cpuid_set_vendor and x86_cpuid_get_vendor. Since kvm's vendor is shorter, the test for correct size is removed and zero padding is added. Set Microsoft's Vendor now that we can. Value defined in: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx And matches want is in target-i386/kvm.c Signed-off-by: Don Slutz --- target-i386/cpu.c | 46 ++ target-i386/cpu.h |2 ++ 2 files changed, 48 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 920278b..964877f 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1192,11 +1192,54 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, cpu->env.cpuid_hv_level_set = true; } +static char *x86_cpuid_get_hv_vendor(Object *obj, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +CPUX86State *env = &cpu->env; +char *value; +int i; + +value = (char *)g_malloc(CPUID_VENDOR_SZ + 1); +for (i = 0; i < 4; i++) { +value[i + 0] = env->cpuid_hv_vendor1 >> (8 * i); +value[i + 4] = env->cpuid_hv_vendor2 >> (8 * i); +value[i + 8] = env->cpuid_hv_vendor3 >> (8 * i); +} +value[CPUID_VENDOR_SZ] = '\0'; + +return value; +} + +static void x86_cpuid_set_hv_vendor(Object *obj, const char *value, +Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +CPUX86State *env = &cpu->env; +int i; +char adj_value[CPUID_VENDOR_SZ + 1]; + +memset(adj_value, 0, sizeof(adj_value)); + +pstrcpy(adj_value, sizeof(adj_value), value); + +env->cpuid_hv_vendor1 = 0; +env->cpuid_hv_vendor2 = 0; +env->cpuid_hv_vendor3 = 0; +for (i = 0; i < 4; i++) { +env->cpuid_hv_vendor1 |= ((uint8_t)adj_value[i + 0]) << (8 * i); +env->cpuid_hv_vendor2 |= ((uint8_t)adj_value[i + 4]) << (8 * i); +env->cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8]) << (8 * i); +} +env->cpuid_hv_vendor_set = true; +} + #if !defined(CONFIG_USER_ONLY) static void x86_set_hyperv(Object *obj, Error **errp) { object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV, "hypervisor-level", errp); +object_property_set_str(obj, CPUID_HV_VENDOR_HYPERV, +"hypervisor-vendor", errp); } static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, @@ -2126,6 +2169,9 @@ static void x86_cpu_initfn(Object *obj) object_property_add(obj, "hypervisor-level", "int", x86_cpuid_get_hv_level, x86_cpuid_set_hv_level, NULL, NULL, NULL); +object_property_add_str(obj, "hypervisor-vendor", +x86_cpuid_get_hv_vendor, +x86_cpuid_set_hv_vendor, NULL); #if !defined(CONFIG_USER_ONLY) object_property_add(obj, "hv_spinlocks", "int", x86_get_hv_spinlocks, diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 11730b2..eb6aa4a 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -488,6 +488,8 @@ #define CPUID_VENDOR_VIA "CentaurHauls" +#define CPUID_HV_VENDOR_HYPERV "Microsoft Hv" + #define CPUID_HV_LEVEL_HYPERV 0x4005 #define CPUID_MWAIT_IBE (1 << 1) /* Interrupts can exit capability */ -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 07/16] target-i386: Add Hypervisor vendor.
Also known as Paravirtualization vendor. This is EBX, ECX, EDX data for 0x4000. QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). This is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 Signed-off-by: Don Slutz --- target-i386/cpu.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 3152a4e..11730b2 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -780,6 +780,7 @@ typedef struct CPUX86State { uint32_t cpuid_apic_id; bool cpuid_vendor_override; bool cpuid_hv_level_set; +bool cpuid_hv_vendor_set; /* Store the results of Centaur's CPUID instructions */ uint32_t cpuid_xlevel2; uint32_t cpuid_ext4_features; @@ -787,6 +788,9 @@ typedef struct CPUX86State { uint32_t cpuid_7_0_ebx; /* Hypervisor CPUIDs */ uint32_t cpuid_hv_level; +uint32_t cpuid_hv_vendor1; +uint32_t cpuid_hv_vendor2; +uint32_t cpuid_hv_vendor3; /* MTRRs */ uint64_t mtrr_fixed[11]; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 06/16] target-i386: Use Hypervisor level in -machine pc,accel=tcg.
Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 QEMU knows this as KVM_CPUID_SIGNATURE (0x4000) in kvm on linux. This does not provide vendor support in tcg yet. Signed-off-by: Don Slutz --- target-i386/cpu.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 48bdaf9..920278b 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1651,6 +1651,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, index = env->cpuid_xlevel; } } +} else if (index & 0x4000) { +if (env->cpuid_hv_level_set) { +uint32_t real_level = env->cpuid_hv_level; + +/* Handle Hypervisor CPUIDs */ +if (real_level < 0x4000) { +real_level = 0x4000; +} +if (index > real_level) { +index = real_level; +} +} else { +if (index > env->cpuid_level) +index = env->cpuid_level; +} } else { if (index > env->cpuid_level) index = env->cpuid_level; @@ -1789,6 +1804,18 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *edx = 0; } break; +case 0x4000: +*eax = env->cpuid_hv_level; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; +case 0x4001: +*eax = env->cpuid_kvm_features; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; case 0x8000: *eax = env->cpuid_xlevel; *ebx = env->cpuid_vendor1; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 05/16] target-i386: Use Hypervisor level in -machine pc,accel=kvm.
Also known as Paravirtualization level. This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). Signed-off-by: Don Slutz --- target-i386/kvm.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 895d848..8462c75 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env) c = &cpuid_data.entries[cpuid_i++]; memset(c, 0, sizeof(*c)); c->function = KVM_CPUID_SIGNATURE; -if (!hyperv_enabled()) { +if (!env->cpuid_hv_level_set) { memcpy(signature, "KVMKVMKVM\0\0\0", 12); c->eax = 0; } else { memcpy(signature, "Microsoft Hv", 12); -c->eax = HYPERV_CPUID_MIN; +c->eax = env->cpuid_hv_level; } c->ebx = signature[0]; c->ecx = signature[1]; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 04/16] target-i386: Add x86_set_hyperv.
This is used to set the cpu object's hypervisor level to the default for Microsoft's Hypervisor. Signed-off-by: Don Slutz --- target-i386/cpu.c |9 + target-i386/cpu.h |2 ++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 451de12..48bdaf9 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1193,6 +1193,12 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, } #if !defined(CONFIG_USER_ONLY) +static void x86_set_hyperv(Object *obj, Error **errp) +{ +object_property_set_int(obj, CPUID_HV_LEVEL_HYPERV, +"hypervisor-level", errp); +} + static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) { @@ -1215,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor *v, void *opaque, return; } hyperv_set_spinlock_retries(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque, @@ -1235,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_relaxed_timing(value); +x86_set_hyperv(obj, errp); } static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque, @@ -1255,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, void *opaque, return; } hyperv_enable_vapic_recommended(value); +x86_set_hyperv(obj, errp); } #endif diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 1899f69..3152a4e 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -488,6 +488,8 @@ #define CPUID_VENDOR_VIA "CentaurHauls" +#define CPUID_HV_LEVEL_HYPERV 0x4005 + #define CPUID_MWAIT_IBE (1 << 1) /* Interrupts can exit capability */ #define CPUID_MWAIT_EMX (1 << 0) /* enumeration supported */ -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 03/16] target-i386: Add cpu object access routines for Hypervisor level.
These are modeled after x86_cpuid_get_xlevel and x86_cpuid_set_xlevel. Signed-off-by: Don Slutz --- target-i386/cpu.c | 29 + 1 files changed, 29 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 25ca986..451de12 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -1166,6 +1166,32 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor *v, void *opaque, cpu->env.tsc_khz = value / 1000; } +static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); + +visit_type_uint32(v, &cpu->env.cpuid_hv_level, name, errp); +} + +static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque, +const char *name, Error **errp) +{ +X86CPU *cpu = X86_CPU(obj); +uint32_t value; + +visit_type_uint32(v, &value, name, errp); +if (error_is_set(errp)) { +return; +} + +if (value != 0 && value < 0x4000) { +value += 0x4000; +} +cpu->env.cpuid_hv_level = value; +cpu->env.cpuid_hv_level_set = true; +} + #if !defined(CONFIG_USER_ONLY) static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque, const char *name, Error **errp) @@ -2061,6 +2087,9 @@ static void x86_cpu_initfn(Object *obj) object_property_add(obj, "enforce", "bool", x86_cpuid_get_enforce, x86_cpuid_set_enforce, NULL, NULL, NULL); +object_property_add(obj, "hypervisor-level", "int", +x86_cpuid_get_hv_level, +x86_cpuid_set_hv_level, NULL, NULL, NULL); #if !defined(CONFIG_USER_ONLY) object_property_add(obj, "hv_spinlocks", "int", x86_get_hv_spinlocks, -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 02/16] target-i386: Add Hypervisor level.
Also known as Paravirtualization level or maximim cpuid function present in this leaf. This is just the EAX value for 0x4000. QEMU knows this is KVM_CPUID_SIGNATURE (0x4000). This is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 QEMU has the value HYPERV_CPUID_MIN defined. Signed-off-by: Don Slutz --- target-i386/cpu.h |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 5265c5a..1899f69 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -777,11 +777,14 @@ typedef struct CPUX86State { uint32_t cpuid_ext3_features; uint32_t cpuid_apic_id; bool cpuid_vendor_override; +bool cpuid_hv_level_set; /* Store the results of Centaur's CPUID instructions */ uint32_t cpuid_xlevel2; uint32_t cpuid_ext4_features; /* Flags from CPUID[EAX=7,ECX=0].EBX */ uint32_t cpuid_7_0_ebx; +/* Hypervisor CPUIDs */ +uint32_t cpuid_hv_level; /* MTRRs */ uint64_t mtrr_fixed[11]; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 01/16] target-i386: Add missing kvm bits.
Signed-off-by: Don Slutz --- target-i386/cpu.c | 12 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 0313cf5..25ca986 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -87,10 +87,14 @@ static const char *ext3_feature_name[] = { }; static const char *kvm_feature_name[] = { -"kvmclock", "kvm_nopiodelay", "kvm_mmu", "kvmclock", "kvm_asyncpf", NULL, "kvm_pv_eoi", NULL, -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, -NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, +"kvmclock", "kvm_nopiodelay", "kvm_mmu", "kvmclock", +"kvm_asyncpf", "kvm_steal_time", "kvm_pv_eoi", NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +NULL, NULL, NULL, NULL, +"kvm_clock_stable", NULL, NULL, NULL, +NULL, NULL, NULL, NULL, }; static const char *svm_feature_name[] = { -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 00/16] Allow changing of Hypervisor CPUIDs.
Also known as Paravirtualization CPUIDs. This is primarily done so that the guest will think it is running under vmware when hypervisor-vendor=vmware is specified as a property of a cpu. This depends on: http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01400.html As far as I know it is #4. It depends on (1) and (2) and (3). This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 Changes from v5 to v6: Split out 01/17: target-i386: Allow tsc-frequency to be larger then 2.147G It has been accepted as a trivial patch: http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg03959.html Blue Swirl: Fix 2 checkpatch.pl "WARNING: line over 80 characters". Changes from v4 to v5: Undo kvm_clock2 change. Add cpuid_hv_level_set; cpuid_hv_level == 0 is now valid. Add cpuid_hv_vendor_set; the null string is now valid. Handle kvm and cpuid_hv_level == 0. hypervisor-vendor=kvm,hypervisor-level=0 and hypervisor-level=0,hypervisor-vendor=kvm now do the same thing. Changes from v3 to v4: Added CPUID_HV_LEVEL_HYPERV, CPUID_HV_LEVEL_KVM. Added CPUID_HV_VENDOR_HYPERV. Added hyperv as known hypservisor-vendor. Allow hypervisor-level to be 0. Changes from v2 to v3: Clean post to qemu-devel. Changes from v1 to v2: 1) Added 1/4 from http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg05153.html Because Fred is changing jobs and so will not be pushing to get this in. It needed to be rebased, And I needed it to complete the testing of this change. 2) Added 2/4 because of the re-work I needed a way to clear all KVM bits, 3) The rework of v1. Make it fit into the object model re-work of cpu.c for x86. 4) Added 3/4 -- The split out of the code that is not needed for accel=kvm. Changes from v2 to v3: Marcelo Tosatti: Its one big patch, better split in logically correlated patches (with better changelog). This would help reviewers. So split 3 and 4 into 3 to 17. More info in change log. No code change. Don Slutz (16): target-i386: Add missing kvm bits. target-i386: Add Hypervisor level. target-i386: Add cpu object access routines for Hypervisor level. target-i386: Add x86_set_hyperv. target-i386: Use Hypervisor level in -machine pc,accel=kvm. target-i386: Use Hypervisor level in -machine pc,accel=tcg. target-i386: Add Hypervisor vendor. target-i386: Add cpu object access routines for Hypervisor vendor. target-i386: Use Hypervisor vendor in -machine pc,accel=kvm. target-i386: Use Hypervisor vendor in -machine pc,accel=tcg. target-i386: Add some known names to Hypervisor vendor. target-i386: Add optional Hypervisor leaf extra. target-i386: Add cpu object access routines for Hypervisor leaf extra. target-i386: Add setting of Hypervisor leaf extra for known vmare4. target-i386: Use Hypervisor leaf extra in -machine pc,accel=kvm. target-i386: Use Hypervisor leaf extra in -machine pc,accel=tcg. target-i386/cpu.c | 285 - target-i386/cpu.h | 34 +++ target-i386/kvm.c | 33 +- 3 files changed, 342 insertions(+), 10 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 07:24 PM, Peter Zijlstra wrote: On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: However Rik had a genuine concern in the cases where runqueue is not equally distributed and lockholder might actually be on a different run queue but not running. Load should eventually get distributed equally -- that's what the load-balancer is for -- so this is a temporary situation. We already try and favour the non running vcpu in this case, that's what yield_to_task_fair() is about. If its still not eligible to run, tough luck. Yes, I agree. Do you think instead of using rq->nr_running, we could get a global sense of load using avenrun (something like avenrun/num_onlinecpus) To what purpose? Also, global stuff is expensive, so you should try and stay away from it as hard as you possibly can. Yes, that concern only had made me to fall back to rq->nr_running. Will come back with the result soon. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] kvm/fpu: Enable fully eager restore kvm FPU
On 09/24/2012 05:28 AM, Xudong Hao wrote: > Enable KVM FPU fully eager restore, if there is other FPU state which isn't > tracked by CR0.TS bit. > > v4 changes from v3: > - Wrap up some confused code with a clear functio lazy_fpu_allowed() > - Update fpu while update cr4 too. > > v3 changes from v2: > - Make fpu active explicitly while guest xsave is enabling and non-lazy xstate > bit exist. > > v2 changes from v1: > - Expand KVM_XSTATE_LAZY to 64 bits before negating it. > > int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); > int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index 818fceb..fbdb44a 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -2941,6 +2941,7 @@ static int cr_interception(struct vcpu_svm *svm) > break; > case 4: > err = kvm_set_cr4(&svm->vcpu, val); > + update_lazy_fpu(vcpu); > break; > case 8: > err = kvm_set_cr8(&svm->vcpu, val); > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 30bcb95..b3880c0 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -4488,6 +4488,7 @@ static int handle_cr(struct kvm_vcpu *vcpu) > return 1; > case 4: > err = handle_set_cr4(vcpu, val); > + update_lazy_fpu(vcpu); > kvm_complete_insn_gp(vcpu, err); > return 1; Why not in kvm_set_cr4()? > case 8: { > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index fc2a0a1..2e14cec 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -544,6 +544,31 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw) > } > EXPORT_SYMBOL_GPL(kvm_lmsw); > > +/* > + * KVM trigger FPU restore by #NM (via CR0.TS), > + * only XCR0.bit0, XCR0.bit1, XCR0.bit2 is tracked > + * by TS bit, there might be other FPU state is not tracked > + * by TS bit. > + * This function lazy_fpu_allowed() return true with any of > + * the following cases: 1)xsave isn't enabled in guest; > + * 2)all guest FPU states can be tracked by TS bit. > + */ > +static bool lazy_fpu_allowed(struct kvm_vcpu *vcpu) > +{ > + return !!(!kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) || > + !(vcpu->arch.xcr0 & ~((u64)KVM_XSTATE_LAZY))); > +} !! is !needed, bool conversion takes care of it. > + > +void update_lazy_fpu(struct kvm_vcpu *vcpu) > +{ > + if (lazy_fpu_allowed(vcpu)) { > + vcpu->fpu_active = 0; > + kvm_x86_ops->fpu_deactivate(vcpu); > + } There is no need to deactivate the fpu here. We try to deactivate the fpu as late as possible, preempt notifiers or vcpu_put will do that for us. > + else > + kvm_x86_ops->fpu_activate(vcpu); > +} > + > int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) > { > u64 xcr0; > @@ -571,6 +596,7 @@ int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) > kvm_inject_gp(vcpu, 0); > return 1; > } > + update_lazy_fpu(vcpu); > return 0; > } > EXPORT_SYMBOL_GPL(kvm_set_xcr); > @@ -5985,7 +6011,11 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) > vcpu->guest_fpu_loaded = 0; > fpu_save_init(&vcpu->arch.guest_fpu); > ++vcpu->stat.fpu_reload; > - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); > + /* > + * Make deactivate request while allow fpu lazy restore. > + */ > + if (lazy_fpu_allowed(vcpu)) > + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); > trace_kvm_fpu(0); > } > > btw, it is clear that long term the fpu will always be eagerly loaded, as hosts and guests (and hardware) are updated. At that time it will make sense to remove the lazy fpu code entirely. But maybe that time is here already, since exits are rare and so the guest has a lot of chance to use the fpu, so eager fpu saves the #NM vmexit. Can you check a kernel compile on a westmere system? If eager fpu is faster there than lazy fpu, we can just make the fpu always eager and remove quite a bit of code. It will slow down guests not using in-kernel apic, or guests that just process interrupts in the kernel and then HLT, or maybe i386 guests, but I think it's worth it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On Mon, 24 Sep 2012 12:18:15 +0200 Avi Kivity wrote: > On 09/24/2012 08:24 AM, Takuya Yoshikawa wrote: > > This is an RFC since I have not done any comparison with the approach > > using for_each_set_bit() which can be seen in Avi's work. > > > > Takuya > > --- > > > > We did a simple test to see which requests we would get at the same time > > in vcpu_enter_guest() and got the following numbers: > > > > |...|...||...|.| > > (N) (E) (S) (ES) others > > 22.3% 30.7% 16.0%29.5% 1.4% > > > > (N) : Nothing > > (E) : Only KVM_REQ_EVENT > > (S) : Only KVM_REQ_STEAL_UPDATE > > (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE > > > > * Note that the exact numbers can change for other guests. > > What was the workload? STEAL_UPDATE is done on schedules and > heavyweight exit (erronously), so it should be rare. Just ping'ing to the host. But even without that, I got many STEAL_UPDATE's and KVM_REQ_EVENT's. > Or maybe we're recording HLT time as steal time? Not sure. BTW, schedule() is really rare? We do either cond_resched() or heavy weight exit, no? I always see vcpu threads actively move around the cores. (When I do not pin them.) > > This motivated us to optimize the following code in vcpu_enter_guest(): > > > > if (vcpu->requests) { /** (1) **/ > > ... > > if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) /** (2) **/ > > record_steal_time(vcpu); > > ... > > } > > > > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { > > ... > > } > > > > - For case (E), we do kvm_check_request() for every request other than > >KVM_REQ_EVENT in the block (1) and always get false. > > - For case (S) and (ES), the only difference from case (E) is that we > >get true for KVM_REQ_STEAL_UPDATE. > > > > This means that we were wasting a lot of time for the many if branches > > in the block (1) even for these trivial three cases which dominated more > > than 75% in total. > > > > This patch mitigates the issue as follows: > > - For case (E), we change the first if condition to > >if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) /** (1') **/ > >so that we can skip the block completely. > > - For case (S) and (ES), we move the check (2) upwards, out of the > >block (1), to clear the KVM_REQ_STEAL_UPDATE flag before doing the > >check (1'). > > > > Although this adds one if branch for case (N), the fact that steal > > update occurs frequently enough except when we give each vcpu a > > dedicated core justifies its tiny cost. > > Modern processors will eliminate KVM_REQ_EVENT in many cases, so the > optmimization is wasted on them. Then, my Nehalem server was not so modern. I did something like this: if requests == KVM_REQ_EVENT ++counter1; if requests == KVM_REQ_STEAL_UPDATE ++counter2; ... in vcpu_enter_guest() and saw KVM_REQ_EVENT many times. > Do you have numbers? Just for your patch, not my alternative. Not now. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On Mon, 24 Sep 2012 09:16:12 +0200 Gleb Natapov wrote: > Yes, for guests that do not enable steal time KVM_REQ_STEAL_UPDATE > should be never set, but currently it is. The patch (not tested) should > fix this. Nice! Takuya > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 901ad00..01572f5 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1544,6 +1544,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu) > delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; > vcpu->arch.st.last_steal = current->sched_info.run_delay; > vcpu->arch.st.accum_steal = delta; > + > + kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); > } > > static void record_steal_time(struct kvm_vcpu *vcpu) > @@ -1673,8 +1675,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, > u64 data) > accumulate_steal_time(vcpu); > preempt_enable(); > > - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); > - > break; > case MSR_KVM_PV_EOI_EN: > if (kvm_lapic_enable_pv_eoi(vcpu, data)) > @@ -2336,7 +2336,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > } > > accumulate_steal_time(vcpu); > - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); > } > > void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > -- > Gleb. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Takuya Yoshikawa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: > However Rik had a genuine concern in the cases where runqueue is not > equally distributed and lockholder might actually be on a different run > queue but not running. Load should eventually get distributed equally -- that's what the load-balancer is for -- so this is a temporary situation. We already try and favour the non running vcpu in this case, that's what yield_to_task_fair() is about. If its still not eligible to run, tough luck. > Do you think instead of using rq->nr_running, we could get a global > sense of load using avenrun (something like avenrun/num_onlinecpus) To what purpose? Also, global stuff is expensive, so you should try and stay away from it as hard as you possibly can. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On Mon, 24 Sep 2012 14:59:44 +0800 Xiao Guangrong wrote: > On 09/24/2012 02:24 PM, Takuya Yoshikawa wrote: > > This is an RFC since I have not done any comparison with the approach > > using for_each_set_bit() which can be seen in Avi's work. > > > > Why not compare it? I think for_each_set_bit is better and it can I did not have enough time. Sure, I should do some comparison. But I also wanted to hear the requests pattern I got was normal. So this is not a request for merging: I rather wanted to share the numbers I got. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler
On Fri, 21 Sep 2012 23:15:40 +0530 Raghavendra K T wrote: > >> How about doing cond_resched() instead? > > > > Actually, an actual call to yield() may be better. > > > > That will set scheduler hints to make the scheduler pick > > another task for one round, while preserving this task's > > top position in the runqueue. > > I am not a scheduler expert, but I am also inclined towards > Rik's suggestion here since we set skip buddy here. Takuya? > Yes, I think it's better. But I hope that experts in Cc will suggest the best way. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 06:06 PM, Peter Zijlstra wrote: On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote: On 09/24/2012 05:04 PM, Peter Zijlstra wrote: On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: In some special scenarios like #vcpu<= #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. What's the costly thing? The vm-exit, the yield (which should be a nop if its the only task there) or something else entirely? Both vmexit and yield_to() actually, because unsuccessful yield_to() overall is costly in PLE handler. This is because when we have large guests, say 32/16 vcpus, and one vcpu is holding lock, rest of the vcpus waiting for the lock, when they do PL-exit, each of the vcpu try to iterate over rest of vcpu list in the VM and try to do directed yield (unsuccessful). (O(n^2) tries). this results is fairly high amount of cpu burning and double run queue lock contention. (if they were spinning probably lock progress would have been faster). As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which seems little complex to achieve currently. OK, so the vmexit stays and we need to improve yield_to. How about something like the below, that would allow breaking out of the for-each-vcpu loop and simply going back into the vm, right? --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b38f00e..5d5b355 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (>0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) local_irq_save(flags); rq = this_rq(); + /* +* If we're the only runnable task on the rq, there's absolutely no +* point in yielding. +*/ + if (rq->nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } + again: p_rq = task_rq(p); double_rq_lock(rq, p_rq); @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt) } if (!curr->sched_class->yield_to_task) - goto out; + goto out_unlock; if (curr->sched_class != p->sched_class) - goto out; + goto out_unlock; if (task_running(p_rq, p) || p->state) - goto out; + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt) resched_task(p_rq->curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); - if (yielded) + if (yielded> 0) schedule(); return yielded; Yes, I think this is a nice idea. Any future users of yield_to also would benefit from this. we will have to iterate only till first attempt to yield_to. I 'll run the test with this patch. However Rik had a genuine concern in the cases where runqueue is not equally distributed and lockholder might actually be on a different run queue but not running. Do you think instead of using rq->nr_running, we could get a global sense of load using avenrun (something like avenrun/num_onlinecpus) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v5 00/17] Allow changing of Hypervisor CPUIDs.
On 09/22/12 09:18, Blue Swirl wrote: On Sat, Sep 22, 2012 at 12:13 AM, Don Slutz wrote: Also known as Paravirtualization CPUIDs. This is primarily done so that the guest will think it is running under vmware when hypervisor-vendor=vmware is specified as a property of a cpu. Please use checkpatch.pl to check for missing braces etc. I have been. Somehow missed the warnings on the posted patches. v6 in the works. This depends on: http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01400.html As far as I know it is #4. It depends on (1) and (2) and (3). This change is based on: Microsoft Hypervisor CPUID Leaves: http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx Linux kernel change starts with: http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html Also: http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html VMware documention on CPUIDs (Mechanisms to determine if software is running in a VMware virtual machine): http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458 Changes from v4 to v5: Undo kvm_clock2 change. Add cpuid_hv_level_set; cpuid_hv_level == 0 is now valid. Add cpuid_hv_vendor_set; the null string is now valid. Handle kvm and cpuid_hv_level == 0. hypervisor-vendor=kvm,hypervisor-level=0 and hypervisor-level=0,hypervisor-vendor=kvm now do the same thing. Changes from v3 to v4: Added CPUID_HV_LEVEL_HYPERV, CPUID_HV_LEVEL_KVM. Added CPUID_HV_VENDOR_HYPERV. Added hyperv as known hypservisor-vendor. Allow hypervisor-level to be 0. Changes from v2 to v3: Clean post to qemu-devel. Changes from v1 to v2: 1) Added 1/4 from http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg05153.html Because Fred is changing jobs and so will not be pushing to get this in. It needed to be rebased, And I needed it to complete the testing of this change. 2) Added 2/4 because of the re-work I needed a way to clear all KVM bits, 3) The rework of v1. Make it fit into the object model re-work of cpu.c for x86. 4) Added 3/4 -- The split out of the code that is not needed for accel=kvm. Changes from v2 to v3: Marcelo Tosatti: Its one big patch, better split in logically correlated patches (with better changelog). This would help reviewers. So split 3 and 4 into 3 to 17. More info in change log. No code change. Don Slutz (17): target-i386: Allow tsc-frequency to be larger then 2.147G target-i386: Add missing kvm bits. target-i386: Add Hypervisor level. target-i386: Add cpu object access routines for Hypervisor level. target-i386: Add cpu object access routines for Hypervisor level. target-i386: Use Hypervisor level in -machine pc,accel=kvm. target-i386: Use Hypervisor level in -machine pc,accel=tcg. target-i386: Add Hypervisor vendor. target-i386: Add cpu object access routines for Hypervisor vendor. target-i386: Use Hypervisor vendor in -machine pc,accel=kvm. target-i386: Use Hypervisor vendor in -machine pc,accel=tcg. target-i386: Add some known names to Hypervisor vendor. target-i386: Add optional Hypervisor leaf extra. target-i386: Add cpu object access routines for Hypervisor leaf extra. target-i386: Add setting of Hypervisor leaf extra for known vmare4. target-i386: Use Hypervisor leaf extra in -machine pc,accel=kvm. target-i386: Use Hypervisor leaf extra in -machine pc,accel=tcg. target-i386/cpu.c | 285 - target-i386/cpu.h | 34 +++ target-i386/kvm.c | 33 +- 3 files changed, 341 insertions(+), 11 deletions(-) -Don -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] KVM: PPC: Book3S HV: Fix calculation of guest phys address for MMIO emulation
On 21.09.2012, at 07:39, Paul Mackerras wrote: > In the case where the host kernel is using a 64kB base page size and > the guest uses a 4k HPTE (hashed page table entry) to map an emulated > MMIO device, we were calculating the guest physical address wrongly. > We were calculating a gfn as the guest physical address shifted right > 16 bits (PAGE_SHIFT) but then only adding back in 12 bits from the > effective address, since the HPTE had a 4k page size. Thus the gpa > reported to userspace was missing 4 bits. > > Instead, we now compute the guest physical address from the HPTE > without reference to the host page size, and then compute the gfn > by shifting the gpa right PAGE_SHIFT bits. > > Reported-by: Alexey Kardashevskiy > Signed-off-by: Paul Mackerras Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/10] KVM: PPC: Book3S HV: Remove bogus update of physical thread IDs
On 21.09.2012, at 07:36, Paul Mackerras wrote: > When making a vcpu non-runnable we incorrectly changed the > thread IDs of all other threads on the core, just remove that > code. > > Signed-off-by: Benjamin Herrenschmidt > Signed-off-by: Paul Mackerras Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/10] KVM: PPC: Book3S HV: Fix updates of vcpu->cpu
On 21.09.2012, at 07:35, Paul Mackerras wrote: > This removes the powerpc "generic" updates of vcpu->cpu in load and > put, and moves them to the various backends. > > The reason is that "HV" KVM does its own sauce with that field > and the generic updates might corrupt it. The field contains the > CPU# of the -first- HW CPU of the core always for all the VCPU > threads of a core (the one that's online from a host Linux > perspective). > > However, the preempt notifiers are going to be called on the > threads VCPUs when they are running (due to them sleeping on our > private waitqueue) causing unload to be called, potentially > clobbering the value. > > Signed-off-by: Benjamin Herrenschmidt > Signed-off-by: Paul Mackerras Thanks, applied to kvm-ppc-next. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/10] KVM: PPC: Book3s HV: Don't access runnable threads list without vcore lock
On 21.09.2012, at 07:37, Paul Mackerras wrote: > There were a few places where we were traversing the list of runnable > threads in a virtual core, i.e. vc->runnable_threads, without holding > the vcore spinlock. This extends the places where we hold the vcore > spinlock to cover everywhere that we traverse that list. > > Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault, > this moves the call of it from kvmppc_handle_exit out to > kvmppc_vcpu_run, where we don't hold the vcore lock. > > In kvmppc_vcore_blocked, we don't actually need to check whether > all vcpus are ceded and don't have any pending exceptions, since the > caller has already done that. The caller (kvmppc_run_vcpu) wasn't > actually checking for pending exceptions, so we add that. > > The change of if to while in kvmppc_run_vcpu is to make sure that we > never call kvmppc_remove_runnable() when the vcore state is RUNNING or > EXITING. > > Signed-off-by: Paul Mackerras > --- > arch/powerpc/include/asm/kvm_asm.h |1 + > arch/powerpc/kvm/book3s_hv.c | 64 +--- > 2 files changed, 31 insertions(+), 34 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_asm.h > b/arch/powerpc/include/asm/kvm_asm.h > index 76fdcfe..fb99a21 100644 > --- a/arch/powerpc/include/asm/kvm_asm.h > +++ b/arch/powerpc/include/asm/kvm_asm.h > @@ -123,6 +123,7 @@ > #define RESUME_GUEST_NV RESUME_FLAG_NV > #define RESUME_HOST RESUME_FLAG_HOST > #define RESUME_HOST_NV (RESUME_FLAG_HOST|RESUME_FLAG_NV) > +#define RESUME_PAGE_FAULT(1<<2) I would actually prefer if you could move this to core specific code. How about #define RESUME_ARCH1(1 << 2) and then in book3s_hv.c: #define RESUME_PAGE_FAULT (RESUME_GUEST | RESUME_ARCH1) Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote: > On 09/24/2012 05:04 PM, Peter Zijlstra wrote: > > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: > >> In some special scenarios like #vcpu<= #pcpu, PLE handler may > >> prove very costly, because there is no need to iterate over vcpus > >> and do unsuccessful yield_to burning CPU. > > > > What's the costly thing? The vm-exit, the yield (which should be a nop > > if its the only task there) or something else entirely? > > > Both vmexit and yield_to() actually, > > because unsuccessful yield_to() overall is costly in PLE handler. > > This is because when we have large guests, say 32/16 vcpus, and one > vcpu is holding lock, rest of the vcpus waiting for the lock, when they > do PL-exit, each of the vcpu try to iterate over rest of vcpu list in > the VM and try to do directed yield (unsuccessful). (O(n^2) tries). > > this results is fairly high amount of cpu burning and double run queue > lock contention. > > (if they were spinning probably lock progress would have been faster). > As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which > seems little complex to achieve currently. OK, so the vmexit stays and we need to improve yield_to. How about something like the below, that would allow breaking out of the for-each-vcpu loop and simply going back into the vm, right? --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b38f00e..5d5b355 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (>0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) local_irq_save(flags); rq = this_rq(); + /* +* If we're the only runnable task on the rq, there's absolutely no +* point in yielding. +*/ + if (rq->nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } + again: p_rq = task_rq(p); double_rq_lock(rq, p_rq); @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt) } if (!curr->sched_class->yield_to_task) - goto out; + goto out_unlock; if (curr->sched_class != p->sched_class) - goto out; + goto out_unlock; if (task_running(p_rq, p) || p->state) - goto out; + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt) resched_task(p_rq->curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); - if (yielded) + if (yielded > 0) schedule(); return yielded; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn
On 09/24/2012 08:04 PM, Gleb Natapov wrote: > On Mon, Sep 24, 2012 at 07:49:37PM +0800, Xiao Guangrong wrote: >> On 09/24/2012 07:24 PM, Gleb Natapov wrote: >>> On Mon, Sep 24, 2012 at 12:59:32PM +0800, Xiao Guangrong wrote: On 09/23/2012 05:13 PM, Gleb Natapov wrote: > On Fri, Sep 21, 2012 at 02:57:19PM +0800, Xiao Guangrong wrote: >> We can not directly call kvm_release_pfn_clean to release the pfn >> since we can meet noslot pfn which is used to cache mmio info into >> spte >> > Wouldn't it be better to move the check into kvm_release_pfn_clean()? I think there is no reason for us to prefer to adding this branch in the common code. :) >>> >>> Is the function performance critical? Is function called without the check >>> on a hot path? The function already contains much heavier kvm_is_mmio_pfn() >>> check. If most/all function invocation require check before call it's >>> better to move it inside. >> >> It is not most/all functions need do this check - it is only needed on x86 >> mmu >> page-fault/prefetch path. > At least on x86 there 7 calls to kvm_release_pfn_clean(), 5 of them are > guarded by is_noslot_pfn() (after this patch) 3 places after the whole patchset (There are some cleanups after this patch). > and one by even stronger is_error_pfn(). This one is: | if (!is_error_pfn(pfn)) { |kvm_release_pfn_clean(pfn); |return true; | } | | return false; We can change it to: | if (is_error_pfn(pfn)) | return false; | | kvm_release_pfn_clean(pfn); | return true; > I guess when/if other architectures will add MMIO MMU > caching they will need to guard kvm_release_pfn_clean() by is_noslot_pfn() > too in most cases. I am not insisting, but as this patch shows it is > easy to miss the check before calling the function. Sounds reasonable. I will consider it if Avi/Marcelo have no object on it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface
On 24.09.2012, at 14:16, Paul Mackerras wrote: > On Fri, Sep 21, 2012 at 11:15:51AM +0200, Alexander Graf wrote: > >> So how about something like >> >> #define kvmppc_set_reg(id, val, reg) { \ >> switch (one_reg_size(id)) { \ >> case 4: val.wval = reg; break; \ >> case 8: val.dval = reg; break; \ >> default: BUG(); \ >> } \ >> } >> >> case KVM_REG_PPC_DAR: >> kvmppc_set_reg(reg->id, &val, vcpu->arch.shared->dar); >> break; > > I tried that and it was fine for the wval vs. dval thing, and gcc even > compiled it to the same number of instructions as what I had before. > But it breaks down when you get to VMX and VSX -- firstly because > there are two different types that are both 16 bytes, and secondly > because when you do this: > > #define kvmppc_get_reg(id, val, reg) { \ > switch (one_reg_size(id)) { \ > case 4: val.wval = reg; break; \ > case 8: val.dval = reg; break; \ > case 16: val.vval = reg; break; \ > default: BUG(); \ > } \ > } > > you get compile errors on things like: > > kvmppc_set_reg(reg->id, val, vcpu->arch.shared->dar); > > because val.vval is a vector128, and vcpu->arch.shared->dar is a u64, > and you can't assign a u64 to a vector128 since vector128 is a struct. > In other words, all of the cases of the switch have to satisfy the > type rules even though only one of them will ever be used. Basically > that trick will only work for integer types, and we don't have 128 bit > integer support in gcc. What a shame. Could you please repost a version that only handles 32/64 setting with the above helper then and leaves 128-bit the way they are implemented now? > Given all that, I would like to see my patches go in while we continue > to search for a way to avoid the potential mistakes you're talking > about. ... that way we can at least make the "common case" of 32-bit and 64-bit registers error proof and only have to worry more about double-checking the 128-bit ones, which are a lot less. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads coming online
On 21.09.2012, at 07:35, Paul Mackerras wrote: > When a Book3S HV KVM guest is running, we need the host to be in > single-thread mode, that is, all of the cores (or at least all of > the cores where the KVM guest could run) to be running only one > active hardware thread. This is because of the hardware restriction > in POWER processors that all of the hardware threads in the core > must be in the same logical partition. Complying with this restriction > is much easier if, from the host kernel's point of view, only one > hardware thread is active. > > This adds two hooks in the SMP hotplug code to allow the KVM code to > make sure that secondary threads (i.e. hardware threads other than > thread 0) cannot come online while any KVM guest exists. The KVM > code still has to check that any core where it runs a guest has the > secondary threads offline, but having done that check it can now be > sure that they will not come online while the guest is running. > > Signed-off-by: Paul Mackerras Ben, since this touches generic ppc code, could you please ack? Alex > --- > arch/powerpc/include/asm/smp.h |8 +++ > arch/powerpc/kernel/smp.c | 46 > arch/powerpc/kvm/book3s_hv.c | 12 +-- > 3 files changed, 64 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h > index ebc24dc..b625a1a 100644 > --- a/arch/powerpc/include/asm/smp.h > +++ b/arch/powerpc/include/asm/smp.h > @@ -66,6 +66,14 @@ void generic_cpu_die(unsigned int cpu); > void generic_mach_cpu_die(void); > void generic_set_cpu_dead(unsigned int cpu); > int generic_check_cpu_restart(unsigned int cpu); > + > +extern void inhibit_secondary_onlining(void); > +extern void uninhibit_secondary_onlining(void); > + > +#else /* HOTPLUG_CPU */ > +static inline void inhibit_secondary_onlining(void) {} > +static inline void uninhibit_secondary_onlining(void) {} > + > #endif > > #ifdef CONFIG_PPC64 > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 0321007..c45f51d 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -410,6 +410,45 @@ int generic_check_cpu_restart(unsigned int cpu) > { > return per_cpu(cpu_state, cpu) == CPU_UP_PREPARE; > } > + > +static atomic_t secondary_inhibit_count; > + > +/* > + * Don't allow secondary CPU threads to come online > + */ > +void inhibit_secondary_onlining(void) > +{ > + /* > + * This makes secondary_inhibit_count stable during cpu > + * online/offline operations. > + */ > + get_online_cpus(); > + > + atomic_inc(&secondary_inhibit_count); > + put_online_cpus(); > +} > +EXPORT_SYMBOL_GPL(inhibit_secondary_onlining); > + > +/* > + * Allow secondary CPU threads to come online again > + */ > +void uninhibit_secondary_onlining(void) > +{ > + get_online_cpus(); > + atomic_dec(&secondary_inhibit_count); > + put_online_cpus(); > +} > +EXPORT_SYMBOL_GPL(uninhibit_secondary_onlining); > + > +static int secondaries_inhibited(void) > +{ > + return atomic_read(&secondary_inhibit_count); > +} > + > +#else /* HOTPLUG_CPU */ > + > +#define secondaries_inhibited() 0 > + > #endif > > static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle) > @@ -428,6 +467,13 @@ int __cpuinit __cpu_up(unsigned int cpu, struct > task_struct *tidle) > { > int rc, c; > > + /* > + * Don't allow secondary threads to come online if inhibited > + */ > + if (threads_per_core > 1 && secondaries_inhibited() && > + cpu % threads_per_core != 0) > + return -EBUSY; > + > if (smp_ops == NULL || > (smp_ops->cpu_bootable && !smp_ops->cpu_bootable(cpu))) > return -EINVAL; > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index bebf9cb..6fe1410 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -47,6 +47,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -918,8 +919,6 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc) > /* >* Make sure we are running on thread 0, and that >* secondary threads are offline. > - * XXX we should also block attempts to bring any > - * secondary threads online. >*/ > if (threads_per_core > 1 && !on_primary_thread()) { > list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list) > @@ -1632,11 +1631,20 @@ int kvmppc_core_init_vm(struct kvm *kvm) > > kvm->arch.using_mmu_notifiers = !!cpu_has_feature(CPU_FTR_ARCH_206); > spin_lock_init(&kvm->arch.slot_phys_lock); > + > + /* > + * Don't allow secondary CPU threads to come online > + * while any KVM VMs exist. > + */ > + inhibit_secondary_onlining(); > + > return 0; > } > > void kvmppc_core_destroy_vm(struct kvm *kvm) > { > + uninhibit_secondary_onlining(
Re: [PATCH 01/10] KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU areas
On 21.09.2012, at 07:33, Paul Mackerras wrote: > The PAPR paravirtualization interface lets guests register three > different types of per-vCPU buffer areas in its memory for communication > with the hypervisor. These are called virtual processor areas (VPAs). > Currently the hypercalls to register and unregister VPAs are handled > by KVM in the kernel, and userspace has no way to know about or save > and restore these registrations across a migration. > > This adds get and set ioctls to allow userspace to see what addresses > have been registered, and to register or unregister them. This will > be needed for guest hibernation and migration, and is also needed so > that userspace can unregister them on reset (otherwise we corrupt > guest memory after reboot by writing to the VPAs registered by the > previous kernel). We also add a capability to indicate that the > ioctls are supported. > > This also fixes a bug where we were calling init_vpa unconditionally, > leading to an oops when unregistering the VPA. > > Signed-off-by: Paul Mackerras Do you think it'd be possible to map these onto ONE_REG as well? I'm slightly wary to add an interface that is restricted to only a limited amount of entries. What if some future PAPR spec wants to add another VPA to the game? We'd have to do a completely new ioctl for that one then. However, if instead we could have 3 REGs 64-bit VPA_ADDR 128-bit VPA_SLB 128-bit VPA_DTL where you'd have to set VPA_ADDR first, then the other two. It gives us nice extensibility for the future too, right? We could just add another REG and everyone's happy. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface
On Fri, Sep 21, 2012 at 11:15:51AM +0200, Alexander Graf wrote: > So how about something like > > #define kvmppc_set_reg(id, val, reg) { \ > switch (one_reg_size(id)) { \ > case 4: val.wval = reg; break; \ > case 8: val.dval = reg; break; \ > default: BUG(); \ > } \ > } > > case KVM_REG_PPC_DAR: > kvmppc_set_reg(reg->id, &val, vcpu->arch.shared->dar); > break; I tried that and it was fine for the wval vs. dval thing, and gcc even compiled it to the same number of instructions as what I had before. But it breaks down when you get to VMX and VSX -- firstly because there are two different types that are both 16 bytes, and secondly because when you do this: #define kvmppc_get_reg(id, val, reg) { \ switch (one_reg_size(id)) { \ case 4: val.wval = reg; break; \ case 8: val.dval = reg; break; \ case 16: val.vval = reg; break; \ default: BUG(); \ } \ } you get compile errors on things like: kvmppc_set_reg(reg->id, val, vcpu->arch.shared->dar); because val.vval is a vector128, and vcpu->arch.shared->dar is a u64, and you can't assign a u64 to a vector128 since vector128 is a struct. In other words, all of the cases of the switch have to satisfy the type rules even though only one of them will ever be used. Basically that trick will only work for integer types, and we don't have 128 bit integer support in gcc. Given all that, I would like to see my patches go in while we continue to search for a way to avoid the potential mistakes you're talking about. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/6] Optimize vcpu->requests processing
On 09/24/2012 01:16 PM, Xiao Guangrong wrote: > On 09/24/2012 06:52 PM, Avi Kivity wrote: >> On 09/24/2012 12:19 PM, Xiao Guangrong wrote: >>> On 09/24/2012 05:48 PM, Avi Kivity wrote: On 09/24/2012 07:55 AM, Xiao Guangrong wrote: > On 07/10/2012 01:05 AM, Avi Kivity wrote: >> Currently, any time a request bit is set (not too uncommon) we check all >> of them. >> This patchset optimizes the process slightly by skipping over unset bits >> using >> for_each_set_bit(). >> > > I also notice that kvm_check_request costs lots of cpu time. What is the > status > of this patchset? > I had problems getting rid of KVM_REQ_PENDING_TIMER. I'll try again. In what workloads did you see kvm_check_request()? >>> >>> Run kernbench on guest, and use perf to sample, this function is really >>> hot. >>> >> >> I don't see it at all. Westmere, 4-way guest. >> > > This is the result i got on my laptop: > > # ./perf report | grep "\[kvm\]" > 85.59% qemu-kvm [kvm][k] arch_local_irq_enable Something's wrong here, should be inlined and certainly not take 86% of cpu time. > 0.18% qemu-kvm [kvm][k] kvm_arch_vcpu_ioctl_run > 0.10% qemu-kvm [kvm][k] paging64_walk_addr_generic > 0.10% qemu-kvm [kvm][k] x86_decode_insn > 0.08% qemu-kvm [kvm][k] kvm_check_request 0.08% is hardly hot. > 0.06% qemu-kvm [kvm][k] apic_clear_vector > > My box: i5-2540M CPU @ 2.60GHz * 4 + 4G > guest: 2 vcpu + 1G > -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn
On Mon, Sep 24, 2012 at 07:49:37PM +0800, Xiao Guangrong wrote: > On 09/24/2012 07:24 PM, Gleb Natapov wrote: > > On Mon, Sep 24, 2012 at 12:59:32PM +0800, Xiao Guangrong wrote: > >> On 09/23/2012 05:13 PM, Gleb Natapov wrote: > >>> On Fri, Sep 21, 2012 at 02:57:19PM +0800, Xiao Guangrong wrote: > We can not directly call kvm_release_pfn_clean to release the pfn > since we can meet noslot pfn which is used to cache mmio info into > spte > > >>> Wouldn't it be better to move the check into kvm_release_pfn_clean()? > >> > >> I think there is no reason for us to prefer to adding this branch in > >> the common code. :) > > > > Is the function performance critical? Is function called without the check > > on a hot path? The function already contains much heavier kvm_is_mmio_pfn() > > check. If most/all function invocation require check before call it's > > better to move it inside. > > It is not most/all functions need do this check - it is only needed on x86 mmu > page-fault/prefetch path. At least on x86 there 7 calls to kvm_release_pfn_clean(), 5 of them are guarded by is_noslot_pfn() (after this patch) and one by even stronger is_error_pfn(). I guess when/if other architectures will add MMIO MMU caching they will need to guard kvm_release_pfn_clean() by is_noslot_pfn() too in most cases. I am not insisting, but as this patch shows it is easy to miss the check before calling the function. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 05:04 PM, Peter Zijlstra wrote: On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: In some special scenarios like #vcpu<= #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. What's the costly thing? The vm-exit, the yield (which should be a nop if its the only task there) or something else entirely? Both vmexit and yield_to() actually, because unsuccessful yield_to() overall is costly in PLE handler. This is because when we have large guests, say 32/16 vcpus, and one vcpu is holding lock, rest of the vcpus waiting for the lock, when they do PL-exit, each of the vcpu try to iterate over rest of vcpu list in the VM and try to do directed yield (unsuccessful). (O(n^2) tries). this results is fairly high amount of cpu burning and double run queue lock contention. (if they were spinning probably lock progress would have been faster). As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which seems little complex to achieve currently. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn
On 09/24/2012 07:24 PM, Gleb Natapov wrote: > On Mon, Sep 24, 2012 at 12:59:32PM +0800, Xiao Guangrong wrote: >> On 09/23/2012 05:13 PM, Gleb Natapov wrote: >>> On Fri, Sep 21, 2012 at 02:57:19PM +0800, Xiao Guangrong wrote: We can not directly call kvm_release_pfn_clean to release the pfn since we can meet noslot pfn which is used to cache mmio info into spte >>> Wouldn't it be better to move the check into kvm_release_pfn_clean()? >> >> I think there is no reason for us to prefer to adding this branch in >> the common code. :) > > Is the function performance critical? Is function called without the check > on a hot path? The function already contains much heavier kvm_is_mmio_pfn() > check. If most/all function invocation require check before call it's > better to move it inside. It is not most/all functions need do this check - it is only needed on x86 mmu page-fault/prefetch path. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for September 25th
Il 24/09/2012 13:28, Juan Quintela ha scritto: > Hi > > Please send in any agenda items you are interested in covering. URI parsing library for glusterfs: libxml2 vs. in-tree "fork" of the same code. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 09/24/2012 05:03 PM, Peter Zijlstra wrote: On Fri, 2012-09-21 at 17:30 +0530, Raghavendra K T wrote: +unsigned long rq_nr_running(void) +{ + return this_rq()->nr_running; +} +EXPORT_SYMBOL(rq_nr_running); Uhm,.. no, that's a horrible thing to export. True.. I had the same fear :). Other options I thought were something like nr_running()/num_online_cpus, this_cpu_load(), etc.. Could you please let me know, if we can rely some good metric to say, system is not overcommitted/overcommitted? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: > In some special scenarios like #vcpu <= #pcpu, PLE handler may > prove very costly, because there is no need to iterate over vcpus > and do unsuccessful yield_to burning CPU. What's the costly thing? The vm-exit, the yield (which should be a nop if its the only task there) or something else entirely? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Fri, 2012-09-21 at 17:30 +0530, Raghavendra K T wrote: > +unsigned long rq_nr_running(void) > +{ > + return this_rq()->nr_running; > +} > +EXPORT_SYMBOL(rq_nr_running); Uhm,.. no, that's a horrible thing to export. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn
On Mon, Sep 24, 2012 at 12:59:32PM +0800, Xiao Guangrong wrote: > On 09/23/2012 05:13 PM, Gleb Natapov wrote: > > On Fri, Sep 21, 2012 at 02:57:19PM +0800, Xiao Guangrong wrote: > >> We can not directly call kvm_release_pfn_clean to release the pfn > >> since we can meet noslot pfn which is used to cache mmio info into > >> spte > >> > > Wouldn't it be better to move the check into kvm_release_pfn_clean()? > > I think there is no reason for us to prefer to adding this branch in > the common code. :) Is the function performance critical? Is function called without the check on a hot path? The function already contains much heavier kvm_is_mmio_pfn() check. If most/all function invocation require check before call it's better to move it inside. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/6] Optimize vcpu->requests processing
On 09/24/2012 06:52 PM, Avi Kivity wrote: > On 09/24/2012 12:19 PM, Xiao Guangrong wrote: >> On 09/24/2012 05:48 PM, Avi Kivity wrote: >>> On 09/24/2012 07:55 AM, Xiao Guangrong wrote: On 07/10/2012 01:05 AM, Avi Kivity wrote: > Currently, any time a request bit is set (not too uncommon) we check all > of them. > This patchset optimizes the process slightly by skipping over unset bits > using > for_each_set_bit(). > I also notice that kvm_check_request costs lots of cpu time. What is the status of this patchset? >>> >>> I had problems getting rid of KVM_REQ_PENDING_TIMER. I'll try again. >>> >>> In what workloads did you see kvm_check_request()? >>> >> >> Run kernbench on guest, and use perf to sample, this function is really >> hot. >> > > I don't see it at all. Westmere, 4-way guest. > This is the result i got on my laptop: # ./perf report | grep "\[kvm\]" 85.59% qemu-kvm [kvm][k] arch_local_irq_enable 0.18% qemu-kvm [kvm][k] kvm_arch_vcpu_ioctl_run 0.10% qemu-kvm [kvm][k] paging64_walk_addr_generic 0.10% qemu-kvm [kvm][k] x86_decode_insn 0.08% qemu-kvm [kvm][k] kvm_check_request 0.06% qemu-kvm [kvm][k] apic_clear_vector My box: i5-2540M CPU @ 2.60GHz * 4 + 4G guest: 2 vcpu + 1G -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On 09/24/2012 08:59 AM, Xiao Guangrong wrote: > On 09/24/2012 02:24 PM, Takuya Yoshikawa wrote: >> This is an RFC since I have not done any comparison with the approach >> using for_each_set_bit() which can be seen in Avi's work. >> > > Why not compare it? I think for_each_set_bit is better and it can > improve for all cases (in your patch, you did not consider all case, > for example, if the guest under mm overcommit, it should generate > lots of TLB flush/RELOAD request). > > Actually, i think Avi's way can be improved in the further, we can > just use one atomic operation to avoid cache-miss. May be like this: > > while (vcpu->request) { > xchg(vcpu->request, request); > > for_each_set_bit(request) { > clear_bit(X); > > .. > } > > } In fact I had something like that in one of the earlier versions, but it was problematic. I'll try to see what the issue was. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On 09/24/2012 09:16 AM, Gleb Natapov wrote: > On Mon, Sep 24, 2012 at 03:24:47PM +0900, Takuya Yoshikawa wrote: >> This is an RFC since I have not done any comparison with the approach >> using for_each_set_bit() which can be seen in Avi's work. >> >> Takuya >> --- >> >> We did a simple test to see which requests we would get at the same time >> in vcpu_enter_guest() and got the following numbers: >> >> |...|...||...|.| >> (N) (E) (S) (ES) others >> 22.3% 30.7% 16.0%29.5% 1.4% >> >> (N) : Nothing >> (E) : Only KVM_REQ_EVENT >> (S) : Only KVM_REQ_STEAL_UPDATE >> (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE >> >> * Note that the exact numbers can change for other guests. >> > Yes, for guests that do not enable steal time KVM_REQ_STEAL_UPDATE > should be never set, but currently it is. The patch (not tested) should > fix this. > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 901ad00..01572f5 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1544,6 +1544,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu) > delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; > vcpu->arch.st.last_steal = current->sched_info.run_delay; > vcpu->arch.st.accum_steal = delta; > + > + kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); > } > > static void record_steal_time(struct kvm_vcpu *vcpu) > @@ -1673,8 +1675,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, > u64 data) > accumulate_steal_time(vcpu); > preempt_enable(); > > - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); > - > break; > case MSR_KVM_PV_EOI_EN: > if (kvm_lapic_enable_pv_eoi(vcpu, data)) > @@ -2336,7 +2336,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > } > > accumulate_steal_time(vcpu); > - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); > } > Even better would be not to account heavyweight exits as steal time. But we'd need a sched_exit preempt notifier for that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/6] Optimize vcpu->requests processing
On 09/24/2012 12:19 PM, Xiao Guangrong wrote: > On 09/24/2012 05:48 PM, Avi Kivity wrote: >> On 09/24/2012 07:55 AM, Xiao Guangrong wrote: >>> On 07/10/2012 01:05 AM, Avi Kivity wrote: Currently, any time a request bit is set (not too uncommon) we check all of them. This patchset optimizes the process slightly by skipping over unset bits using for_each_set_bit(). >>> >>> I also notice that kvm_check_request costs lots of cpu time. What is the >>> status >>> of this patchset? >>> >> >> I had problems getting rid of KVM_REQ_PENDING_TIMER. I'll try again. >> >> In what workloads did you see kvm_check_request()? >> > > Run kernbench on guest, and use perf to sample, this function is really > hot. > I don't see it at all. Westmere, 4-way guest. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 20/19][SeaBIOS] alternative: Use paravirt interface for pci windows
On Mon, Sep 24, 2012 at 02:35:30PM +0800, Wen Congyang wrote: > At 09/21/2012 07:20 PM, Vasilis Liaskovitis Wrote: > > Initialize the 32-bit and 64-bit pci starting offsets from values passed in > > by > > the qemu paravirt interface QEMU_CFG_PCI_WINDOW. Qemu calculates the > > starting > > offsets based on initial memory and hotplug-able dimms. > > This patch can't be applied if I apply the other patches for seabios. And I > don't find this patch in your tree. to test these alternative patches, please try these trees: https://github.com/vliaskov/seabios/commits/memhp-v3-alt https://github.com/vliaskov/qemu-kvm/commits/memhp-v3-alt thanks, - Vasilis -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement "-dimm" command line option
On Sat, Sep 22, 2012 at 01:46:57PM +, Blue Swirl wrote: > On Fri, Sep 21, 2012 at 11:17 AM, Vasilis Liaskovitis > wrote: > > Example: > > "-dimm id=dimm0,size=512M,node=0,populated=off" > > There should not be a need to introduce a new top level option, > instead you should just use -device, like > -device dimm,base=0,id=dimm0,size=512M,node=0,populated=off > > That would also specify the start address. What is "base"? the start address? I think the start address should be calculated by the chipset / board, not by the user. The "-dimm" option is supposed to specify the dimm/memory layout, and not create any devices. If we don't want this new option, I have a question: A "-device/device_add" means we create a new qdev device at startup or as a hotplug operation respectively. So, the semantics of "-device dimm,id=dimm0,size=512M,node=0,populated=on" are clear to me. What does "-device dimm,populated=off" mean from a qdev perspective? There are 2 alternatives: - The device is created on the dimmbus, but is not used/populated yet. Than the activation/acpi-hotplug of the dimm may require a separate command (we used to have "dimm_add" in versions < 3). "device_add" handling always hotplugs a new qdev device, so this wouldn't fit this usecase, because the device already exists. In this case, the actual "acpi hotplug" operation is decoupled from qdev device creation. - The dimmdevice is not created when "-device dimm,populated=off" (this would require some ugly checking in normal -device argument handling). Only the dimm layout is saved. The hotplug is triggered from a normal device_add later. So in this case, the "acpi hotplug" happens at the same time as the qdev hotplug. Do you see a simpler alternative without introducing a new option? Using the "-dimm" option follows the second semantic and avoids changing the "-device" semantics. Dimm layout description is decoupled from dimmdevice creation, and qdev hotplug coincides with acpi hotplug. thanks, - Vasilis -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/6] Optimize vcpu->requests processing
On 09/24/2012 05:48 PM, Avi Kivity wrote: > On 09/24/2012 07:55 AM, Xiao Guangrong wrote: >> On 07/10/2012 01:05 AM, Avi Kivity wrote: >>> Currently, any time a request bit is set (not too uncommon) we check all of >>> them. >>> This patchset optimizes the process slightly by skipping over unset bits >>> using >>> for_each_set_bit(). >>> >> >> I also notice that kvm_check_request costs lots of cpu time. What is the >> status >> of this patchset? >> > > I had problems getting rid of KVM_REQ_PENDING_TIMER. I'll try again. > > In what workloads did you see kvm_check_request()? > Run kernbench on guest, and use perf to sample, this function is really hot. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On 09/24/2012 08:24 AM, Takuya Yoshikawa wrote: > This is an RFC since I have not done any comparison with the approach > using for_each_set_bit() which can be seen in Avi's work. > > Takuya > --- > > We did a simple test to see which requests we would get at the same time > in vcpu_enter_guest() and got the following numbers: > > |...|...||...|.| > (N) (E) (S) (ES) others > 22.3% 30.7% 16.0%29.5% 1.4% > > (N) : Nothing > (E) : Only KVM_REQ_EVENT > (S) : Only KVM_REQ_STEAL_UPDATE > (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE > > * Note that the exact numbers can change for other guests. What was the workload? STEAL_UPDATE is done on schedules and heavyweight exit (erronously), so it should be rare. Or maybe we're recording HLT time as steal time? > > This motivated us to optimize the following code in vcpu_enter_guest(): > > if (vcpu->requests) { /** (1) **/ > ... > if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) /** (2) **/ > record_steal_time(vcpu); > ... > } > > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { > ... > } > > - For case (E), we do kvm_check_request() for every request other than >KVM_REQ_EVENT in the block (1) and always get false. > - For case (S) and (ES), the only difference from case (E) is that we >get true for KVM_REQ_STEAL_UPDATE. > > This means that we were wasting a lot of time for the many if branches > in the block (1) even for these trivial three cases which dominated more > than 75% in total. > > This patch mitigates the issue as follows: > - For case (E), we change the first if condition to >if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) /** (1') **/ >so that we can skip the block completely. > - For case (S) and (ES), we move the check (2) upwards, out of the >block (1), to clear the KVM_REQ_STEAL_UPDATE flag before doing the >check (1'). > > Although this adds one if branch for case (N), the fact that steal > update occurs frequently enough except when we give each vcpu a > dedicated core justifies its tiny cost. Modern processors will eliminate KVM_REQ_EVENT in many cases, so the optmimization is wasted on them. Do you have numbers? Just for your patch, not my alternative. > > Signed-off-by: Takuya Yoshikawa > --- > [My email address change is not a mistake.] > > arch/x86/kvm/x86.c | 11 --- > 1 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index fc2a0a1..e351902 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -5233,7 +5233,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > vcpu->run->request_interrupt_window; > bool req_immediate_exit = 0; > > - if (vcpu->requests) { > + /* > + * Getting KVM_REQ_STEAL_UPDATE alone is so common that clearing it now > + * will hopefully result in skipping the following checks. > + */ > + if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) > + record_steal_time(vcpu); > + > + if (vcpu->requests & ~(1 << KVM_REQ_EVENT)) { > if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu)) > kvm_mmu_unload(vcpu); > if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu)) > @@ -5267,8 +5274,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > r = 1; > goto out; > } > - if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) > - record_steal_time(vcpu); > if (kvm_check_request(KVM_REQ_NMI, vcpu)) > process_nmi(vcpu); > req_immediate_exit = > -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/6] Optimize vcpu->requests processing
On 09/24/2012 07:55 AM, Xiao Guangrong wrote: > On 07/10/2012 01:05 AM, Avi Kivity wrote: >> Currently, any time a request bit is set (not too uncommon) we check all of >> them. >> This patchset optimizes the process slightly by skipping over unset bits >> using >> for_each_set_bit(). >> > > I also notice that kvm_check_request costs lots of cpu time. What is the > status > of this patchset? > I had problems getting rid of KVM_REQ_PENDING_TIMER. I'll try again. In what workloads did you see kvm_check_request()? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote: > On Fri, 21 Sep 2012 11:46:22 +0100 > Mel Gorman wrote: > > > When compaction was implemented it was known that scanning could potentially > > be excessive. The ideal was that a counter be maintained for each pageblock > > but maintaining this information would incur a severe penalty due to a > > shared writable cache line. It has reached the point where the scanning > > costs are an serious problem, particularly on long-lived systems where a > > large process starts and allocates a large number of THPs at the same time. > > > > Instead of using a shared counter, this patch adds another bit to the > > pageblock flags called PG_migrate_skip. If a pageblock is scanned by > > either migrate or free scanner and 0 pages were isolated, the pageblock > > is marked to be skipped in the future. When scanning, this bit is checked > > before any scanning takes place and the block skipped if set. > > > > The main difficulty with a patch like this is "when to ignore the cached > > information?" If it's ignored too often, the scanning rates will still > > be excessive. If the information is too stale then allocations will fail > > that might have otherwise succeeded. In this patch > > > > o CMA always ignores the information > > o If the migrate and free scanner meet then the cached information will > > be discarded if it's at least 5 seconds since the last time the cache > > was discarded > > o If there are a large number of allocation failures, discard the cache. > > > > The time-based heuristic is very clumsy but there are few choices for a > > better event. Depending solely on multiple allocation failures still allows > > excessive scanning when THP allocations are failing in quick succession > > due to memory pressure. Waiting until memory pressure is relieved would > > cause compaction to continually fail instead of using reclaim/compaction > > to try allocate the page. The time-based mechanism is clumsy but a better > > option is not obvious. > > ick. > I know. I was being generous when I described it as "clumsy". > Wall time has sooo little relationship to what's happening in there. > If we *have* to use polling, cannot we clock the poll with some metric > which is at least vaguely related to the amount of activity? Initially I wanted to only depend on just this /* Clear pageblock skip if there are numerous alloc failures */ if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT) reset_isolation_suitable(zone); because this it at least related to activity but it's weak for two reasons. One, it's depending on failures to make the decisions - i.e. when it already is too late. Two, even this condition can be hit very quickly and can result in many resets per second in the worst case. > Number > (or proportion) of pages scanned, for example? Or reset everything on > the Nth trip around the zone? For a full compaction failure we have scanned all pages in the zone so there is no proportion to use there. Resetting everything every Nth trip around the zone is similar to the above check except it would look like if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT && zone->compact_reset_laps == COMPACT_MAX_LAPS) reset_isolation_suitable(zone) but it's weak for the same reasons - depending on failures to make decisions and can happen too quickly. I also considered using the PGFREE vmstat to reset if a pageblock worth of pages had been freed since the last reset but this happens very quickly under memory pressure and would not throttle enough. I also considered deferring until NR_FREE_PAGES was high enough but this would severely impact allocation success rates under memory pressure. > Or even a combination of one of these > *and* of wall time, so the system will at least work harder when MM is > under load. > We sortof do that now - we are depending on a number of failures and time before clearing the bits. > Also, what has to be done to avoid the polling altogether? eg/ie, zap > a pageblock's PB_migrate_skip synchronously, when something was done to > that pageblock which justifies repolling it? > The "something" event you are looking for is pages being freed or allocated in the page allocator. A movable page being allocated in block or a page being freed should clear the PB_migrate_skip bit if it's set. Unfortunately this would impact the fast path of the alloc and free paths of the page allocator. I felt that that was too high a price to pay. > > > > ... > > > > +static void reset_isolation_suitable(struct zone *zone) > > +{ > > + unsigned long start_pfn = zone->zone_start_pfn; > > + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages; > > + unsigned long pfn; > > + > > + /* > > +* Do not reset more than once every five seconds. If allocations are > > +* failing sufficiently quickly to allow this to happen then continually > > +
Re: [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible
On Fri, Sep 21, 2012 at 02:35:57PM -0700, Andrew Morton wrote: > On Fri, 21 Sep 2012 11:46:20 +0100 > Mel Gorman wrote: > > > Compactions free scanner acquires the zone->lock when checking for PageBuddy > > pages and isolating them. It does this even if there are no PageBuddy pages > > in the range. > > > > This patch defers acquiring the zone lock for as long as possible. In the > > event there are no free pages in the pageblock then the lock will not be > > acquired at all which reduces contention on zone->lock. > > > > ... > > > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t > > *lock, > > return compact_checklock_irqsave(lock, flags, false, cc); > > } > > > > +/* Returns true if the page is within a block suitable for migration to */ > > +static bool suitable_migration_target(struct page *page) > > +{ > > + > > stray newline > Fixed. > > + int migratetype = get_pageblock_migratetype(page); > > + > > + /* Don't interfere with memory hot-remove or the min_free_kbytes blocks > > */ > > + if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE) > > + return false; > > + > > + /* If the page is a large free page, then allow migration */ > > + if (PageBuddy(page) && page_order(page) >= pageblock_order) > > + return true; > > + > > + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ > > + if (migrate_async_suitable(migratetype)) > > + return true; > > + > > + /* Otherwise skip the block */ > > + return false; > > +} > > + > > > > ... > > > > @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned > > long blockpfn, > > int isolated, i; > > struct page *page = cursor; > > > > - if (!pfn_valid_within(blockpfn)) { > > - if (strict) > > - return 0; > > - continue; > > - } > > + if (!pfn_valid_within(blockpfn)) > > + goto strict_check; > > nr_scanned++; > > > > - if (!PageBuddy(page)) { > > - if (strict) > > - return 0; > > - continue; > > - } > > + if (!PageBuddy(page)) > > + goto strict_check; > > + > > + /* > > +* The zone lock must be held to isolate freepages. This > > +* unfortunately this is a very coarse lock and can be > > this this > Fixed. > > +* heavily contended if there are parallel allocations > > +* or parallel compactions. For async compaction do not > > +* spin on the lock and we acquire the lock as late as > > +* possible. > > +*/ > > + locked = compact_checklock_irqsave(&cc->zone->lock, &flags, > > + locked, cc); > > + if (!locked) > > + break; > > + > > + /* Recheck this is a suitable migration target under lock */ > > + if (!strict && !suitable_migration_target(page)) > > + break; > > + > > + /* Recheck this is a buddy page under lock */ > > + if (!PageBuddy(page)) > > + goto strict_check; > > > > /* Found a free page, break it into order-0 pages */ > > isolated = split_free_page(page); > > if (!isolated && strict) > > - return 0; > > + goto strict_check; > > total_isolated += isolated; > > for (i = 0; i < isolated; i++) { > > list_add(&page->lru, freelist); > > @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned > > long blockpfn, > > blockpfn += isolated - 1; > > cursor += isolated - 1; > > } > > + > > + continue; > > + > > +strict_check: > > + /* Abort isolation if the caller requested strict isolation */ > > + if (strict) { > > + total_isolated = 0; > > + goto out; > > + } > > } > > > > trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); > > + > > +out: > > + if (locked) > > + spin_unlock_irqrestore(&cc->zone->lock, flags); > > + > > return total_isolated; > > } > > A a few things about this function. > > Would it be cleaner if we did > > if (!strict) { > if (!suitable_migration_target(page)) > break; > } else { > if (!PageBuddy(page)) > goto strict_check; > } > > and then remove the test of `strict' from strict_check (which then > should be renamed)? > I was not able to implement what you suggested without breakage. However, I can do something very similar if "strict" mode
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor On 09/21/2012 08:36 PM, Raghavendra K T wrote: On 09/21/2012 06:48 PM, Chegu Vinod wrote: On 9/21/2012 4:59 AM, Raghavendra K T wrote: In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, Yes. because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. An idea to solve this is: 1) As Avi had proposed we can modify hardware ple_window dynamically to avoid frequent PL-exit. Yes. We had to do this to get around some scaling issues for large (>20way) guests (with no overcommitment) Do you mean you already have some solution tested for this? As part of some experimentation we even tried "switching off" PLE too :( Honestly, Your this experiment and Andrew Theurer's observations were the motivation for this patch. (IMHO, it is difficult to decide when we have mixed type of VMs). Agree. Not sure if the following alternatives have also been looked at : - Could the behavior associated with the "ple_window" be modified to be a function of some [new] per-guest attribute (which can be conveyed to the host as part of the guest launch sequence). The user can choose to set this [new] attribute for a given guest. This would help avoid the frequent exits due to PLE (as Avi had mentioned earlier) ? Ccing Drew also. We had a good discussion on this idea last time. (sorry that I forgot to include in patch series) May be a good idea when we know the load in advance.. - Can the PLE feature ( in VT) be "enhanced" to be made a per guest attribute ? IMHO, the approach of not taking a frequent exit is better than taking an exit and returning back from the handler etc. I entirely agree on this point. (though have not tried above approaches). Hope to see more expert opinions pouring in. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On Mon, Sep 24, 2012 at 03:24:47PM +0900, Takuya Yoshikawa wrote: > This is an RFC since I have not done any comparison with the approach > using for_each_set_bit() which can be seen in Avi's work. > > Takuya > --- > > We did a simple test to see which requests we would get at the same time > in vcpu_enter_guest() and got the following numbers: > > |...|...||...|.| > (N) (E) (S) (ES) others > 22.3% 30.7% 16.0%29.5% 1.4% > > (N) : Nothing > (E) : Only KVM_REQ_EVENT > (S) : Only KVM_REQ_STEAL_UPDATE > (ES): Only KVM_REQ_EVENT and KVM_REQ_STEAL_UPDATE > > * Note that the exact numbers can change for other guests. > Yes, for guests that do not enable steal time KVM_REQ_STEAL_UPDATE should be never set, but currently it is. The patch (not tested) should fix this. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 901ad00..01572f5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1544,6 +1544,8 @@ static void accumulate_steal_time(struct kvm_vcpu *vcpu) delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; vcpu->arch.st.accum_steal = delta; + + kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -1673,8 +1675,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) accumulate_steal_time(vcpu); preempt_enable(); - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); - break; case MSR_KVM_PV_EOI_EN: if (kvm_lapic_enable_pv_eoi(vcpu, data)) @@ -2336,7 +2336,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } accumulate_steal_time(vcpu); - kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: x86: Skip request checking branches in vcpu_enter_guest() more effectively
On 09/24/2012 02:24 PM, Takuya Yoshikawa wrote: > This is an RFC since I have not done any comparison with the approach > using for_each_set_bit() which can be seen in Avi's work. > Why not compare it? I think for_each_set_bit is better and it can improve for all cases (in your patch, you did not consider all case, for example, if the guest under mm overcommit, it should generate lots of TLB flush/RELOAD request). Actually, i think Avi's way can be improved in the further, we can just use one atomic operation to avoid cache-miss. May be like this: while (vcpu->request) { xchg(vcpu->request, request); for_each_set_bit(request) { clear_bit(X); .. } } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html