* Avi Kivity <[EMAIL PROTECTED]> wrote: > >2-task context-switch performance (in microseconds, lower is better): > > > > native: 1.11 > > ---------------------------------- > > Qemu: 61.18 > > KVM upstream: 53.01 > > KVM trunk: 6.36 > > KVM trunk+paravirt/cr3: 1.60 > > > >i.e. 2-task context-switch performance is faster by a factor of 4, and > >is now quite close to native speed! > > > > Very impressive! The gain probably comes not only from avoiding the > vmentry/vmexit, but also from avoiding the flushing of the global page > tlb entries.
90% of the win comes from the avoidance of the VM exit. To quantify this more precisely i have added an artificial __flush_tlb_global() call to after switch_to(), just to see how much impact an extra global flush has on the native kernel. Context-switch cost went from 1.11 usecs to 1.65 usecs. Then i added a __flush_tlb(), which made the cost go to 1.75, which means that the global flush component is at around 0.5 usecs. > >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better): > > > > native: 0.25 > > ---------------------------------- > > Qemu: 7.8 > > KVM upstream: 2.8 > > KVM trunk: 0.55 > > KVM paravirt/cr3: 0.36 > > > >almost twice as fast. > > > >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better): > > > > native: 0.9 > > ---------------------------------- > > Qemu: 35.2 > > KVM upstream: 9.4 > > KVM trunk: 2.8 > > KVM paravirt/cr3: 2.2 > > > >still a 30% improvement - which isnt too bad considering that 200 tasks > >are context-switching in this workload and the cr3 cache in current CPUs > >is only 4 entries. > > > > This is a little too good to be true. Were both runs with the same > KVM_NUM_MMU_PAGES? yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' run should have been labeled as: 'cr3 tree with paravirt turned off'. That's not completely 'trunk' but close to it, and all other changes (like elimination of unnecessary TLB flushes) are fairly applied to both. i also did a run with much less MMU cache pages of 256, and hackbench 1 stayed the same, while hackbench 5 numbers started fluctuating badly (i think that workload if trashing the MMU cache badly). > I'm also concerned that at this point in time the cr3 optimizations > will only show an improvement in microbenchmarks. In real life > workloads a context switch is usually preceded by an I/O, and with the > current sorry state of kvm I/O the context switch time would be > dominated by the I/O time. oh, i agreed completely - but in my opinion accelerating virtual I/O is really easy. Accelerating the context-switch path (and basic syscall overhead like KVM does) is /hard/. So i wanted to see whether KVM runs well in all the hard cases, before looking at the low hanging performance fruits in the I/O area =B-) also note that there's lots of internal reasons why application workloads can be heavily context-switching - it's not just I/O that generates them. (pipes, critical sections / futexes, etc.) So having near-native performance for context-switches is very important. > >+ if (irq & 8) { > >+ outb(cached_slave_mask, PIC_SLAVE_IMR); > >+ outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */ > >+ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' > >to master-IRQ2 */ > >+ } else { > >+ outb(cached_master_mask, PIC_MASTER_IMR); > >+ /* 'Specific EOI' to master: */ > >+ outb(0x60+irq, PIC_MASTER_CMD); > >+ } > >+ spin_unlock_irqrestore(&i8259A_lock, flags); > >+} > > Any reason this can't be applied to mainline? There's probably no > downside to native, and it would benefit all virtualization solutions > equally. this is legacy stuff ... > >- u64 *pae_root; > >+ u64 *pae_root[KVM_CR3_CACHE_SIZE]; > > hmm. wouldn't it be simpler to have pae_root always point at the > current root? does that guarantee that it's available? I wanted to 'pin' the root itself this way, to make sure that if a guest switches to it via the cache, that it's truly available and a valid root. cr3 addresses are non-virtual so this is the only mechanism available to guarantee that the host-side memory truly contains a root pagetable. > >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE; > >+ } > > } > > vcpu->mmu.root_hpa = INVALID_PAGE; > > } > > You keep the page directories pinned here. [...] yes. > [...] This can be a problem if a guest frees a page directory, and > then starts using it as a regular page. kvm sometimes chooses not to > emulate a write to a guest page table, but instead to zap it, which is > impossible when the page is freed. You need to either unpin the page > when that happens, or add a hypercall to let kvm know when a page > directory is freed. the cache is zapped upon pagefaults anyway, so unpinning ought to be possible. Which one would you prefer? > >- for (i = 0; i < 4; ++i) > >- vcpu->mmu.pae_root[i] = INVALID_PAGE; > >+ for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) { > >+ /* > >+ * When emulating 32-bit mode, cr3 is only 32 bits even on > >+ * x86_64. Therefore we need to allocate shadow page tables > >+ * in the first 4GB of memory, which happens to fit the DMA32 > >+ * zone: > >+ */ > >+ page = alloc_page(GFP_KERNEL | __GFP_DMA32); > >+ if (!page) > >+ goto error_1; > >+ > >+ ASSERT(!vcpu->mmu.pae_root[j]); > >+ vcpu->mmu.pae_root[j] = page_address(page); > >+ for (i = 0; i < 4; ++i) > >+ vcpu->mmu.pae_root[j][i] = INVALID_PAGE; > >+ } > > Since a pae root uses just 32 bytes, you can store all cache entries > in a single page. Not that it matters much. yeah - i wanted to extend the current code in a safe way, before optimizing it. > >+#define KVM_API_MAGIC 0x87654321 > >+ > > <linux/kvm.h> is the vmm userspace interface. The guest/host > interface should probably go somewhere else. yeah. kvm_para.h? Ingo ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel