Re: [patch 10/10] KVM: MMU: speed up mmu_unsync_walk

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 06:26:56PM -0700, Avi Kivity wrote:
>>  --- kvm.orig/include/asm-x86/kvm_host.h
>> +++ kvm/include/asm-x86/kvm_host.h
>> @@ -201,6 +201,7 @@ struct kvm_mmu_page {
>>  u64 *parent_pte;   /* !multimapped */
>>  struct hlist_head parent_ptes; /* multimapped, kvm_pte_chain */
>>  };
>> +DECLARE_BITMAP(unsync_child_bitmap, 512);
>>  };
>>  
>
> Later, we can throw this bitmap out to a separate object. 

Yeah.

> Also, it may make sense to replace it with an array of u16s.

Why?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/10] KVM: MMU: out of sync shadow core v2

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 06:22:52PM -0700, Avi Kivity wrote:
> Instead of private, have an object contain both callback and private  
> data, and use container_of().  Reduces the chance of type errors.

OK.

>> +while (parent->unsync_children) {
>> +for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>> +u64 ent = sp->spt[i];
>> +
>> +if (is_shadow_present_pte(ent)) {
>> +struct kvm_mmu_page *child;
>> +child = page_header(ent & PT64_BASE_ADDR_MASK);
>
> What does this do?

Walks all children of given page with no efficiency. Its replaced later
by the bitmap version.

>> +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>> +{
>> +if (sp->role.glevels != vcpu->arch.mmu.root_level) {
>> +kvm_mmu_zap_page(vcpu->kvm, sp);
>> +return 1;
>> +}
>>   
>
> Suppose we switch to real mode, touch a pte, switch back.  Is this handled?

The shadow page will go unsync on pte touch and resynced as soon as its
visible (after return to paging).

Or, while still in real mode, it might be zapped by
kvm_mmu_get_page->kvm_sync_page.

Am I missing something?

>> @@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_
>>   gfn, role.word);
>>  index = kvm_page_table_hashfn(gfn);
>>  bucket = &vcpu->kvm->arch.mmu_page_hash[index];
>> -hlist_for_each_entry(sp, node, bucket, hash_link)
>> -if (sp->gfn == gfn && sp->role.word == role.word) {
>> +hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
>> +if (sp->gfn == gfn) {
>> +if (sp->unsync)
>> +if (kvm_sync_page(vcpu, sp))
>> +continue;
>> +
>> +if (sp->role.word != role.word)
>> +continue;
>> +
>> +if (sp->unsync_children)
>> +vcpu->arch.mmu.need_root_sync = 1;
>>   
>
> mmu_reload() maybe?

Hum, will think about it.

>>  static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>> -return 0;
>> +return ret;
>>  }
>>   
>
> Why does the caller care if zap also zapped some other random pages?  To  
> restart walking the list?

Yes. The next element for_each_entry_safe saved could have been zapped.

>> +/* don't unsync if pagetable is shadowed with multiple roles */
>> +hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
>> +if (s->gfn != sp->gfn || s->role.metaphysical)
>> +continue;
>> +if (s->role.word != sp->role.word)
>> +return 1;
>> +}
>>   
>
> This will happen for nonpae paging.  But why not allow it?  Zap all  
> unsynced pages on mode switch.
>
> Oh, if a page is both a page directory and page table, yes.  

Yes. 

> So to allow nonpae oos, check the level instead.

Windows 2008 64-bit has all sorts of sharing a pagetable at multiple
levels too.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/10] KVM: MMU: mmu_parent_walk

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:56:46PM -0700, Avi Kivity wrote:
>> +} while (level > start_level-1);
>> +}
>> +
>
> Could be much simplified with recursion, no?  As the depth is limited to  
> 4, there's no stack overflow problem.

The early version was recursive, but since its a generic helper I
preferred a non-recursive function.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 06/10] KVM: x86: trap invlpg

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:53:22PM -0700, Avi Kivity wrote:
> Marcelo Tosatti wrote:
>>  +static int FNAME(shadow_invlpg_entry)(struct kvm_shadow_walk *_sw,
>> +  struct kvm_vcpu *vcpu, u64 addr,
>> +  u64 *sptep, int level)
>> +{
>> +
>> +if (level == PT_PAGE_TABLE_LEVEL) {
>> +if (is_shadow_present_pte(*sptep))
>> +rmap_remove(vcpu->kvm, sptep);
>> +set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
>>   
>
> Need to flush the real tlb as well.

The local TLB you mean?

+void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
+{
+   spin_lock(&vcpu->kvm->mmu_lock);
+   vcpu->arch.mmu.invlpg(vcpu, gva);
+   spin_unlock(&vcpu->kvm->mmu_lock);
+   kvm_mmu_flush_tlb(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_invlpg);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 03/10] KVM: MMU: do not write-protect large mappings

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:29:28PM -0700, Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> There is not much point in write protecting large mappings. This
>> can only happen when a page is shadowed during the window between
>> is_largepage_backed and mmu_lock acquision. Zap the entry instead, so
>> the next pagefault will find a shadowed page via is_largepage_backed and
>> fallback to 4k translations.
>>
>> Simplifies out of sync shadow.
>>
>> Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
>>
>> Index: kvm/arch/x86/kvm/mmu.c
>> ===
>> --- kvm.orig/arch/x86/kvm/mmu.c
>> +++ kvm/arch/x86/kvm/mmu.c
>> @@ -1180,11 +1180,16 @@ static int set_spte(struct kvm_vcpu *vcp
>>  || (write_fault && !is_write_protection(vcpu) && !user_fault)) {
>>  struct kvm_mmu_page *shadow;
>>  +   if (largepage && has_wrprotected_page(vcpu->kvm, gfn)) {
>> +ret = 1;
>> +spte = shadow_trap_nonpresent_pte;
>> +goto set_pte;
>> +}
>
> Don't we need to drop a reference to the page?

mmu_set_spte will do it if necessary:

if (!was_rmapped) {
rmap_add(vcpu, shadow_pte, gfn, largepage);
if (!is_rmap_pte(*shadow_pte))
kvm_release_pfn_clean(pfn);


But as noted, this part is wrong:

@@ -307,11 +307,10 @@ static int FNAME(shadow_walk_entry)(stru
return 1;
}
 
-   if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
+   if (is_shadow_present_pte(*sptep))
return 0;
 
-   if (is_large_pte(*sptep))
-   rmap_remove(vcpu->kvm, sptep);
+   WARN_ON (is_large_pte(*sptep));

Since it might still be necessary to replace a read-only large mapping
(which rmap_write_protect will not nuke) with a normal pte pointer.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 02/10] KVM: MMU: move local TLB flush to mmu_set_spte

2008-09-20 Thread Marcelo Tosatti
On Fri, Sep 19, 2008 at 05:21:09PM -0700, Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> Since the sync page path can collapse flushes.
>>
>> Also only flush if the spte was writable before.
>>
>> Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
>>
>> @@ -1241,9 +1239,12 @@ static void mmu_set_spte(struct kvm_vcpu
>>  }
>>  }
>>  if (set_spte(vcpu, shadow_pte, pte_access, user_fault, write_fault,
>> -  dirty, largepage, gfn, pfn, speculative))
>> +  dirty, largepage, gfn, pfn, speculative)) {
>>  if (write_fault)
>>  *ptwrite = 1;
>> +if (was_writeble)
>> +kvm_x86_ops->tlb_flush(vcpu);
>> +}
>>
>
> I think we had cases where the spte.pfn contents changed, for example  
> when a large page was replaced by a normal page,

True. And the TLB is not flushed now for large->normal replace, in case
the pte thats faulting is read-only. The local (and remote) TLB's must
be flushed on large->normal replace.

(BTW the largepage patch is wrong, will reply to that soon).

> and also:
>
>} else if (pfn != spte_to_pfn(*shadow_pte)) {

That one is likely to crash the guest anyway, so I don't see the need
for a flush there:

> > Did you find out what's causing the errors in the first place (if
> > zap is not used)?  It worries me greatly.
> Yes, the problem is that the rmap code does not handle the qemu
> process
> mappings from vanishing while there is a present rmap. If that
> happens,
> and there is a fault for a gfn whose qemu mapping has been removed, a
> different physical zero page will be allocated:
> 
>  rmap a -> gfn 0 -> physical host page 0
>  mapping for gfn 0 gets removed
>  guest faults in gfn 0 through the same pte "chain"
>  rmap a -> gfn 0 -> physical host page 1
> 
> When instantiating the shadow mapping for the second time, the
> "is_rmap_pte" check succeeds, so we release the reference grabbed by
> gfn_to_page() at mmu_set_spte(). We now have a shadow mapping
> pointing
> to a physical page without having an additional reference on that
> page.
> 
> The following makes the host not crash under such a condition, but
> the condition itself is invalid leading to inconsistent state on the
> guest.
> So IMHO it shouldnt be allowed to happen in the first place.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Graceful shutdown for OpenBSD

2008-09-20 Thread Avi Kivity

[EMAIL PROTECTED] wrote:


Maybe "power button pressed" or something?



Yes.  If OpenBSD supports power buttons, this should work.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/9] register mmio slots

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

By analysing phys_offset, we know whether a region is an mmio region
or not. If it is, register it as so. We don't reuse the same slot
infrastructure already existant, because there is a relationship between
the slot number for kvm the kernel module, and the index in the slots vector
for libkvm. However, we can do best in the future and use only a single data 
structure
for both.

  


Why is kvm interested in emulated mmio regions, at all?



@@ -1032,11 +1042,9 @@ void kvm_mutex_lock(void)
 
 int qemu_kvm_register_coalesced_mmio(target_phys_addr_t addr, unsigned int size)

 {
-return kvm_register_coalesced_mmio(kvm_context, addr, size);
 }
 
 int qemu_kvm_unregister_coalesced_mmio(target_phys_addr_t addr,

   unsigned int size)
 {
-return kvm_unregister_coalesced_mmio(kvm_context, addr, size);
 }
  


Why?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] coalesce mmio regions with an explicit call

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

Remove explicit calls to mmio coalescing. Rather,
include it in the registration functions.

index 5ae3960..2d97b34 100644
--- a/qemu/hw/e1000.c
+++ b/qemu/hw/e1000.c
@@ -942,18 +942,6 @@ e1000_mmio_map(PCIDevice *pci_dev, int region_num,
 
 d->mmio_base = addr;

 cpu_register_physical_memory(addr, PNPMMIO_SIZE, d->mmio_index);
-
-if (kvm_enabled()) {
-   int i;
-uint32_t excluded_regs[] = {
-E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
-E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
-};
-qemu_kvm_register_coalesced_mmio(addr, excluded_regs[0]);
-for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
-qemu_kvm_register_coalesced_mmio(addr + excluded_regs[i] + 4,
- excluded_regs[i + 1] - excluded_regs[i] - 4);
-}
 }
 
  


Where did all of this go?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] add debuging facilities to memory registration at libkvm

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

+#ifdef DEBUG_MEMREG
+   fprintf(stderr, "%s, memory: gpa: %llx, size: %llx, uaddr: %llx, slot: %x, 
flags: %lx\n",
+   __func__, memory.guest_phys_addr, memory.memory_size,
+   memory.userspace_addr, memory.slot, memory.flags);
+#endif
  


Please wrap in a dprintf() macro or similar.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/9] substitute is_allocated_mem with more general is_containing_region

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

is_allocated_mem is a function that checks if every relevant aspect of the 
memory slot
match (start and size). Replace it with a more generic function that checks if 
a memory
region is totally contained into another. The former case is also covered.
  


I think enabling dirty page tracking requires the slot to match exactly.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] allow intersecting region to be on the boundary.

2008-09-20 Thread Avi Kivity

Glauber Costa wrote:

Signed-off-by: Glauber Costa <[EMAIL PROTECTED]>
---
 libkvm/libkvm.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c
index e768e44..fa65c30 100644
--- a/libkvm/libkvm.c
+++ b/libkvm/libkvm.c
@@ -130,8 +130,8 @@ int get_intersecting_slot(unsigned long phys_addr)
int i;
 
 	for (i = 0; i < KVM_MAX_NUM_MEM_REGIONS ; ++i)

-   if (slots[i].len && slots[i].phys_addr < phys_addr &&
-   (slots[i].phys_addr + slots[i].len) > phys_addr)
+   if (slots[i].len && slots[i].phys_addr <= phys_addr &&
+   (slots[i].phys_addr + slots[i].len) >= phys_addr)
return i;
return -1;
  

consider

 slots[i].phys_addr = 0
 slots[i].len = 1
 phys_addr = 1

with the new calculation, i (well, not me personally) will be considered 
an intersecting slot.


Not that I (me this time) can understand how you can calculate interval 
intersection without the entire interval.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html