On Mon, Dec 10, 2007 at 11:27:45PM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> > On Mon, Dec 10, 2007 at 07:07:54PM +0200, Avi Kivity wrote:
> >   
> >> Marcelo Tosatti wrote:
> >>     
> >>> There is a race where VCPU0 is shadowing a pagetable entry while VCPU1
> >>> is updating it, which results in a stale shadow copy.
> >>>
> >>> Fix that by comparing the contents of the cached guest pte with the
> >>> current guest pte after write-protecting the guest pagetable.
> >>>
> >>> Attached program kvm_shadow_race.c demonstrates the problem.
> >>>
> >>>  
> >>>       
> >> Where is it?
> >>     
> >
> > Attached.
> >
> >   
> 
> Can you explain what it does?  I get the same results on both host and 
> guest (successful completion).

What it does is:

1) mmap 128MB file as (PROT_READ|PROT_WRITE)
2) starts 32 threads to read 128/32 bytes each
3) calls mprotect(PROT_READ) on that region

2) and 3) run simultaneously.

I added a printk inside the fault handler to read and compare the
original and the just shadowed pte, and you can sometimes see that they
differ (it takes about 10-20 runs of the program to hit the race on a
4-way host with 4-way guest), in that the shadowed PTE has the writeable
bit set but the original doesnt. The program will always succeed even if
the race happens.

It could be a similar scenario such as the one you mentioned earlier:
guest kernel is nukeing a pte to reclaim a page while another CPU is
instantiating that shadow pte, in which case there would be a shadow
page mapping for a now freed page, resulting in data corruption.

The kernel sets the PTE to zero and then flushes the TLB to do that, but
for KVM the TLB flush has no effect.

> >>> diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h
> >>> index 72d4816..4fece01 100644
> >>> --- a/drivers/kvm/paging_tmpl.h
> >>> +++ b/drivers/kvm/paging_tmpl.h
> >>> @@ -66,6 +66,7 @@ struct guest_walker {
> >>>   int level;
> >>>   gfn_t table_gfn[PT_MAX_FULL_LEVELS];
> >>>   pt_element_t pte;
> >>> + gpa_t pte_gpa;
> >>>  
> >>>       
> >> I think this needs to be an array like table_gfn[].  The guest may play 
> >> with the pde (and upper entries) as well as the pte.
> >>     
> >
> > I was working under the assumption that the only significant bits of
> > upper entries (WRITEABLE and PRESENT) that can be changed by the guest
> > must be reflected first in the lower level pte's.
> >
> > Isnt that a fair assumption to make?
> >
> >   
> 
> The other bits (including the physical addresses) may change too.  There 
> is no requirement that the changes to pde write/present bits be 
> reflected on pte write/present bits.
> 
> Consider a unix kernel implementing fork() by write-protecting the pud 
> tables.  It can write protect the entire user address space by clearing 
> the write bit on the first 256 pgd entries.
> 
> (I don't think Linux does that; maybe that's a worthwhile optimization)

OK, I'll do as you suggest comparing at mmu_get_page() just after shadowing 
(which
also gets rid of the copy/test if the page is already shadowed).

Thanks


-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Reply via email to