On Fri, Jun 27, 2008 at 07:43:17PM -0300, Marcelo Tosatti wrote:
> These two functions share the same memslot iteration code, you could 
> avoid duplication.

Ok, I'll look into cleaning this up a bit.

> > +   if (unlikely(atomic_read(&vcpu->kvm->mmu_notifier_count)))
> > +           return;
> > +   smp_rmb();
> 
> I don't think you need smp_rmb() on x86 since atomic operations 
> serialize. barrier() should suffice.

atomic_read doesn't issue any atomic op, and reads are out of order on
x86. So it's needed.

> Wrap this sequence in a well documented function?

It would be nice cleanup, agreed.

> > +static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier 
> > *mn,
> > +                                               struct mm_struct *mm,
> > +                                               unsigned long start,
> > +                                               unsigned long end)
> > +{
> > +   struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > +   int need_tlb_flush = 0;
> > +
> > +   /*
> > +    * The count increase must become visible at unlock time as no
> > +    * spte can be established without taking the mmu_lock and
> > +    * count is also read inside the mmu_lock critical section.
> > +    */
> > +   atomic_inc(&kvm->mmu_notifier_count);
> > +
> > +   spin_lock(&kvm->mmu_lock);
> > +   for (; start < end; start += PAGE_SIZE)
> > +           need_tlb_flush |= kvm_unmap_hva(kvm, start);
> > +   spin_unlock(&kvm->mmu_lock);
> 
> You don't handle large mappings here at all, which means that there
> might be external mappings even after ->range_start, ->range_end.

Well if I don't handle large mappings it won't be because of the above
function. The above function is common code, it doesn't know anything
about the lowlevel spte implementation yet. If something it's a
business of kvm_unmap_hva to handle large mappings. I thought it does
but maybe not and we need to fix it.

> This is not a problem now because QEMU kills all the shadow mappings
> before munmap() on hugetlbfs, but it will be a practical problem if
> ballooning supports largepages (which will probably happen in the
> future), or with fancy hugetlb features.

We need to handle large mappings too, especially if we intend to
remove page pinning from them as well.

> > +   atomic_inc(&kvm->mmu_notifier_seq);
> > +   /*
> > +    * The sequence increase must be visible before count
> > +    * decrease. The page fault has to read count before sequence
> > +    * for this write order to be effective.
> > +    */
> > +   wmb();
> 
> smp_mb_after_atomic_inc ?

Surely more finegrined, agreed.

> Is it worth to flush remote TLB's just due to the young bit? Aging
> can happen often.

This is more important for the linux ptes than the sptes. Because we
do for the main pagetables, I do for secondary ones too to be
consistent. For both, if we don't flush the tlb, the young bit will
never be set again until the entry goes away of the tlb for other
reasons.

> - mmu_notifier_count could be a non-atomic type (range_end() does not grab
> mmu_lock but could).

Yes, I considered this but I preferred not to take the lock in
range_end. Not sure if it makes any real difference but this looks
nicer to me.

> - why the MMU notifier API pass mm_struct instead of vma ? As it stands,
>   VM pte aging/swapping/nuking of QEMU non-guest mappings interferes with
>   guest pagefault processing for no reason.

It doesn't interfere with guest pagefault, with the special exception
of do_wp_page which is by far not a fast path for KVM unless we truly
need the mmu notifier invoked anyway. The memslot <-> vma isn't an
identity. Registering in the vma would waste more ram and complicate
the unregister event. Initially we considered doing it in the vma but
that idea was obsoleted quickly.

> - isnt the logic susceptible to mmu_seq wraparound ? :-)

You mean a false negative because 4G invalidate events happened in
between get_user_pages and the spte establishment? I doubt it's a
practical matter but I like theoretical safe things, so that would be
the best argument to make mmu_notifier_seq 64bit instead of an
atomic_t.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to