Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/16/2012 06:50 AM, Xiao Guangrong wrote: I think we do not need handle all tlb-flushed request here since all of these request can be delayed to the point where mmu-lock is released , we can simply do it: void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-tlbs_dirty; } void kvm_mmu_commit_remote_flush(struct kvm *kvm) { int dirty_count = kvm-tlbs_dirty; smp_mb(); if (!dirty_count) return; if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm-stat.remote_tlb_flush; -- point A cmpxchg(kvm-tlbs_dirty, dirty_count, 0); } if this is ok, we only need do small change in the current code, since kvm_mmu_commit_remote_flush is very similar with kvm_flush_remote_tlbs(). Suppose at point A another thread executes defer_remote_flush(), commit_remote_flush(), and defer_remote_flush() again. This brings the balue of tlbs_dirty back to 1 again, with the tlbs dirty. The cmpxchg() then resets tlbs_dirty, leaving the actual tlbs dirty. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/16/2012 07:57 PM, Avi Kivity wrote: Suppose at point A another thread executes defer_remote_flush(), commit_remote_flush(), and defer_remote_flush() again. This brings the balue of tlbs_dirty back to 1 again, with the tlbs dirty. The cmpxchg() then resets tlbs_dirty, leaving the actual tlbs dirty. Oh, right, sorry for my careless! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/14/2012 09:43 PM, Marcelo Tosatti wrote: Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). We do write protect with TDP, if nested virt is active. The question is whether we have indirect pages or not, not whether TDP is active or not (even without TDP, if you don't enable paging in the guest, you don't have to write protect). But it would be awkward to differentiate the unlock position based on EPT/NPT. I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 11:18 AM, Avi Kivity wrote: On 02/14/2012 09:43 PM, Marcelo Tosatti wrote: Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). We do write protect with TDP, if nested virt is active. The question is whether we have indirect pages or not, not whether TDP is active or not (even without TDP, if you don't enable paging in the guest, you don't have to write protect). But it would be awkward to differentiate the unlock position based on EPT/NPT. I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 05:47 PM, Avi Kivity wrote: On 02/15/2012 11:18 AM, Avi Kivity wrote: On 02/14/2012 09:43 PM, Marcelo Tosatti wrote: Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). We do write protect with TDP, if nested virt is active. The question is whether we have indirect pages or not, not whether TDP is active or not (even without TDP, if you don't enable paging in the guest, you don't have to write protect). But it would be awkward to differentiate the unlock position based on EPT/NPT. I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } Hmm, we already have kvm-tlbs_dirty, so, we can do it like this: #define SPTE_INVALID_UNCLEAN (1 63 ) in invalid page path: lock mmu_lock if (spte is invalid) kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN; need_tlb_flush = kvm-tlbs_dirty; unlock mmu_lock if (need_tlb_flush) kvm_flush_remote_tlbs() And in page write-protected path: lock mmu_lock if (it has spte change to readonly | kvm-tlbs_dirty SPTE_INVALID_UNCLEAN) kvm_flush_remote_tlbs() unlock mmu_lock How about this? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 01:37 PM, Xiao Guangrong wrote: I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } Hmm, we already have kvm-tlbs_dirty, so, we can do it like this: #define SPTE_INVALID_UNCLEAN (1 63 ) in invalid page path: lock mmu_lock if (spte is invalid) kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN; need_tlb_flush = kvm-tlbs_dirty; unlock mmu_lock if (need_tlb_flush) kvm_flush_remote_tlbs() And in page write-protected path: lock mmu_lock if (it has spte change to readonly | kvm-tlbs_dirty SPTE_INVALID_UNCLEAN) kvm_flush_remote_tlbs() unlock mmu_lock How about this? Well, it still has flushes inside the lock. And it seems to be more complicated, but maybe that's because I thought of my idea and didn't fully grok yours yet. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Wed, Feb 15, 2012 at 04:07:49PM +0200, Avi Kivity wrote: Well, it still has flushes inside the lock. And it seems to be more complicated, but maybe that's because I thought of my idea and didn't fully grok yours yet. If we go more complicated I prefer Avi's suggestion to move them all outside the lock. Yesterday I was also thinking at the regular pagetables and how we do not have similar issues there. On the regular pagetables we just do unconditional flush in fork when we make it readonly and KSM (the other place that ptes stuff readonly that later can cow) uses ptep_clear_flush which does an unconditional flush and furthermore it does it inside the PT lock, so generally we don't optimize for those things on the regular pagetables. But then these events don't happen as frequently as they can on KVM without EPT/NPT. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 10:07 PM, Avi Kivity wrote: On 02/15/2012 01:37 PM, Xiao Guangrong wrote: I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } Hmm, we already have kvm-tlbs_dirty, so, we can do it like this: #define SPTE_INVALID_UNCLEAN (1 63 ) in invalid page path: lock mmu_lock if (spte is invalid) kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN; need_tlb_flush = kvm-tlbs_dirty; unlock mmu_lock if (need_tlb_flush) kvm_flush_remote_tlbs() And in page write-protected path: lock mmu_lock if (it has spte change to readonly | kvm-tlbs_dirty SPTE_INVALID_UNCLEAN) kvm_flush_remote_tlbs() unlock mmu_lock How about this? Well, it still has flushes inside the lock. And it seems to be more complicated, but maybe that's because I thought of my idea and didn't fully grok yours yet. Oh, i was not think that flush all tlbs out of mmu-lock, just invalid page path. But, there still have some paths need flush tlbs inside of mmu-lock(like sync_children, get_page). In your code: /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } I think we do not need handle all tlb-flushed request here since all of these request can be delayed to the point where mmu-lock is released , we can simply do it: void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-tlbs_dirty; } void kvm_mmu_commit_remote_flush(struct kvm *kvm) { int dirty_count = kvm-tlbs_dirty; smp_mb(); if (!dirty_count) return; if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm-stat.remote_tlb_flush; cmpxchg(kvm-tlbs_dirty, dirty_count, 0); } if this is ok, we only need do small change in the current code, since kvm_mmu_commit_remote_flush is very similar with kvm_flush_remote_tlbs(). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote: Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. It's correct to flush the shadow MMU TLB without the mmu_lock only in the context of mmu notifier methods. So the below while won't hurt, it's performance regression and shouldn't be applied (and it obfuscates the code by not being strict anymore). To the contrary every other place that does a shadow/secondary MMU smp tlb flush _must_ happen inside the mmu_lock, otherwise the serialization isn't correct anymore against the very below mmu_lock in the below quoted patch taken by kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start. The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4. I'll try to explain it more clearly: the moment you drop mmu_lock, pages can be freed. So if you invalidate a spte in any place inside the KVM code (except the mmu notifier methods where a reference of the page is implicitly hold by the caller and so the page can't go away under a mmu notifier method by design), then the below kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start won't get their need_tlb_flush set anymore, and they won't run the tlb flush before freeing the page. So every other place (except mmu notifier) must flush the secondary MMU smp tlb _before_ releasing the mmu_lock. Only mmu notifier is safe to flush the secondary MMU TLB _after_ releasing the mmu_lock. If we changed the mmu notifier methods to unconditionally flush the shadow TLB (regardless if a spte was present or not), we might not need to hold the mmu_lock in every tlb flush outside the context of the mmu notifier methods. But then the mmu notifier methods would become more expensive, I didn't evaluate fully what would be the side effects of such a change. Also note, only the kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_invalidate_range_start would need to do that, because they're the only two where the page reference gets dropped. Even shorter: because the mmu notifier a implicit reference on the page exists and is hold by the caller, they can flush outside the mmu_lock. Every other place in KVM only holds an implicit valid reference on the page only as long as you hold the mmu_lock, or while a spte is still established. Well it's not easy logic so it's not surprising it wasn't totally clear. It's probably not heavily documented, and the fact you changed it still is still good so we refresh our minds on the exact rules of mmu notifier locking, thanks! Andrea Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- virt/kvm/kvm_main.c | 19 ++- 1 files changed, 10 insertions(+), 9 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 470e305..2b4bc77 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, */ idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); + kvm-mmu_notifier_seq++; need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); } static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, for (; start end; start += PAGE_SIZE) need_tlb_flush |= kvm_unmap_hva(kvm, start); need_tlb_flush |= kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); - young = kvm_age_hva(kvm, address); - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); + young = kvm_age_hva(kvm, address); if (young) kvm_flush_remote_tlbs(kvm); + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); + return young; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Fri, Feb 10, 2012 at 03:52:49PM +0800, Xiao Guangrong wrote: On 02/10/2012 02:28 PM, Takuya Yoshikawa wrote: Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. It is possible that flush tlb in mmu lock only when writeable spte is invalided? Sometimes, kvm_flush_remote_tlbs need long time to wait. readonly isn't enough to defer the flush after mmu_lock is released... if you do it only for writable spte, then what can happen is the guest may read random data and would crash. However for this case, the mmu_notifier methods (and only them) are perfectly safe to flush the shadow MMU TLB after the mmu_lock is released because the page reference is guaranteed hold by the caller (not the case for any other place where a spte gets dropped in KVM, all other places dropping sptes, can only on the mmu notifier to block on the mmu_lock in order to have a guarantee of the page not being freed under them, so in every other place the shadow MMU TLB flush must happen before releasing the mmu_lock so the mmu_notifier will wait and prevent the page to be freed until all other CPUs running in guest mode stopped accessing it). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Tue, Feb 14, 2012 at 06:10:44PM +0100, Andrea Arcangeli wrote: On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote: Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. It's correct to flush the shadow MMU TLB without the mmu_lock only in the context of mmu notifier methods. So the below while won't hurt, it's performance regression and shouldn't be applied (and it obfuscates the code by not being strict anymore). To the contrary every other place that does a shadow/secondary MMU smp tlb flush _must_ happen inside the mmu_lock, otherwise the serialization isn't correct anymore against the very below mmu_lock in the below quoted patch taken by kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start. The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4. I'll try to explain it more clearly: the moment you drop mmu_lock, pages can be freed. So if you invalidate a spte in any place inside the KVM code (except the mmu notifier methods where a reference of the page is implicitly hold by the caller and so the page can't go away under a mmu notifier method by design), then the below kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start won't get their need_tlb_flush set anymore, and they won't run the tlb flush before freeing the page. So every other place (except mmu notifier) must flush the secondary MMU smp tlb _before_ releasing the mmu_lock. Only mmu notifier is safe to flush the secondary MMU TLB _after_ releasing the mmu_lock. If we changed the mmu notifier methods to unconditionally flush the shadow TLB (regardless if a spte was present or not), we might not need to hold the mmu_lock in every tlb flush outside the context of the mmu notifier methods. But then the mmu notifier methods would become more expensive, I didn't evaluate fully what would be the side effects of such a change. Also note, only the kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_invalidate_range_start would need to do that, because they're the only two where the page reference gets dropped. Even shorter: because the mmu notifier a implicit reference on the page exists and is hold by the caller, they can flush outside the mmu_lock. Every other place in KVM only holds an implicit valid reference on the page only as long as you hold the mmu_lock, or while a spte is still established. Well it's not easy logic so it's not surprising it wasn't totally clear. It's probably not heavily documented, and the fact you changed it still is still good so we refresh our minds on the exact rules of mmu notifier locking, thanks! The problem the patch is fixing is not related to page freeing, but rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d (replace A (get_dirty_log) with mmu_notifier_invalidate_page): During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. Andrea Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- virt/kvm/kvm_main.c | 19 ++- 1 files changed, 10 insertions(+), 9 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 470e305..2b4bc77 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, */ idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); + kvm-mmu_notifier_seq++; need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); } static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, for (; start end; start += PAGE_SIZE) need_tlb_flush |= kvm_unmap_hva(kvm, start); need_tlb_flush |= kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote: The problem the patch is fixing is not related to page freeing, but rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d Can't find the commit on kvm.git. (replace A (get_dirty_log) with mmu_notifier_invalidate_page): During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush Not sure which tree it is, but in kvm and upstream I see an unconditional tlb flush here, not skip (both kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I assume this has been updated in your tree to eb conditional. Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock in the kvm_mmu_rmap_write_protect case (like in quoted description), so two write_protect_slot in parallel against each other may not be ok, but that may be enforced by design if qemu won't ever call that ioctl from two different userland threads (it doesn't sounds security related so it should be ok to enforce its safety by userland design). return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. Now I get the question of why not running the TLB flush inside the mmu_lock only if the spte was writable :). kvm_mmu_get_page as long as it only runs in the context of a kvm page fault is ok, because the page fault would be inhibited by the mmu notifier invalidates, so maybe it's safe. mmu_sync_children seems to have a problem instead, in your tree get_dirty_log also has an issue if it has been updated to skip the flush on readonly sptes, like I guess. Interesting how the spte is already non present, the page is just being freed shortly later, but yet we still need to trigger write faults synchronously and prevent other CPUs in guest mode to further modify the page to avoid losing dirty bits updates or updates on pagetables that maps pagetables in the not NPT/EPT case. If the page was really only going to be freed it would be ok if the other cpus would still write to it for a little longer until the page was freed. Like I wrote in previous email, I was thinking if we'd change the mmu notifier methods to do an unconditional flush, then every other flush could also run outside of the mmu_lock. But then I didn't think enough about this to be sure. My guess is we could move all flushes outside the mmu_lock if we stop flushling the tlb conditonally to the current spte values. It'd clearly be slower for an UP guest though :). Large SMP guests might benefit, if that is feasible at all... It depends how problematic the mmu_lock is on the large SMP guests and how much we're saving by doing conditional TLB flushes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Tue, Feb 14, 2012 at 07:53:56PM +0100, Andrea Arcangeli wrote: On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote: The problem the patch is fixing is not related to page freeing, but rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d Can't find the commit on kvm.git. Sorry, we got kvm.git out of sync. But you can see an equivalent below. (replace A (get_dirty_log) with mmu_notifier_invalidate_page): During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush Not sure which tree it is, but in kvm and upstream I see an unconditional tlb flush here, not skip (both kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I assume this has been updated in your tree to eb conditional. if (!direct) { if (rmap_write_protect(vcpu-kvm, gfn)) kvm_flush_remote_tlbs(vcpu-kvm); Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock in the kvm_mmu_rmap_write_protect case (like in quoted description), so two write_protect_slot in parallel against each other may not be ok, but that may be enforced by design if qemu won't ever call that ioctl from two different userland threads (it doesn't sounds security related so it should be ok to enforce its safety by userland design). Yes, here is the fix: http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=02b48d00d7f1853bdf8a06da19ca5413ebe334c6 This is an equivalent of 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d. return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. Now I get the question of why not running the TLB flush inside the mmu_lock only if the spte was writable :). kvm_mmu_get_page as long as it only runs in the context of a kvm page fault is ok, because the page fault would be inhibited by the mmu notifier invalidates, so maybe it's safe. Ah, perhaps, but this was not taken into account before. Can you confirm this is the case so we can revert the invalidate_page patch? mmu_sync_children seems to have a problem instead, in your tree get_dirty_log also has an issue if it has been updated to skip the flush on readonly sptes, like I guess. Interesting how the spte is already non present, the page is just being freed shortly later, but yet we still need to trigger write faults synchronously and prevent other CPUs in guest mode to further modify the page to avoid losing dirty bits updates or updates on pagetables that maps pagetables in the not NPT/EPT case. If the page was really only going to be freed it would be ok if the other cpus would still write to it for a little longer until the page was freed. Like I wrote in previous email, I was thinking if we'd change the mmu notifier methods to do an unconditional flush, then every other flush could also run outside of the mmu_lock. But then I didn't think enough about this to be sure. My guess is we could move all flushes outside the mmu_lock if we stop flushling the tlb conditonally to the current spte values. It'd clearly be slower for an UP guest though :). Large SMP guests might benefit, if that is feasible at all... It depends how problematic the mmu_lock is on the large SMP guests and how much we're saving by doing conditional TLB flushes. Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). But it would be awkward to differentiate the unlock position based on EPT/NPT. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
(2012/02/10 16:52), Xiao Guangrong wrote: On 02/10/2012 02:28 PM, Takuya Yoshikawa wrote: Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. It is possible that flush tlb in mmu lock only when writeable spte is invalided? Sometimes, kvm_flush_remote_tlbs need long time to wait. I am sorry but I don't see the point of adding yet another complexity for that. OK, let me clarify my position. When I made this patch, I did a simple test to see if mmu notifier would work normally after my patch applied: I allocated large memory behind the guest and made the host call mmu_notifier functions. In that situation the host as a whole became slow, e.g. I saw really bad response for some seconds. In such a case, do you mind the additional time VCPU threads may have to wait until the mmu notifier finishes the flush, only when they are trying to take the mmu_lock? Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote: Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- virt/kvm/kvm_main.c | 19 ++- 1 files changed, 10 insertions(+), 9 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- virt/kvm/kvm_main.c | 19 ++- 1 files changed, 10 insertions(+), 9 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 470e305..2b4bc77 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, */ idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); + kvm-mmu_notifier_seq++; need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); } static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, for (; start end; start += PAGE_SIZE) need_tlb_flush |= kvm_unmap_hva(kvm, start); need_tlb_flush |= kvm-tlbs_dirty; - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); - /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) kvm_flush_remote_tlbs(kvm); + + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); - young = kvm_age_hva(kvm, address); - spin_unlock(kvm-mmu_lock); - srcu_read_unlock(kvm-srcu, idx); + young = kvm_age_hva(kvm, address); if (young) kvm_flush_remote_tlbs(kvm); + spin_unlock(kvm-mmu_lock); + srcu_read_unlock(kvm-srcu, idx); + return young; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/10/2012 02:28 PM, Takuya Yoshikawa wrote: Other threads may process the same page in that small window and skip TLB flush and then return before these functions do flush. It is possible that flush tlb in mmu lock only when writeable spte is invalided? Sometimes, kvm_flush_remote_tlbs need long time to wait. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html