Re: hrtimer possible issue
On 02/05/2013 12:20 PM, Thomas Gleixner wrote: On Mon, 4 Feb 2013, Leonid Shatz wrote: I assume the race can also happen between hrtimer cancel and start. In both cases timer base switch can happen. Izik, please check if you can arrange the patch in the standard format (do we need to do it against latest kernel version?) Yes please. But I sent it already yasterday with git-send-email and checkpatch.pl :-) Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: hrtimer possible issue
On 02/05/2013 12:20 PM, Thomas Gleixner wrote: On Mon, 4 Feb 2013, Leonid Shatz wrote: I assume the race can also happen between hrtimer cancel and start. In both cases timer base switch can happen. Izik, please check if you can arrange the patch in the standard format (do we need to do it against latest kernel version?) Yes please. But I sent it already yasterday with git-send-email and checkpatch.pl :-) Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fix hrtimer_enqueue_reprogram race
From: leonid Shatz it seems like hrtimer_enqueue_reprogram contain a race which could result in timer.base switch during unlock/lock sequence. See the code at __hrtimer_start_range_ns where it calls hrtimer_enqueue_reprogram. The later is releasing lock protecting the timer base for a short time and timer base switch can occur from a different CPU thread. Later when __hrtimer_start_range_ns calls unlock_hrtimer_base, a base switch could have happened and this causes the bug Try to start the same hrtimer from two different threads in kernel running each one on a different CPU. Eventually one of the calls will cause timer base switch while another thread is not expecting it. This can happen in virtualized environment where one thread can be delayed by lower hypervisor, and due to time delay a different CPU is taking care of missed timer start and runs the timer start logic on its own. Signed-off-by: Leonid Shatz Signed-off-by: Izik Eidus --- kernel/hrtimer.c | 32 ++-- 1 file changed, 14 insertions(+), 18 deletions(-) diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..0c8c6cd 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -640,21 +640,9 @@ static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) * and expiry check is done in the hrtimer_interrupt or in the softirq. */ static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer, - struct hrtimer_clock_base *base, - int wakeup) + struct hrtimer_clock_base *base) { - if (base->cpu_base->hres_active && hrtimer_reprogram(timer, base)) { - if (wakeup) { - raw_spin_unlock(>cpu_base->lock); - raise_softirq_irqoff(HRTIMER_SOFTIRQ); - raw_spin_lock(>cpu_base->lock); - } else - __raise_softirq_irqoff(HRTIMER_SOFTIRQ); - - return 1; - } - - return 0; + return base->cpu_base->hres_active && hrtimer_reprogram(timer, base); } static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base) @@ -735,8 +723,7 @@ static inline int hrtimer_switch_to_hres(void) { return 0; } static inline void hrtimer_force_reprogram(struct hrtimer_cpu_base *base, int skip_equal) { } static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer, - struct hrtimer_clock_base *base, - int wakeup) + struct hrtimer_clock_base *base) { return 0; } @@ -995,8 +982,17 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, * * XXX send_remote_softirq() ? */ - if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases)) - hrtimer_enqueue_reprogram(timer, new_base, wakeup); + if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases) + && hrtimer_enqueue_reprogram(timer, new_base)) { + if (wakeup) { + raw_spin_unlock(_base->cpu_base->lock); + raise_softirq_irqoff(HRTIMER_SOFTIRQ); + local_irq_restore(flags); + return ret; + } else { + __raise_softirq_irqoff(HRTIMER_SOFTIRQ); + } + } unlock_hrtimer_base(timer, ); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fix hrtimer_enqueue_reprogram race
From: leonid Shatz leonid.sh...@ravellosystems.com it seems like hrtimer_enqueue_reprogram contain a race which could result in timer.base switch during unlock/lock sequence. See the code at __hrtimer_start_range_ns where it calls hrtimer_enqueue_reprogram. The later is releasing lock protecting the timer base for a short time and timer base switch can occur from a different CPU thread. Later when __hrtimer_start_range_ns calls unlock_hrtimer_base, a base switch could have happened and this causes the bug Try to start the same hrtimer from two different threads in kernel running each one on a different CPU. Eventually one of the calls will cause timer base switch while another thread is not expecting it. This can happen in virtualized environment where one thread can be delayed by lower hypervisor, and due to time delay a different CPU is taking care of missed timer start and runs the timer start logic on its own. Signed-off-by: Leonid Shatz leonid.sh...@ravellosystems.com Signed-off-by: Izik Eidus izik.ei...@ravellosystems.com --- kernel/hrtimer.c | 32 ++-- 1 file changed, 14 insertions(+), 18 deletions(-) diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..0c8c6cd 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -640,21 +640,9 @@ static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) * and expiry check is done in the hrtimer_interrupt or in the softirq. */ static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer, - struct hrtimer_clock_base *base, - int wakeup) + struct hrtimer_clock_base *base) { - if (base-cpu_base-hres_active hrtimer_reprogram(timer, base)) { - if (wakeup) { - raw_spin_unlock(base-cpu_base-lock); - raise_softirq_irqoff(HRTIMER_SOFTIRQ); - raw_spin_lock(base-cpu_base-lock); - } else - __raise_softirq_irqoff(HRTIMER_SOFTIRQ); - - return 1; - } - - return 0; + return base-cpu_base-hres_active hrtimer_reprogram(timer, base); } static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base) @@ -735,8 +723,7 @@ static inline int hrtimer_switch_to_hres(void) { return 0; } static inline void hrtimer_force_reprogram(struct hrtimer_cpu_base *base, int skip_equal) { } static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer, - struct hrtimer_clock_base *base, - int wakeup) + struct hrtimer_clock_base *base) { return 0; } @@ -995,8 +982,17 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, * * XXX send_remote_softirq() ? */ - if (leftmost new_base-cpu_base == __get_cpu_var(hrtimer_bases)) - hrtimer_enqueue_reprogram(timer, new_base, wakeup); + if (leftmost new_base-cpu_base == __get_cpu_var(hrtimer_bases) +hrtimer_enqueue_reprogram(timer, new_base)) { + if (wakeup) { + raw_spin_unlock(new_base-cpu_base-lock); + raise_softirq_irqoff(HRTIMER_SOFTIRQ); + local_irq_restore(flags); + return ret; + } else { + __raise_softirq_irqoff(HRTIMER_SOFTIRQ); + } + } unlock_hrtimer_base(timer, flags); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/11] ksm: NUMA trees and page migration
On 01/29/2013 02:49 AM, Izik Eidus wrote: On 01/29/2013 01:54 AM, Andrew Morton wrote: On Fri, 25 Jan 2013 17:53:10 -0800 (PST) Hugh Dickins wrote: Here's a KSM series Sanity check: do you have a feeling for how useful KSM is? Performance/space improvements for typical (or atypical) workloads? Are people using it? Successfully? BTW, After thinking a bit about the word people, I wanted to see if normal users of linux that just download and install Linux (without using special virtualization product) are able to use it. So I google little bit for it, and found some nice results from users: http://serverascode.com/2012/11/11/ksm-kvm.html But I do agree that it provide justifying value only for virtualization users... Hi, I think it mostly used for virtualization, I know at least two products that it use - RHEV - RedHat enterprise virtualization, and my current place (Ravello Systems) that use it to do vm consolidation on top of cloud enviorments (Run multiple unmodified VMs on top of one vm you get from ec2 / rackspace / what so ever), for Ravello it is highly critical in achieving high rate of consolidation ratio... IOW, is it justifying itself? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/11] ksm: NUMA trees and page migration
On 01/29/2013 01:54 AM, Andrew Morton wrote: On Fri, 25 Jan 2013 17:53:10 -0800 (PST) Hugh Dickins wrote: Here's a KSM series Sanity check: do you have a feeling for how useful KSM is? Performance/space improvements for typical (or atypical) workloads? Are people using it? Successfully? Hi, I think it mostly used for virtualization, I know at least two products that it use - RHEV - RedHat enterprise virtualization, and my current place (Ravello Systems) that use it to do vm consolidation on top of cloud enviorments (Run multiple unmodified VMs on top of one vm you get from ec2 / rackspace / what so ever), for Ravello it is highly critical in achieving high rate of consolidation ratio... IOW, is it justifying itself? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/11] ksm: NUMA trees and page migration
On 01/29/2013 01:54 AM, Andrew Morton wrote: On Fri, 25 Jan 2013 17:53:10 -0800 (PST) Hugh Dickins hu...@google.com wrote: Here's a KSM series Sanity check: do you have a feeling for how useful KSM is? Performance/space improvements for typical (or atypical) workloads? Are people using it? Successfully? Hi, I think it mostly used for virtualization, I know at least two products that it use - RHEV - RedHat enterprise virtualization, and my current place (Ravello Systems) that use it to do vm consolidation on top of cloud enviorments (Run multiple unmodified VMs on top of one vm you get from ec2 / rackspace / what so ever), for Ravello it is highly critical in achieving high rate of consolidation ratio... IOW, is it justifying itself? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/11] ksm: NUMA trees and page migration
On 01/29/2013 02:49 AM, Izik Eidus wrote: On 01/29/2013 01:54 AM, Andrew Morton wrote: On Fri, 25 Jan 2013 17:53:10 -0800 (PST) Hugh Dickins hu...@google.com wrote: Here's a KSM series Sanity check: do you have a feeling for how useful KSM is? Performance/space improvements for typical (or atypical) workloads? Are people using it? Successfully? BTW, After thinking a bit about the word people, I wanted to see if normal users of linux that just download and install Linux (without using special virtualization product) are able to use it. So I google little bit for it, and found some nice results from users: http://serverascode.com/2012/11/11/ksm-kvm.html But I do agree that it provide justifying value only for virtualization users... Hi, I think it mostly used for virtualization, I know at least two products that it use - RHEV - RedHat enterprise virtualization, and my current place (Ravello Systems) that use it to do vm consolidation on top of cloud enviorments (Run multiple unmodified VMs on top of one vm you get from ec2 / rackspace / what so ever), for Ravello it is highly critical in achieving high rate of consolidation ratio... IOW, is it justifying itself? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 7/8] MMU: add TDP support to the KVM MMU
Joerg Roedel wrote: On Thu, Feb 07, 2008 at 03:27:19PM +0200, Izik Eidus wrote: Joerg Roedel wrote: This patch contains the changes to the KVM MMU necessary for support of the Nested Paging feature in AMD Barcelona and Phenom Processors. good patch, it look like things will be very fixable with it Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]> --- arch/x86/kvm/mmu.c | 79 ++-- arch/x86/kvm/mmu.h |6 2 files changed, 82 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5e76963..5304d55 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1081,6 +1081,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) int i; gfn_t root_gfn; struct kvm_mmu_page *sp; + int metaphysical = 0; root_gfn = vcpu->arch.cr3 >> PAGE_SHIFT; @@ -1089,14 +1090,20 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu->arch.mmu.root_hpa; ASSERT(!VALID_PAGE(root)); + if (tdp_enabled) + metaphysical = 1; sp = kvm_mmu_get_page(vcpu, root_gfn, 0, - PT64_ROOT_LEVEL, 0, ACC_ALL, NULL, NULL); + PT64_ROOT_LEVEL, metaphysical, + ACC_ALL, NULL, NULL); root = __pa(sp->spt); ++sp->root_count; vcpu->arch.mmu.root_hpa = root; return; } #endif + metaphysical = !is_paging(vcpu); + if (tdp_enabled) + metaphysical = 1; for (i = 0; i < 4; ++i) { hpa_t root = vcpu->arch.mmu.pae_root[i]; @@ -1110,7 +1117,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) } else if (vcpu->arch.mmu.root_level == 0) root_gfn = 0; sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, - PT32_ROOT_LEVEL, !is_paging(vcpu), + PT32_ROOT_LEVEL, metaphysical, ACC_ALL, NULL, NULL); root = __pa(sp->spt); ++sp->root_count; @@ -1144,6 +1151,36 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, error_code & PFERR_WRITE_MASK, gfn); } +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, + u32 error_code) you probably mean gpa_t ? Yes. But the function is assigned to a function pointer. And the type of that pointer expects gva_t there. So I named the parameter gpa to describe that a guest physical address is meant there. +{ + struct page *page; + int r; + + ASSERT(vcpu); + ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); + + r = mmu_topup_memory_caches(vcpu); + if (r) + return r; + + down_read(>mm->mmap_sem); + page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT); + if (is_error_page(page)) { + kvm_release_page_clean(page); + up_read(>mm->mmap_sem); + return 1; + } i dont know if it worth checking it here, in the worth case we will map the error page and the host will be safe Looking at the nonpaging_map function it is the right place to check for the error page. thinking about it again you are right, (for some reason i was thinking about old kvm code that was replace already) the is_error_page should be here. Joerg Roedel -- woof. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 7/8] MMU: add TDP support to the KVM MMU
Joerg Roedel wrote: This patch contains the changes to the KVM MMU necessary for support of the Nested Paging feature in AMD Barcelona and Phenom Processors. good patch, it look like things will be very fixable with it Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]> --- arch/x86/kvm/mmu.c | 79 ++-- arch/x86/kvm/mmu.h |6 2 files changed, 82 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5e76963..5304d55 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1081,6 +1081,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) int i; gfn_t root_gfn; struct kvm_mmu_page *sp; + int metaphysical = 0; root_gfn = vcpu->arch.cr3 >> PAGE_SHIFT; @@ -1089,14 +1090,20 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu->arch.mmu.root_hpa; ASSERT(!VALID_PAGE(root)); + if (tdp_enabled) + metaphysical = 1; sp = kvm_mmu_get_page(vcpu, root_gfn, 0, - PT64_ROOT_LEVEL, 0, ACC_ALL, NULL, NULL); + PT64_ROOT_LEVEL, metaphysical, + ACC_ALL, NULL, NULL); root = __pa(sp->spt); ++sp->root_count; vcpu->arch.mmu.root_hpa = root; return; } #endif + metaphysical = !is_paging(vcpu); + if (tdp_enabled) + metaphysical = 1; for (i = 0; i < 4; ++i) { hpa_t root = vcpu->arch.mmu.pae_root[i]; @@ -1110,7 +1117,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) } else if (vcpu->arch.mmu.root_level == 0) root_gfn = 0; sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, - PT32_ROOT_LEVEL, !is_paging(vcpu), + PT32_ROOT_LEVEL, metaphysical, ACC_ALL, NULL, NULL); root = __pa(sp->spt); ++sp->root_count; @@ -1144,6 +1151,36 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, error_code & PFERR_WRITE_MASK, gfn); } +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, + u32 error_code) you probably mean gpa_t ? +{ + struct page *page; + int r; + + ASSERT(vcpu); + ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); + + r = mmu_topup_memory_caches(vcpu); + if (r) + return r; + + down_read(>mm->mmap_sem); + page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT); + if (is_error_page(page)) { + kvm_release_page_clean(page); + up_read(>mm->mmap_sem); + return 1; + } i dont know if it worth checking it here, in the worth case we will map the error page and the host will be safe + spin_lock(>kvm->mmu_lock); + kvm_mmu_free_some_pages(vcpu); + r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK, +gpa >> PAGE_SHIFT, page, TDP_ROOT_LEVEL); + spin_unlock(>kvm->mmu_lock); + up_read(>mm->mmap_sem); + + return r; +} + static void nonpaging_free(struct kvm_vcpu *vcpu) { mmu_free_roots(vcpu); @@ -1237,7 +1274,35 @@ static int paging32E_init_context(struct kvm_vcpu *vcpu) return paging64_init_context_common(vcpu, PT32E_ROOT_LEVEL); } -static int init_kvm_mmu(struct kvm_vcpu *vcpu) tdp_page_fault(struct +static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *context = >arch.mmu; + + context->new_cr3 = nonpaging_new_cr3; + context->page_fault = tdp_page_fault; + context->free = nonpaging_free; + context->prefetch_page = nonpaging_prefetch_page; + context->shadow_root_level = TDP_ROOT_LEVEL; + context->root_hpa = INVALID_PAGE; + + if (!is_paging(vcpu)) { + context->gva_to_gpa = nonpaging_gva_to_gpa; + context->root_level = 0; + } else if (is_long_mode(vcpu)) { + context->gva_to_gpa = paging64_gva_to_gpa; + context->root_level = PT64_ROOT_LEVEL; + } else if (is_pae(vcpu)) { + context->gva_to_gpa = paging64_gva_to_gpa; + context->root_level = PT32E_ROOT_LEVEL; + } else { + context->gva_to_gpa = paging32_gva_to_gpa; + context->root_level = PT32_ROOT_LEVEL; + } + + return 0; +} + +static int init_kvm_softmmu(struct kvm_vcpu *vcpu) { ASSERT(vcpu); ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa)); @@ -1252,6 +1317,14 @@ static int init_kvm_mmu(struct kvm_vcpu *vcpu) return paging32_init_context(vcpu); } +static int init_kvm_mmu(struct kvm_vcpu *vcpu) +{ + if (tdp_enabled) +
Re: [kvm-devel] [PATCH 7/8] MMU: add TDP support to the KVM MMU
Joerg Roedel wrote: This patch contains the changes to the KVM MMU necessary for support of the Nested Paging feature in AMD Barcelona and Phenom Processors. good patch, it look like things will be very fixable with it Signed-off-by: Joerg Roedel [EMAIL PROTECTED] --- arch/x86/kvm/mmu.c | 79 ++-- arch/x86/kvm/mmu.h |6 2 files changed, 82 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5e76963..5304d55 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1081,6 +1081,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) int i; gfn_t root_gfn; struct kvm_mmu_page *sp; + int metaphysical = 0; root_gfn = vcpu-arch.cr3 PAGE_SHIFT; @@ -1089,14 +1090,20 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu-arch.mmu.root_hpa; ASSERT(!VALID_PAGE(root)); + if (tdp_enabled) + metaphysical = 1; sp = kvm_mmu_get_page(vcpu, root_gfn, 0, - PT64_ROOT_LEVEL, 0, ACC_ALL, NULL, NULL); + PT64_ROOT_LEVEL, metaphysical, + ACC_ALL, NULL, NULL); root = __pa(sp-spt); ++sp-root_count; vcpu-arch.mmu.root_hpa = root; return; } #endif + metaphysical = !is_paging(vcpu); + if (tdp_enabled) + metaphysical = 1; for (i = 0; i 4; ++i) { hpa_t root = vcpu-arch.mmu.pae_root[i]; @@ -1110,7 +1117,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) } else if (vcpu-arch.mmu.root_level == 0) root_gfn = 0; sp = kvm_mmu_get_page(vcpu, root_gfn, i 30, - PT32_ROOT_LEVEL, !is_paging(vcpu), + PT32_ROOT_LEVEL, metaphysical, ACC_ALL, NULL, NULL); root = __pa(sp-spt); ++sp-root_count; @@ -1144,6 +1151,36 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, error_code PFERR_WRITE_MASK, gfn); } +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, + u32 error_code) you probably mean gpa_t ? +{ + struct page *page; + int r; + + ASSERT(vcpu); + ASSERT(VALID_PAGE(vcpu-arch.mmu.root_hpa)); + + r = mmu_topup_memory_caches(vcpu); + if (r) + return r; + + down_read(current-mm-mmap_sem); + page = gfn_to_page(vcpu-kvm, gpa PAGE_SHIFT); + if (is_error_page(page)) { + kvm_release_page_clean(page); + up_read(current-mm-mmap_sem); + return 1; + } i dont know if it worth checking it here, in the worth case we will map the error page and the host will be safe + spin_lock(vcpu-kvm-mmu_lock); + kvm_mmu_free_some_pages(vcpu); + r = __direct_map(vcpu, gpa, error_code PFERR_WRITE_MASK, +gpa PAGE_SHIFT, page, TDP_ROOT_LEVEL); + spin_unlock(vcpu-kvm-mmu_lock); + up_read(current-mm-mmap_sem); + + return r; +} + static void nonpaging_free(struct kvm_vcpu *vcpu) { mmu_free_roots(vcpu); @@ -1237,7 +1274,35 @@ static int paging32E_init_context(struct kvm_vcpu *vcpu) return paging64_init_context_common(vcpu, PT32E_ROOT_LEVEL); } -static int init_kvm_mmu(struct kvm_vcpu *vcpu) tdp_page_fault(struct +static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu *context = vcpu-arch.mmu; + + context-new_cr3 = nonpaging_new_cr3; + context-page_fault = tdp_page_fault; + context-free = nonpaging_free; + context-prefetch_page = nonpaging_prefetch_page; + context-shadow_root_level = TDP_ROOT_LEVEL; + context-root_hpa = INVALID_PAGE; + + if (!is_paging(vcpu)) { + context-gva_to_gpa = nonpaging_gva_to_gpa; + context-root_level = 0; + } else if (is_long_mode(vcpu)) { + context-gva_to_gpa = paging64_gva_to_gpa; + context-root_level = PT64_ROOT_LEVEL; + } else if (is_pae(vcpu)) { + context-gva_to_gpa = paging64_gva_to_gpa; + context-root_level = PT32E_ROOT_LEVEL; + } else { + context-gva_to_gpa = paging32_gva_to_gpa; + context-root_level = PT32_ROOT_LEVEL; + } + + return 0; +} + +static int init_kvm_softmmu(struct kvm_vcpu *vcpu) { ASSERT(vcpu); ASSERT(!VALID_PAGE(vcpu-arch.mmu.root_hpa)); @@ -1252,6 +1317,14 @@ static int init_kvm_mmu(struct kvm_vcpu *vcpu) return paging32_init_context(vcpu); } +static int init_kvm_mmu(struct kvm_vcpu *vcpu) +{ + if (tdp_enabled) +
Re: [kvm-devel] [PATCH 7/8] MMU: add TDP support to the KVM MMU
Joerg Roedel wrote: On Thu, Feb 07, 2008 at 03:27:19PM +0200, Izik Eidus wrote: Joerg Roedel wrote: This patch contains the changes to the KVM MMU necessary for support of the Nested Paging feature in AMD Barcelona and Phenom Processors. good patch, it look like things will be very fixable with it Signed-off-by: Joerg Roedel [EMAIL PROTECTED] --- arch/x86/kvm/mmu.c | 79 ++-- arch/x86/kvm/mmu.h |6 2 files changed, 82 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5e76963..5304d55 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1081,6 +1081,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) int i; gfn_t root_gfn; struct kvm_mmu_page *sp; + int metaphysical = 0; root_gfn = vcpu-arch.cr3 PAGE_SHIFT; @@ -1089,14 +1090,20 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu-arch.mmu.root_hpa; ASSERT(!VALID_PAGE(root)); + if (tdp_enabled) + metaphysical = 1; sp = kvm_mmu_get_page(vcpu, root_gfn, 0, - PT64_ROOT_LEVEL, 0, ACC_ALL, NULL, NULL); + PT64_ROOT_LEVEL, metaphysical, + ACC_ALL, NULL, NULL); root = __pa(sp-spt); ++sp-root_count; vcpu-arch.mmu.root_hpa = root; return; } #endif + metaphysical = !is_paging(vcpu); + if (tdp_enabled) + metaphysical = 1; for (i = 0; i 4; ++i) { hpa_t root = vcpu-arch.mmu.pae_root[i]; @@ -1110,7 +1117,7 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu) } else if (vcpu-arch.mmu.root_level == 0) root_gfn = 0; sp = kvm_mmu_get_page(vcpu, root_gfn, i 30, - PT32_ROOT_LEVEL, !is_paging(vcpu), + PT32_ROOT_LEVEL, metaphysical, ACC_ALL, NULL, NULL); root = __pa(sp-spt); ++sp-root_count; @@ -1144,6 +1151,36 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, error_code PFERR_WRITE_MASK, gfn); } +static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, + u32 error_code) you probably mean gpa_t ? Yes. But the function is assigned to a function pointer. And the type of that pointer expects gva_t there. So I named the parameter gpa to describe that a guest physical address is meant there. +{ + struct page *page; + int r; + + ASSERT(vcpu); + ASSERT(VALID_PAGE(vcpu-arch.mmu.root_hpa)); + + r = mmu_topup_memory_caches(vcpu); + if (r) + return r; + + down_read(current-mm-mmap_sem); + page = gfn_to_page(vcpu-kvm, gpa PAGE_SHIFT); + if (is_error_page(page)) { + kvm_release_page_clean(page); + up_read(current-mm-mmap_sem); + return 1; + } i dont know if it worth checking it here, in the worth case we will map the error page and the host will be safe Looking at the nonpaging_map function it is the right place to check for the error page. thinking about it again you are right, (for some reason i was thinking about old kvm code that was replace already) the is_error_page should be here. Joerg Roedel -- woof. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] [RFC] MMU Notifiers V1
Andrea Arcangeli wrote: On Thu, Jan 24, 2008 at 09:56:06PM -0800, Christoph Lameter wrote: Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that it is easier to review their use. - Avoid adding #include to linux/mm_types.h - Integrate RCU logic suggested by Peter. I'm glad you're converging on something a bit saner and much much closer to my code, plus perfectly usable by KVM optimal rmap design too. It would have preferred if you would have sent me patches like Peter did for review and merging etc... that would have made review especially easier. Anyway I'm used to that on lkml so it's ok, I just need this patch to be included in mainline, everything else is irrelevant to me. On a technical merit this still partially makes me sick and I think it's the last issue to debate. @@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (unlikely(PageExternalRmap(page))) + mmu_rmap_notifier(invalidate_page, page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; I find the above hard to accept, because the moment you work with physical pages and not "mm+address" I think you couldn't possibly care if page_mapped is true or false, and I think the above notifier should be called _outside_ try_to_unmap. Infact I'd call mmu_rmap_notifier(invalidate_page, page); only if page_unmapped is false and the linux pte is gone already (practically just before the page_count == 2 check and after try_to_unmap). i dont understand how is that better than notification on tlb flush? i mean cpus have very smiler problem as we do, so why not deal with the notification at the same place as they do (tlb flush) ? moreover notification on tlb flush allow unmodified applications to work with us for example the memory merging driver that i wrote was able to work with kvm without any change to its source code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 0/4] [RFC] MMU Notifiers V1
Andrea Arcangeli wrote: On Thu, Jan 24, 2008 at 09:56:06PM -0800, Christoph Lameter wrote: Andrea's mmu_notifier #4 - RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that it is easier to review their use. - Avoid adding #include to linux/mm_types.h - Integrate RCU logic suggested by Peter. I'm glad you're converging on something a bit saner and much much closer to my code, plus perfectly usable by KVM optimal rmap design too. It would have preferred if you would have sent me patches like Peter did for review and merging etc... that would have made review especially easier. Anyway I'm used to that on lkml so it's ok, I just need this patch to be included in mainline, everything else is irrelevant to me. On a technical merit this still partially makes me sick and I think it's the last issue to debate. @@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (unlikely(PageExternalRmap(page))) + mmu_rmap_notifier(invalidate_page, page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; I find the above hard to accept, because the moment you work with physical pages and not mm+address I think you couldn't possibly care if page_mapped is true or false, and I think the above notifier should be called _outside_ try_to_unmap. Infact I'd call mmu_rmap_notifier(invalidate_page, page); only if page_unmapped is false and the linux pte is gone already (practically just before the page_count == 2 check and after try_to_unmap). i dont understand how is that better than notification on tlb flush? i mean cpus have very smiler problem as we do, so why not deal with the notification at the same place as they do (tlb flush) ? moreover notification on tlb flush allow unmodified applications to work with us for example the memory merging driver that i wrote was able to work with kvm without any change to its source code. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Andrea Arcangeli wrote: On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli <[EMAIL PROTECTED]> wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). agree so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? yes you are right FOLL_TOUCH is set. The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for "hot" sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the "age" time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during "initial" swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though. ohh i like it, this is cleaver solution, and i guess the cost of the vmexits wont be too high if it will be not too much aggressive -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Andrea Arcangeli wrote: On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). agree so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? yes you are right FOLL_TOUCH is set. The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for hot sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the age time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during initial swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though. ohh i like it, this is cleaver solution, and i guess the cost of the vmexits wont be too high if it will be not too much aggressive -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli <[EMAIL PROTECTED]> wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmu notifiers #v2
Rik van Riel wrote: On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 2/3] kvmclock - the host part.
Glauber de Oliveira Costa wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dong, Eddie escreveu: +static void kvm_write_guest_time(struct kvm_vcpu *vcpu) +{ + struct timespec ts; + int r; + + if (!vcpu->clock_gpa) + return; + + /* Updates version to the next odd number, indicating we're writing */ + vcpu->hv_clock.version++; + kvm_write_guest(vcpu->kvm, vcpu->clock_gpa, >hv_clock, PAGE_SIZE); + + kvm_get_msr(vcpu, MSR_IA32_TIME_STAMP_COUNTER, + >hv_clock.last_tsc); + + ktime_get_ts(); Do we need to disable preemption here? After thinking for a little while, you are theoretically right. In the current state, we could even be preempted between all operations ;-) Maybe after avi's suggestion of moving the call to it it will end up in a preempt safe region, but anyway, it's safer to add the preempt markers here. I'll put it in next version, thanks note that you cant call to kvm_write_guest with preemption disabled (witch you do few lines below) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 2/3] kvmclock - the host part.
Glauber de Oliveira Costa wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dong, Eddie escreveu: +static void kvm_write_guest_time(struct kvm_vcpu *vcpu) +{ + struct timespec ts; + int r; + + if (!vcpu-clock_gpa) + return; + + /* Updates version to the next odd number, indicating we're writing */ + vcpu-hv_clock.version++; + kvm_write_guest(vcpu-kvm, vcpu-clock_gpa, vcpu-hv_clock, PAGE_SIZE); + + kvm_get_msr(vcpu, MSR_IA32_TIME_STAMP_COUNTER, + vcpu-hv_clock.last_tsc); + + ktime_get_ts(ts); Do we need to disable preemption here? After thinking for a little while, you are theoretically right. In the current state, we could even be preempted between all operations ;-) Maybe after avi's suggestion of moving the call to it it will end up in a preempt safe region, but anyway, it's safer to add the preempt markers here. I'll put it in next version, thanks note that you cant call to kvm_write_guest with preemption disabled (witch you do few lines below) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/