what is the state about "[v2] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit"
Hi, I can not find the history about: https://patchwork.ozlabs.org/patch/577193/ Can we have this patch? Thanks, Pingfan
Re: [PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing
On Fri, Apr 28, 2017 at 2:06 AM, Hari Bathiniwrote: > Hi Pingfan, > > > On Thursday 27 April 2017 01:13 PM, Pingfan Liu wrote: >> >> E.g after fadump reserves mem regions, these regions should not be removed >> before fadump explicitly free them. >> Signed-off-by: Pingfan Liu >> --- >> arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c >> b/arch/powerpc/platforms/pseries/hotplug-memory.c >> index e104c71..201be23 100644 >> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c >> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c >> @@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base, >> unsigned int memblock_siz >> >> if (!pfn_valid(start_pfn)) >> goto out; >> + if (memblock_is_reserved(base)) >> + return -EINVAL; > > > I think memblock reserved regions are not hot removed even without this > patch. > So, can you elaborate on when/why this patch is needed? > I have not found code to prevent the reserved regions from free. Do I miss anything? I will try to reserve a ppc to have a test. Thx, Pingfan
Re: a question about powervm
On Mon, Dec 19, 2016 at 5:58 PM, Michal Suchanek <hramr...@gmail.com> wrote: > Hello, > > On 19 December 2016 at 10:09, Liu ping fan <kernelf...@gmail.com> wrote: >> Hi guys, >> >> I am using a pSeries. It runs on powerVM. >> My question is whether the cpuX under /sys/devices/system/cpu >> corresponds to a real cpu or not. >> I think it is not the same as the cpu on kvm-guest, which is emulated >> by a linux process. > > In general the hypervisor (powerVM or KVM) needs to disable access of > the guest to some CPU features in order to manage multiple guests. So > some CPU features are either disabled or emulated by the hypervisor to > prevent the guest from interfering with isolation of the guests from > each other and the host. On the other hand, emulating the whole CPU is > ineffective so to achieve reasonable guest speed it must run on a real > CPU most of the time. That means that in most cases the CPU you see in > the guest will have a subset of the features of the CPU the host sees. > Thank you for the explanation. And sorry not to explain my question clearly. I wonders about the mapping between guest-CPU and physical cpu. On powerVM, except the feature exposed by host, it is 1:1 map right? > Some virtualization solutions allow to overcommit CPU time so you can > allocate more CPUs to guests than you physically have. Then when > running the guest will use an actual physical CPU but when it is > sleeping other guest may use the CPU. > Yes, agree. Thanks!
a question about powervm
Hi guys, I am using a pSeries. It runs on powerVM. My question is whether the cpuX under /sys/devices/system/cpu corresponds to a real cpu or not. I think it is not the same as the cpu on kvm-guest, which is emulated by a linux process. Thx.
Re: fadump: a question about "ibm,configure-kernel-dump"
Thank you very much. I am more clear about it now. On Fri, Nov 18, 2016 at 5:29 PM, Michael Ellerman <m...@ellerman.id.au> wrote: > Liu ping fan <kernelf...@gmail.com> writes: > >> I have an ibm-p8-garrison machine. But I can not find a node >> "ibm,configure-kernel-dump" under /proc/device-tree. Does garrison >> machine support fadump? And normally, which component in open-power >> presents the "ibm,configure-kernel-dump" ? I had though it was in >> skiboot gitree or garrison-xml gitree, but found nothing. > > fadump only exists for guests running under PowerVM (LPARs). It sounds > like you're running bare metal (because you mention skiboot). > > If you're talking about guests on top of KVM then it *could* be > implemented, but AFAIK Qemu/KVM do not implement it. > > cheers
fadump: a question about "ibm,configure-kernel-dump"
I have an ibm-p8-garrison machine. But I can not find a node "ibm,configure-kernel-dump" under /proc/device-tree. Does garrison machine support fadump? And normally, which component in open-power presents the "ibm,configure-kernel-dump" ? I had though it was in skiboot gitree or garrison-xml gitree, but found nothing. Can anyone give a help? Thx
Re: [RFC 01/11] sched: introduce sys_cpumask in tsk to adapt asymmetric system
On Wed, Nov 12, 2014 at 5:22 PM, Srikar Dronamraju sri...@linux.vnet.ibm.com wrote: * kernelf...@gmail.com kernelf...@gmail.com [2014-10-16 15:29:50]: Some system such as powerpc, some tsk (vcpu thread) can only run on the dedicated cpu. Since we adapt some asymmetric method to monitor the whole physical cpu. (powerKVM only allows the primary hwthread to set up runtime env for the secondary when entering guest). Nowadays, powerKVM run with all the secondary hwthread offline to ensure the vcpu threads only run on the primary thread. But we plan to keep all cpus online when running powerKVM to give more power when switching back to host, so introduce sys_allowed cpumask to reflect the cpuset which the vcpu thread can run on. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/linux/init_task.h | 1 + include/linux/sched.h | 6 ++ kernel/sched/core.c | 10 -- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 2bb4c4f3..c56f69e 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -172,6 +172,7 @@ extern struct task_group root_task_group; .normal_prio= MAX_PRIO-20, \ .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ + .sys_allowed = CPU_MASK_ALL,\ Do we really need another mask, cant we just use cpus_allowed itself. I think it is not easy to cast two request: chip inherit and user's configuration onto one mask. .nr_cpus_allowed= NR_CPUS, \ .mm = NULL, \ .active_mm = init_mm, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 5c2c885..ce429f3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1260,7 +1260,10 @@ struct task_struct { unsigned int policy; int nr_cpus_allowed; + /* Anded user and sys_allowed */ cpumask_t cpus_allowed; + /* due to the feature of asymmetric, some tsk can only run on such cpu */ + cpumask_t sys_allowed; #ifdef CONFIG_PREEMPT_RCU int rcu_read_lock_nesting; @@ -2030,6 +2033,9 @@ static inline void tsk_restore_flags(struct task_struct *task, } #ifdef CONFIG_SMP +extern void set_cpus_sys_allowed(struct task_struct *p, + const struct cpumask *new_mask); + extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ec1a286..2cd1ae3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4596,13 +4596,19 @@ void init_idle(struct task_struct *idle, int cpu) } #ifdef CONFIG_SMP +void set_cpus_sys_allowed(struct task_struct *p, + const struct cpumask *new_mask) +{ + cpumask_copy(p-sys_allowed, new_mask); +} + This function doesnt seem to be used anywhere... Not sure why it is introduced Not layered the patches well :( It is used later in the series. Thx, Fan void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) { if (p-sched_class p-sched_class-set_cpus_allowed) p-sched_class-set_cpus_allowed(p, new_mask); - cpumask_copy(p-cpus_allowed, new_mask); - p-nr_cpus_allowed = cpumask_weight(new_mask); + cpumask_and(p-cpus_allowed, p-sys_allowed, new_mask); + p-nr_cpus_allowed = cpumask_weight(p-cpus_allowed); } /* -- 1.8.3.1 -- Thanks and Regards Srikar Dronamraju ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 03/11] powerpc: kvm: add interface to control kvm function on a core
On Mon, Oct 27, 2014 at 12:04 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi Liu, On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: When kvm is enabled on a core, we migrate all external irq to primary thread. Since currently, the kvmirq logic is handled by the primary hwthread. Todo: this patch lacks re-enable of irqbalance when kvm is disable on the core Why is a sysfs file introduced to trigger irq migration? Why is it not done during kvm module insert ? And similarly spread interrupts when the module is removed? Isn't this a saner way ? Consider the scene: coreA and coreB, we want to enable KVM on coreA, while keeping coreB unchanged. In fact, I try to acheive something un-symmetric on the platform. Do you think it is an justification? Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kernel/sysfs.c| 39 ++ arch/powerpc/sysdev/xics/xics-common.c | 12 +++ 2 files changed, 51 insertions(+) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 67fd2fd..a2595dd 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -552,6 +552,45 @@ static void sysfs_create_dscr_default(void) if (cpu_has_feature(CPU_FTR_DSCR)) err = device_create_file(cpu_subsys.dev_root, dev_attr_dscr_default); } + +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY +#define NR_CORES (CONFIG_NR_CPUS/threads_per_core) +static DECLARE_BITMAP(kvm_on_core, NR_CORES) __read_mostly + +static ssize_t show_kvm_enable(struct device *dev, + struct device_attribute *attr, char *buf) +{ +} + +static ssize_t __used store_kvm_enable(struct device *dev, + struct device_attribute *attr, const char *buf, + size_t count) +{ + struct cpumask stop_cpus; + unsigned long core, thr; + + sscanf(buf, %lx, core); + if (core NR_CORES) + return -1; + if (!test_bit(core, kvm_on_core)) + for (thr = 1; thr threads_per_core; thr++) + if (cpu_online(thr * threads_per_core + thr)) + cpumask_set_cpu(thr * threads_per_core + thr, stop_cpus); What is the above logic trying to do? Did you mean cpu_online(threads_per_core * core + thr) ? Yeah. My mistake, should be cpumask_set_cpu(core * threads_per_core + thr, stop_cpus) + + stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus); + set_bit(core, kvm_on_core); + return count; +} + +static DEVICE_ATTR(kvm_enable, 0600, + show_kvm_enable, store_kvm_enable); + +static void sysfs_create_kvm_enable(void) +{ + device_create_file(cpu_subsys.dev_root, dev_attr_kvm_enable); +} +#endif + #endif /* CONFIG_PPC64 */ #ifdef HAS_PPC_PMC_PA6T diff --git a/arch/powerpc/sysdev/xics/xics-common.c b/arch/powerpc/sysdev/xics/xics-common.c index fe0cca4..68b33d8 100644 --- a/arch/powerpc/sysdev/xics/xics-common.c +++ b/arch/powerpc/sysdev/xics/xics-common.c @@ -258,6 +258,18 @@ unlock: raw_spin_unlock_irqrestore(desc-lock, flags); } } + +int xics_migrate_irqs_away_secondary(void *data) +{ + int cpu = smp_processor_id(); + if(cpu%thread_per_core != 0) { + WARN(condition, format...); + return 0; + } + /* In fact, if we can migrate the primary, it will be more fine */ + (); Isn't the aim of the patch to migrate irqs away from the secondary onto the primary? But from above it looks like we are returning when we find out that we are secondary threads, isn't it? Yes, will fix in next version. + return 0; +} #endif /* CONFIG_HOTPLUG_CPU */ Note that xics_migrate_irqs_away() is defined under CONFIG_CPU_HOTPLUG. But we will need this option on PowerKVM even when hotplug is not configured in. Yes, will fix the dependency in next version Thx, Fan Regards Preeti U Murthy #ifdef CONFIG_SMP ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 04/11] powerpc: kvm: introduce a kthread on primary thread to anti tickless
On Mon, Oct 27, 2014 at 12:45 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: (This patch is a place holder.) If there is only one vcpu thread is ready(the other vcpu thread can wait for it to execute), the primary thread can enter tickless mode, We do not configure NOHZ_FULL to y by default. Hence no thread would enter tickless mode. But NOHZ_FULL can be chosen by user, and we should survive from it :) which causes the primary keeps running, so the secondary has no opportunity to exit to host, even they have other tsk on them. The secondary threads can still get scheduling ticks. The decrementer of the secondary threads is still active. So as long as secondary threads are busy, scheduling ticks will fire and try to schedule a new task on them. No. As my original thought, after enable KVM on core, the HDEC on secondary is disabled, otherwise the host exit will be too frequent. Any suggestion? Thx, Fan Regards Preeti U Murthy Introduce a kthread (anti_tickless) on primary, so when there is only one vcpu thread on primary, the secondary can resort to anti_tickless to keep the primary out of tickless mode. (I thought that anti_tickless thread can goto NAP, so we can let the secondary run). Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kernel/sysfs.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index a2595dd..f0b110e 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -575,9 +575,11 @@ static ssize_t __used store_kvm_enable(struct device *dev, if (!test_bit(core, kvm_on_core)) for (thr = 1; thr threads_per_core; thr++) if (cpu_online(thr * threads_per_core + thr)) - cpumask_set_cpu(thr * threads_per_core + thr, stop_cpus); + cpumask_set_cpu(core * threads_per_core + thr, stop_cpus); stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus); + /* fixme, create a kthread on primary hwthread to handle tickless mode */ + //kthread_create_on_cpu(prevent_tickless, NULL, core * threads_per_core, ppckvm_prevent_tickless); set_bit(core, kvm_on_core); return count; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 06/11] powerpc: kvm: introduce online in paca to indicate whether cpu is needed by host
On Mon, Oct 27, 2014 at 1:32 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: Hi Liu, On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: Nowadays, powerKVM runs with secondary hwthread offline. Although we can make all secondary hwthread online later, we still preserve this behavior for dedicated KVM env. Achieve this by setting paca-online as false. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/paca.h | 3 +++ arch/powerpc/kernel/asm-offsets.c | 3 +++ arch/powerpc/kernel/smp.c | 3 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 4 files changed, 21 insertions(+) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index a5139ea..67c2500 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -84,6 +84,9 @@ struct paca_struct { u8 cpu_start; /* At startup, processor spins until */ /* this becomes non-zero. */ u8 kexec_state; /* set when kexec down has irqs off */ +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + u8 online; +#endif #ifdef CONFIG_PPC_STD_MMU_64 struct slb_shadow *slb_shadow_ptr; struct dtl_entry *dispatch_log; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 9d7dede..0faa8fe 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -182,6 +182,9 @@ int main(void) DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc)); DEFINE(PACAKBASE, offsetof(struct paca_struct, kernelbase)); DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr)); +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + DEFINE(PACAONLINE, offsetof(struct paca_struct, online)); +#endif DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled)); DEFINE(PACAIRQHAPPENED, offsetof(struct paca_struct, irq_happened)); DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id)); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index a0738af..4c3843e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -736,6 +736,9 @@ void start_secondary(void *unused) cpu_startup_entry(CPUHP_ONLINE); +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + get_paca()-online = true; +#endif BUG(); } diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index f0c4db7..d5594b0 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -322,6 +322,13 @@ kvm_no_guest: li r0, KVM_HWTHREAD_IN_NAP stb r0, HSTATE_HWTHREAD_STATE(r13) kvm_do_nap: +#ifdef PPCKVM_ENABLE_SECONDARY + /* check the cpu is needed by host or not */ + ld r2, PACAONLINE(r13) + ld r3, 0 + cmp r2, r3 + bne kvm_secondary_exit_trampoline +#endif /* Clear the runlatch bit before napping */ mfspr r2, SPRN_CTRLF clrrdi r2, r2, 1 @@ -340,6 +347,11 @@ kvm_do_nap: nap b . +#ifdef PPCKVM_ENABLE_SECONDARY +kvm_secondary_exit_trampoline: + b . Uh? When we have no vcpu to run, we loop here instead of doing a nap? What are we achieving? If I understand the intention of the patch well, we are looking to provide a knob whereby the host can indicate if it needs the secondaries at all. Yes, you catch it :) Today the host does boot with all threads online. There are some init scripts which take the secondaries down. So today the host does not have a say in preventing this, compile time or runtime. So lets see how we can switch between the two behaviors if we don't have the init script, which looks like a saner thing to do. We should set the paca-online flag to false by default. If KVM_PPC_ENABLE_SECONDARY is configured, we need to set this flag to true. So at compile time, we resolve the flag. While booting, we look at the flag and decide whether to get the secondaries online. So we get the current behavior if we have not configured KVM_PPC_ENABLE_SECONDARY. Will this achieve the purpose of this patch? At boot time, KVM can not run. So we can achieve the change of the flag by soft cpu hotplug on/off. I think this is a more flexible way. Thx, Fan Regards Preeti U Murthy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 09/11] powerpc: kvm: handle time base on secondary hwthread
On Mon, Oct 27, 2014 at 2:40 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote: (This is a place holder patch.) We need to store the time base for host on secondary hwthread. Later when switching back, we need to reprogram it with elapse time. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 89ea16c..a817ba6 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -371,6 +371,8 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter) /* fixme: store other register such as msr */ + /* fixme: store the tb, and set it as MAX, so we cease the tick on secondary */ This can lead to serious consequences. First of all, even if we set the decrementer(not tb) to MAX, it is bound to fire after 4s. That is the maximum time till which you can keep off the decrementer from firing. In the hotplug path, the offline cpus nap and their decrementers are programmed to fire at MAX too. But the difference is that we clear the LPCR_PECE1 wakeup bit which prevents cpus from waking up on a decrementer interrupt. We cannot afford to do this here though because there are other tasks on the secondary threads' runqueue. They need to be scheduled in. There are also timers besides the tick_sched one, which can be queued on these secondary threads. These patches have not taken care to migrate timers or tasks before entering guest as far as I observed. Hence we cannot just turn off time base like this and expect to handle the above mentioned events the next time the primary thread decides to exit to the host. Yes, that is the nut in this series. My plan is to compensate the hrtimer when the secondary exit. But as to the scheduler on secondary, if it is ceased, what is side-effect? Thx, Fan Regards Preeti U Murthy + /* prevent us to enter kernel */ li r0, 1 stb r0, HSTATE_HWTHREAD_REQ(r13) @@ -382,6 +384,10 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter) /* enter with vmode */ kvmppc_secondary_stopper_exit: + /* fixme: restore the tb, with the orig val plus time elapse + * so we can fire the hrtimer as soon as possible + */ + /* fixme, restore the stack which we store on lpaca */ ld r0, 112+PPC_LR_STKOFF(r1) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC 11/11] powerpc: kvm: Kconfig add an option for enabling secondary hwthread
On Mon, Oct 27, 2014 at 2:44 PM, Preeti U Murthy pre...@linux.vnet.ibm.com wrote: On 10/17/2014 01:00 AM, kernelf...@gmail.com wrote: Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/Kconfig | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 602eb51..de38566 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -93,6 +93,10 @@ config KVM_BOOK3S_64_HV If unsure, say N. +config KVMPPC_ENABLE_SECONDARY + tristate KVM support for running on secondary hwthread in host + depends on KVM_BOOK3S_64_HV This patch is required ontop of all the rest :) The top patches won't compile without this one. Every patch in the patchset should be able to compile successfully without the aid of the patches that come after it. I think here is a conflict. If we do so, then we should make effort to prevent the independent patch to take effect before the whole patchset is applied. Thx, Fan Regards Preeti U Murthy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
On Tue, Jul 29, 2014 at 2:57 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2014-07-28 at 15:58 +0800, Liu ping fan wrote: Hope I am right. Take the following seq as an example if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); } else { kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - if we try_to_unmap on pfn at here, then @r contains a invalid pfn hptep[1] = r; eieio(); hptep[0] = hpte[0]; asm volatile(ptesync : : : memory); If that was the case we would have the same race in kvmppc_do_h_enter(). I think the fact that the HPTE is locked will prevent the race, ie, HPTE_V_HVLOCK is set until hptep[0] is written to. If I look at at the unmap case, my understanding is that it uses kvm_unmap_rmapp() which will also lock the HPTE (try_lock_hpte) and so shouldn't have a race vs the above code. Yes, you are right :) Thx, Fan Or do you see a race I don't ? Cheers, Ben. Thx. Fan On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote: In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Please describe the race you think you see. I'm quite sure both Paul and I went over that code and somewhat convinced ourselves that it was ok but it's possible that we were both wrong :-) Cheers, Ben. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT
Hope I am right. Take the following seq as an example if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); } else { kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - if we try_to_unmap on pfn at here, then @r contains a invalid pfn hptep[1] = r; eieio(); hptep[0] = hpte[0]; asm volatile(ptesync : : : memory); Thx. Fan On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote: In current code, the setup of hpte is under the risk of race with mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn. Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT. Please describe the race you think you see. I'm quite sure both Paul and I went over that code and somewhat convinced ourselves that it was ok but it's possible that we were both wrong :-) Cheers, Ben. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 8056107..e6dcff4 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (hptep[0] HPTE_V_VALID) { /* HPTE was previously valid, so we need to invalidate it */ - unlock_rmap(rmap); hptep[0] |= HPTE_V_ABSENT; kvmppc_invalidate_hpte(kvm, hptep, index); /* don't lose previous R and C bits */ r |= hptep[1] (HPTE_R_R | HPTE_R_C); + + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); + unlock_rmap(rmap); } else { + hptep[1] = r; + eieio(); + hptep[0] = hpte[0]; + asm volatile(ptesync : : : memory); kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0); } - hptep[1] = r; - eieio(); - hptep[0] = hpte[0]; - asm volatile(ptesync : : : memory); preempt_enable(); if (page hpte_is_writable(r)) SetPageDirty(page); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4] powerpc: kvm: make _PAGE_NUMA take effect
Numa fault is a method which help to achieve auto numa balancing. When such a page fault takes place, the page fault handler will check whether the page is placed correctly. If not, migration should be involved to cut down the distance between the cpu and pages. A pte with _PAGE_NUMA help to implement numa fault. It means not to allow the MMU to access the page directly. So a page fault is triggered and numa fault handler gets the opportunity to run checker. As for the access of MMU, we need special handling for the powernv's guest. When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to invalidate it in guest's htab, but when we tried to re-insert them, we firstly try to map it in real-mode. Only after this fails, we fallback to virt mode, and most of important, we run numa fault handler in virt mode. This patch guards the way of real-mode to ensure that if a pte is marked with _PAGE_NUMA, it will NOT be mapped in real mode, instead, it will be mapped in virt mode and have the opportunity to be checked with placement. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- v4: more detail description --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..8fcc363 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect
On Mon, Apr 14, 2014 at 2:43 PM, Alexander Graf ag...@suse.de wrote: On 13.04.14 04:27, Liu ping fan wrote: On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote: On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com wrote: When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. What happens when we don't? Why do we need the check? Why isn't it done implicitly? What happens when we treat a NUMA marked page as non-present? Why does it work out for us? Assume you have no idea what PAGE_NUMA is, but try to figure out what this patch does and whether you need to cherry-pick it into your downstream kernel. The description as is still is not very helpful for that. It doesn't even explain what really changes with this patch applied. Yeah. what about appending the following description? Can it make the context clear? Guest should not setup a hpte for the page whose pte is marked with _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect to check whether the page is placed correctly or not. Try to come up with a text that answers the following questions in order: I divide them into 3 groups, and answer them by 3 sections. Seems that it has the total story :) Please take a look. - What does _PAGE_NUMA mean? Group 1 - section 2 - How does page migration with _PAGE_NUMA work? - Why should we not map pages when _PAGE_NUMA is set? Group 2 - section 1 (Note: for the 1st question in this group, I am not sure about the details, except that we can fix numa balancing by moving task or moving page. So I comment as migration should be involved to cut down the distance between the cpu and pages) - Which part of what needs to be done did the previous _PAGE_NUMA patch address? - What's the situation without this patch? - Which scenario does this patch fix? Group 3 - section 3 Numa fault is a method which help to achieve auto numa balancing. When such a page fault takes place, the page fault handler will check whether the page is placed correctly. If not, migration should be involved to cut down the distance between the cpu and pages. A pte with _PAGE_NUMA help to implement numa fault. It means not to allow the MMU to access the page directly. So a page fault is triggered and numa fault handler gets the opportunity to run checker. As for the access of MMU, we need special handling for the powernv's guest. When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to invalidate it in guest's htab, but when we tried to re-insert them, we firstly try to fix it in real-mode. Only after this fails, we fallback to virt mode, and most of important, we run numa fault handler in virt mode. This patch guards the way of real-mode to ensure that if a pte is marked with _PAGE_NUMA, it will NOT be fixed in real mode, instead, it will be fixed in virt mode and have the opportunity to be checked with placement. Thx, Fan Once you have a text that answers those, you should have a good patch description :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect
On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote: On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com wrote: When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. What happens when we don't? Why do we need the check? Why isn't it done implicitly? What happens when we treat a NUMA marked page as non-present? Why does it work out for us? Assume you have no idea what PAGE_NUMA is, but try to figure out what this patch does and whether you need to cherry-pick it into your downstream kernel. The description as is still is not very helpful for that. It doesn't even explain what really changes with this patch applied. Yeah. what about appending the following description? Can it make the context clear? Guest should not setup a hpte for the page whose pte is marked with _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect to check whether the page is placed correctly or not. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- v3: rebased onto Alex's tree, branch kvm-ppc-next substitue commit log with the more clear description in Aneesh's reply (thanks, Aneesh) --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..1117525 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte)!pte_numa(pte)) { Spaces missing Will fix, Thx, Fan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect
When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- v3: rebased onto Alex's tree, branch kvm-ppc-next substitue commit log with the more clear description in Aneesh's reply (thanks, Aneesh) --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 1d6c56a..1117525 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pte_size = psize; pte = lookup_linux_pte_and_update(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte)!pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect
On Mon, Apr 7, 2014 at 4:36 PM, Alexander Graf ag...@suse.de wrote: On 07.04.14 09:42, Aneesh Kumar K.V wrote: Alexander Graf ag...@suse.com writes: On 03.04.14 04:36, Liu ping fan wrote: Hi Alex, could you help to pick up this patch? since v3.14 kernel can enable numa fault for powerpc. What bad happens without this patch? We map a page even though it was declared to get NUMA migrated? What happens next? Nothing much, we won't be properly accounting the numa access in the host. What we want to achieve is to convert a guest access of the page to a host fault so that we can do proper numa access accounting in the host. This would enable us to migrate the page to the correct numa node. Ok, so no breakages, just less performance. I wouldn't consider it stable material then :). Sorry to reply late, since I am half out of office during this period. I am puzzling about you reply.Without this patch, the guest can NOT sense the numa changes and ask host to put the pages in right place. So the pages which is used by guest will be always misplaced. The numa-fault method is inspired by real requirement to improve performance, so we should also consider the performance drop of guest. Right? Regards, Fan Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect
Hi Alex, could you help to pick up this patch? since v3.14 kernel can enable numa fault for powerpc. Thx, Fan On Mon, Jan 27, 2014 at 5:11 PM, Alexander Graf ag...@suse.de wrote: On 21.01.2014, at 10:42, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu Ping Fan kernelf...@gmail.com writes: To make sure that on host, the pages marked with _PAGE_NUMA result in a fault when guest access them, we should force the checking when guest uses hypercall to setup hpte. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com When we mark pte with _PAGE_NUMA we already call mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end, which will mark existing guest hpte entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new guest hpte entries. This patch does that. So what happens next? We insert a page into the HTAB without HPTE_V_VALID set, so the guest will fail to use it. If the guest does an H_READ on it it will suddenly turn to V_VALID though? I might need a crash course in the use of HPTE_V_ABSENT. Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
On Fri, Feb 28, 2014 at 7:41 AM, Andrew Morton a...@linux-foundation.org wrote: On Wed, 26 Feb 2014 13:22:16 +0530 Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Andrew Morton a...@linux-foundation.org writes: On Wed, 5 Feb 2014 09:25:46 +0800 Liu Ping Fan qemul...@gmail.com wrote: When doing some numa tests on powerpc, I triggered an oops bug. I find it is caused by using page-_last_cpupid. It should be initialized as -1 LAST_CPUPID_MASK, but not -1. Otherwise, in task_numa_fault(), we will miss the checking (last_cpupid == (-1 LAST_CPUPID_MASK)). And finally cause an oops bug in task_numa_group(), since the online cpu is less than possible cpu. I grabbed this. I added this to the changelog: : PPC needs the LAST_CPUPID_NOT_IN_PAGE_FLAGS case because ppc needs to : support a large physical address region, up to 2^46 but small section size : (2^24). So when NR_CPUS grows up, it is easily to cause : not-in-page-flags. to hopefully address Peter's observation. How should we proceed with this? I'm getting the impression that numa balancing on ppc is a dead duck in 3.14, so perhaps this and powerpc-mm-add-new-set-flag-argument-to-pte-pmd-update-function.patch mm-dirty-accountable-change-only-apply-to-non-prot-numa-case.patch mm-use-ptep-pmdp_set_numa-for-updating-_page_numa-bit.patch All these are already in 3.14 ? Yes. are 3.15-rc1 material? We should push the first hunk to 3.14. I will wait for Liu to redo the patch. BTW this should happen only when SPARSE_VMEMMAP is not specified. Srikar had reported the issue here http://mid.gmane.org/20140219180200.ga29...@linux.vnet.ibm.com #if defined(CONFIG_SPARSEMEM) !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTIONS_WIDTHSECTIONS_SHIFT #else #define SECTIONS_WIDTH0 #endif I'm lost. What patch are you talking about? The first hunk of what? I think Aneesh was talking about the chunk of patch, which modified the file page-flags-layout.h. I tried to collapse and simplify the logic, but it will incur that LAST_CPUPID_WIDTH depends on CONFIG_NUMA_BALANCING. It is an error since we need LAST_CPUPID_WIDTH even without CONFIG_NUMA_BALANCING. (Sorry, I compiled and run kernel, but not find this). Thanks and best regards, Fan I assume we're talking about mm-numa-bugfix-for-last_cpupid_not_in_page_flags.patch, which I had queued for 3.14. I'll put it on hold until there's some clarity here. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
On Fri, Feb 28, 2014 at 12:47 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Andrew Morton a...@linux-foundation.org writes: On Wed, 26 Feb 2014 13:22:16 +0530 Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Andrew Morton a...@linux-foundation.org writes: On Wed, 5 Feb 2014 09:25:46 +0800 Liu Ping Fan qemul...@gmail.com wrote: When doing some numa tests on powerpc, I triggered an oops bug. I find it is caused by using page-_last_cpupid. It should be initialized as -1 LAST_CPUPID_MASK, but not -1. Otherwise, in task_numa_fault(), we will miss the checking (last_cpupid == (-1 LAST_CPUPID_MASK)). And finally cause an oops bug in task_numa_group(), since the online cpu is less than possible cpu. I grabbed this. I added this to the changelog: : PPC needs the LAST_CPUPID_NOT_IN_PAGE_FLAGS case because ppc needs to : support a large physical address region, up to 2^46 but small section size : (2^24). So when NR_CPUS grows up, it is easily to cause : not-in-page-flags. to hopefully address Peter's observation. How should we proceed with this? I'm getting the impression that numa balancing on ppc is a dead duck in 3.14, so perhaps this and powerpc-mm-add-new-set-flag-argument-to-pte-pmd-update-function.patch mm-dirty-accountable-change-only-apply-to-non-prot-numa-case.patch mm-use-ptep-pmdp_set_numa-for-updating-_page_numa-bit.patch All these are already in 3.14 ? Yes. are 3.15-rc1 material? We should push the first hunk to 3.14. I will wait for Liu to redo the patch. BTW this should happen only when SPARSE_VMEMMAP is not specified. Srikar had reported the issue here http://mid.gmane.org/20140219180200.ga29...@linux.vnet.ibm.com #if defined(CONFIG_SPARSEMEM) !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTIONS_WIDTH SECTIONS_SHIFT #else #define SECTIONS_WIDTH 0 #endif I'm lost. What patch are you talking about? The first hunk of what? The patch in this thread. I assume we're talking about mm-numa-bugfix-for-last_cpupid_not_in_page_flags.patch, which I had queued for 3.14. I'll put it on hold until there's some clarity here. We don't need the complete patch, it is just the first hunk that we need to fix the crash ie. we only need diff --git a/include/linux/mm.h b/include/linux/mm.h index a7b4e31..ddc66df4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -727,7 +727,7 @@ static inline int page_cpupid_last(struct page *page) } static inline void page_cpupid_reset_last(struct page *page) { - page-_last_cpupid = -1; + page-_last_cpupid = -1 LAST_CPUPID_MASK; } #else static inline int page_cpupid_last(struct page *page) Also the issue will only happen when SPARSE_VMEMMAP is not enabled. I will send a proper patch with updated changelog. I was hoping Liu will get to that quickly Thanks for sending V2. Since the ppc machine env is changed by others, I am blocking on setting up the env for re-test this patch. And not send out it quickly. Best regards, Fan -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: ftrace: bugfix for test_24bit_addr
The branch target should be the func addr, not the addr of func_descr_t. So using ppc_function_entry() to generate the right target addr. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- This bug will make ftrace fail to work. It can be triggered when the kernel size grows up. --- arch/powerpc/kernel/ftrace.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c index 9b27b29..b0ded97 100644 --- a/arch/powerpc/kernel/ftrace.c +++ b/arch/powerpc/kernel/ftrace.c @@ -74,6 +74,7 @@ ftrace_modify_code(unsigned long ip, unsigned int old, unsigned int new) */ static int test_24bit_addr(unsigned long ip, unsigned long addr) { + addr = ppc_function_entry((void *)addr); /* use the create_branch to verify that this offset can be branched */ return create_branch((unsigned int *)ip, addr, 0); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
); } } if (clock_gettime(CLOCK_REALTIME, now) == -1) handle_error(clock_gettime); /* Create a CLOCK_REALTIME absolute timer with initial expiration and interval as specified in command line */ new_value.it_value.tv_sec = now.tv_sec + elapsed; new_value.it_value.tv_nsec = now.tv_nsec; new_value.it_interval.tv_sec = 0; new_value.it_interval.tv_nsec = 0; if (timerfd_settime(fd, TFD_TIMER_ABSTIME, new_value, NULL) == -1) handle_error(timerfd_settime); pfd.fd = fd; pfd.events = POLLIN; pfd.revents = 0; /* -1: infinite wait */ poll(pfd, 1, -1); /* ask children to stop and get back cnt */ *(pmap + SHM_CMD_OFF) = CMD_STOP; wait(NULL); dst_info = (char *)(pmap + SHM_MESSAGE_OFF); printf(dst_info); printf(total cnt:%lu\n, *(pmap + SHM_CNT_OFF)); munmap(pmap, PAGE_SIZE); shm_unlink(SHM_FNAME); } On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf ag...@suse.de wrote: On 15.01.2014, at 07:36, Liu ping fan kernelf...@gmail.com wrote: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) If my suppose is correct, will CCing k...@vger.kernel.org from next version. This translates to me as This is an RFC? Yes, I am not quite sure about it. I have no bare-metal to verify it. So I hope at least, from the theory, it is correct. Paul, could you please give this some thought and maybe benchmark it? Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: ftrace: bugfix for test_24bit_addr
On Wed, Feb 26, 2014 at 12:35 PM, Ananth N Mavinakayanahalli ana...@in.ibm.com wrote: On Wed, Feb 26, 2014 at 10:23:01AM +0800, Liu Ping Fan wrote: The branch target should be the func addr, not the addr of func_descr_t. So using ppc_function_entry() to generate the right target addr. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- This bug will make ftrace fail to work. It can be triggered when the kernel size grows up. --- arch/powerpc/kernel/ftrace.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c index 9b27b29..b0ded97 100644 --- a/arch/powerpc/kernel/ftrace.c +++ b/arch/powerpc/kernel/ftrace.c @@ -74,6 +74,7 @@ ftrace_modify_code(unsigned long ip, unsigned int old, unsigned int new) */ static int test_24bit_addr(unsigned long ip, unsigned long addr) { + addr = ppc_function_entry((void *)addr); Won't this break on LE? How? I can not figure out it. Anyway, ppc_function_entry() is already used in other places with LE. Thx, Fan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Wed, Jan 22, 2014 at 1:18 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Paul Mackerras pau...@samba.org writes: On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote: On 15.01.2014, at 07:36, Liu ping fan kernelf...@gmail.com wrote: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) If my suppose is correct, will CCing k...@vger.kernel.org from next version. This translates to me as This is an RFC? Yes, I am not quite sure about it. I have no bare-metal to verify it. So I hope at least, from the theory, it is correct. Paul, could you please give this some thought and maybe benchmark it? OK, once I get Aneesh to tell me how I get to have ptes with _PAGE_NUMA set in the first place. :) I guess we want patch 2, Which Liu has sent separately and I have reviewed. http://article.gmane.org/gmane.comp.emulators.kvm.powerpc.devel/8619 I am not sure about the rest of the patches in the series. We definitely don't want to numa migrate on henter. We may want to do that on fault. But even there, IMHO, we should let the host take the fault and do the numa migration instead of doing this in guest context. My patch does NOT do the numa migration in guest context( h_enter). Instead it just do a pre-check to see whether the numa migration is needed. If needed, the host will take the fault and do the numa migration as it currently does. Otherwise, h_enter can directly setup hpte without HPTE_V_ABSENT. And since pte_mknuma() is called system-wide periodly, so it has more possibility that guest will suffer from HPTE_V_ABSENT.(as my previous reply, I think we should also place the quick check in kvmppc_hpte_hv_fault ) Thx, Fan -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powernv: kvm: make _PAGE_NUMA take effect
To make sure that on host, the pages marked with _PAGE_NUMA result in a fault when guest access them, we should force the checking when guest uses hypercall to setup hpte. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v2: It should be the reply to [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect And I imporve the changelog according to Aneesh's suggestion. --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..af8602d 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) Can you explain more. Are we looking at hcall from guest and hypervisor handling them in real mode ? If so why would guest issue a hcall on a pte entry that have PAGE_NUMA set. Or is this about hypervisor handling a missing hpte, because of host swapping this page out ? In that case how we end up in h_enter ? IIUC for that case we should get to kvmppc_hpte_hv_fault. After setting _PAGE_NUMA, we should flush out all hptes both in host's htab and guest's. So when guest tries to access memory, host finds that there is not hpte ready for guest in guest's htab. And host should raise dsi to guest. Now guest receive that fault, removes the PAGE_NUMA bit and do an hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have cleared PAGE_NUMA bit. This incurs that guest ends up in h_enter. And you can see in current code, we also try this quick path firstly. Only if fail, we will resort to slow path -- kvmppc_hpte_hv_fault. hmm ? hpte_hv_fault is the hypervisor handling the fault. After we discuss in irc. I think we should also do the fast check in kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT, and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right? Thanks and regards, Fan -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Tue, Jan 21, 2014 at 5:07 PM, Liu ping fan kernelf...@gmail.com wrote: On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) Can you explain more. Are we looking at hcall from guest and hypervisor handling them in real mode ? If so why would guest issue a hcall on a pte entry that have PAGE_NUMA set. Or is this about hypervisor handling a missing hpte, because of host swapping this page out ? In that case how we end up in h_enter ? IIUC for that case we should get to kvmppc_hpte_hv_fault. After setting _PAGE_NUMA, we should flush out all hptes both in host's htab and guest's. So when guest tries to access memory, host finds that there is not hpte ready for guest in guest's htab. And host should raise dsi to guest. Now guest receive that fault, removes the PAGE_NUMA bit and do an hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have cleared PAGE_NUMA bit. This incurs that guest ends up in h_enter. And you can see in current code, we also try this quick path firstly. Only if fail, we will resort to slow path -- kvmppc_hpte_hv_fault. hmm ? hpte_hv_fault is the hypervisor handling the fault. After we discuss in irc. I think we should also do the fast check in kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT, and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right? Or we can delay the quick fix in H_ENTER, and let the host fault again, so do the fix in kvmppc_hpte_hv_fault() Thanks and regards, Fan -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu ping fan kernelf...@gmail.com writes: On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) Can you explain more. Are we looking at hcall from guest and hypervisor handling them in real mode ? If so why would guest issue a hcall on a pte entry that have PAGE_NUMA set. Or is this about hypervisor handling a missing hpte, because of host swapping this page out ? In that case how we end up in h_enter ? IIUC for that case we should get to kvmppc_hpte_hv_fault. After setting _PAGE_NUMA, we should flush out all hptes both in host's htab and guest's. So when guest tries to access memory, host finds that there is not hpte ready for guest in guest's htab. And host should raise dsi to guest. This incurs that guest ends up in h_enter. And you can see in current code, we also try this quick path firstly. Only if fail, we will resort to slow path -- kvmppc_hpte_hv_fault. Thanks and regards, Fan If my suppose is correct, will CCing k...@vger.kernel.org from next version. This translates to me as This is an RFC? Yes, I am not quite sure about it. I have no bare-metal to verify it. So I hope at least, from the theory, it is correct. -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/4] powernv: kvm: numa fault improvement
On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote: On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote: This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it. Sorry for the unclear message. After introducing the _PAGE_NUMA, kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it should rely on host's kvmppc_book3s_hv_page_fault() to call do_numa_page() to do the numa fault check. This incurs the overhead when exiting from rmode to vmode. My idea is that in kvmppc_do_h_enter(), we do a quick check, if the page is right placed, there is no need to exit to vmode (i.e saving htab, slab switching) If my suppose is correct, will CCing k...@vger.kernel.org from next version. This translates to me as This is an RFC? Yes, I am not quite sure about it. I have no bare-metal to verify it. So I hope at least, from the theory, it is correct. Thanks and regards, Ping Fan Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/4] powernv: kvm: numa fault improvement
This series is based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 For this series, I apply the same idea from the previous thread [PATCH 0/3] optimize for powerpc _PAGE_NUMA (for which, I still try to get a machine to show nums) But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host, which is well known. If my suppose is correct, will CCing k...@vger.kernel.org from next version. Liu Ping Fan (4): mm: export numa_migrate_prep() powernv: kvm: make _PAGE_NUMA take effect powernv: kvm: extend input param for lookup_linux_pte powernv: kvm: make the handling of _PAGE_NUMA faster for guest arch/powerpc/kvm/book3s_hv_rm_mmu.c | 38 ++--- include/linux/mm.h | 2 ++ 2 files changed, 37 insertions(+), 3 deletions(-) -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect
To make _PAGE_NUMA take effect, we should force the checking when guest uses hypercall to setup hpte. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..af8602d 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; pte = lookup_linux_pte(pgdir, hva, writing, pte_size); - if (pte_present(pte)) { + if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/4] mm: export numa_migrate_prep()
powerpc will use it in fast path. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/linux/mm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5ab0e22..420fb77 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long extern int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags); +extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int page_nid); /* * doesn't attempt to fault and will return short. -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/4] powernv: kvm: extend input param for lookup_linux_pte
It will be helpful for next patch Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- Can it be merged with the next patch? --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index af8602d..ae46052 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -135,7 +135,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index, } static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, - int writing, unsigned long *pte_sizep) + int writing, unsigned long *pte_sizep, + pte_t **ptepp) { pte_t *ptep; unsigned long ps = *pte_sizep; @@ -144,6 +145,8 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva, ptep = find_linux_pte_or_hugepte(pgdir, hva, hugepage_shift); if (!ptep) return __pte(0); + if (ptepp != NULL) + *ptepp = ptep; if (hugepage_shift) *pte_sizep = 1ul hugepage_shift; else @@ -231,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; - pte = lookup_linux_pte(pgdir, hva, writing, pte_size); + pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL); if (pte_present(pte) !pte_numa(pte)) { if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ @@ -671,7 +674,8 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags, memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn); if (memslot) { hva = __gfn_to_hva_memslot(memslot, gfn); - pte = lookup_linux_pte(pgdir, hva, 1, psize); + pte = lookup_linux_pte(pgdir, hva, 1, psize, + NULL); if (pte_present(pte) !pte_write(pte)) r = hpte_make_readonly(r); } -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/4] powernv: kvm: make the handling of _PAGE_NUMA faster for guest
The period check of _PAGE_NUMA can probably happen on the correctly placed page. For this case, when guest try to setup hpte in real mode, we try to resolve the numa fault in real mode, since the switch between guest context and host context costs too much. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index ae46052..a06b199 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -179,6 +179,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, unsigned int writing; unsigned long mmu_seq; unsigned long rcbits; + struct mm_struct *mm = kvm-mm; + struct vm_area_struct *vma; + int page_nid, target_nid; + struct page *test_page; + pte_t *ptep; psize = hpte_page_size(pteh, ptel); if (!psize) @@ -234,8 +239,26 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, /* Look up the Linux PTE for the backing page */ pte_size = psize; - pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL); - if (pte_present(pte) !pte_numa(pte)) { + pte = lookup_linux_pte(pgdir, hva, writing, pte_size, ptep); + if (pte_present(pte)) { + if (pte_numa(pte)) { + /* If fail, let gup handle it */ + if (unlikely(!down_read_trylock(mm-mmap_sem))) + goto pte_check; + + vma = find_vma(mm, hva); + up_read(mm-mmap_sem); + test_page = pte_page(pte); + page_nid = page_to_nid(test_page); + target_nid = numa_migrate_prep(test_page, vma, +hva, page_nid); + put_page(test_page); + if (unlikely(target_nid != -1)) { + /* If fail, let gup handle it */ + goto pte_check; + } + } + if (writing !pte_write(pte)) /* make the actual HPTE be read-only */ ptel = hpte_make_readonly(ptel); @@ -244,6 +267,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, } } +pte_check: if (pte_size psize) return H_PARAMETER; if (pa pte_size psize) @@ -339,6 +363,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, pteh = ~HPTE_V_VALID; unlock_rmap(rmap); } else { + if (pte_numa(pte) pa) { + pte = pte_mknonnuma(pte); + *ptep = pte; + } kvmppc_add_revmap_chain(kvm, rev, rmap, pte_index, realmode); /* Only set R/C in real HPTE if already set in *rmap */ -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect
On Thu, Dec 5, 2013 at 6:53 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu Ping Fan kernelf...@gmail.com writes: To enable the do_numa_page(), we should not fix _PAGE_NUMA in hash_page(), so bail out for the case of pte_numa(). Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/mm/hash_utils_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index fb176e9..9bf1195 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1033,7 +1033,7 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap) /* Get PTE and page size from page tables */ ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift); - if (ptep == NULL || !pte_present(*ptep)) { + if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) { DBG_LOW( no PTE !\n); rc = 1; goto bail; why ? , All the hash routines do check for _PAGE_PRESENT via access variable. Going through __hash_page_4K(4k on 4k HW), I do not find such check. Am I wrong? Or I will send out a patch to fix that. Thanks and regards, Pingfan -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect
On Wed, Dec 11, 2013 at 5:50 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Wed, 2013-12-11 at 16:50 +0800, Liu ping fan wrote: why ? , All the hash routines do check for _PAGE_PRESENT via access variable. Going through __hash_page_4K(4k on 4k HW), I do not find such check. Am I wrong? Or I will send out a patch to fix that. We pass a bitmask of flags to check which are tested by doing an andc of the PTE on that mask and checking if anything is left... Oh, see it, thank you very much Regards, Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect
On Mon, Dec 9, 2013 at 8:31 AM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Thu, 2013-12-05 at 16:23 +0530, Aneesh Kumar K.V wrote: Liu Ping Fan kernelf...@gmail.com writes: To enable the do_numa_page(), we should not fix _PAGE_NUMA in hash_page(), so bail out for the case of pte_numa(). For some reason I don't have 2/3 and 3/3 in my mbox (though I do have them on patchwork) so I'll reply to this one. Overall, your statement that this is a faster path needs to be backed up with numbers. The code is complicated enough as it-is, such additional mess in the low level hashing code requires a good justification, and also a demonstration that it doesn't add overhead to the normal hash path. For the test, is it ok to have an user application to copy page where all page are PG_mlocked? Thanks and regards, Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/3] powerpc: mm: optimize for the correctly placed page
On Thu, Dec 5, 2013 at 6:58 PM, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com wrote: Liu Ping Fan kernelf...@gmail.com writes: The period check of _PAGE_NUMA can probably trigger the check on the correctly placed page. For this case, we can just insert hpte and do fast exception return. I still don't understand why we need to handle numa faults in hash page ? Are you trying to optimize the code path ? If so can you explain the benefits ? Some numbers showing it is helping ? When return from hash_page(), we will take fast_exc_return_irq, while from do_page_fault(), we take ret_from_except. As the fast implies that there are more complicated logic to sync the interrupt states in ret_from_except, which cost much. Do you think so? Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/mm/hash_utils_64.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 9bf1195..735678c 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -965,6 +965,10 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap) const struct cpumask *tmp; int rc, user_region = 0, local = 0; int psize, ssize; + pte_t old, new; + struct vm_area_struct *vma; + int page_nid, target_nid; + struct page *test_page; DBG_LOW(hash_page(ea=%016lx, access=%lx, trap=%lx\n, ea, access, trap); @@ -1033,12 +1037,40 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap) /* Get PTE and page size from page tables */ ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift); - if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) { + if (ptep == NULL || !pte_present(*ptep)) { DBG_LOW( no PTE !\n); rc = 1; goto bail; } + old = pte_val(*ptep); + if (pte_numa(old)) { + /* If fail to lock, let do_page_fault() to handle it */ + if (down_read_trylock(mm-mmap_sem)) { hmm is that something we want to do in hash_page ? Yes, the function has no relation with hash. But I think it depends on whether it is worth to optimize or not. Thanks and regards, Pingfan + vma = find_vma(mm, ea); + up_read(mm-mmap_sem); + test_page = pte_page(old); + page_nid = page_to_nid(test_page); + target_nid = numa_migrate_prep(test_page, vma, ea, + page_nid); + if (target_nid 0) { + new = pte_mknonnuma(old); + /* If ptep is modified under us, + * just retry the access + */ + if (unlikely(cmpxchg(ptep, old, new) != old)) { + put_page(test_page); + return 0; + } + put_page(test_page); + } + } else { + put_page(test_page); + rc = 1; + goto bail; + } + } + /* Add _PAGE_PRESENT to the required access perm */ access |= _PAGE_PRESENT; -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/3] optimize for powerpc _PAGE_NUMA
I saw Aneesh had implemented Numa faults on ppc64. Most of them have been reviewed-by. So I just diff mine and his to form some trivial modification. Based on Aneesh's series [PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64 Liu Ping Fan (3): powerpc: mm: make _PAGE_NUMA take effect mm: export numa_migrate_prep() powerpc: mm: optimize for the correctly placed page arch/powerpc/mm/hash_utils_64.c | 32 include/linux/mm.h | 2 ++ 2 files changed, 34 insertions(+) -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] mm: export numa_migrate_prep()
powerpc will use it in fast path. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/linux/mm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5ab0e22..420fb77 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long extern int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags); +extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int page_nid); /* * doesn't attempt to fault and will return short. -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect
To enable the do_numa_page(), we should not fix _PAGE_NUMA in hash_page(), so bail out for the case of pte_numa(). Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/mm/hash_utils_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index fb176e9..9bf1195 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1033,7 +1033,7 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap) /* Get PTE and page size from page tables */ ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift); - if (ptep == NULL || !pte_present(*ptep)) { + if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) { DBG_LOW( no PTE !\n); rc = 1; goto bail; -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] powerpc: mm: optimize for the correctly placed page
The period check of _PAGE_NUMA can probably trigger the check on the correctly placed page. For this case, we can just insert hpte and do fast exception return. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/mm/hash_utils_64.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 9bf1195..735678c 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -965,6 +965,10 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap) const struct cpumask *tmp; int rc, user_region = 0, local = 0; int psize, ssize; + pte_t old, new; + struct vm_area_struct *vma; + int page_nid, target_nid; + struct page *test_page; DBG_LOW(hash_page(ea=%016lx, access=%lx, trap=%lx\n, ea, access, trap); @@ -1033,12 +1037,40 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap) /* Get PTE and page size from page tables */ ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift); - if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) { + if (ptep == NULL || !pte_present(*ptep)) { DBG_LOW( no PTE !\n); rc = 1; goto bail; } + old = pte_val(*ptep); + if (pte_numa(old)) { + /* If fail to lock, let do_page_fault() to handle it */ + if (down_read_trylock(mm-mmap_sem)) { + vma = find_vma(mm, ea); + up_read(mm-mmap_sem); + test_page = pte_page(old); + page_nid = page_to_nid(test_page); + target_nid = numa_migrate_prep(test_page, vma, ea, + page_nid); + if (target_nid 0) { + new = pte_mknonnuma(old); + /* If ptep is modified under us, +* just retry the access +*/ + if (unlikely(cmpxchg(ptep, old, new) != old)) { + put_page(test_page); + return 0; + } + put_page(test_page); + } + } else { + put_page(test_page); + rc = 1; + goto bail; + } + } + /* Add _PAGE_PRESENT to the required access perm */ access |= _PAGE_PRESENT; -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] powerpc: mm: use macro PGTABLE_EADDR_SIZE instead of digital
In case of extending the eaddr in future, use this macro PGTABLE_EADDR_SIZE to ease the maintenance of the code. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/mm/slb_low.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S index 17aa6df..e0b3cf4 100644 --- a/arch/powerpc/mm/slb_low.S +++ b/arch/powerpc/mm/slb_low.S @@ -35,7 +35,7 @@ _GLOBAL(slb_allocate_realmode) * check for bad kernel/user address * (ea ~REGION_MASK) = PGTABLE_RANGE */ - rldicr. r9,r3,4,(63 - 46 - 4) + rldicr. r9,r3,4,(63 - PGTABLE_EADDR_SIZE - 4) bne-8f srdir9,r3,60/* get region */ -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] powerpc: mm: change pgtable index size for 64K page
For 64K page, we waste half of the pte_t page. With this patch, after changing PGD_INDEX_SIZE from 12 to 11, PTE_INDEX_SIZE from 8 to 9, we can improve the usage of pte_t page and shrink the continuous phys size for pgd_t. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/pgtable-ppc64-64k.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h index a56b82f..f6955ff 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h +++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h @@ -4,10 +4,10 @@ #include asm-generic/pgtable-nopud.h -#define PTE_INDEX_SIZE 8 +#define PTE_INDEX_SIZE 9 #define PMD_INDEX_SIZE 10 #define PUD_INDEX_SIZE 0 -#define PGD_INDEX_SIZE 12 +#define PGD_INDEX_SIZE 11 #ifndef __ASSEMBLY__ #define PTE_TABLE_SIZE (sizeof(real_pte_t) PTE_INDEX_SIZE) -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene
On Tue, Nov 19, 2013 at 6:39 PM, Alexander Graf ag...@suse.de wrote: On 19.11.2013, at 07:12, Liu Ping Fan kernelf...@gmail.com wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Any particular reason for the resend? The patch is already applied, no? Oh, seems that I misunderstood your meaning. You said Actually, I've changed my mind and moved the patch to the for-3.13 branch instead. Please make sure to CC kvm@vger on all patches you submit though. So I think it is necessary to resend with cc kvm@vger Regards, Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 842f081..abf81fe 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, slb_v = vcpu-kvm-arch.vrma_slb_v; } + preempt_disable(); /* Find the HPTE in the hash table */ index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..ea17b30 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; +/* When called from virtmode, this func should be protected by + * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK + * can trigger deadlock issue. + */ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, unsigned long valid) { -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3] powerpc: kvm: optimize sc 1 as fast return
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v3: add some document --- arch/powerpc/kvm/book3s_hv.c| 10 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 ++- 2 files changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..1addb1a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,10 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } + /* hypercall with MSR_PR has already been handled in rmode, +* and never reaches here. +*/ + run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..eea7ca7 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -534,6 +534,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) 5: mtspr SPRN_SRR0, r6 mtspr SPRN_SRR1, r7 +/* + * Required state: + * R4 = vcpu + * R10: value for HSRR0 + * R11: value for HSRR1 + * R13 = PACA + */ fast_guest_return: li r0,0 stb r0,VCPU_CEDED(r4) /* cancel cede */ @@ -1388,7 +1395,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_1_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1417,15 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 + mr r4,r9 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc: kvm: optimize sc 1 as fast return
On Sat, Nov 16, 2013 at 3:00 PM, Paul Mackerras pau...@samba.org wrote: On Fri, Nov 15, 2013 at 04:35:01PM +0800, Liu Ping Fan wrote: +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 You need a mr r4, r9 instruction here, because fast_guest_return needs the vcpu pointer in r4. Apart from that this looks fine. Will fix it. Thanks and regards, Pingfan + b fast_guest_return Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3] powerpc: kvm: optimize sc 1 as fast return
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 +++- 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..0d1e2c2 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_1_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,15 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 + mr r4,r9 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v4] powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- v4: remove the over-engineered part and keep it simple, also add some notes. --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 842f081..abf81fe 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, slb_v = vcpu-kvm-arch.vrma_slb_v; } + preempt_disable(); /* Find the HPTE in the hash table */ index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..ea17b30 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; +/* When called from virtmode, this func should be protected by + * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK + * can trigger deadlock issue. + */ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, unsigned long valid) { -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powerpc: kvm: optimize sc 1 as fast return
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently, this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL to HV guest, so powernv can return to HV guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..a463f08 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_1_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,14 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_1_fast_return: + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ + rotldi r11, r11, 63 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return
On Fri, Nov 8, 2013 at 7:12 PM, Paul Mackerras pau...@samba.org wrote: On Fri, Nov 08, 2013 at 10:44:16AM +0800, Liu Ping Fan wrote: syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Many interrupts that are caused by guest code go directly to the guest and don't come to the hypervisor at all. That includes system call (sc 0), alignment interrupts, program interrupts, SLB miss interrupts, etc. See section 6.5 of Book 3S of the Power ISA specification; all the interrupts with '-' in the 'HV' column of the table there get delivered directly to the guest when they occur inside a guest. Oh,got it, thanks! That is an important thing I tried to find out but missed all these days. --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_0_fast_return Discrepancy between comment and code here. In fact we would only take the branch for a sc 1 instruction in userspace, which occurs when a PR KVM guest nested inside a HV KVM guest does a hypercall (i.e., not for I made a big mistake from the beginning, and now get a more clear understand of the scene. Thanks! normal system calls). It is probably worthwhile to speed those up. +sc_0_fast_return: + ld r10,VCPU_PC(r9) + ld r11,VCPU_MSR(r9) r11 must already contain this since you just did andi. r0,r11,MSR_PR. In fact r10 already contains VCPU_PC(r9) at this point also, though that is not so obvious. + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + LOAD_REG_IMMEDIATE(r3,0x87a0) /* zero 33:36,42:47 */ + and r11,r11,r3 This is not correct, since you don't even clear PR. In fact what you Yes. need is to load up MSR_SF | MSR_ME, though that value changes with Is it enough to only set MSR_SF | MSR_ME? Sould the HV guest(PR KVM) need to fake msr, so that PR guest feels that sc 1 is trapped by PR KVM directly? I.e, according to ISA Figure 51. MSR setting due to interrupt, about System Call, we need to keep MSR_IR/_DR unchanged. If it is true, then HV need to help HV guest to make this fake. Right? little-endian support and changes again with transactional memory support for POWER8. There is an idiom for loading that MSR value, which is: li r11, (MSR_ME 1) | 1 /* synthesize MSR_SF | MSR_ME */ rotldi r11, r11, 63 Why do we define MSR_SF_LG as bit 63, not like the ISA says bit 0 is SF? And could you enlighten me briefly that what is the effect on the value, when LE and transactional memory are introduced? Thanks and best regards, Pingfan which you could use for now, but it will need to be changed when Anton's LE patch gets accepted. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/2] powerpc kvm: fix deadlock scene
On Fri, Nov 8, 2013 at 6:58 PM, Paul Mackerras pau...@samba.org wrote: On Fri, Nov 08, 2013 at 03:29:52PM +0800, Liu Ping Fan wrote: v2-v3: introduce kvmppc_hv_unlock_hpte() to pair with kvmppc_hv_find_lock_hpte() and hide the preemption detail inside this pair from the callers Actually, I preferred v2. This version seems a bit over-engineered. Making a kvmppc_hv_unlock_hpte() is not such a bad idea, though I would make it identical to the existing unlock_hpte() from Do you think it is helpful to distingusish HPTE_V_LOCK from HPTE_V_HVLOCK at an API level? If it is, I will keep patch 1/2 and just fix patch 2/2 . book3s_hv_rm_mmu.c, just in a header. I'm really not convinced about putting the preempt_disable/enable inside the lock/unlock functions, with the consequent need to pass in a 'vmode' parameter, given that there is just one caller that needs to do the preempt_disable/enable. Ok, will fix patch 2/2 Thanks and regards, Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return
On Fri, Nov 8, 2013 at 12:19 PM, Liu ping fan kernelf...@gmail.com wrote: On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote: On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote: syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc The syscall exit you touch here only happens when you do an sc 0 with MSR_PR set inside the guest. The only case you realistically see this is when you run PR KVM inside of an HV KVM guest. Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call Interrupt, no description about the MSR_PR when sc trigger a syscall interrupt. So I think, guest application sc 0 will also fall to the kernel who owns hypervisor mode. Am I right? Some further comment: I think the essential of the problem is whether we switch RMA from guest to HV when interrupts raise. DSI/ISI will be redirected to HDSI and RMA switch. But what about SYSCALL, and DEC, external interrupt, ...etc? I don't think we should optimize for that case. Instead, we should rather try to not bounce to the 1st hypervisor in the first place in that scenario :). Sorry, but just want to make clear about the idiom: 0 - kernel run with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up layer? Right? When you say try to not bounce to the 1st hypervisor , what is the exact meaning and how can we achieve this? I am a quite newer on powerpc, and hope that I can get more clear figure about it :) Thanks Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 1/2] powerpc/kvm: fix rare but potential deadlock scene
On Thu, Nov 7, 2013 at 5:54 PM, Alexander Graf ag...@suse.de wrote: On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Very nice catch :). I think it makes a lot of sense to document the fact that kvmppc_hv_find_lock_hpte() should be called with preemption disabled in a comment above the function, so that next time when someone potentially calls it, he knows that he needs to put preempt_disable() around it. Ok, I will document them in v3 Thanks a lot for finding this pretty subtle issue. May I ask how you got there? Did you actually see systems deadlock because of this? Intuition :). then I begin try to model a scene which causes the deadlock. And fortunately, I find a case to verify my suspension. Regards, Pingfan Alex --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 043eec8..dbc1478 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -474,10 +474,13 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 2/2] powerpc/kvm: remove redundant assignment
On Thu, Nov 7, 2013 at 6:06 PM, Alexander Graf ag...@suse.de wrote: On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote: ret is assigned twice with the same value, so remove the later one. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Acked-by: Paul Mackerras pau...@samba.org I suppose my last request for a patch description was slightly too abbreviated :). Sorry about that. Imagine you are a Linux-stable maintainer. You have about 5000 patches in front of you and you want to figure out whether a patch should get backported into a stable tree or not. It's very easy to read through the patch description. It's reasonably easy to do a git show on the patch. It's very hard to look at the actual surrounding code that was changed. If I open a text editor on the file, I immediately see what you're saying: ret = RESUME_GUEST; preempt_disable(); while (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) cpu_relax(); if ((hptep[0] ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] || rev-guest_rpte != hpte[2]) /* HPTE has been changed under us; let the guest retry */ goto out_unlock; hpte[0] = (hpte[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; rmap = memslot-arch.rmap[gfn - memslot-base_gfn]; lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ ret = RESUME_GUEST; However, that scope is not given in the actual patch itself. If you look at the diff below, you have no idea whether the patch is fixing a bug or just removes duplication and doesn't actually have any effect. In fact, the compiled assembly should be the same with this patch and without. But you can't tell from the diff below. So what I would like to see in the patch description is something that makes it easy to understand what's going on without the need to check out the source file. Something like We redundantly set ret to RESUME_GUEST twice without changing it in between. Only do it once: ret = RESUME_GUEST; preempt_disable(); while (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) cpu_relax(); if ((hptep[0] ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] || rev-guest_rpte != hpte[2]) /* HPTE has been changed under us; let the guest retry */ goto out_unlock; hpte[0] = (hpte[0] ~HPTE_V_ABSENT) | HPTE_V_VALID; rmap = memslot-arch.rmap[gfn - memslot-base_gfn]; lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ ret = RESUME_GUEST; If I look at that patch description it immediately tells me Ah, no need to worry, it's not a critical bug I need to backport. If you have a better idea how to express that I'm more than happy to take that too. Otherwise just let me know whether you like the description above and I'll modify it to the one that includes the code snippet when applying the patch. Oh, yes. Thank you very much.. And I had a better understanding of the heavy work of maintainers :) Will keep this in mind. Best regards, Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: kvm: optimize sc 0 as fast return
syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- Compiled, but lack of bare metal, I have not tested it yet. --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 - 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..9f626c3 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_0_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,16 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_0_fast_return: + ld r10,VCPU_PC(r9) + ld r11,VCPU_MSR(r9) + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + LOAD_REG_IMMEDIATE(r3,0x87a0) /* zero 33:36,42:47 */ + and r11,r11,r3 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return
On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote: On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote: syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc The syscall exit you touch here only happens when you do an sc 0 with MSR_PR set inside the guest. The only case you realistically see this is when you run PR KVM inside of an HV KVM guest. Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call Interrupt, no description about the MSR_PR when sc trigger a syscall interrupt. So I think, guest application sc 0 will also fall to the kernel who owns hypervisor mode. Am I right? I don't think we should optimize for that case. Instead, we should rather try to not bounce to the 1st hypervisor in the first place in that scenario :). Sorry, but just want to make clear about the idiom: 0 - kernel run with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up layer? Right? When you say try to not bounce to the 1st hypervisor , what is the exact meaning and how can we achieve this? I am a quite newer on powerpc, and hope that I can get more clear figure about it :) Thanks Pingfan Alex Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- Compiled, but lack of bare metal, I have not tested it yet. --- arch/powerpc/kvm/book3s_hv.c| 6 -- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 - 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..73dc852 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* hcall - punt to userspace */ int i; - if (vcpu-arch.shregs.msr MSR_PR) { - /* sc 1 from userspace - reflect to guest syscall */ - kvmppc_book3s_queue_irqprio(vcpu, BOOK3S_INTERRUPT_SYSCALL); - r = RESUME_GUEST; - break; - } run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3); for (i = 0; i 9; ++i) run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index c71103b..9f626c3 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1388,7 +1388,8 @@ kvmppc_hisi: hcall_try_real_mode: ld r3,VCPU_GPR(R3)(r9) andi. r0,r11,MSR_PR - bne guest_exit_cont + /* sc 1 from userspace - reflect to guest syscall */ + bne sc_0_fast_return clrrdi r3,r3,2 cmpldi r3,hcall_real_table_end - hcall_real_table bge guest_exit_cont @@ -1409,6 +1410,16 @@ hcall_try_real_mode: ld r11,VCPU_MSR(r4) b fast_guest_return +sc_0_fast_return: + ld r10,VCPU_PC(r9) + ld r11,VCPU_MSR(r9) + mtspr SPRN_SRR0,r10 + mtspr SPRN_SRR1,r11 + li r10, BOOK3S_INTERRUPT_SYSCALL + LOAD_REG_IMMEDIATE(r3,0x87a0) /* zero 33:36,42:47 */ + and r11,r11,r3 + b fast_guest_return + /* We've attempted a real mode hcall, but it's punted it back * to userspace. We need to restore some clobbered volatiles * before resuming the pass-it-to-qemu path */ -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return
On Fri, Nov 8, 2013 at 12:11 PM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Fri, 2013-11-08 at 15:05 +1100, Benjamin Herrenschmidt wrote: On Fri, 2013-11-08 at 04:10 +0100, Alexander Graf wrote: On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote: syscall is a very common behavior inside guest, and this patch optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL, so hypervisor can return to guest without heavy exit, i.e, no need to swap TLB, HTAB,.. etc The syscall exit you touch here only happens when you do an sc 0 with MSR_PR set inside the guest. The only case you realistically see this is when you run PR KVM inside of an HV KVM guest. I don't think we should optimize for that case. Instead, we should rather try to not bounce to the 1st hypervisor in the first place in that scenario :). Well, so unfortunately openstack CI uses PR inside HV pretty heavily it *might* be worthwhile optimizing that path if the patch is simple enough... I'd make that Paul's call. Note that this is a statement of value for the idea ... not the implementation ;-) From a quick look with Paulus, the patch is quite broken. I'll let Paul comment in details. Thank you very much, Regards, Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/2] powerpc: kvm: pair kvmppc_hv_find_lock_hpte with _unlock_hpte
Highlight the lock pair for the reader. (and later it will the place to hide the detail about preemption disable) Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_book3s.h | 1 + arch/powerpc/kvm/book3s_64_mmu_hv.c | 7 ++- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 13 ++--- 3 files changed, 13 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index fa19e2f..a818932 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -129,6 +129,7 @@ extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu); extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned long addr, unsigned long status); +extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val); extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, unsigned long valid); diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 842f081..97685e7 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -479,12 +479,9 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, if (index 0) return -ENOENT; hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); - v = hptep[0] ~HPTE_V_HVLOCK; + v = hptep[0]; gr = kvm-arch.revmap[index].guest_rpte; - - /* Unlock the HPTE */ - asm volatile(lwsync : : : memory); - hptep[0] = v; + kvmppc_hv_unlock_hpte(hptep, v); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 9c51544..0ff9e91 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,6 +749,14 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; +void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val) +{ + *hpte_val = *hpte_val ~HPTE_V_HVLOCK; + asm volatile(lwsync : : : memory); + *hptep = *hpte_val; +} +EXPORT_SYMBOL(kvmppc_hv_unlock_hpte); + long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, unsigned long valid) { @@ -863,12 +871,11 @@ long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr, return 0; /* for prot fault, HPTE disappeared */ } hpte = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); - v = hpte[0] ~HPTE_V_HVLOCK; + v = hpte[0]; r = hpte[1]; rev = real_vmalloc_addr(kvm-arch.revmap[index]); gr = rev-guest_rpte; - - unlock_hpte(hpte, v); + kvmppc_hv_unlock_hpte(hpte, v); /* For not found, if the HPTE is valid by now, retry the instruction */ if ((status DSISR_NOHPTE) (v HPTE_V_VALID)) -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/2] powerpc kvm: fix deadlock scene
v2-v3: introduce kvmppc_hv_unlock_hpte() to pair with kvmppc_hv_find_lock_hpte() and hide the preemption detail inside this pair from the callers Liu Ping Fan (2): powerpc: kvm: pair kvmppc_hv_find_lock_hpte with _unlock_hpte powerpc: kvm: fix rare but potential deadlock scene arch/powerpc/include/asm/kvm_book3s.h | 3 ++- arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 -- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 29 - 3 files changed, 30 insertions(+), 12 deletions(-) -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 2/2] powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_book3s.h | 4 ++-- arch/powerpc/kvm/book3s_64_mmu_hv.c | 5 +++-- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 20 3 files changed, 21 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index a818932..3d710ba 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -129,9 +129,9 @@ extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu); extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, unsigned long addr, unsigned long status); -extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val); +extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val, bool vmode); extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, - unsigned long slb_v, unsigned long valid); + unsigned long slb_v, unsigned long valid, bool vmode); extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache *pte); extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu); diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 97685e7..12d9635 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -475,13 +475,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Find the HPTE in the hash table */ index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, -HPTE_V_VALID | HPTE_V_ABSENT); +HPTE_V_VALID | HPTE_V_ABSENT, +true); if (index 0) return -ENOENT; hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0]; gr = kvm-arch.revmap[index].guest_rpte; - kvmppc_hv_unlock_hpte(hptep, v); + kvmppc_hv_unlock_hpte(hptep, v, true); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 0ff9e91..18a9425 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -749,16 +749,22 @@ static int slb_base_page_shift[4] = { 20, /* 1M, unsupported */ }; -void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val) +void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val, + bool vmode) { *hpte_val = *hpte_val ~HPTE_V_HVLOCK; asm volatile(lwsync : : : memory); *hptep = *hpte_val; + if (unlikely(vmode)) + preempt_enable(); } EXPORT_SYMBOL(kvmppc_hv_unlock_hpte); +/* If called from virtmode and success to lock, then the context will be set + * as preemption disabled + */ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, - unsigned long valid) + unsigned long valid, bool vmode) { unsigned int i; unsigned int pshift; @@ -796,6 +802,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, avpn = ~0x7fUL; val |= avpn; + if (unlikely(vmode)) + preempt_disable(); + for (;;) { hpte = (unsigned long *)(kvm-arch.hpt_virt + (hash 7)); @@ -833,6 +842,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v, val |= HPTE_V_SECONDARY; hash = hash ^ kvm-arch.hpt_mask; } + + if (unlikely(vmode)) + preempt_enable(); return -1; } EXPORT_SYMBOL(kvmppc_hv_find_lock_hpte); @@ -864,7 +876,7 @@ long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr, if (status DSISR_NOHPTE) valid |= HPTE_V_ABSENT; - index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid); + index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid, false); if (index 0) { if (status DSISR_NOHPTE) return status
Re: [PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene
On Wed, Nov 6, 2013 at 7:18 PM, Paul Mackerras pau...@samba.org wrote: On Wed, Nov 06, 2013 at 02:02:07PM +0800, Liu ping fan wrote: On Wed, Nov 6, 2013 at 1:04 PM, Paul Mackerras pau...@samba.org wrote: On Tue, Nov 05, 2013 at 03:42:43PM +0800, Liu Ping Fan wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Good catch, we should have preemption disabled while ever we have a HPTE locked. @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); + preempt_enable(); Which means we need to add the preempt_enable after unlocking the HPTE, not here. Yes. Sorry, but I am not sure about whether we can call preempt_disable/enable() in realmode. I think since thread_info is allocated with linear address, so we can use preempt_disable/enable() inside kvmppc_hv_find_lock_hpte(), right? Your analysis correctly pointed out that we can get a deadlock if we can be preempted while holding a lock on a HPTE. That means that we have to disable preemption before taking an HPTE lock and keep it disabled until after we unlock the HPTE. Since the point of kvmppc_hv_find_lock_hpte() is to lock the HPTE and return with it locked, we can't have the preempt_enable() inside it. The preempt_enable() has to come after we have unlocked the HPTE. That is also why we can't have the preempt_enable() where your patch put it; it needs to be about 9 lines further down, after the statement hptep[0] = v. (We also need to make sure to re-enable preemption in the index 0 case.) Oh, yes, will fix like what you said. My attention is attracted by the trick of calling kernel func in realmode, and miss the exact point where the lock is released. Thanks and regards, Pingfan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/2] powerpc/kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 043eec8..dbc1478 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -474,10 +474,13 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); - if (index 0) + if (index 0) { + preempt_enable(); return -ENOENT; + } hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); v = hptep[0] ~HPTE_V_HVLOCK; gr = kvm-arch.revmap[index].guest_rpte; @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, /* Unlock the HPTE */ asm volatile(lwsync : : : memory); hptep[0] = v; + preempt_enable(); gpte-eaddr = eaddr; gpte-vpage = ((v HPTE_V_AVPN) 4) | ((eaddr 12) 0xfff); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 2/2] powerpc/kvm: remove redundant assignment
ret is assigned twice with the same value, so remove the later one. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com Acked-by: Paul Mackerras pau...@samba.org --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index dbc1478..9b97b42 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -733,7 +733,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ - ret = RESUME_GUEST; if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) { unlock_rmap(rmap); goto out_unlock; -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene
On Wed, Nov 6, 2013 at 1:04 PM, Paul Mackerras pau...@samba.org wrote: On Tue, Nov 05, 2013 at 03:42:43PM +0800, Liu Ping Fan wrote: Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Good catch, we should have preemption disabled while ever we have a HPTE locked. @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); + preempt_enable(); Which means we need to add the preempt_enable after unlocking the HPTE, not here. Yes. Sorry, but I am not sure about whether we can call preempt_disable/enable() in realmode. I think since thread_info is allocated with linear address, so we can use preempt_disable/enable() inside kvmppc_hv_find_lock_hpte(), right? Thanks and regards, Pingfan Regards, Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/3] powerpc/kvm: simplify the entering logic for secondary thread
On Wed, Nov 6, 2013 at 1:01 PM, Paul Mackerras pau...@samba.org wrote: On Tue, Nov 05, 2013 at 03:42:42PM +0800, Liu Ping Fan wrote: After the primary vcpu changes vcore_state to VCORE_RUNNING, there is very little chance to schedule to secondary vcpu. So if we change the Why do you say there is very little chance to run the secondary vcpu? code sequence around set vcore_state to VCORE_RUNNING and disable preemption, we lost little. But we simplify the entering logi, based on the fact that if primary vcpu runs, the secondary vcpu can not be scheduled. If a vcpu does a hypercall or something else that requires the host (kernel or userspace) to do something, that can happen in the context of the vcpu task for that vcpu. That vcpu task can run on another core (unless it has been pinned). When it is finished we would like the vcpu to continue executing in the guest as soon as possible. The code that you remove in this patch enables that to happen without having to wait until the other threads exit the guest. So I don't think it is a good idea to remove this code. Yes, you are right. Thanks and regards, Pingfan Regards, Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 043eec8..28160ac 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, } /* Find the HPTE in the hash table */ + preempt_disable(); index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v, HPTE_V_VALID | HPTE_V_ABSENT); + preempt_enable(); if (index 0) return -ENOENT; hptep = (unsigned long *)(kvm-arch.hpt_virt + (index 4)); -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/3] powerpc/kvm: simplify the entering logic for secondary thread
After the primary vcpu changes vcore_state to VCORE_RUNNING, there is very little chance to schedule to secondary vcpu. So if we change the code sequence around set vcore_state to VCORE_RUNNING and disable preemption, we lost little. But we simplify the entering logi, based on the fact that if primary vcpu runs, the secondary vcpu can not be scheduled. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 62a2b5a..38b1fc0 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1222,8 +1222,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) kvmppc_create_dtl_entry(vcpu, vc); } - vc-vcore_state = VCORE_RUNNING; preempt_disable(); + vc-vcore_state = VCORE_RUNNING; spin_unlock(vc-lock); kvm_guest_enter(); @@ -1351,12 +1351,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) * this thread straight away and have it join in. */ if (!signal_pending(current)) { - if (vc-vcore_state == VCORE_RUNNING - VCORE_EXIT_COUNT(vc) == 0) { - vcpu-arch.ptid = vc-n_runnable - 1; - kvmppc_create_dtl_entry(vcpu, vc); - kvmppc_start_thread(vcpu); - } else if (vc-vcore_state == VCORE_SLEEPING) { + if (vc-vcore_state == VCORE_SLEEPING) { wake_up(vc-wq); } -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] powerpc/kvm: remove redundant assignment
Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 28160ac..7682837 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -731,7 +731,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, lock_rmap(rmap); /* Check if we might have been invalidated; let the guest retry if so */ - ret = RESUME_GUEST; if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) { unlock_rmap(rmap); goto out_unlock; -- 1.8.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev