what is the state about "[v2] ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is about to hit"

2017-11-27 Thread Liu ping fan
Hi,

I can not find the history about:
https://patchwork.ozlabs.org/patch/577193/


Can we have this patch?

Thanks,
Pingfan


Re: [PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing

2017-04-27 Thread Liu ping fan
On Fri, Apr 28, 2017 at 2:06 AM, Hari Bathini
 wrote:
> Hi Pingfan,
>
>
> On Thursday 27 April 2017 01:13 PM, Pingfan Liu wrote:
>>
>> E.g after fadump reserves mem regions, these regions should not be removed
>> before fadump explicitly free them.
>> Signed-off-by: Pingfan Liu 
>> ---
>>   arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index e104c71..201be23 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base,
>> unsigned int memblock_siz
>>
>> if (!pfn_valid(start_pfn))
>> goto out;
>> +   if (memblock_is_reserved(base))
>> +   return -EINVAL;
>
>
> I think memblock reserved regions are not hot removed even without this
> patch.
> So, can you elaborate on when/why this patch is needed?
>
I have not found code to prevent the reserved regions from free. Do I
miss anything?
I will try to reserve a ppc to have a test.

Thx,
Pingfan


Re: a question about powervm

2016-12-19 Thread Liu ping fan
On Mon, Dec 19, 2016 at 5:58 PM, Michal Suchanek <hramr...@gmail.com> wrote:
> Hello,
>
> On 19 December 2016 at 10:09, Liu ping fan <kernelf...@gmail.com> wrote:
>> Hi guys,
>>
>> I am using a pSeries. It runs on powerVM.
>> My question is whether the cpuX under /sys/devices/system/cpu
>> corresponds to a real cpu or not.
>> I think it is not the same as the cpu on kvm-guest, which is emulated
>> by a linux process.
>
> In general the hypervisor (powerVM or KVM) needs to disable access of
> the guest to some CPU features in order to manage multiple guests. So
> some CPU features are either disabled or emulated by the hypervisor to
> prevent the guest from interfering with isolation of the guests from
> each other and the host. On the other hand, emulating the whole CPU is
> ineffective so to achieve reasonable guest speed it must run on a real
> CPU most of the time. That means that in most cases the CPU you see in
> the guest will have a subset of the features of the CPU the host sees.
>
Thank you for the explanation. And sorry not to explain my question clearly.
I wonders about the mapping between guest-CPU and physical cpu. On
powerVM, except the feature exposed by host, it is 1:1 map right?

> Some virtualization solutions allow to overcommit CPU time so you can
> allocate more CPUs to guests than you physically have. Then when
> running the guest will use an actual physical CPU but when it is
> sleeping other guest may use the CPU.
>
Yes, agree.

Thanks!


a question about powervm

2016-12-19 Thread Liu ping fan
Hi guys,

I am using a pSeries. It runs on powerVM.
My question is whether the cpuX under /sys/devices/system/cpu
corresponds to a real cpu or not.
I think it is not the same as the cpu on kvm-guest, which is emulated
by a linux process.

Thx.


Re: fadump: a question about "ibm,configure-kernel-dump"

2016-11-18 Thread Liu ping fan
Thank you very much. I am more clear about it now.

On Fri, Nov 18, 2016 at 5:29 PM, Michael Ellerman <m...@ellerman.id.au> wrote:
> Liu ping fan <kernelf...@gmail.com> writes:
>
>> I have an ibm-p8-garrison machine. But I can not find a node
>> "ibm,configure-kernel-dump" under /proc/device-tree.  Does garrison
>> machine support fadump? And normally, which component in open-power
>> presents the "ibm,configure-kernel-dump" ?  I had though it was in
>> skiboot gitree or  garrison-xml gitree, but found nothing.
>
> fadump only exists for guests running under PowerVM (LPARs). It sounds
> like you're running bare metal (because you mention skiboot).
>
> If you're talking about guests on top of KVM then it *could* be
> implemented, but AFAIK Qemu/KVM do not implement it.
>
> cheers


fadump: a question about "ibm,configure-kernel-dump"

2016-11-17 Thread Liu ping fan
I have an ibm-p8-garrison machine. But I can not find a node
"ibm,configure-kernel-dump" under /proc/device-tree.  Does garrison
machine support fadump? And normally, which component in open-power
presents the "ibm,configure-kernel-dump" ?  I had though it was in
skiboot gitree or  garrison-xml gitree, but found nothing.

Can anyone give a help?
Thx


Re: [RFC 01/11] sched: introduce sys_cpumask in tsk to adapt asymmetric system

2014-11-17 Thread Liu ping fan
On Wed, Nov 12, 2014 at 5:22 PM, Srikar Dronamraju
sri...@linux.vnet.ibm.com wrote:
 * kernelf...@gmail.com kernelf...@gmail.com [2014-10-16 15:29:50]:

 Some system such as powerpc, some tsk (vcpu thread) can only run on
 the dedicated cpu. Since we adapt some asymmetric method to monitor the
 whole physical cpu. (powerKVM only allows the primary hwthread to
 set up runtime env for the secondary when entering guest).

 Nowadays, powerKVM run with all the secondary hwthread offline to ensure
 the vcpu threads only run on the primary thread. But we plan to keep all
 cpus online when running powerKVM to give more power when switching back
 to host, so introduce sys_allowed cpumask to reflect the cpuset which
 the vcpu thread can run on.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  include/linux/init_task.h |  1 +
  include/linux/sched.h |  6 ++
  kernel/sched/core.c   | 10 --
  3 files changed, 15 insertions(+), 2 deletions(-)

 diff --git a/include/linux/init_task.h b/include/linux/init_task.h
 index 2bb4c4f3..c56f69e 100644
 --- a/include/linux/init_task.h
 +++ b/include/linux/init_task.h
 @@ -172,6 +172,7 @@ extern struct task_group root_task_group;
   .normal_prio= MAX_PRIO-20,  \
   .policy = SCHED_NORMAL, \
   .cpus_allowed   = CPU_MASK_ALL, \
 + .sys_allowed = CPU_MASK_ALL,\

 Do we really need another mask, cant we just use cpus_allowed itself.

I think it is not easy to cast two request: chip inherit and user's
configuration onto one mask.

   .nr_cpus_allowed= NR_CPUS,  \
   .mm = NULL, \
   .active_mm  = init_mm, \
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 5c2c885..ce429f3 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1260,7 +1260,10 @@ struct task_struct {

   unsigned int policy;
   int nr_cpus_allowed;
 + /* Anded user and sys_allowed */
   cpumask_t cpus_allowed;
 + /* due to the feature of asymmetric, some tsk can only run on such cpu 
 */
 + cpumask_t sys_allowed;

  #ifdef CONFIG_PREEMPT_RCU
   int rcu_read_lock_nesting;
 @@ -2030,6 +2033,9 @@ static inline void tsk_restore_flags(struct 
 task_struct *task,
  }

  #ifdef CONFIG_SMP
 +extern void set_cpus_sys_allowed(struct task_struct *p,
 + const struct cpumask *new_mask);
 +
  extern void do_set_cpus_allowed(struct task_struct *p,
  const struct cpumask *new_mask);

 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index ec1a286..2cd1ae3 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4596,13 +4596,19 @@ void init_idle(struct task_struct *idle, int cpu)
  }

  #ifdef CONFIG_SMP
 +void set_cpus_sys_allowed(struct task_struct *p,
 + const struct cpumask *new_mask)
 +{
 + cpumask_copy(p-sys_allowed, new_mask);
 +}
 +

 This function doesnt seem to be used anywhere... Not sure why it is
 introduced

Not layered the patches well :(  It is used later in the series.

Thx,
Fan

  void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
 *new_mask)
  {
   if (p-sched_class  p-sched_class-set_cpus_allowed)
   p-sched_class-set_cpus_allowed(p, new_mask);

 - cpumask_copy(p-cpus_allowed, new_mask);
 - p-nr_cpus_allowed = cpumask_weight(new_mask);
 + cpumask_and(p-cpus_allowed, p-sys_allowed, new_mask);
 + p-nr_cpus_allowed = cpumask_weight(p-cpus_allowed);
  }

  /*
 --
 1.8.3.1



 --
 Thanks and Regards
 Srikar Dronamraju

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 03/11] powerpc: kvm: add interface to control kvm function on a core

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 12:04 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 Hi Liu,

 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 When kvm is enabled on a core, we migrate all external irq to primary
 thread. Since currently, the kvmirq logic is handled by the primary
 hwthread.

 Todo: this patch lacks re-enable of irqbalance when kvm is disable on
 the core

 Why is a sysfs file introduced to trigger irq migration? Why is it not
 done during kvm module insert ? And similarly spread interrupts when the
 module is removed? Isn't this a saner way ?

Consider the scene: coreA and coreB, we want to enable KVM on coreA,
while keeping coreB unchanged.
In fact, I try to acheive something un-symmetric on the platform. Do
you think it is an justification?

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kernel/sysfs.c| 39 
 ++
  arch/powerpc/sysdev/xics/xics-common.c | 12 +++
  2 files changed, 51 insertions(+)

 diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
 index 67fd2fd..a2595dd 100644
 --- a/arch/powerpc/kernel/sysfs.c
 +++ b/arch/powerpc/kernel/sysfs.c
 @@ -552,6 +552,45 @@ static void sysfs_create_dscr_default(void)
   if (cpu_has_feature(CPU_FTR_DSCR))
   err = device_create_file(cpu_subsys.dev_root, 
 dev_attr_dscr_default);
  }
 +
 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 +#define NR_CORES (CONFIG_NR_CPUS/threads_per_core)
 +static DECLARE_BITMAP(kvm_on_core, NR_CORES) __read_mostly
 +
 +static ssize_t show_kvm_enable(struct device *dev,
 + struct device_attribute *attr, char *buf)
 +{
 +}
 +
 +static ssize_t __used store_kvm_enable(struct device *dev,
 + struct device_attribute *attr, const char *buf,
 + size_t count)
 +{
 + struct cpumask stop_cpus;
 + unsigned long core, thr;
 +
 + sscanf(buf, %lx, core);
 + if (core  NR_CORES)
 + return -1;
 + if (!test_bit(core, kvm_on_core))
 + for (thr = 1; thr threads_per_core; thr++)
 + if (cpu_online(thr * threads_per_core + thr))
 + cpumask_set_cpu(thr * threads_per_core + thr, 
 stop_cpus);

 What is the above logic trying to do? Did you mean
 cpu_online(threads_per_core * core + thr) ?

Yeah. My mistake, should be cpumask_set_cpu(core * threads_per_core +
thr, stop_cpus)

 +
 + stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus);
 + set_bit(core, kvm_on_core);
 + return count;
 +}
 +
 +static DEVICE_ATTR(kvm_enable, 0600,
 + show_kvm_enable, store_kvm_enable);
 +
 +static void sysfs_create_kvm_enable(void)
 +{
 + device_create_file(cpu_subsys.dev_root, dev_attr_kvm_enable);
 +}
 +#endif
 +
  #endif /* CONFIG_PPC64 */

  #ifdef HAS_PPC_PMC_PA6T
 diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
 b/arch/powerpc/sysdev/xics/xics-common.c
 index fe0cca4..68b33d8 100644
 --- a/arch/powerpc/sysdev/xics/xics-common.c
 +++ b/arch/powerpc/sysdev/xics/xics-common.c
 @@ -258,6 +258,18 @@ unlock:
   raw_spin_unlock_irqrestore(desc-lock, flags);
   }
  }
 +
 +int xics_migrate_irqs_away_secondary(void *data)
 +{
 + int cpu = smp_processor_id();
 + if(cpu%thread_per_core != 0) {
 + WARN(condition, format...);
 + return 0;
 + }
 + /* In fact, if we can migrate the primary, it will be more fine */
 + ();

 Isn't the aim of the patch to migrate irqs away from the secondary onto
 the primary? But from above it looks like we are returning when we find
 out that we are secondary threads, isn't it?

Yes, will fix in next version.

 + return 0;
 +}
  #endif /* CONFIG_HOTPLUG_CPU */

 Note that xics_migrate_irqs_away() is defined under CONFIG_CPU_HOTPLUG.
 But we will need this option on PowerKVM even when hotplug is not
 configured in.

Yes, will fix the dependency in next version

Thx,
Fan

 Regards
 Preeti U Murthy
  #ifdef CONFIG_SMP


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 04/11] powerpc: kvm: introduce a kthread on primary thread to anti tickless

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 12:45 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 (This patch is a place holder.)

 If there is only one vcpu thread is ready(the other vcpu thread can
 wait for it to execute), the primary thread can enter tickless mode,

 We do not configure NOHZ_FULL to y by default. Hence no thread would
 enter tickless mode.

But NOHZ_FULL can be chosen by user, and we should survive from it :)

 which causes the primary keeps running, so the secondary has no
 opportunity to exit to host, even they have other tsk on them.

 The secondary threads can still get scheduling ticks. The decrementer of
 the secondary threads is still active. So as long as secondary threads
 are busy, scheduling ticks will fire and try to schedule a new task on them.

No. As my original thought, after enable KVM on core, the HDEC on
secondary is disabled, otherwise the host exit will be too frequent.
Any suggestion?

Thx,
Fan

 Regards
 Preeti U Murthy

 Introduce a kthread (anti_tickless) on primary, so when there is only
 one vcpu thread on primary, the secondary can resort to anti_tickless
 to keep the primary out of tickless mode.
 (I thought that anti_tickless thread can goto NAP, so we can let the
 secondary run).

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kernel/sysfs.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

 diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
 index a2595dd..f0b110e 100644
 --- a/arch/powerpc/kernel/sysfs.c
 +++ b/arch/powerpc/kernel/sysfs.c
 @@ -575,9 +575,11 @@ static ssize_t __used store_kvm_enable(struct device 
 *dev,
   if (!test_bit(core, kvm_on_core))
   for (thr = 1; thr threads_per_core; thr++)
   if (cpu_online(thr * threads_per_core + thr))
 - cpumask_set_cpu(thr * threads_per_core + thr, 
 stop_cpus);
 + cpumask_set_cpu(core * threads_per_core + thr, 
 stop_cpus);

   stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus);
 + /* fixme, create a kthread on primary hwthread to handle tickless mode 
 */
 + //kthread_create_on_cpu(prevent_tickless, NULL, core * 
 threads_per_core, ppckvm_prevent_tickless);
   set_bit(core, kvm_on_core);
   return count;
  }


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 06/11] powerpc: kvm: introduce online in paca to indicate whether cpu is needed by host

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 1:32 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 Hi Liu,

 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 Nowadays, powerKVM runs with secondary hwthread offline. Although
 we can make all secondary hwthread online later, we still preserve
 this behavior for dedicated KVM env. Achieve this by setting
 paca-online as false.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/include/asm/paca.h |  3 +++
  arch/powerpc/kernel/asm-offsets.c   |  3 +++
  arch/powerpc/kernel/smp.c   |  3 +++
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 
  4 files changed, 21 insertions(+)

 diff --git a/arch/powerpc/include/asm/paca.h 
 b/arch/powerpc/include/asm/paca.h
 index a5139ea..67c2500 100644
 --- a/arch/powerpc/include/asm/paca.h
 +++ b/arch/powerpc/include/asm/paca.h
 @@ -84,6 +84,9 @@ struct paca_struct {
   u8 cpu_start;   /* At startup, processor spins until */
   /* this becomes non-zero. */
   u8 kexec_state; /* set when kexec down has irqs off */
 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 + u8 online;
 +#endif
  #ifdef CONFIG_PPC_STD_MMU_64
   struct slb_shadow *slb_shadow_ptr;
   struct dtl_entry *dispatch_log;
 diff --git a/arch/powerpc/kernel/asm-offsets.c 
 b/arch/powerpc/kernel/asm-offsets.c
 index 9d7dede..0faa8fe 100644
 --- a/arch/powerpc/kernel/asm-offsets.c
 +++ b/arch/powerpc/kernel/asm-offsets.c
 @@ -182,6 +182,9 @@ int main(void)
   DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc));
   DEFINE(PACAKBASE, offsetof(struct paca_struct, kernelbase));
   DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr));
 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 + DEFINE(PACAONLINE, offsetof(struct paca_struct, online));
 +#endif
   DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled));
   DEFINE(PACAIRQHAPPENED, offsetof(struct paca_struct, irq_happened));
   DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
 diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
 index a0738af..4c3843e 100644
 --- a/arch/powerpc/kernel/smp.c
 +++ b/arch/powerpc/kernel/smp.c
 @@ -736,6 +736,9 @@ void start_secondary(void *unused)

   cpu_startup_entry(CPUHP_ONLINE);

 +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
 + get_paca()-online = true;
 +#endif
   BUG();
  }

 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 index f0c4db7..d5594b0 100644
 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -322,6 +322,13 @@ kvm_no_guest:
   li  r0, KVM_HWTHREAD_IN_NAP
   stb r0, HSTATE_HWTHREAD_STATE(r13)
  kvm_do_nap:
 +#ifdef PPCKVM_ENABLE_SECONDARY
 + /* check the cpu is needed by host or not */
 + ld  r2, PACAONLINE(r13)
 + ld  r3, 0
 + cmp r2, r3
 + bne kvm_secondary_exit_trampoline
 +#endif
   /* Clear the runlatch bit before napping */
   mfspr   r2, SPRN_CTRLF
   clrrdi  r2, r2, 1
 @@ -340,6 +347,11 @@ kvm_do_nap:
   nap
   b   .

 +#ifdef PPCKVM_ENABLE_SECONDARY
 +kvm_secondary_exit_trampoline:
 + b   .

 Uh? When we have no vcpu to run, we loop here instead of doing a nap?
 What are we achieving?

 If I understand the intention of the patch well, we are looking to
 provide a knob whereby the host can indicate if it needs the secondaries
 at all.

Yes, you catch it :)

 Today the host does boot with all threads online. There are some init
 scripts which take the secondaries down. So today the host does not have
 a say in preventing this, compile time or runtime. So lets see how we
 can switch between the two behaviors if we don't have the init script,
 which looks like a saner thing to do.

 We should set the paca-online flag to false by default. If
 KVM_PPC_ENABLE_SECONDARY is configured, we need to set this flag to
 true. So at compile time, we resolve the flag.

 While booting, we look at the flag and decide whether to get the
 secondaries online. So we get the current behavior if we have not
 configured KVM_PPC_ENABLE_SECONDARY. Will this achieve the purpose of
 this patch?

At boot time, KVM can not run. So we can achieve the change of the
flag by soft cpu hotplug on/off.
I think this is a more flexible way.

Thx,
Fan

 Regards
 Preeti U Murthy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 09/11] powerpc: kvm: handle time base on secondary hwthread

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 2:40 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 On 10/17/2014 12:59 AM, kernelf...@gmail.com wrote:
 (This is a place holder patch.)
 We need to store the time base for host on secondary hwthread.
 Later when switching back, we need to reprogram it with elapse
 time.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 ++
  1 file changed, 6 insertions(+)

 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 index 89ea16c..a817ba6 100644
 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -371,6 +371,8 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter)

   /* fixme: store other register such as msr */

 + /* fixme: store the tb, and set it as MAX, so we cease the tick on 
 secondary */

 This can lead to serious consequences. First of all, even if we set the
 decrementer(not tb) to MAX, it is bound to fire after 4s. That is the
 maximum time till which you can keep off the decrementer from firing.

 In the hotplug path, the offline cpus nap and their decrementers are
 programmed to fire at MAX too. But the difference is that we clear the
 LPCR_PECE1 wakeup bit which prevents cpus from waking up on a
 decrementer interrupt.

 We cannot afford to do this here though because there are other tasks on
 the secondary threads' runqueue. They need to be scheduled in.
 There are also timers besides the tick_sched one, which can be queued on
 these secondary threads. These patches have not taken care to migrate
 timers or tasks before entering guest as far as I observed. Hence we
 cannot just turn off time base like this and expect to handle the above
 mentioned events the next time the primary thread decides to exit to the
 host.

Yes, that is the nut in this series. My plan is to compensate the
hrtimer when the secondary exit.
But as to the scheduler on secondary, if it is ceased, what is side-effect?

Thx,
Fan

 Regards
 Preeti U Murthy
 +
   /* prevent us to enter kernel */
   li  r0, 1
   stb r0, HSTATE_HWTHREAD_REQ(r13)
 @@ -382,6 +384,10 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter)

  /* enter with vmode */
  kvmppc_secondary_stopper_exit:
 + /* fixme: restore the tb, with the orig val plus time elapse
 + * so we can fire the hrtimer as soon as possible
 + */
 +
   /* fixme, restore the stack which we store on lpaca */

   ld  r0, 112+PPC_LR_STKOFF(r1)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 11/11] powerpc: kvm: Kconfig add an option for enabling secondary hwthread

2014-11-17 Thread Liu ping fan
On Mon, Oct 27, 2014 at 2:44 PM, Preeti U Murthy
pre...@linux.vnet.ibm.com wrote:
 On 10/17/2014 01:00 AM, kernelf...@gmail.com wrote:
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/Kconfig | 4 
  1 file changed, 4 insertions(+)

 diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
 index 602eb51..de38566 100644
 --- a/arch/powerpc/kvm/Kconfig
 +++ b/arch/powerpc/kvm/Kconfig
 @@ -93,6 +93,10 @@ config KVM_BOOK3S_64_HV

 If unsure, say N.

 +config KVMPPC_ENABLE_SECONDARY
 + tristate KVM support for running on secondary hwthread in host
 + depends on KVM_BOOK3S_64_HV

 This patch is required ontop of all the rest :) The top patches won't
 compile without this one. Every patch in the patchset should be able to
 compile successfully without the aid of the patches that come after it.

I think here is a conflict. If we do so, then we should make effort to
prevent the independent patch to take effect before the whole patchset
is applied.

Thx,
Fan

 Regards
 Preeti U Murthy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-29 Thread Liu ping fan
On Tue, Jul 29, 2014 at 2:57 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Mon, 2014-07-28 at 15:58 +0800, Liu ping fan wrote:
 Hope I am right.  Take the following seq as an example

 if (hptep[0]  HPTE_V_VALID) {
 /* HPTE was previously valid, so we need to invalidate it */
 unlock_rmap(rmap);
 hptep[0] |= HPTE_V_ABSENT;
 kvmppc_invalidate_hpte(kvm, hptep, index);
 /* don't lose previous R and C bits */
 r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
 } else {
 kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
 }
   - if we try_to_unmap on
 pfn at here, then @r contains a invalid pfn
 hptep[1] = r;
 eieio();
 hptep[0] = hpte[0];
 asm volatile(ptesync : : : memory);

 If that was the case we would have the same race in kvmppc_do_h_enter().

 I think the fact that the HPTE is locked will prevent the race, ie,
 HPTE_V_HVLOCK is set until hptep[0] is written to.

 If I look at at the unmap case, my understanding is that it uses
 kvm_unmap_rmapp() which will also lock the HPTE (try_lock_hpte)
 and so shouldn't have a race vs the above code.

Yes, you are right :)

Thx,
Fan


 Or do you see a race I don't ?

 Cheers,
 Ben.

 Thx.
 Fan

 On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
  On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote:
  In current code, the setup of hpte is under the risk of race with
  mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
  Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.
 
  Please describe the race you think you see. I'm quite sure both Paul and
  I went over that code and somewhat convinced ourselves that it was ok
  but it's possible that we were both wrong :-)
 
  Cheers,
  Ben.
 
  Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
  ---
   arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
   1 file changed, 10 insertions(+), 5 deletions(-)
 
  diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
  b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  index 8056107..e6dcff4 100644
  --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
  +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
  @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run 
  *run, struct kvm_vcpu *vcpu,
 
if (hptep[0]  HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
  - unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
  +
  + hptep[1] = r;
  + eieio();
  + hptep[0] = hpte[0];
  + asm volatile(ptesync : : : memory);
  + unlock_rmap(rmap);
} else {
  + hptep[1] = r;
  + eieio();
  + hptep[0] = hpte[0];
  + asm volatile(ptesync : : : memory);
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
 
  - hptep[1] = r;
  - eieio();
  - hptep[0] = hpte[0];
  - asm volatile(ptesync : : : memory);
preempt_enable();
if (page  hpte_is_writable(r))
SetPageDirty(page);
 
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-28 Thread Liu Ping Fan
In current code, the setup of hpte is under the risk of race with
mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8056107..e6dcff4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 
if (hptep[0]  HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
-   unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
+
+   hptep[1] = r;
+   eieio();
+   hptep[0] = hpte[0];
+   asm volatile(ptesync : : : memory);
+   unlock_rmap(rmap);
} else {
+   hptep[1] = r;
+   eieio();
+   hptep[0] = hpte[0];
+   asm volatile(ptesync : : : memory);
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
 
-   hptep[1] = r;
-   eieio();
-   hptep[0] = hpte[0];
-   asm volatile(ptesync : : : memory);
preempt_enable();
if (page  hpte_is_writable(r))
SetPageDirty(page);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-28 Thread Liu ping fan
Hope I am right.  Take the following seq as an example

if (hptep[0]  HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
} else {
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
  - if we try_to_unmap on
pfn at here, then @r contains a invalid pfn
hptep[1] = r;
eieio();
hptep[0] = hpte[0];
asm volatile(ptesync : : : memory);

Thx.
Fan

On Mon, Jul 28, 2014 at 2:42 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote:
 In current code, the setup of hpte is under the risk of race with
 mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
 Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.

 Please describe the race you think you see. I'm quite sure both Paul and
 I went over that code and somewhat convinced ourselves that it was ok
 but it's possible that we were both wrong :-)

 Cheers,
 Ben.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 8056107..e6dcff4 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,

   if (hptep[0]  HPTE_V_VALID) {
   /* HPTE was previously valid, so we need to invalidate it */
 - unlock_rmap(rmap);
   hptep[0] |= HPTE_V_ABSENT;
   kvmppc_invalidate_hpte(kvm, hptep, index);
   /* don't lose previous R and C bits */
   r |= hptep[1]  (HPTE_R_R | HPTE_R_C);
 +
 + hptep[1] = r;
 + eieio();
 + hptep[0] = hpte[0];
 + asm volatile(ptesync : : : memory);
 + unlock_rmap(rmap);
   } else {
 + hptep[1] = r;
 + eieio();
 + hptep[0] = hpte[0];
 + asm volatile(ptesync : : : memory);
   kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
   }

 - hptep[1] = r;
 - eieio();
 - hptep[0] = hpte[0];
 - asm volatile(ptesync : : : memory);
   preempt_enable();
   if (page  hpte_is_writable(r))
   SetPageDirty(page);


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v4] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-15 Thread Liu Ping Fan
Numa fault is a method which help to achieve auto numa balancing.
When such a page fault takes place, the page fault handler will check
whether the page is placed correctly. If not, migration should be
involved to cut down the distance between the cpu and pages.

A pte with _PAGE_NUMA help to implement numa fault. It means not to
allow the MMU to access the page directly. So a page fault is triggered
and numa fault handler gets the opportunity to run checker.

As for the access of MMU, we need special handling for the powernv's guest.
When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to
invalidate it in guest's htab, but when we tried to re-insert them,
we firstly try to map it in real-mode. Only after this fails, we fallback
to virt mode, and most of important, we run numa fault handler in virt
mode.  This patch guards the way of real-mode to ensure that if a pte is
marked with _PAGE_NUMA, it will NOT be mapped in real mode, instead, it will
be mapped in virt mode and have the opportunity to be checked with placement.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
v4:
  more detail description 
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 1d6c56a..8fcc363 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-14 Thread liu ping fan
On Mon, Apr 14, 2014 at 2:43 PM, Alexander Graf ag...@suse.de wrote:

 On 13.04.14 04:27, Liu ping fan wrote:

 On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote:

 On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com
 wrote:

 When we mark pte with _PAGE_NUMA we already call
 mmu_notifier_invalidate_range_start
 and mmu_notifier_invalidate_range_end, which will mark existing guest
 hpte
 entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
 guest hpte entries.

 What happens when we don't? Why do we need the check? Why isn't it done
 implicitly? What happens when we treat a NUMA marked page as non-present?
 Why does it work out for us?

 Assume you have no idea what PAGE_NUMA is, but try to figure out what
 this patch does and whether you need to cherry-pick it into your downstream
 kernel. The description as is still is not very helpful for that. It doesn't
 even explain what really changes with this patch applied.

 Yeah.  what about appending the following description?  Can it make
 the context clear?
 Guest should not setup a hpte for the page whose pte is marked with
 _PAGE_NUMA, so on the host, the numa-fault mechanism can take effect
 to check whether the page is placed correctly or not.


 Try to come up with a text that answers the following questions in order:

I divide them into 3 groups, and answer them by 3 sections. Seems that
it has the total story :)
Please take a look.

   - What does _PAGE_NUMA mean?
Group 1 - section 2

   - How does page migration with _PAGE_NUMA work?
   - Why should we not map pages when _PAGE_NUMA is set?
Group 2 - section 1
(Note: for the 1st question in this group, I am not sure about the
details, except that we can fix numa balancing by moving task or
moving page.  So I comment as  migration should be involved to cut
down the distance between the cpu and pages)

   - Which part of what needs to be done did the previous _PAGE_NUMA patch
 address?
   - What's the situation without this patch?
   - Which scenario does this patch fix?

Group 3 - section 3


Numa fault is a method which help to achieve auto numa balancing.
When such a page fault takes place, the page fault handler will check
whether the page is placed correctly. If not, migration should be
involved to cut down the distance between the cpu and pages.

A pte with _PAGE_NUMA help to implement numa fault. It means not to
allow the MMU to access the page directly. So a page fault is triggered
and numa fault handler gets the opportunity to run checker.

As for the access of MMU, we need special handling for the powernv's guest.
When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to
invalidate it in guest's htab, but when we tried to re-insert them,
we firstly try to fix it in real-mode. Only after this fails, we fallback
to virt mode, and most of important, we run numa fault handler in virt
mode.  This patch guards the way of real-mode to ensure that if a pte is
marked with _PAGE_NUMA, it will NOT be fixed in real mode, instead, it will
be fixed in virt mode and have the opportunity to be checked with placement.


Thx,
Fan


 Once you have a text that answers those, you should have a good patch
 description :).

 Alex


 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-12 Thread Liu ping fan
On Fri, Apr 11, 2014 at 10:03 PM, Alexander Graf ag...@suse.de wrote:

 On 11.04.2014, at 13:45, Liu Ping Fan pingf...@linux.vnet.ibm.com wrote:

 When we mark pte with _PAGE_NUMA we already call 
 mmu_notifier_invalidate_range_start
 and mmu_notifier_invalidate_range_end, which will mark existing guest hpte
 entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
 guest hpte entries.

 What happens when we don't? Why do we need the check? Why isn't it done 
 implicitly? What happens when we treat a NUMA marked page as non-present? Why 
 does it work out for us?

 Assume you have no idea what PAGE_NUMA is, but try to figure out what this 
 patch does and whether you need to cherry-pick it into your downstream 
 kernel. The description as is still is not very helpful for that. It doesn't 
 even explain what really changes with this patch applied.

Yeah.  what about appending the following description?  Can it make
the context clear?
Guest should not setup a hpte for the page whose pte is marked with
_PAGE_NUMA, so on the host, the numa-fault mechanism can take effect
to check whether the page is placed correctly or not.


 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 ---
 v3:
  rebased onto Alex's tree, branch kvm-ppc-next
  substitue commit log with the more clear description in Aneesh's reply 
 (thanks, Aneesh)
 ---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
 b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 index 1d6c56a..1117525 100644
 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
 @@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
 flags,
   pte_size = psize;
   pte = lookup_linux_pte_and_update(pgdir, hva, writing,
 pte_size);
 - if (pte_present(pte)) {
 + if (pte_present(pte)!pte_numa(pte)) {

 Spaces missing

Will fix,

Thx,
Fan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3] powerpc: kvm: make _PAGE_NUMA take effect

2014-04-11 Thread Liu Ping Fan
When we mark pte with _PAGE_NUMA we already call 
mmu_notifier_invalidate_range_start
and mmu_notifier_invalidate_range_end, which will mark existing guest hpte
entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
guest hpte entries.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
v3:
  rebased onto Alex's tree, branch kvm-ppc-next
  substitue commit log with the more clear description in Aneesh's reply 
(thanks, Aneesh)
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 1d6c56a..1117525 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -234,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
pte_size = psize;
pte = lookup_linux_pte_and_update(pgdir, hva, writing,
  pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)!pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect

2014-04-09 Thread Liu ping fan
On Mon, Apr 7, 2014 at 4:36 PM, Alexander Graf ag...@suse.de wrote:

 On 07.04.14 09:42, Aneesh Kumar K.V wrote:

 Alexander Graf ag...@suse.com writes:

 On 03.04.14 04:36, Liu ping fan wrote:

 Hi Alex, could you help to pick up this patch? since  v3.14 kernel can
 enable numa fault for powerpc.

 What bad happens without this patch? We map a page even though it was
 declared to get NUMA migrated? What happens next?

 Nothing much, we won't be properly accounting the numa access in the
 host.  What we want to achieve is to convert a guest access of the page to
 a host fault so that we can do proper numa access accounting in the
 host. This would enable us to migrate the page to the correct numa
 node.


 Ok, so no breakages, just less performance. I wouldn't consider it stable
 material then :).

Sorry to reply late, since I am half out of office during this period.
I am puzzling about you reply.Without this patch, the guest can
NOT sense the numa changes and ask host to put the pages in right
place.  So the pages which is used by guest will be always misplaced.
The numa-fault method is inspired by real requirement to improve
performance, so we should also consider the performance drop of guest.
Right?

Regards,
Fan


 Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powernv: kvm: make _PAGE_NUMA take effect

2014-04-02 Thread Liu ping fan
Hi Alex, could you help to pick up this patch? since  v3.14 kernel can
enable numa fault for powerpc.

Thx,
Fan

On Mon, Jan 27, 2014 at 5:11 PM, Alexander Graf ag...@suse.de wrote:

 On 21.01.2014, at 10:42, Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com 
 wrote:

 Liu Ping Fan kernelf...@gmail.com writes:

 To make sure that on host, the pages marked with _PAGE_NUMA result in a 
 fault
 when guest access them, we should force the checking when guest uses 
 hypercall
 to setup hpte.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

 When we mark pte with _PAGE_NUMA we already call 
 mmu_notifier_invalidate_range_start and
 mmu_notifier_invalidate_range_end, which will mark existing guest hpte
 entry as HPTE_V_ABSENT. Now we need to do that when we are inserting new
 guest hpte entries. This patch does that.

 So what happens next? We insert a page into the HTAB without HPTE_V_VALID 
 set, so the guest will fail to use it. If the guest does an H_READ on it it 
 will suddenly turn to V_VALID though?

 I might need a crash course in the use of HPTE_V_ABSENT.


 Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS

2014-02-27 Thread liu ping fan
On Fri, Feb 28, 2014 at 7:41 AM, Andrew Morton
a...@linux-foundation.org wrote:
 On Wed, 26 Feb 2014 13:22:16 +0530 Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com wrote:

 Andrew Morton a...@linux-foundation.org writes:

  On Wed,  5 Feb 2014 09:25:46 +0800 Liu Ping Fan qemul...@gmail.com wrote:
 
  When doing some numa tests on powerpc, I triggered an oops bug. I find
  it is caused by using page-_last_cpupid.  It should be initialized as
  -1  LAST_CPUPID_MASK, but not -1. Otherwise, in task_numa_fault(),
  we will miss the checking (last_cpupid == (-1  LAST_CPUPID_MASK)).
  And finally cause an oops bug in task_numa_group(), since the online cpu 
  is
  less than possible cpu.
 
  I grabbed this.  I added this to the changelog:
 
  : PPC needs the LAST_CPUPID_NOT_IN_PAGE_FLAGS case because ppc needs to
  : support a large physical address region, up to 2^46 but small section 
  size
  : (2^24).  So when NR_CPUS grows up, it is easily to cause
  : not-in-page-flags.
 
  to hopefully address Peter's observation.
 
  How should we proceed with this?  I'm getting the impression that numa
  balancing on ppc is a dead duck in 3.14, so perhaps this and
 
  powerpc-mm-add-new-set-flag-argument-to-pte-pmd-update-function.patch
  mm-dirty-accountable-change-only-apply-to-non-prot-numa-case.patch
  mm-use-ptep-pmdp_set_numa-for-updating-_page_numa-bit.patch
 

 All these are already in 3.14  ?

 Yes.

  are 3.15-rc1 material?
 

 We should push the first hunk to 3.14. I will wait for Liu to redo the
 patch. BTW this should happen only when SPARSE_VMEMMAP is not
 specified. Srikar had reported the issue here

 http://mid.gmane.org/20140219180200.ga29...@linux.vnet.ibm.com

 #if defined(CONFIG_SPARSEMEM)  !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTIONS_WIDTHSECTIONS_SHIFT
 #else
 #define SECTIONS_WIDTH0
 #endif


 I'm lost.  What patch are you talking about?  The first hunk of what?

I think Aneesh was talking about the chunk of patch, which modified
the file page-flags-layout.h.
I tried to collapse and simplify the logic, but it will incur that
LAST_CPUPID_WIDTH depends on CONFIG_NUMA_BALANCING.
It is an error since we need LAST_CPUPID_WIDTH even without
CONFIG_NUMA_BALANCING. (Sorry, I compiled and run kernel, but not find
this).

Thanks and best regards,
Fan

 I assume we're talking about
 mm-numa-bugfix-for-last_cpupid_not_in_page_flags.patch, which I had
 queued for 3.14.  I'll put it on hold until there's some clarity here.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS

2014-02-27 Thread liu ping fan
On Fri, Feb 28, 2014 at 12:47 PM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Andrew Morton a...@linux-foundation.org writes:

 On Wed, 26 Feb 2014 13:22:16 +0530 Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com wrote:

 Andrew Morton a...@linux-foundation.org writes:

  On Wed,  5 Feb 2014 09:25:46 +0800 Liu Ping Fan qemul...@gmail.com 
  wrote:
 
  When doing some numa tests on powerpc, I triggered an oops bug. I find
  it is caused by using page-_last_cpupid.  It should be initialized as
  -1  LAST_CPUPID_MASK, but not -1. Otherwise, in task_numa_fault(),
  we will miss the checking (last_cpupid == (-1  LAST_CPUPID_MASK)).
  And finally cause an oops bug in task_numa_group(), since the online cpu 
  is
  less than possible cpu.
 
  I grabbed this.  I added this to the changelog:
 
  : PPC needs the LAST_CPUPID_NOT_IN_PAGE_FLAGS case because ppc needs to
  : support a large physical address region, up to 2^46 but small section 
  size
  : (2^24).  So when NR_CPUS grows up, it is easily to cause
  : not-in-page-flags.
 
  to hopefully address Peter's observation.
 
  How should we proceed with this?  I'm getting the impression that numa
  balancing on ppc is a dead duck in 3.14, so perhaps this and
 
  powerpc-mm-add-new-set-flag-argument-to-pte-pmd-update-function.patch
  mm-dirty-accountable-change-only-apply-to-non-prot-numa-case.patch
  mm-use-ptep-pmdp_set_numa-for-updating-_page_numa-bit.patch
 

 All these are already in 3.14  ?

 Yes.

  are 3.15-rc1 material?
 

 We should push the first hunk to 3.14. I will wait for Liu to redo the
 patch. BTW this should happen only when SPARSE_VMEMMAP is not
 specified. Srikar had reported the issue here

 http://mid.gmane.org/20140219180200.ga29...@linux.vnet.ibm.com

 #if defined(CONFIG_SPARSEMEM)  !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTIONS_WIDTH   SECTIONS_SHIFT
 #else
 #define SECTIONS_WIDTH   0
 #endif


 I'm lost.  What patch are you talking about?  The first hunk of what?

 The patch in this thread.


 I assume we're talking about
 mm-numa-bugfix-for-last_cpupid_not_in_page_flags.patch, which I had
 queued for 3.14.  I'll put it on hold until there's some clarity here.

 We don't need the complete patch, it is just the first hunk that we need
 to fix the crash ie. we only need

 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index a7b4e31..ddc66df4 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -727,7 +727,7 @@ static inline int page_cpupid_last(struct page *page)
  }
  static inline void page_cpupid_reset_last(struct page *page)
  {
 -   page-_last_cpupid = -1;
 +   page-_last_cpupid = -1  LAST_CPUPID_MASK;
  }
  #else
  static inline int page_cpupid_last(struct page *page)

 Also the issue will only happen when SPARSE_VMEMMAP is not enabled. I
 will send a proper patch with updated changelog. I was hoping Liu will
 get to that quickly

Thanks for sending V2.  Since the ppc machine env is changed by
others, I am blocking on setting up the env for re-test this patch.
And not send out it quickly.

Best regards,
Fan

 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: ftrace: bugfix for test_24bit_addr

2014-02-25 Thread Liu Ping Fan
The branch target should be the func addr, not the addr of func_descr_t.
So using ppc_function_entry() to generate the right target addr.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
This bug will make ftrace fail to work. It can be triggered when the kernel
size grows up.
---
 arch/powerpc/kernel/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c
index 9b27b29..b0ded97 100644
--- a/arch/powerpc/kernel/ftrace.c
+++ b/arch/powerpc/kernel/ftrace.c
@@ -74,6 +74,7 @@ ftrace_modify_code(unsigned long ip, unsigned int old, 
unsigned int new)
  */
 static int test_24bit_addr(unsigned long ip, unsigned long addr)
 {
+   addr = ppc_function_entry((void *)addr);
 
/* use the create_branch to verify that this offset can be branched */
return create_branch((unsigned int *)ip, addr, 0);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-02-25 Thread Liu ping fan
);
}
}

if (clock_gettime(CLOCK_REALTIME, now) == -1)
handle_error(clock_gettime);

/* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line */

new_value.it_value.tv_sec = now.tv_sec + elapsed;
new_value.it_value.tv_nsec = now.tv_nsec;
new_value.it_interval.tv_sec = 0;
new_value.it_interval.tv_nsec = 0;

if (timerfd_settime(fd, TFD_TIMER_ABSTIME, new_value, NULL) == -1)
handle_error(timerfd_settime);

pfd.fd = fd;
pfd.events = POLLIN;
pfd.revents = 0;
/* -1: infinite wait */
poll(pfd, 1, -1);



/* ask children to stop and get back cnt */

*(pmap + SHM_CMD_OFF) = CMD_STOP;

wait(NULL);
dst_info = (char *)(pmap + SHM_MESSAGE_OFF);
printf(dst_info);
printf(total cnt:%lu\n, *(pmap + SHM_CNT_OFF));

munmap(pmap, PAGE_SIZE);
shm_unlink(SHM_FNAME);
}




On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf ag...@suse.de wrote:

 On 15.01.2014, at 07:36, Liu ping fan kernelf...@gmail.com wrote:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the fact 
 of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to achieve 
 what you're trying and convince your readers that it's a good idea to do it 
 the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 If my suppose is correct, will CCing k...@vger.kernel.org from next 
 version.

 This translates to me as This is an RFC?

 Yes, I am not quite sure about it. I have no bare-metal to verify it.
 So I hope at least, from the theory, it is correct.

 Paul, could you please give this some thought and maybe benchmark it?


 Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: ftrace: bugfix for test_24bit_addr

2014-02-25 Thread Liu ping fan
On Wed, Feb 26, 2014 at 12:35 PM, Ananth N Mavinakayanahalli
ana...@in.ibm.com wrote:
 On Wed, Feb 26, 2014 at 10:23:01AM +0800, Liu Ping Fan wrote:
 The branch target should be the func addr, not the addr of func_descr_t.
 So using ppc_function_entry() to generate the right target addr.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
 This bug will make ftrace fail to work. It can be triggered when the kernel
 size grows up.
 ---
  arch/powerpc/kernel/ftrace.c | 1 +
  1 file changed, 1 insertion(+)

 diff --git a/arch/powerpc/kernel/ftrace.c b/arch/powerpc/kernel/ftrace.c
 index 9b27b29..b0ded97 100644
 --- a/arch/powerpc/kernel/ftrace.c
 +++ b/arch/powerpc/kernel/ftrace.c
 @@ -74,6 +74,7 @@ ftrace_modify_code(unsigned long ip, unsigned int old, 
 unsigned int new)
   */
  static int test_24bit_addr(unsigned long ip, unsigned long addr)
  {
 + addr = ppc_function_entry((void *)addr);

 Won't this break on LE?

How? I can not figure out it. Anyway, ppc_function_entry() is already
used in other places with LE.

Thx,
Fan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-22 Thread Liu ping fan
On Wed, Jan 22, 2014 at 1:18 PM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Paul Mackerras pau...@samba.org writes:

 On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote:

 On 15.01.2014, at 07:36, Liu ping fan kernelf...@gmail.com wrote:

  On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:
 
  On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:
 
  This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
  Numa faults support for ppc64
 
  For this series, I apply the same idea from the previous thread [PATCH 
  0/3] optimize for powerpc _PAGE_NUMA
  (for which, I still try to get a machine to show nums)
 
  But for this series, I think that I have a good justification -- the 
  fact of heavy cost when switching context between guest and host,
  which is  well known.
 
  This cover letter isn't really telling me anything. Please put a proper 
  description of what you're trying to achieve, why you're trying to 
  achieve what you're trying and convince your readers that it's a good 
  idea to do it the way you do it.
 
  Sorry for the unclear message. After introducing the _PAGE_NUMA,
  kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
  should rely on host's kvmppc_book3s_hv_page_fault() to call
  do_numa_page() to do the numa fault check. This incurs the overhead
  when exiting from rmode to vmode.  My idea is that in
  kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
  there is no need to exit to vmode (i.e saving htab, slab switching)
 
  If my suppose is correct, will CCing k...@vger.kernel.org from next 
  version.
 
  This translates to me as This is an RFC?
 
  Yes, I am not quite sure about it. I have no bare-metal to verify it.
  So I hope at least, from the theory, it is correct.

 Paul, could you please give this some thought and maybe benchmark it?

 OK, once I get Aneesh to tell me how I get to have ptes with
 _PAGE_NUMA set in the first place. :)


 I guess we want patch 2, Which Liu has sent separately and I have
 reviewed. http://article.gmane.org/gmane.comp.emulators.kvm.powerpc.devel/8619
 I am not sure about the rest of the patches in the series.
 We definitely don't want to numa migrate on henter. We may want to do
 that on fault. But even there, IMHO, we should let the host take the
 fault and do the numa migration instead of doing this in guest context.

My patch does NOT do the numa migration in guest context( h_enter).
Instead it just do a pre-check to see whether the numa migration is
needed. If needed, the host will take the fault and do the numa
migration as it currently does. Otherwise, h_enter can directly setup
hpte without HPTE_V_ABSENT.
And since pte_mknuma() is called system-wide periodly, so it has more
possibility that guest will suffer from HPTE_V_ABSENT.(as my previous
reply, I think we should also place the quick check in
kvmppc_hpte_hv_fault )

Thx,
Fan

 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2] powernv: kvm: make _PAGE_NUMA take effect

2014-01-21 Thread Liu Ping Fan
To make sure that on host, the pages marked with _PAGE_NUMA result in a fault
when guest access them, we should force the checking when guest uses hypercall
to setup hpte.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v2:
  It should be the reply to [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take 
effect
  And I imporve the changelog according to Aneesh's suggestion.
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..af8602d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
/* Look up the Linux PTE for the backing page */
pte_size = psize;
pte = lookup_linux_pte(pgdir, hva, writing, pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-21 Thread Liu ping fan
On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
 aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the 
 fact of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to 
 achieve what you're trying and convince your readers that it's a good 
 idea to do it the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 Can you explain more. Are we looking at hcall from guest  and
 hypervisor handling them in real mode ? If so why would guest issue a
 hcall on a pte entry that have PAGE_NUMA set. Or is this about
 hypervisor handling a missing hpte, because of host swapping this page
 out ? In that case how we end up in h_enter ? IIUC for that case we
 should get to kvmppc_hpte_hv_fault.

 After setting _PAGE_NUMA, we should flush out all hptes both in host's
 htab and guest's. So when guest tries to access memory, host finds
 that there is not hpte ready for guest in guest's htab. And host
 should raise dsi to guest.

 Now guest receive that fault, removes the PAGE_NUMA bit and do an
 hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
 cleared PAGE_NUMA bit.

This incurs that guest ends up in h_enter.
 And you can see in current code, we also try this quick path firstly.
 Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

 hmm ? hpte_hv_fault is the hypervisor handling the fault.

After we discuss in irc. I think we should also do the fast check in
kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right?

Thanks and regards,
Fan
 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-21 Thread Liu ping fan
On Tue, Jan 21, 2014 at 5:07 PM, Liu ping fan kernelf...@gmail.com wrote:
 On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
 aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
 aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the 
 fact of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to 
 achieve what you're trying and convince your readers that it's a good 
 idea to do it the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 Can you explain more. Are we looking at hcall from guest  and
 hypervisor handling them in real mode ? If so why would guest issue a
 hcall on a pte entry that have PAGE_NUMA set. Or is this about
 hypervisor handling a missing hpte, because of host swapping this page
 out ? In that case how we end up in h_enter ? IIUC for that case we
 should get to kvmppc_hpte_hv_fault.

 After setting _PAGE_NUMA, we should flush out all hptes both in host's
 htab and guest's. So when guest tries to access memory, host finds
 that there is not hpte ready for guest in guest's htab. And host
 should raise dsi to guest.

 Now guest receive that fault, removes the PAGE_NUMA bit and do an
 hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
 cleared PAGE_NUMA bit.

This incurs that guest ends up in h_enter.
 And you can see in current code, we also try this quick path firstly.
 Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

 hmm ? hpte_hv_fault is the hypervisor handling the fault.

 After we discuss in irc. I think we should also do the fast check in
 kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
 and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. 
 Right?

Or we can delay the quick fix in H_ENTER, and let the host fault
again, so do the fix in kvmppc_hpte_hv_fault()

 Thanks and regards,
 Fan
 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-20 Thread Liu ping fan
On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu ping fan kernelf...@gmail.com writes:

 On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: 
 Numa faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 
 0/3] optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the fact 
 of heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to achieve 
 what you're trying and convince your readers that it's a good idea to do it 
 the way you do it.

 Sorry for the unclear message. After introducing the _PAGE_NUMA,
 kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
 should rely on host's kvmppc_book3s_hv_page_fault() to call
 do_numa_page() to do the numa fault check. This incurs the overhead
 when exiting from rmode to vmode.  My idea is that in
 kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
 there is no need to exit to vmode (i.e saving htab, slab switching)

 Can you explain more. Are we looking at hcall from guest  and
 hypervisor handling them in real mode ? If so why would guest issue a
 hcall on a pte entry that have PAGE_NUMA set. Or is this about
 hypervisor handling a missing hpte, because of host swapping this page
 out ? In that case how we end up in h_enter ? IIUC for that case we
 should get to kvmppc_hpte_hv_fault.

After setting _PAGE_NUMA, we should flush out all hptes both in host's
htab and guest's. So when guest tries to access memory, host finds
that there is not hpte ready for guest in guest's htab. And host
should raise dsi to guest. This incurs that guest ends up in h_enter.
And you can see in current code, we also try this quick path firstly.
Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

Thanks and regards,
Fan


 If my suppose is correct, will CCing k...@vger.kernel.org from next 
 version.

 This translates to me as This is an RFC?

 Yes, I am not quite sure about it. I have no bare-metal to verify it.
 So I hope at least, from the theory, it is correct.


 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/4] powernv: kvm: numa fault improvement

2014-01-14 Thread Liu ping fan
On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf ag...@suse.de wrote:

 On 11.12.2013, at 09:47, Liu Ping Fan kernelf...@gmail.com wrote:

 This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: Numa 
 faults support for ppc64

 For this series, I apply the same idea from the previous thread [PATCH 0/3] 
 optimize for powerpc _PAGE_NUMA
 (for which, I still try to get a machine to show nums)

 But for this series, I think that I have a good justification -- the fact of 
 heavy cost when switching context between guest and host,
 which is  well known.

 This cover letter isn't really telling me anything. Please put a proper 
 description of what you're trying to achieve, why you're trying to achieve 
 what you're trying and convince your readers that it's a good idea to do it 
 the way you do it.

Sorry for the unclear message. After introducing the _PAGE_NUMA,
kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
should rely on host's kvmppc_book3s_hv_page_fault() to call
do_numa_page() to do the numa fault check. This incurs the overhead
when exiting from rmode to vmode.  My idea is that in
kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
there is no need to exit to vmode (i.e saving htab, slab switching)

 If my suppose is correct, will CCing k...@vger.kernel.org from next version.

 This translates to me as This is an RFC?

Yes, I am not quite sure about it. I have no bare-metal to verify it.
So I hope at least, from the theory, it is correct.

Thanks and regards,
Ping Fan

 Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/4] powernv: kvm: numa fault improvement

2013-12-11 Thread Liu Ping Fan
This series is based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: Numa 
faults support for ppc64

For this series, I apply the same idea from the previous thread [PATCH 0/3] 
optimize for powerpc _PAGE_NUMA
(for which, I still try to get a machine to show nums)

But for this series, I think that I have a good justification -- the fact of 
heavy cost when switching context between guest and host,
which is  well known.

If my suppose is correct, will CCing k...@vger.kernel.org from next version.


Liu Ping Fan (4):
  mm: export numa_migrate_prep()
  powernv: kvm: make _PAGE_NUMA take effect
  powernv: kvm: extend input param for lookup_linux_pte
  powernv: kvm: make the handling of _PAGE_NUMA faster for guest

 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 38 ++---
 include/linux/mm.h  |  2 ++
 2 files changed, 37 insertions(+), 3 deletions(-)

-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect

2013-12-11 Thread Liu Ping Fan
To make _PAGE_NUMA take effect, we should force the checking when
guest uses hypercall to setup hpte.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..af8602d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
/* Look up the Linux PTE for the backing page */
pte_size = psize;
pte = lookup_linux_pte(pgdir, hva, writing, pte_size);
-   if (pte_present(pte)) {
+   if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/4] mm: export numa_migrate_prep()

2013-12-11 Thread Liu Ping Fan
powerpc will use it in fast path.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ab0e22..420fb77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct 
vm_area_struct *vma, unsigned long
 extern int mprotect_fixup(struct vm_area_struct *vma,
  struct vm_area_struct **pprev, unsigned long start,
  unsigned long end, unsigned long newflags);
+extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+   unsigned long addr, int page_nid);
 
 /*
  * doesn't attempt to fault and will return short.
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/4] powernv: kvm: extend input param for lookup_linux_pte

2013-12-11 Thread Liu Ping Fan
It will be helpful for next patch

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
Can it be merged with the next patch?
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index af8602d..ae46052 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -135,7 +135,8 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
 }
 
 static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
- int writing, unsigned long *pte_sizep)
+ int writing, unsigned long *pte_sizep,
+ pte_t **ptepp)
 {
pte_t *ptep;
unsigned long ps = *pte_sizep;
@@ -144,6 +145,8 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long 
hva,
ptep = find_linux_pte_or_hugepte(pgdir, hva, hugepage_shift);
if (!ptep)
return __pte(0);
+   if (ptepp != NULL)
+   *ptepp = ptep;
if (hugepage_shift)
*pte_sizep = 1ul  hugepage_shift;
else
@@ -231,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 
/* Look up the Linux PTE for the backing page */
pte_size = psize;
-   pte = lookup_linux_pte(pgdir, hva, writing, pte_size);
+   pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL);
if (pte_present(pte)  !pte_numa(pte)) {
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
@@ -671,7 +674,8 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long 
flags,
memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
if (memslot) {
hva = __gfn_to_hva_memslot(memslot, gfn);
-   pte = lookup_linux_pte(pgdir, hva, 1, psize);
+   pte = lookup_linux_pte(pgdir, hva, 1, psize,
+   NULL);
if (pte_present(pte)  !pte_write(pte))
r = hpte_make_readonly(r);
}
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 4/4] powernv: kvm: make the handling of _PAGE_NUMA faster for guest

2013-12-11 Thread Liu Ping Fan
The period check of _PAGE_NUMA can probably happen on the correctly
placed page. For this case, when guest try to setup hpte in real mode,
we try to resolve the numa fault in real mode, since the switch between
guest context and host context costs too much.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index ae46052..a06b199 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -179,6 +179,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
unsigned int writing;
unsigned long mmu_seq;
unsigned long rcbits;
+   struct mm_struct *mm = kvm-mm;
+   struct vm_area_struct *vma;
+   int page_nid, target_nid;
+   struct page *test_page;
+   pte_t *ptep;
 
psize = hpte_page_size(pteh, ptel);
if (!psize)
@@ -234,8 +239,26 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
 
/* Look up the Linux PTE for the backing page */
pte_size = psize;
-   pte = lookup_linux_pte(pgdir, hva, writing, pte_size, NULL);
-   if (pte_present(pte)  !pte_numa(pte)) {
+   pte = lookup_linux_pte(pgdir, hva, writing, pte_size, ptep);
+   if (pte_present(pte)) {
+   if (pte_numa(pte)) {
+   /* If fail, let gup handle it */
+   if (unlikely(!down_read_trylock(mm-mmap_sem)))
+   goto pte_check;
+
+   vma = find_vma(mm, hva);
+   up_read(mm-mmap_sem);
+   test_page = pte_page(pte);
+   page_nid = page_to_nid(test_page);
+   target_nid = numa_migrate_prep(test_page, vma,
+hva, page_nid);
+   put_page(test_page);
+   if (unlikely(target_nid != -1)) {
+   /* If fail, let gup handle it */
+   goto pte_check;
+   }
+   }
+
if (writing  !pte_write(pte))
/* make the actual HPTE be read-only */
ptel = hpte_make_readonly(ptel);
@@ -244,6 +267,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
}
}
 
+pte_check:
if (pte_size  psize)
return H_PARAMETER;
if (pa  pte_size  psize)
@@ -339,6 +363,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
pteh = ~HPTE_V_VALID;
unlock_rmap(rmap);
} else {
+   if (pte_numa(pte)  pa) {
+   pte = pte_mknonnuma(pte);
+   *ptep = pte;
+   }
kvmppc_add_revmap_chain(kvm, rev, rmap, pte_index,
realmode);
/* Only set R/C in real HPTE if already set in *rmap */
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect

2013-12-11 Thread Liu ping fan
On Thu, Dec 5, 2013 at 6:53 PM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu Ping Fan kernelf...@gmail.com writes:

 To enable the do_numa_page(), we should not fix _PAGE_NUMA in
 hash_page(), so bail out for the case of pte_numa().

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/mm/hash_utils_64.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/powerpc/mm/hash_utils_64.c 
 b/arch/powerpc/mm/hash_utils_64.c
 index fb176e9..9bf1195 100644
 --- a/arch/powerpc/mm/hash_utils_64.c
 +++ b/arch/powerpc/mm/hash_utils_64.c
 @@ -1033,7 +1033,7 @@ int hash_page(unsigned long ea, unsigned long access, 
 unsigned long trap)

   /* Get PTE and page size from page tables */
   ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift);
 - if (ptep == NULL || !pte_present(*ptep)) {
 + if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) {
   DBG_LOW( no PTE !\n);
   rc = 1;
   goto bail;

 why ? , All the hash routines do check for _PAGE_PRESENT via access
 variable.

Going through __hash_page_4K(4k on 4k HW), I do not find such check.
Am I wrong? Or I will send out a patch to fix that.

Thanks and regards,
Pingfan
 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect

2013-12-11 Thread Liu ping fan
On Wed, Dec 11, 2013 at 5:50 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Wed, 2013-12-11 at 16:50 +0800, Liu ping fan wrote:
  why ? , All the hash routines do check for _PAGE_PRESENT via access
  variable.
 
 Going through __hash_page_4K(4k on 4k HW), I do not find such check.
 Am I wrong? Or I will send out a patch to fix that.

 We pass a bitmask of flags to check which are tested by doing an andc
 of the PTE on that mask and checking if anything is left...

Oh, see it, thank you very much

Regards,
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect

2013-12-08 Thread Liu ping fan
On Mon, Dec 9, 2013 at 8:31 AM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Thu, 2013-12-05 at 16:23 +0530, Aneesh Kumar K.V wrote:
 Liu Ping Fan kernelf...@gmail.com writes:

  To enable the do_numa_page(), we should not fix _PAGE_NUMA in
  hash_page(), so bail out for the case of pte_numa().

 For some reason I don't have 2/3 and 3/3 in my mbox (though I do have
 them on patchwork) so I'll reply to this one.

 Overall, your statement that this is a faster path needs to be backed up
 with numbers.

 The code is complicated enough as it-is, such additional mess in the low
 level hashing code requires a good justification, and also a
 demonstration that it doesn't add overhead to the normal hash path.

For the test, is it ok to have an user application to copy page where
all page are PG_mlocked?

Thanks and regards,
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] powerpc: mm: optimize for the correctly placed page

2013-12-05 Thread Liu ping fan
On Thu, Dec 5, 2013 at 6:58 PM, Aneesh Kumar K.V
aneesh.ku...@linux.vnet.ibm.com wrote:
 Liu Ping Fan kernelf...@gmail.com writes:

 The period check of _PAGE_NUMA can probably trigger the check on
 the correctly placed page. For this case, we can just insert hpte and
 do fast exception return.

 I still don't understand why we need to handle numa faults in hash
 page ? Are you trying to optimize the code path ? If so can you explain
 the benefits ? Some numbers showing it is helping  ?

When return from hash_page(), we will take fast_exc_return_irq, while
from do_page_fault(), we take ret_from_except.
As the fast implies that there are more complicated logic to sync
the interrupt states in ret_from_except, which cost much.
Do you think so?


 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  arch/powerpc/mm/hash_utils_64.c | 34 +-
  1 file changed, 33 insertions(+), 1 deletion(-)

 diff --git a/arch/powerpc/mm/hash_utils_64.c 
 b/arch/powerpc/mm/hash_utils_64.c
 index 9bf1195..735678c 100644
 --- a/arch/powerpc/mm/hash_utils_64.c
 +++ b/arch/powerpc/mm/hash_utils_64.c
 @@ -965,6 +965,10 @@ int hash_page(unsigned long ea, unsigned long access, 
 unsigned long trap)
   const struct cpumask *tmp;
   int rc, user_region = 0, local = 0;
   int psize, ssize;
 + pte_t old, new;
 + struct vm_area_struct *vma;
 + int page_nid, target_nid;
 + struct page *test_page;

   DBG_LOW(hash_page(ea=%016lx, access=%lx, trap=%lx\n,
   ea, access, trap);
 @@ -1033,12 +1037,40 @@ int hash_page(unsigned long ea, unsigned long 
 access, unsigned long trap)

   /* Get PTE and page size from page tables */
   ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift);
 - if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) {
 + if (ptep == NULL || !pte_present(*ptep)) {
   DBG_LOW( no PTE !\n);
   rc = 1;
   goto bail;
   }

 + old = pte_val(*ptep);
 + if (pte_numa(old)) {
 + /* If fail to lock, let do_page_fault() to handle it */
 + if (down_read_trylock(mm-mmap_sem)) {

 hmm is that something we want to do in hash_page ?

Yes, the function has no relation with hash. But I think it depends on
whether it is worth to optimize or not.

Thanks and regards,
Pingfan
 + vma = find_vma(mm, ea);
 + up_read(mm-mmap_sem);
 + test_page = pte_page(old);
 + page_nid = page_to_nid(test_page);
 + target_nid = numa_migrate_prep(test_page, vma, ea,
 + page_nid);
 + if (target_nid  0) {
 + new = pte_mknonnuma(old);
 + /* If ptep is modified under us,
 +  * just retry the access
 +  */
 + if (unlikely(cmpxchg(ptep, old, new) != old)) {
 + put_page(test_page);
 + return 0;
 + }
 + put_page(test_page);
 + }
 + } else {
 + put_page(test_page);
 + rc = 1;
 + goto bail;
 + }
 + }
 +
   /* Add _PAGE_PRESENT to the required access perm */
   access |= _PAGE_PRESENT;


 -aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/3] optimize for powerpc _PAGE_NUMA

2013-12-03 Thread Liu Ping Fan
I saw Aneesh had implemented Numa faults on ppc64. Most of them have been 
reviewed-by.
So I just diff mine and his to form some trivial modification.

Based on Aneesh's series  [PATCH -V2 0/5] powerpc: mm: Numa faults support for 
ppc64

Liu Ping Fan (3):
  powerpc: mm: make _PAGE_NUMA take effect
  mm: export numa_migrate_prep()
  powerpc: mm: optimize for the correctly placed page

 arch/powerpc/mm/hash_utils_64.c | 32 
 include/linux/mm.h  |  2 ++
 2 files changed, 34 insertions(+)

-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/3] mm: export numa_migrate_prep()

2013-12-03 Thread Liu Ping Fan
powerpc will use it in fast path.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ab0e22..420fb77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct 
vm_area_struct *vma, unsigned long
 extern int mprotect_fixup(struct vm_area_struct *vma,
  struct vm_area_struct **pprev, unsigned long start,
  unsigned long end, unsigned long newflags);
+extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+   unsigned long addr, int page_nid);
 
 /*
  * doesn't attempt to fault and will return short.
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/3] powerpc: mm: make _PAGE_NUMA take effect

2013-12-03 Thread Liu Ping Fan
To enable the do_numa_page(), we should not fix _PAGE_NUMA in
hash_page(), so bail out for the case of pte_numa().

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/mm/hash_utils_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index fb176e9..9bf1195 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1033,7 +1033,7 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)
 
/* Get PTE and page size from page tables */
ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift);
-   if (ptep == NULL || !pte_present(*ptep)) {
+   if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) {
DBG_LOW( no PTE !\n);
rc = 1;
goto bail;
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/3] powerpc: mm: optimize for the correctly placed page

2013-12-03 Thread Liu Ping Fan
The period check of _PAGE_NUMA can probably trigger the check on
the correctly placed page. For this case, we can just insert hpte and
do fast exception return.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/mm/hash_utils_64.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 9bf1195..735678c 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -965,6 +965,10 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)
const struct cpumask *tmp;
int rc, user_region = 0, local = 0;
int psize, ssize;
+   pte_t old, new;
+   struct vm_area_struct *vma;
+   int page_nid, target_nid;
+   struct page *test_page;
 
DBG_LOW(hash_page(ea=%016lx, access=%lx, trap=%lx\n,
ea, access, trap);
@@ -1033,12 +1037,40 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)
 
/* Get PTE and page size from page tables */
ptep = find_linux_pte_or_hugepte(pgdir, ea, hugeshift);
-   if (ptep == NULL || !pte_present(*ptep) || pte_numa(*ptep)) {
+   if (ptep == NULL || !pte_present(*ptep)) {
DBG_LOW( no PTE !\n);
rc = 1;
goto bail;
}
 
+   old = pte_val(*ptep);
+   if (pte_numa(old)) {
+   /* If fail to lock, let do_page_fault() to handle it */
+   if (down_read_trylock(mm-mmap_sem)) {
+   vma = find_vma(mm, ea);
+   up_read(mm-mmap_sem);
+   test_page = pte_page(old);
+   page_nid = page_to_nid(test_page);
+   target_nid = numa_migrate_prep(test_page, vma, ea,
+   page_nid);
+   if (target_nid  0) {
+   new = pte_mknonnuma(old);
+   /* If ptep is modified under us,
+* just retry the access
+*/
+   if (unlikely(cmpxchg(ptep, old, new) != old)) {
+   put_page(test_page);
+   return 0;
+   }
+   put_page(test_page);
+   }
+   } else {
+   put_page(test_page);
+   rc = 1;
+   goto bail;
+   }
+   }
+
/* Add _PAGE_PRESENT to the required access perm */
access |= _PAGE_PRESENT;
 
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/2] powerpc: mm: use macro PGTABLE_EADDR_SIZE instead of digital

2013-11-20 Thread Liu Ping Fan
In case of extending the eaddr in future, use this macro
PGTABLE_EADDR_SIZE to ease the maintenance of the code.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/mm/slb_low.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
index 17aa6df..e0b3cf4 100644
--- a/arch/powerpc/mm/slb_low.S
+++ b/arch/powerpc/mm/slb_low.S
@@ -35,7 +35,7 @@ _GLOBAL(slb_allocate_realmode)
 * check for bad kernel/user address
 * (ea  ~REGION_MASK) = PGTABLE_RANGE
 */
-   rldicr. r9,r3,4,(63 - 46 - 4)
+   rldicr. r9,r3,4,(63 - PGTABLE_EADDR_SIZE - 4)
bne-8f
 
srdir9,r3,60/* get region */
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/2] powerpc: mm: change pgtable index size for 64K page

2013-11-20 Thread Liu Ping Fan
For 64K page, we waste half of the pte_t page. With this patch, after
changing PGD_INDEX_SIZE from 12 to 11, PTE_INDEX_SIZE from 8 to 9,
we can improve the usage of pte_t page and shrink the continuous phys
size for pgd_t.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/pgtable-ppc64-64k.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h 
b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
index a56b82f..f6955ff 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
@@ -4,10 +4,10 @@
 #include asm-generic/pgtable-nopud.h
 
 
-#define PTE_INDEX_SIZE  8
+#define PTE_INDEX_SIZE  9
 #define PMD_INDEX_SIZE  10
 #define PUD_INDEX_SIZE 0
-#define PGD_INDEX_SIZE  12
+#define PGD_INDEX_SIZE  11
 
 #ifndef __ASSEMBLY__
 #define PTE_TABLE_SIZE (sizeof(real_pte_t)  PTE_INDEX_SIZE)
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-19 Thread Liu ping fan
On Tue, Nov 19, 2013 at 6:39 PM, Alexander Graf ag...@suse.de wrote:

 On 19.11.2013, at 07:12, Liu Ping Fan kernelf...@gmail.com wrote:

 Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
 realmode, so it can trigger the deadlock.

 Suppose the following scene:

 Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
 vcpus.

 If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
 out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
 caught in realmode for a long time.

 What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

 Oops! deadlock happens

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Any particular reason for the resend? The patch is already applied, no?

Oh, seems that I misunderstood your meaning. You said Actually, I've
changed my mind and moved the patch to the for-3.13 branch instead.
Please make sure to CC kvm@vger on all patches you submit though. So
I think it is necessary to resend with cc kvm@vger

Regards,
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-18 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
caught in realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 842f081..abf81fe 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
slb_v = vcpu-kvm-arch.vrma_slb_v;
}
 
+   preempt_disable();
/* Find the HPTE in the hash table */
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
-   if (index  0)
+   if (index  0) {
+   preempt_enable();
return -ENOENT;
+   }
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0]  ~HPTE_V_HVLOCK;
gr = kvm-arch.revmap[index].guest_rpte;
@@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* Unlock the HPTE */
asm volatile(lwsync : : : memory);
hptep[0] = v;
+   preempt_enable();
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..ea17b30 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
+/* When called from virtmode, this func should be protected by
+ * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK
+ * can trigger deadlock issue.
+ */
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
  unsigned long valid)
 {
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3] powerpc: kvm: optimize sc 1 as fast return

2013-11-18 Thread Liu Ping Fan
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v3: add some document
---
 arch/powerpc/kvm/book3s_hv.c| 10 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 19 ++-
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..1addb1a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,10 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
+   /* hypercall with MSR_PR has already been handled in rmode,
+* and never reaches here.
+*/
+
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..eea7ca7 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -534,6 +534,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 5: mtspr   SPRN_SRR0, r6
mtspr   SPRN_SRR1, r7
 
+/*
+ * Required state:
+ * R4 = vcpu
+ * R10: value for HSRR0
+ * R11: value for HSRR1
+ * R13 = PACA
+ */
 fast_guest_return:
li  r0,0
stb r0,VCPU_CEDED(r4)   /* cancel cede */
@@ -1388,7 +1395,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_1_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1417,15 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_1_fast_return:
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
+   rotldi  r11, r11, 63
+   mr  r4,r9
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2] powerpc: kvm: optimize sc 1 as fast return

2013-11-17 Thread liu ping fan
On Sat, Nov 16, 2013 at 3:00 PM, Paul Mackerras pau...@samba.org wrote:
 On Fri, Nov 15, 2013 at 04:35:01PM +0800, Liu Ping Fan wrote:

 +sc_1_fast_return:
 + mtspr   SPRN_SRR0,r10
 + mtspr   SPRN_SRR1,r11
 + li  r10, BOOK3S_INTERRUPT_SYSCALL
 + li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
 + rotldi  r11, r11, 63

 You need a mr r4, r9 instruction here, because fast_guest_return
 needs the vcpu pointer in r4.  Apart from that this looks fine.

Will fix it.

Thanks and regards,
Pingfan
 + b   fast_guest_return

 Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3] powerpc: kvm: optimize sc 1 as fast return

2013-11-17 Thread Liu Ping Fan
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 +++-
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..73dc852 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..0d1e2c2 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_1_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1410,15 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_1_fast_return:
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
+   rotldi  r11, r11, 63
+   mr  r4,r9
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-15 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
caught in realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
v4: remove the over-engineered part and keep it simple, also add some notes.
  
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 842f081..abf81fe 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -473,11 +473,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
slb_v = vcpu-kvm-arch.vrma_slb_v;
}
 
+   preempt_disable();
/* Find the HPTE in the hash table */
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
-   if (index  0)
+   if (index  0) {
+   preempt_enable();
return -ENOENT;
+   }
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0]  ~HPTE_V_HVLOCK;
gr = kvm-arch.revmap[index].guest_rpte;
@@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* Unlock the HPTE */
asm volatile(lwsync : : : memory);
hptep[0] = v;
+   preempt_enable();
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..ea17b30 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,6 +749,10 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
+/* When called from virtmode, this func should be protected by
+ * preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK
+ * can trigger deadlock issue.
+ */
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
  unsigned long valid)
 {
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2] powerpc: kvm: optimize sc 1 as fast return

2013-11-15 Thread Liu Ping Fan
In some scene, e.g openstack CI, PR guest can trigger sc 1 frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++-
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..73dc852 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..a463f08 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_1_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1410,14 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_1_fast_return:
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
+   rotldi  r11, r11, 63
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-10 Thread Liu ping fan
On Fri, Nov 8, 2013 at 7:12 PM, Paul Mackerras pau...@samba.org wrote:
 On Fri, Nov 08, 2013 at 10:44:16AM +0800, Liu Ping Fan wrote:
 syscall is a very common behavior inside guest, and this patch
 optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
 so hypervisor can return to guest without heavy exit, i.e, no need
 to swap TLB, HTAB,.. etc

 Many interrupts that are caused by guest code go directly to the guest
 and don't come to the hypervisor at all.  That includes system call
 (sc 0), alignment interrupts, program interrupts, SLB miss interrupts,
 etc.  See section 6.5 of Book 3S of the Power ISA specification; all
 the interrupts with '-' in the 'HV' column of the table there get
 delivered directly to the guest when they occur inside a guest.

Oh,got it, thanks! That is an important thing I tried to find out but
missed all these days.

 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -1388,7 +1388,8 @@ kvmppc_hisi:
  hcall_try_real_mode:
   ld  r3,VCPU_GPR(R3)(r9)
   andi.   r0,r11,MSR_PR
 - bne guest_exit_cont
 + /* sc 1 from userspace - reflect to guest syscall */
 + bne sc_0_fast_return

 Discrepancy between comment and code here.  In fact we would only take
 the branch for a sc 1 instruction in userspace, which occurs when a PR
 KVM guest nested inside a HV KVM guest does a hypercall (i.e., not for

I made a big mistake from the beginning, and now get a more clear
understand of the scene. Thanks!

 normal system calls).  It is probably worthwhile to speed those up.

 +sc_0_fast_return:
 + ld  r10,VCPU_PC(r9)
 + ld  r11,VCPU_MSR(r9)

 r11 must already contain this since you just did andi. r0,r11,MSR_PR.
 In fact r10 already contains VCPU_PC(r9) at this point also, though
 that is not so obvious.

 + mtspr   SPRN_SRR0,r10
 + mtspr   SPRN_SRR1,r11
 + li  r10, BOOK3S_INTERRUPT_SYSCALL
 + LOAD_REG_IMMEDIATE(r3,0x87a0)   /* zero 33:36,42:47 */
 + and r11,r11,r3

 This is not correct, since you don't even clear PR.  In fact what you

Yes.
 need is to load up MSR_SF | MSR_ME, though that value changes with

Is it enough to only set MSR_SF | MSR_ME? Sould the HV guest(PR KVM)
need to fake msr,  so that PR guest feels that sc 1 is trapped by PR
KVM directly? I.e, according to ISA Figure 51. MSR setting due to
interrupt, about System Call, we need to keep MSR_IR/_DR unchanged.
If it is true, then HV need to help HV guest to make this fake. Right?

 little-endian support and changes again with transactional memory
 support for POWER8.  There is an idiom for loading that MSR value,
 which is:

 li  r11, (MSR_ME  1) | 1  /* synthesize MSR_SF | MSR_ME */
 rotldi  r11, r11, 63

Why do we define MSR_SF_LG as bit 63, not like the ISA says bit 0 is SF?
And could you enlighten me briefly that what is the effect on the
value, when LE and  transactional memory are introduced?

Thanks and best regards,
Pingfan

 which you could use for now, but it will need to be changed when
 Anton's LE patch gets accepted.

 Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 0/2] powerpc kvm: fix deadlock scene

2013-11-10 Thread Liu ping fan
On Fri, Nov 8, 2013 at 6:58 PM, Paul Mackerras pau...@samba.org wrote:
 On Fri, Nov 08, 2013 at 03:29:52PM +0800, Liu Ping Fan wrote:
 v2-v3:
   introduce kvmppc_hv_unlock_hpte() to pair with kvmppc_hv_find_lock_hpte()
   and hide the preemption detail inside this pair from the callers

 Actually, I preferred v2.  This version seems a bit over-engineered.

 Making a kvmppc_hv_unlock_hpte() is not such a bad idea, though I
 would make it identical to the existing unlock_hpte() from

Do you think it is helpful to distingusish HPTE_V_LOCK from
HPTE_V_HVLOCK at an API level?
If it is, I will keep patch 1/2 and just fix patch 2/2 .

 book3s_hv_rm_mmu.c, just in a header.  I'm really not convinced about
 putting the preempt_disable/enable inside the lock/unlock functions,
 with the consequent need to pass in a 'vmode' parameter, given that
 there is just one caller that needs to do the preempt_disable/enable.

Ok, will fix patch 2/2

Thanks and regards,
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-08 Thread Liu ping fan
On Fri, Nov 8, 2013 at 12:19 PM, Liu ping fan kernelf...@gmail.com wrote:
 On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote:

 On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote:

 syscall is a very common behavior inside guest, and this patch
 optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
 so hypervisor can return to guest without heavy exit, i.e, no need
 to swap TLB, HTAB,.. etc

 The syscall exit you touch here only happens when you do an sc  0 with 
 MSR_PR set inside the guest. The only case you realistically see this is 
 when you run PR KVM inside of an HV KVM guest.

 Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call
 Interrupt, no description about the MSR_PR when sc trigger a syscall
 interrupt. So I think, guest application sc 0 will also fall to the
 kernel who owns hypervisor mode.  Am I right?

Some further comment: I think the essential of the problem is whether
we switch RMA from guest to HV when interrupts raise.
DSI/ISI will be redirected to HDSI and RMA switch.  But what about
SYSCALL, and DEC, external interrupt, ...etc?

 I don't think we should optimize for that case. Instead, we should rather 
 try to not bounce to the 1st hypervisor in the first place in that scenario 
 :).

 Sorry, but just want to make clear about the idiom:  0 - kernel run
 with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up
 layer? Right?

 When you say try to not bounce to the 1st hypervisor , what is the
 exact meaning and how can we achieve this?  I am a quite newer on
 powerpc, and hope that I can get more clear figure about it  :)


Thanks
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 1/2] powerpc/kvm: fix rare but potential deadlock scene

2013-11-07 Thread Liu ping fan
On Thu, Nov 7, 2013 at 5:54 PM, Alexander Graf ag...@suse.de wrote:

 On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote:

 Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
 realmode, so it can trigger the deadlock.

 Suppose the following scene:

 Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus.

 If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out,
 and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in
 realmode for a long time.

 What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

 Oops! deadlock happens

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com

 Very nice catch :).

 I think it makes a lot of sense to document the fact that 
 kvmppc_hv_find_lock_hpte() should be called with preemption disabled in a 
 comment above the function, so that next time when someone potentially calls 
 it, he knows that he needs to put preempt_disable() around it.

Ok, I will document them in v3

 Thanks a lot for finding this pretty subtle issue. May I ask how you got 
 there? Did you actually see systems deadlock because of this?

Intuition :). then I begin try to model a scene which causes the
deadlock. And fortunately, I find a case to verify my suspension.

Regards,
Pingfan

 Alex

 ---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 043eec8..dbc1478 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -474,10 +474,13 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct 
 kvm_vcpu *vcpu, gva_t eaddr,
   }

   /* Find the HPTE in the hash table */
 + preempt_disable();
   index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
HPTE_V_VALID | HPTE_V_ABSENT);
 - if (index  0)
 + if (index  0) {
 + preempt_enable();
   return -ENOENT;
 + }
   hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
   v = hptep[0]  ~HPTE_V_HVLOCK;
   gr = kvm-arch.revmap[index].guest_rpte;
 @@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
 *vcpu, gva_t eaddr,
   /* Unlock the HPTE */
   asm volatile(lwsync : : : memory);
   hptep[0] = v;
 + preempt_enable();

   gpte-eaddr = eaddr;
   gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
 --
 1.8.1.4


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v2 2/2] powerpc/kvm: remove redundant assignment

2013-11-07 Thread Liu ping fan
On Thu, Nov 7, 2013 at 6:06 PM, Alexander Graf ag...@suse.de wrote:

 On 07.11.2013, at 07:22, Liu Ping Fan kernelf...@gmail.com wrote:

 ret is assigned twice with the same value, so remove the later one.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 Acked-by: Paul Mackerras pau...@samba.org

 I suppose my last request for a patch description was slightly too 
 abbreviated :). Sorry about that.

 Imagine you are a Linux-stable maintainer. You have about 5000 patches in 
 front of you and you want to figure out whether a patch should get backported 
 into a stable tree or not.

 It's very easy to read through the patch description.
 It's reasonably easy to do a git show on the patch.
 It's very hard to look at the actual surrounding code that was changed.

 If I open a text editor on the file, I immediately see what you're saying:

 ret = RESUME_GUEST;
 preempt_disable();
 while (!try_lock_hpte(hptep, HPTE_V_HVLOCK))
 cpu_relax();
 if ((hptep[0]  ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] ||
 rev-guest_rpte != hpte[2])
 /* HPTE has been changed under us; let the guest retry */
 goto out_unlock;
 hpte[0] = (hpte[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;

 rmap = memslot-arch.rmap[gfn - memslot-base_gfn];
 lock_rmap(rmap);

 /* Check if we might have been invalidated; let the guest retry if 
 so */
 ret = RESUME_GUEST;

 However, that scope is not given in the actual patch itself. If you look at 
 the diff below, you have no idea whether the patch is fixing a bug or just 
 removes duplication and doesn't actually have any effect. In fact, the 
 compiled assembly should be the same with this patch and without. But you 
 can't tell from the diff below.

 So what I would like to see in the patch description is something that makes 
 it easy to understand what's going on without the need to check out the 
 source file. Something like

 We redundantly set ret to RESUME_GUEST twice without changing it in between. 
 Only do it once:

 ret = RESUME_GUEST;
 preempt_disable();
 while (!try_lock_hpte(hptep, HPTE_V_HVLOCK))
 cpu_relax();
 if ((hptep[0]  ~HPTE_V_HVLOCK) != hpte[0] || hptep[1] != hpte[1] ||
 rev-guest_rpte != hpte[2])
 /* HPTE has been changed under us; let the guest retry */
 goto out_unlock;
 hpte[0] = (hpte[0]  ~HPTE_V_ABSENT) | HPTE_V_VALID;

 rmap = memslot-arch.rmap[gfn - memslot-base_gfn];
 lock_rmap(rmap);

 /* Check if we might have been invalidated; let the guest retry if 
 so */
 ret = RESUME_GUEST;


 If I look at that patch description it immediately tells me Ah, no need to 
 worry, it's not a critical bug I need to backport. If you have a better idea 
 how to express that I'm more than happy to take that too. Otherwise just let 
 me know whether you like the description above and I'll modify it to the one 
 that includes the code snippet when applying the patch.

Oh, yes. Thank you very much..
And I had a better understanding of the heavy work of maintainers :)
Will keep this in mind.

Best regards,
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-07 Thread Liu Ping Fan
syscall is a very common behavior inside guest, and this patch
optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
so hypervisor can return to guest without heavy exit, i.e, no need
to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
Compiled, but lack of bare metal, I have not tested it yet.
---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 -
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..73dc852 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
/* hcall - punt to userspace */
int i;
 
-   if (vcpu-arch.shregs.msr  MSR_PR) {
-   /* sc 1 from userspace - reflect to guest syscall */
-   kvmppc_book3s_queue_irqprio(vcpu, 
BOOK3S_INTERRUPT_SYSCALL);
-   r = RESUME_GUEST;
-   break;
-   }
run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
for (i = 0; i  9; ++i)
run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c71103b..9f626c3 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
ld  r3,VCPU_GPR(R3)(r9)
andi.   r0,r11,MSR_PR
-   bne guest_exit_cont
+   /* sc 1 from userspace - reflect to guest syscall */
+   bne sc_0_fast_return
clrrdi  r3,r3,2
cmpldi  r3,hcall_real_table_end - hcall_real_table
bge guest_exit_cont
@@ -1409,6 +1410,16 @@ hcall_try_real_mode:
ld  r11,VCPU_MSR(r4)
b   fast_guest_return
 
+sc_0_fast_return:
+   ld  r10,VCPU_PC(r9)
+   ld  r11,VCPU_MSR(r9)
+   mtspr   SPRN_SRR0,r10
+   mtspr   SPRN_SRR1,r11
+   li  r10, BOOK3S_INTERRUPT_SYSCALL
+   LOAD_REG_IMMEDIATE(r3,0x87a0)   /* zero 33:36,42:47 */
+   and r11,r11,r3
+   b   fast_guest_return
+
/* We've attempted a real mode hcall, but it's punted it back
 * to userspace.  We need to restore some clobbered volatiles
 * before resuming the pass-it-to-qemu path */
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-07 Thread Liu ping fan
On Fri, Nov 8, 2013 at 11:10 AM, Alexander Graf ag...@suse.de wrote:

 On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote:

 syscall is a very common behavior inside guest, and this patch
 optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
 so hypervisor can return to guest without heavy exit, i.e, no need
 to swap TLB, HTAB,.. etc

 The syscall exit you touch here only happens when you do an sc  0 with 
 MSR_PR set inside the guest. The only case you realistically see this is when 
 you run PR KVM inside of an HV KVM guest.

Maybe I misunderstood the ISA spec, but refer for 6.5.14 System Call
Interrupt, no description about the MSR_PR when sc trigger a syscall
interrupt. So I think, guest application sc 0 will also fall to the
kernel who owns hypervisor mode.  Am I right?

 I don't think we should optimize for that case. Instead, we should rather try 
 to not bounce to the 1st hypervisor in the first place in that scenario :).

Sorry, but just want to make clear about the idiom:  0 - kernel run
with NV, and 1st - kernel run on HV-KVM and provide PR-KVM to up
layer? Right?

When you say try to not bounce to the 1st hypervisor , what is the
exact meaning and how can we achieve this?  I am a quite newer on
powerpc, and hope that I can get more clear figure about it  :)

Thanks

Pingfan

 Alex


 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
 Compiled, but lack of bare metal, I have not tested it yet.
 ---
 arch/powerpc/kvm/book3s_hv.c|  6 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 -
 2 files changed, 12 insertions(+), 7 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
 index 62a2b5a..73dc852 100644
 --- a/arch/powerpc/kvm/book3s_hv.c
 +++ b/arch/powerpc/kvm/book3s_hv.c
 @@ -628,12 +628,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,
   /* hcall - punt to userspace */
   int i;

 - if (vcpu-arch.shregs.msr  MSR_PR) {
 - /* sc 1 from userspace - reflect to guest syscall */
 - kvmppc_book3s_queue_irqprio(vcpu, 
 BOOK3S_INTERRUPT_SYSCALL);
 - r = RESUME_GUEST;
 - break;
 - }
   run-papr_hcall.nr = kvmppc_get_gpr(vcpu, 3);
   for (i = 0; i  9; ++i)
   run-papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
 diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
 b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 index c71103b..9f626c3 100644
 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
 @@ -1388,7 +1388,8 @@ kvmppc_hisi:
 hcall_try_real_mode:
   ld  r3,VCPU_GPR(R3)(r9)
   andi.   r0,r11,MSR_PR
 - bne guest_exit_cont
 + /* sc 1 from userspace - reflect to guest syscall */
 + bne sc_0_fast_return
   clrrdi  r3,r3,2
   cmpldi  r3,hcall_real_table_end - hcall_real_table
   bge guest_exit_cont
 @@ -1409,6 +1410,16 @@ hcall_try_real_mode:
   ld  r11,VCPU_MSR(r4)
   b   fast_guest_return

 +sc_0_fast_return:
 + ld  r10,VCPU_PC(r9)
 + ld  r11,VCPU_MSR(r9)
 + mtspr   SPRN_SRR0,r10
 + mtspr   SPRN_SRR1,r11
 + li  r10, BOOK3S_INTERRUPT_SYSCALL
 + LOAD_REG_IMMEDIATE(r3,0x87a0)   /* zero 33:36,42:47 */
 + and r11,r11,r3
 + b   fast_guest_return
 +
   /* We've attempted a real mode hcall, but it's punted it back
* to userspace.  We need to restore some clobbered volatiles
* before resuming the pass-it-to-qemu path */
 --
 1.8.1.4


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: kvm: optimize sc 0 as fast return

2013-11-07 Thread Liu ping fan
On Fri, Nov 8, 2013 at 12:11 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Fri, 2013-11-08 at 15:05 +1100, Benjamin Herrenschmidt wrote:
 On Fri, 2013-11-08 at 04:10 +0100, Alexander Graf wrote:
  On 08.11.2013, at 03:44, Liu Ping Fan kernelf...@gmail.com wrote:
 
   syscall is a very common behavior inside guest, and this patch
   optimizes the path for the emulation of BOOK3S_INTERRUPT_SYSCALL,
   so hypervisor can return to guest without heavy exit, i.e, no need
   to swap TLB, HTAB,.. etc
 
  The syscall exit you touch here only happens when you do an sc  0
  with MSR_PR set inside the guest. The only case you realistically see
  this is when you run PR KVM inside of an HV KVM guest.
 
  I don't think we should optimize for that case. Instead, we should
  rather try to not bounce to the 1st hypervisor in the first place in
  that scenario :).

 Well, so unfortunately openstack CI uses PR inside HV pretty
 heavily  it *might* be worthwhile optimizing that path if the patch
 is simple enough... I'd make that Paul's call.

 Note that this is a statement of value for the idea ... not the
 implementation ;-) From a quick look with Paulus, the patch is quite
 broken. I'll let Paul comment in details.

Thank you very much,

Regards,
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 1/2] powerpc: kvm: pair kvmppc_hv_find_lock_hpte with _unlock_hpte

2013-11-07 Thread Liu Ping Fan
Highlight the lock pair for the reader. (and later it will the
place to hide the detail about preemption disable)

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/kvm_book3s.h |  1 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |  7 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 13 ++---
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index fa19e2f..a818932 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -129,6 +129,7 @@ extern void kvmppc_mmu_flush_segments(struct kvm_vcpu 
*vcpu);
 extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu, unsigned long addr,
unsigned long status);
+extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val);
 extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr,
unsigned long slb_v, unsigned long valid);
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 842f081..97685e7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -479,12 +479,9 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
if (index  0)
return -ENOENT;
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
-   v = hptep[0]  ~HPTE_V_HVLOCK;
+   v = hptep[0];
gr = kvm-arch.revmap[index].guest_rpte;
-
-   /* Unlock the HPTE */
-   asm volatile(lwsync : : : memory);
-   hptep[0] = v;
+   kvmppc_hv_unlock_hpte(hptep, v);
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..0ff9e91 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,6 +749,14 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
+void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val)
+{
+   *hpte_val = *hpte_val  ~HPTE_V_HVLOCK;
+   asm volatile(lwsync : : : memory);
+   *hptep = *hpte_val;
+}
+EXPORT_SYMBOL(kvmppc_hv_unlock_hpte);
+
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
  unsigned long valid)
 {
@@ -863,12 +871,11 @@ long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned 
long addr,
return 0;   /* for prot fault, HPTE disappeared */
}
hpte = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
-   v = hpte[0]  ~HPTE_V_HVLOCK;
+   v = hpte[0];
r = hpte[1];
rev = real_vmalloc_addr(kvm-arch.revmap[index]);
gr = rev-guest_rpte;
-
-   unlock_hpte(hpte, v);
+   kvmppc_hv_unlock_hpte(hpte, v);
 
/* For not found, if the HPTE is valid by now, retry the instruction */
if ((status  DSISR_NOHPTE)  (v  HPTE_V_VALID))
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 0/2] powerpc kvm: fix deadlock scene

2013-11-07 Thread Liu Ping Fan
v2-v3:
  introduce kvmppc_hv_unlock_hpte() to pair with kvmppc_hv_find_lock_hpte()
  and hide the preemption detail inside this pair from the callers

Liu Ping Fan (2):
  powerpc: kvm: pair kvmppc_hv_find_lock_hpte with _unlock_hpte
  powerpc: kvm: fix rare but potential deadlock scene

 arch/powerpc/include/asm/kvm_book3s.h |  3 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 10 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 29 -
 3 files changed, 30 insertions(+), 12 deletions(-)

-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v3 2/2] powerpc: kvm: fix rare but potential deadlock scene

2013-11-07 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out,
and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in
realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/kvm_book3s.h |  4 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |  5 +++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 20 
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index a818932..3d710ba 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -129,9 +129,9 @@ extern void kvmppc_mmu_flush_segments(struct kvm_vcpu 
*vcpu);
 extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu, unsigned long addr,
unsigned long status);
-extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val);
+extern void kvmppc_hv_unlock_hpte(ulong *hptep, ulong *hpte_val, bool vmode);
 extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr,
-   unsigned long slb_v, unsigned long valid);
+   unsigned long slb_v, unsigned long valid, bool vmode);
 
 extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache 
*pte);
 extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 97685e7..12d9635 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -475,13 +475,14 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
 
/* Find the HPTE in the hash table */
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
-HPTE_V_VALID | HPTE_V_ABSENT);
+HPTE_V_VALID | HPTE_V_ABSENT,
+true);
if (index  0)
return -ENOENT;
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0];
gr = kvm-arch.revmap[index].guest_rpte;
-   kvmppc_hv_unlock_hpte(hptep, v);
+   kvmppc_hv_unlock_hpte(hptep, v, true);
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 0ff9e91..18a9425 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -749,16 +749,22 @@ static int slb_base_page_shift[4] = {
20, /* 1M, unsupported */
 };
 
-void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val)
+void kvmppc_hv_unlock_hpte(unsigned long *hptep, unsigned long *hpte_val,
+   bool vmode)
 {
*hpte_val = *hpte_val  ~HPTE_V_HVLOCK;
asm volatile(lwsync : : : memory);
*hptep = *hpte_val;
+   if (unlikely(vmode))
+   preempt_enable();
 }
 EXPORT_SYMBOL(kvmppc_hv_unlock_hpte);
 
+/* If called from virtmode and success to lock, then the context will be set
+ * as preemption disabled
+ */
 long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long 
slb_v,
- unsigned long valid)
+ unsigned long valid, bool vmode)
 {
unsigned int i;
unsigned int pshift;
@@ -796,6 +802,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, 
unsigned long slb_v,
avpn = ~0x7fUL;
val |= avpn;
 
+   if (unlikely(vmode))
+   preempt_disable();
+
for (;;) {
hpte = (unsigned long *)(kvm-arch.hpt_virt + (hash  7));
 
@@ -833,6 +842,9 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, 
unsigned long slb_v,
val |= HPTE_V_SECONDARY;
hash = hash ^ kvm-arch.hpt_mask;
}
+
+   if (unlikely(vmode))
+   preempt_enable();
return -1;
 }
 EXPORT_SYMBOL(kvmppc_hv_find_lock_hpte);
@@ -864,7 +876,7 @@ long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned 
long addr,
if (status  DSISR_NOHPTE)
valid |= HPTE_V_ABSENT;
 
-   index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid);
+   index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid, false);
if (index  0) {
if (status  DSISR_NOHPTE)
return status

Re: [PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene

2013-11-06 Thread Liu ping fan
On Wed, Nov 6, 2013 at 7:18 PM, Paul Mackerras pau...@samba.org wrote:
 On Wed, Nov 06, 2013 at 02:02:07PM +0800, Liu ping fan wrote:
 On Wed, Nov 6, 2013 at 1:04 PM, Paul Mackerras pau...@samba.org wrote:
  On Tue, Nov 05, 2013 at 03:42:43PM +0800, Liu Ping Fan wrote:
  Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
  realmode, so it can trigger the deadlock.
 
  Good catch, we should have preemption disabled while ever we have a
  HPTE locked.
 
  @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct 
  kvm_vcpu *vcpu, gva_t eaddr,
}
 
/* Find the HPTE in the hash table */
  + preempt_disable();
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
  + preempt_enable();
 
  Which means we need to add the preempt_enable after unlocking the
  HPTE, not here.
 
 Yes. Sorry, but I am not sure about whether we can call
 preempt_disable/enable() in realmode. I think since thread_info is
 allocated with linear address, so we can use preempt_disable/enable()
 inside kvmppc_hv_find_lock_hpte(), right?

 Your analysis correctly pointed out that we can get a deadlock if we
 can be preempted while holding a lock on a HPTE.  That means that we
 have to disable preemption before taking an HPTE lock and keep it
 disabled until after we unlock the HPTE.  Since the point of
 kvmppc_hv_find_lock_hpte() is to lock the HPTE and return with it
 locked, we can't have the preempt_enable() inside it.  The
 preempt_enable() has to come after we have unlocked the HPTE.  That is
 also why we can't have the preempt_enable() where your patch put it;
 it needs to be about 9 lines further down, after the statement
 hptep[0] = v.  (We also need to make sure to re-enable preemption in
 the index  0 case.)

Oh, yes, will fix like what you said. My attention is attracted by the
trick of calling kernel func in realmode, and miss the exact point
where the lock is released.

Thanks and regards,
Pingfan
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 1/2] powerpc/kvm: fix rare but potential deadlock scene

2013-11-06 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out,
and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in
realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 043eec8..dbc1478 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -474,10 +474,13 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
}
 
/* Find the HPTE in the hash table */
+   preempt_disable();
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
-   if (index  0)
+   if (index  0) {
+   preempt_enable();
return -ENOENT;
+   }
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
v = hptep[0]  ~HPTE_V_HVLOCK;
gr = kvm-arch.revmap[index].guest_rpte;
@@ -485,6 +488,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
/* Unlock the HPTE */
asm volatile(lwsync : : : memory);
hptep[0] = v;
+   preempt_enable();
 
gpte-eaddr = eaddr;
gpte-vpage = ((v  HPTE_V_AVPN)  4) | ((eaddr  12)  0xfff);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v2 2/2] powerpc/kvm: remove redundant assignment

2013-11-06 Thread Liu Ping Fan
ret is assigned twice with the same value, so remove the later one.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
Acked-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index dbc1478..9b97b42 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -733,7 +733,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
lock_rmap(rmap);
 
/* Check if we might have been invalidated; let the guest retry if so */
-   ret = RESUME_GUEST;
if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) {
unlock_rmap(rmap);
goto out_unlock;
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene

2013-11-05 Thread Liu ping fan
On Wed, Nov 6, 2013 at 1:04 PM, Paul Mackerras pau...@samba.org wrote:
 On Tue, Nov 05, 2013 at 03:42:43PM +0800, Liu Ping Fan wrote:
 Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
 realmode, so it can trigger the deadlock.

 Good catch, we should have preemption disabled while ever we have a
 HPTE locked.

 @@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct 
 kvm_vcpu *vcpu, gva_t eaddr,
   }

   /* Find the HPTE in the hash table */
 + preempt_disable();
   index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
HPTE_V_VALID | HPTE_V_ABSENT);
 + preempt_enable();

 Which means we need to add the preempt_enable after unlocking the
 HPTE, not here.

Yes. Sorry, but I am not sure about whether we can call
preempt_disable/enable() in realmode. I think since thread_info is
allocated with linear address, so we can use preempt_disable/enable()
inside kvmppc_hv_find_lock_hpte(), right?

Thanks and regards,
Pingfan

 Regards,
 Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc/kvm: simplify the entering logic for secondary thread

2013-11-05 Thread Liu ping fan
On Wed, Nov 6, 2013 at 1:01 PM, Paul Mackerras pau...@samba.org wrote:
 On Tue, Nov 05, 2013 at 03:42:42PM +0800, Liu Ping Fan wrote:
 After the primary vcpu changes vcore_state to VCORE_RUNNING, there is
 very little chance to schedule to secondary vcpu. So if we change the

 Why do you say there is very little chance to run the secondary vcpu?

 code sequence around set vcore_state to VCORE_RUNNING and disable
 preemption, we lost little. But we simplify the entering logi, based on
 the fact that if primary vcpu runs, the secondary vcpu can not be scheduled.

 If a vcpu does a hypercall or something else that requires the host
 (kernel or userspace) to do something, that can happen in the context
 of the vcpu task for that vcpu.  That vcpu task can run on another
 core (unless it has been pinned).  When it is finished we would like
 the vcpu to continue executing in the guest as soon as possible.  The
 code that you remove in this patch enables that to happen without
 having to wait until the other threads exit the guest.  So I don't
 think it is a good idea to remove this code.

Yes, you are right.

Thanks and regards,
Pingfan


 Regards,
 Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/3] powerpc/kvm: fix rare but potential deadlock scene

2013-11-04 Thread Liu Ping Fan
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out,
and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in
realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 043eec8..28160ac 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -474,8 +474,10 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
}
 
/* Find the HPTE in the hash table */
+   preempt_disable();
index = kvmppc_hv_find_lock_hpte(kvm, eaddr, slb_v,
 HPTE_V_VALID | HPTE_V_ABSENT);
+   preempt_enable();
if (index  0)
return -ENOENT;
hptep = (unsigned long *)(kvm-arch.hpt_virt + (index  4));
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/3] powerpc/kvm: simplify the entering logic for secondary thread

2013-11-04 Thread Liu Ping Fan
After the primary vcpu changes vcore_state to VCORE_RUNNING, there is
very little chance to schedule to secondary vcpu. So if we change the
code sequence around set vcore_state to VCORE_RUNNING and disable
preemption, we lost little. But we simplify the entering logi, based on
the fact that if primary vcpu runs, the secondary vcpu can not be scheduled.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 62a2b5a..38b1fc0 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1222,8 +1222,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
kvmppc_create_dtl_entry(vcpu, vc);
}
 
-   vc-vcore_state = VCORE_RUNNING;
preempt_disable();
+   vc-vcore_state = VCORE_RUNNING;
spin_unlock(vc-lock);
 
kvm_guest_enter();
@@ -1351,12 +1351,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
 * this thread straight away and have it join in.
 */
if (!signal_pending(current)) {
-   if (vc-vcore_state == VCORE_RUNNING 
-   VCORE_EXIT_COUNT(vc) == 0) {
-   vcpu-arch.ptid = vc-n_runnable - 1;
-   kvmppc_create_dtl_entry(vcpu, vc);
-   kvmppc_start_thread(vcpu);
-   } else if (vc-vcore_state == VCORE_SLEEPING) {
+   if (vc-vcore_state == VCORE_SLEEPING) {
wake_up(vc-wq);
}
 
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/3] powerpc/kvm: remove redundant assignment

2013-11-04 Thread Liu Ping Fan
Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 28160ac..7682837 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -731,7 +731,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
lock_rmap(rmap);
 
/* Check if we might have been invalidated; let the guest retry if so */
-   ret = RESUME_GUEST;
if (mmu_notifier_retry(vcpu-kvm, mmu_seq)) {
unlock_rmap(rmap);
goto out_unlock;
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev