RE: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-09 Thread Bharat Bhushan


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Thursday, August 9, 2018 11:42 AM
> To: Bharat Bhushan ;
> b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> ga...@kernel.crashing.org; mark.rutl...@arm.com;
> kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> ker...@vger.kernel.org
> Cc: r...@kernel.org; keesc...@chromium.org; tyr...@linux.vnet.ibm.com;
> j...@perches.com
> Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020
> 
> On Thu, 2018-08-09 at 03:28 +, Bharat Bhushan wrote:
> > > -Original Message-
> > > From: Scott Wood [mailto:o...@buserror.net]
> > > Sent: Wednesday, August 8, 2018 11:27 PM
> > > To: Bharat Bhushan ;
> > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > > ker...@vger.kernel.org
> > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > P2020
> > >
> > > On Wed, 2018-08-08 at 06:28 +, Bharat Bhushan wrote:
> > > > > -Original Message-
> > > > > From: Scott Wood [mailto:o...@buserror.net]
> > > > > Sent: Wednesday, August 8, 2018 11:26 AM
> > > > > To: Bharat Bhushan ;
> > > > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> > > > > linux- ker...@vger.kernel.org
> > > > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > > > P2020
> > > > >
> > > > > On Wed, 2018-08-08 at 03:44 +, Bharat Bhushan wrote:
> > > > > > > -Original Message-
> > > > > > > From: Scott Wood [mailto:o...@buserror.net]
> > > > > > > Sent: Wednesday, August 8, 2018 2:44 AM
> > > > > > > To: Bharat Bhushan ;
> > > > > > > b...@kernel.crashing.org; pau...@samba.org;
> > > > > > > m...@ellerman.id.au; ga...@kernel.crashing.org;
> > > > > > > mark.rutl...@arm.com; kstew...@linuxfoundation.org;
> > > > > > > gre...@linuxfoundation.org; devicet...@vger.kernel.org;
> > > > > > > linuxppc-dev@lists.ozlabs.org;
> > > > > > > linux- ker...@vger.kernel.org
> > > > > > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > > > > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > > > > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges
> > > > > > > for
> > > > > > > P2020
> > > > > > >
> > > > > > > On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> > > > > > > > MPIC on NXP (Freescale) P2020 supports following irq
> > > > > > > > ranges:
> > > > > > > >   > 0 - 11  (External interrupt)
> > > > > > > >   > 16 - 79 (Internal interrupt)
> > > > > > > >   > 176 - 183   (Messaging interrupt)
> > > > > > > >   > 224 - 231   (Shared message signaled interrupt)
> > > > > > >
> > > > > > > Why don't you convert to the 4-cell interrupt specifiers
> > > > > > > that make dealing with these ranges less error-prone?
> > > > > >
> > > > > > Ok , will do if we agree to have this series as per comment on
> > > > > > other patch.
> > > > >
> > > > > If you're concerned with errors, this would be a good things to
> > > > > do regardless.
> > > > >  Actually, it seems that p2020si-post.dtsi already uses 4-cell
> > > > > interrupts.
> > > > >
> > > > > What is motivating this patchset?  Is there something wrong in
> > > > > the existing dts files?
> > > >
> > > > There is no error in device tree. Main motivation is to improve
> > > > code for following reasons:
> > > >   - While code study it was found that if a reserved irq-number
> > > > used then there are no check in driver. irq will be configured as
> > > > correct and interrupt will never fire.
> > >
> > > Again, a wrong interrupt number won't fire, whether an interrupt by
> > > that number exists or not.  I wouldn't mind a sanity check in the
> > > driver if the programming model made it properly discoverable, but I
> > > don't think it's worth messing with device trees just for this (and
> > > even less so given that there don't seem to be new chips coming out
> > > that this would be relevant for).
> >
> > Fair enough, we can use MPIC version to define supported interrupts
> ranges.
> > Will that be acceptable.
> 
> It's better than device tree changes but I'm not convinced it's worthwhile 
> just
> to suppress some simulator warnings.
>  If the warnings really bother you, you
> can use pic-no-reset in the device tree (assuming this isn't some new chip
> that you want to make sure doesn't fall over when the usua

Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-09 Thread Nicholas Piggin
On Thu, 09 Aug 2018 16:34:07 +1000
Michael Ellerman  wrote:

> "Aneesh Kumar K.V"  writes:
> > On 08/08/2018 08:26 PM, Michael Ellerman wrote:  
> >> Mahesh J Salgaonkar  writes:  
> >>> From: Mahesh Salgaonkar 
> >>>
> >>> Introduce recovery action for recovered memory errors (MCEs). There are
> >>> soft memory errors like SLB Multihit, which can be a result of a bad
> >>> hardware OR software BUG. Kernel can easily recover from these soft errors
> >>> by flushing SLB contents. After the recovery kernel can still continue to
> >>> function without any issue. But in some scenario's we may keep getting
> >>> these soft errors until the root cause is fixed. To be able to analyze and
> >>> find the root cause, best way is to gather enough data and system state at
> >>> the time of MCE. Hence this patch introduces a sysctl knob where user can
> >>> decide either to continue after recovery or panic the kernel to capture 
> >>> the
> >>> dump.  
> >> 
> >> I'm not convinced we want this.
> >> 
> >> As we've discovered it's often not possible to reconstruct what happened
> >> based on a dump anyway.
> >> 
> >> The key thing you need is the content of the SLB and that's not included
> >> in a dump.
> >> 
> >> So I think we should dump the SLB content when we get the MCE (which
> >> this series does) and any other useful info, and then if we can recover
> >> we should.  
> >
> > The reasoning there is what if we got multi-hit due to some corruption 
> > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > data structure due to wrong pointer. Now that is far fetched, but then 
> > possible right?. Hence the idea that, if we don't have much insight into 
> > why a slb multi-hit occur from the dmesg which include slb content, 
> > slb_cache contents etc, there should be an easy way to force a dump that 
> > might assist in further debug.  
> 
> If you're debugging something complex that you can't determine from the
> SLB dump then you should be running a debug kernel anyway. And if
> anything you want to drop into xmon and sit there, preserving the most
> state, rather than taking a dump.

I'm not saying for a dump specifically, just some form of crash. And we
really should have an option to xmon on panic, but that's another story.

I think HA/failover kind of environments use options like this too. If
anything starts going bad they don't want to try limping along but stop
ASAP.

Thanks,
Nick


Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-09 Thread Ananth N Mavinakayanahalli
On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:
> On Thu, 09 Aug 2018 16:34:07 +1000
> Michael Ellerman  wrote:
> 
> > "Aneesh Kumar K.V"  writes:
> > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:  
> > >> Mahesh J Salgaonkar  writes:  
> > >>> From: Mahesh Salgaonkar 
> > >>>
> > >>> Introduce recovery action for recovered memory errors (MCEs). There are
> > >>> soft memory errors like SLB Multihit, which can be a result of a bad
> > >>> hardware OR software BUG. Kernel can easily recover from these soft 
> > >>> errors
> > >>> by flushing SLB contents. After the recovery kernel can still continue 
> > >>> to
> > >>> function without any issue. But in some scenario's we may keep getting
> > >>> these soft errors until the root cause is fixed. To be able to analyze 
> > >>> and
> > >>> find the root cause, best way is to gather enough data and system state 
> > >>> at
> > >>> the time of MCE. Hence this patch introduces a sysctl knob where user 
> > >>> can
> > >>> decide either to continue after recovery or panic the kernel to capture 
> > >>> the
> > >>> dump.  
> > >> 
> > >> I'm not convinced we want this.
> > >> 
> > >> As we've discovered it's often not possible to reconstruct what happened
> > >> based on a dump anyway.
> > >> 
> > >> The key thing you need is the content of the SLB and that's not included
> > >> in a dump.
> > >> 
> > >> So I think we should dump the SLB content when we get the MCE (which
> > >> this series does) and any other useful info, and then if we can recover
> > >> we should.  
> > >
> > > The reasoning there is what if we got multi-hit due to some corruption 
> > > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > > data structure due to wrong pointer. Now that is far fetched, but then 
> > > possible right?. Hence the idea that, if we don't have much insight into 
> > > why a slb multi-hit occur from the dmesg which include slb content, 
> > > slb_cache contents etc, there should be an easy way to force a dump that 
> > > might assist in further debug.  
> > 
> > If you're debugging something complex that you can't determine from the
> > SLB dump then you should be running a debug kernel anyway. And if
> > anything you want to drop into xmon and sit there, preserving the most
> > state, rather than taking a dump.
> 
> I'm not saying for a dump specifically, just some form of crash. And we
> really should have an option to xmon on panic, but that's another story.

That's fine during development or in a lab, not something we could
enforce in a customer environment, could we?

> I think HA/failover kind of environments use options like this too. If
> anything starts going bad they don't want to try limping along but stop
> ASAP.

Right. And in this particular case, can we guarantee no corruption
(leading to or post the multihit recovery) when running a customer workload,
is the question...

Ananth



[PATCH] powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled

2018-08-09 Thread Christophe Leroy
The symbol memcpy_nocache_branch defined in order to allow patching
of memset function once cache is enabled leads to confusing reports
by perf tool.

Using the new patch_site functionality solves this issue.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/asm-prototypes.h | 1 +
 arch/powerpc/kernel/setup_32.c| 7 +++
 arch/powerpc/lib/copy_32.S| 9 ++---
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 70fdc5b9b9fb..1f4691ce4126 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -146,6 +146,7 @@ void _kvmppc_save_tm_pr(struct kvm_vcpu *vcpu, u64 
guest_msr);
 /* Patch sites */
 extern s32 patch__call_flush_count_cache;
 extern s32 patch__flush_count_cache_return;
+extern s32 patch__memset_nocache, patch__memcpy_nocache;
 
 extern long flush_count_cache;
 
diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
index 0e3743343280..ba969278bf4d 100644
--- a/arch/powerpc/kernel/setup_32.c
+++ b/arch/powerpc/kernel/setup_32.c
@@ -97,11 +97,10 @@ notrace unsigned long __init early_init(unsigned long 
dt_ptr)
  * We do the initial parsing of the flat device-tree and prepares
  * for the MMU to be fully initialized.
  */
-extern unsigned int memset_nocache_branch; /* Insn to be replaced by NOP */
-
 notrace void __init machine_init(u64 dt_ptr)
 {
-   unsigned int *addr = &memset_nocache_branch;
+   unsigned int *addr = (unsigned int *)((unsigned 
long)&patch__memset_nocache +
+  patch__memset_nocache);
unsigned long insn;
 
/* Configure static keys first, now that we're relocated. */
@@ -110,7 +109,7 @@ notrace void __init machine_init(u64 dt_ptr)
/* Enable early debugging if any specified (see udbg.h) */
udbg_early_init();
 
-   patch_instruction((unsigned int *)&memcpy, PPC_INST_NOP);
+   patch_instruction_site(&patch__memcpy_nocache, PPC_INST_NOP);
 
insn = create_cond_branch(addr, branch_target(addr), 0x82);
patch_instruction(addr, insn);  /* replace b by bne cr0 */
diff --git a/arch/powerpc/lib/copy_32.S b/arch/powerpc/lib/copy_32.S
index da425bb6b369..ba66846fe973 100644
--- a/arch/powerpc/lib/copy_32.S
+++ b/arch/powerpc/lib/copy_32.S
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define COPY_16_BYTES  \
lwz r7,4(r4);   \
@@ -107,8 +108,8 @@ _GLOBAL(memset)
 * Skip optimised bloc until cache is enabled. Will be replaced
 * by 'bne' during boot to use normal procedure if r4 is not zero
 */
-_GLOBAL(memset_nocache_branch)
-   b   2f
+5: b   2f
+   patch_site  5b, patch__memset_nocache
 
clrlwi  r7,r6,32-LG_CACHELINE_BYTES
add r8,r7,r5
@@ -168,7 +169,9 @@ _GLOBAL(memmove)
/* fall through */
 
 _GLOBAL(memcpy)
-   b   generic_memcpy
+1: b   generic_memcpy
+   patch_site  1b, patch__memcpy_nocache
+
add r7,r3,r5/* test if the src & dst overlap */
add r8,r4,r5
cmplw   0,r4,r7
-- 
2.13.3



Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-09 Thread Nicholas Piggin
On Thu, 9 Aug 2018 13:39:45 +0530
Ananth N Mavinakayanahalli  wrote:

> On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:
> > On Thu, 09 Aug 2018 16:34:07 +1000
> > Michael Ellerman  wrote:
> >   
> > > "Aneesh Kumar K.V"  writes:  
> > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:
> > > >> Mahesh J Salgaonkar  writes:
> > > >>> From: Mahesh Salgaonkar 
> > > >>>
> > > >>> Introduce recovery action for recovered memory errors (MCEs). There 
> > > >>> are
> > > >>> soft memory errors like SLB Multihit, which can be a result of a bad
> > > >>> hardware OR software BUG. Kernel can easily recover from these soft 
> > > >>> errors
> > > >>> by flushing SLB contents. After the recovery kernel can still 
> > > >>> continue to
> > > >>> function without any issue. But in some scenario's we may keep getting
> > > >>> these soft errors until the root cause is fixed. To be able to 
> > > >>> analyze and
> > > >>> find the root cause, best way is to gather enough data and system 
> > > >>> state at
> > > >>> the time of MCE. Hence this patch introduces a sysctl knob where user 
> > > >>> can
> > > >>> decide either to continue after recovery or panic the kernel to 
> > > >>> capture the
> > > >>> dump.
> > > >> 
> > > >> I'm not convinced we want this.
> > > >> 
> > > >> As we've discovered it's often not possible to reconstruct what 
> > > >> happened
> > > >> based on a dump anyway.
> > > >> 
> > > >> The key thing you need is the content of the SLB and that's not 
> > > >> included
> > > >> in a dump.
> > > >> 
> > > >> So I think we should dump the SLB content when we get the MCE (which
> > > >> this series does) and any other useful info, and then if we can recover
> > > >> we should.
> > > >
> > > > The reasoning there is what if we got multi-hit due to some corruption 
> > > > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > > > data structure due to wrong pointer. Now that is far fetched, but then 
> > > > possible right?. Hence the idea that, if we don't have much insight 
> > > > into 
> > > > why a slb multi-hit occur from the dmesg which include slb content, 
> > > > slb_cache contents etc, there should be an easy way to force a dump 
> > > > that 
> > > > might assist in further debug.
> > > 
> > > If you're debugging something complex that you can't determine from the
> > > SLB dump then you should be running a debug kernel anyway. And if
> > > anything you want to drop into xmon and sit there, preserving the most
> > > state, rather than taking a dump.  
> > 
> > I'm not saying for a dump specifically, just some form of crash. And we
> > really should have an option to xmon on panic, but that's another story.  
> 
> That's fine during development or in a lab, not something we could
> enforce in a customer environment, could we?

xmon on panic? Not something to enforce but IMO (without thinking about
it too much but having encountered it several times) it should probably
be tied xmon on BUG option.

> 
> > I think HA/failover kind of environments use options like this too. If
> > anything starts going bad they don't want to try limping along but stop
> > ASAP.  
> 
> Right. And in this particular case, can we guarantee no corruption
> (leading to or post the multihit recovery) when running a customer workload,
> is the question...

I think that's an element of it. If SLB corruption is caused by
software then we could already have memory corruption. If it's hardware
then presumably we're supposed to have some guarantee of error rates.
But still you would say a machine that has taken no MCEs is less likely
to have a problem than one that has taken some MCEs!

It's not just corruption either, I've run into bugs where we get huge
streams of HMIs for example which all get recovered properly but
performance would have been in the toilet.

Anyway, being policy maybe we could drop this patch out of the SLB MCE
series and introduce it afterwards if we think it's necessary. For
SLB multi hit caused by software bug in slb handling, I'd say Michael's
pretty right about just needing the MCE output with SLB contents.

Thanks,
Nick


Re: [PATCH] lib/test_hexdump: fix failure on big endian cpu

2018-08-09 Thread Andy Shevchenko
On Thu, 2018-08-09 at 16:18 +1000, Michael Ellerman wrote:
> rashmica  writes:
> > On 08/08/18 17:25, Michael Ellerman wrote:
> > > Christophe Leroy  writes:
> > > > 
> > mpe I sent a patch doing that awhile ago and you obviously didn't
> > like
> > it because you never merged it :P

Hmm... I (as an author of the test case) never saw that patch.

> Sorry, I wasn't sure who should merge it, and never followed up.

lib/* most of the time under Andrew's responsibility, though since it's
orphaned in MAINTAINERS, anyone can push, though, I think, it's good to
notify Andrew.

-- 
Andy Shevchenko 
Intel Finland Oy



Re: [PATCH 1/5] powerpc/asm: Add a patch_site macro & helpers for patching instructions

2018-08-09 Thread Michael Ellerman
Christophe LEROY  writes:
> Le 08/08/2018 à 18:30, Christophe LEROY a écrit :
>> Le 23/07/2018 à 17:07, Michael Ellerman a écrit :
...
>>> diff --git a/arch/powerpc/include/asm/code-patching.h 
>>> b/arch/powerpc/include/asm/code-patching.h
>>> index 812535f40124..b2051234ada8 100644
>>> --- a/arch/powerpc/include/asm/code-patching.h
>>> +++ b/arch/powerpc/include/asm/code-patching.h
>>> @@ -32,6 +32,8 @@ unsigned int create_cond_branch(const unsigned int 
>>> *addr,
>>>   int patch_branch(unsigned int *addr, unsigned long target, int flags);
>>>   int patch_instruction(unsigned int *addr, unsigned int instr);
>>>   int raw_patch_instruction(unsigned int *addr, unsigned int instr);
>>> +int patch_instruction_site(s32 *addr, unsigned int instr);
>>> +int patch_branch_site(s32 *site, unsigned long target, int flags);
>> 
>> Why use s32* instead of unsigned int* as usual for pointer to code ?
>
> Forget my stupid question, I didn't see it was a relative address and 
> not an absolute one.

No worries. 

It is a bit non-obvious at first glance, it looks like the s32 * points
to the instruction. But it points to the s32 that holds the relative
offset from itself, of the instruction.

We could add a typedef to try and make that more obvious, but I
generally don't like typedefs that hide pointerness.

cheers


Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-09 Thread Michal Suchánek
On Thu, 9 Aug 2018 18:33:33 +1000
Nicholas Piggin  wrote:

> On Thu, 9 Aug 2018 13:39:45 +0530
> Ananth N Mavinakayanahalli  wrote:
> 
> > On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:  
> > > On Thu, 09 Aug 2018 16:34:07 +1000
> > > Michael Ellerman  wrote:
> > > 
> > > > "Aneesh Kumar K.V"  writes:
> > > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:  
> > > > >> Mahesh J Salgaonkar  writes:  
> > > > >>> From: Mahesh Salgaonkar 
> > > > >>>
> > > > >>> Introduce recovery action for recovered memory errors
> > > > >>> (MCEs). There are soft memory errors like SLB Multihit,
> > > > >>> which can be a result of a bad hardware OR software BUG.
> > > > >>> Kernel can easily recover from these soft errors by
> > > > >>> flushing SLB contents. After the recovery kernel can still
> > > > >>> continue to function without any issue. But in some
> > > > >>> scenario's we may keep getting these soft errors until the
> > > > >>> root cause is fixed. To be able to analyze and find the
> > > > >>> root cause, best way is to gather enough data and system
> > > > >>> state at the time of MCE. Hence this patch introduces a
> > > > >>> sysctl knob where user can decide either to continue after
> > > > >>> recovery or panic the kernel to capture the dump.  
> > > > >> 
> > > > >> I'm not convinced we want this.
> > > > >> 
> > > > >> As we've discovered it's often not possible to reconstruct
> > > > >> what happened based on a dump anyway.
> > > > >> 
> > > > >> The key thing you need is the content of the SLB and that's
> > > > >> not included in a dump.
> > > > >> 
> > > > >> So I think we should dump the SLB content when we get the
> > > > >> MCE (which this series does) and any other useful info, and
> > > > >> then if we can recover we should.  
> > > > >
> > > > > The reasoning there is what if we got multi-hit due to some
> > > > > corruption in slb_cache_ptr. ie. some part of kernel is
> > > > > wrongly updating the paca data structure due to wrong
> > > > > pointer. Now that is far fetched, but then possible right?.
> > > > > Hence the idea that, if we don't have much insight into why a
> > > > > slb multi-hit occur from the dmesg which include slb content,
> > > > > slb_cache contents etc, there should be an easy way to force
> > > > > a dump that might assist in further debug.  
> > > > 
> > > > If you're debugging something complex that you can't determine
> > > > from the SLB dump then you should be running a debug kernel
> > > > anyway. And if anything you want to drop into xmon and sit
> > > > there, preserving the most state, rather than taking a dump.
> > > 
> > > I'm not saying for a dump specifically, just some form of crash.
> > > And we really should have an option to xmon on panic, but that's
> > > another story.
> > 
> > That's fine during development or in a lab, not something we could
> > enforce in a customer environment, could we?  
> 
> xmon on panic? Not something to enforce but IMO (without thinking
> about it too much but having encountered it several times) it should
> probably be tied xmon on BUG option.

You should get that with this patch and xmon=on or am I missing
something?

Thanks

Michal


Re: [PATCH] lib/test_hexdump: fix failure on big endian cpu

2018-08-09 Thread Michael Ellerman
Andy Shevchenko  writes:
> On Thu, 2018-08-09 at 16:18 +1000, Michael Ellerman wrote:
>> rashmica  writes:
>> > On 08/08/18 17:25, Michael Ellerman wrote:
>> > > Christophe Leroy  writes:
>> > > > 
>> > mpe I sent a patch doing that awhile ago and you obviously didn't
>> > like
>> > it because you never merged it :P
>
> Hmm... I (as an author of the test case) never saw that patch.

I probably told Rashmica to send it to the linuxppc list and then I was
meant to follow up who should merge it. So my fault.

>> Sorry, I wasn't sure who should merge it, and never followed up.
>
> lib/* most of the time under Andrew's responsibility, though since it's
> orphaned in MAINTAINERS, anyone can push, though, I think, it's good to
> notify Andrew.

Yeah. I was going to apply it with your Ack because it seems pretty
uncontroversial, and there's nothing else in linux-next touching that
file so it should be conflict-free.

I guess I'll wait and see if Andrew would prefer to take it.

cheers


Re: [PATCH v6 2/2] powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores

2018-08-09 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2018-08-09 11:02:08]:

> 
> 3) ppc64_cpu --smt=2
>SMT domain ceases to exist as each domain consists of just one
>group.
> 

When seen in isolation, the above looks as if ppc64_cpu --smt=2 o/p says
" SMT domain ceases to exist"

> @@ -999,7 +1012,17 @@ static void add_cpu_to_masks(int cpu)
>  {
>   int first_thread = cpu_first_thread_sibling(cpu);
>   int chipid = cpu_to_chip_id(cpu);
> - int i;
> +
> + struct thread_groups tg;
> + int i, cpu_group_start = -1;
> +
> + if (has_big_cores) {
> + struct device_node *dn = of_get_cpu_node(cpu, NULL);
> +

Not checking for validity of dn and no of_node_puts?

> + parse_thread_groups(dn, &tg);
> + cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> + cpumask_set_cpu(cpu, cpu_smallcore_sibling_mask(cpu));
> + }
> 
>   /*
>* This CPU will not be in the online mask yet so we need to manually

The rest looks good



Re: [PATCH v6 1/2] powerpc: Detect the presence of big-cores via "ibm,thread-groups"

2018-08-09 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2018-08-09 11:02:07]:

> 
>  int threads_per_core, threads_per_subcore, threads_shift;
> +bool has_big_cores;
>  cpumask_t threads_core_mask;
>  EXPORT_SYMBOL_GPL(threads_per_core);
>  EXPORT_SYMBOL_GPL(threads_per_subcore);
>  EXPORT_SYMBOL_GPL(threads_shift);
> +EXPORT_SYMBOL_GPL(has_big_cores);

Why do we need EXPORT_SYMBOL_GPL?

>  EXPORT_SYMBOL_GPL(threads_core_mask);
> 
> + *
> + * Returns 0 on success, -EINVAL if the property does not exist,
> + * -ENODATA if property does not have a value, and -EOVERFLOW if the
> + * property data isn't large enough.
> + */
> +int parse_thread_groups(struct device_node *dn,
> + struct thread_groups *tg)
> +{
> + unsigned int nr_groups, threads_per_group, property;
> + int i;
> + u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
> + u32 *thread_list;
> + size_t total_threads;
> + int ret;
> +
> + ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> +  thread_group_array, 3);
> +
> + if (ret)
> + goto out_err;
> +
> + property = thread_group_array[0];
> + nr_groups = thread_group_array[1];
> + threads_per_group = thread_group_array[2];
> + total_threads = nr_groups * threads_per_group;
> +

Shouldnt we check for property and nr_groups
If the property is not 1 and nr_groups < 1, we should error out
No point in calling a of_property read if property is not right.


Nit: 
Cant we directly assign to tg->property, and hence avoid local
variables, property, nr_groups and threads_per_group?

> + ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> +  thread_group_array,
> +  3 + total_threads);
> +
> +static inline bool dt_has_big_core(struct device_node *dn,
> +struct thread_groups *tg)
> +{
> + if (parse_thread_groups(dn, tg))
> + return false;
> +
> + if (tg->property != 1)
> + return false;
> +
> + if (tg->nr_groups < 1)
> + return false;

Can avoid these check if we can check in parse_thread_groups.

>  /**
>   * setup_cpu_maps - initialize the following cpu maps:
>   *  cpu_possible_mask
> @@ -457,6 +605,7 @@ void __init smp_setup_cpu_maps(void)
>   int cpu = 0;
>   int nthreads = 1;
> 
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 755dc98..f5717de 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "cacheinfo.h"
>  #include "setup.h"
> @@ -1025,6 +1026,33 @@ static ssize_t show_physical_id(struct device *dev,
>  }
>  static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
> 
> +static ssize_t show_small_core_siblings(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct cpu *cpu = container_of(dev, struct cpu, dev);
> + struct device_node *dn = of_get_cpu_node(cpu->dev.id, NULL);
> + struct thread_groups tg;
> + int i, j;
> + ssize_t ret = 0;
> +

Here we need to check for validity of dn and error out accordingly.


> + if (parse_thread_groups(dn, &tg))
> + return -ENODATA;

Did we miss a of_node_put(dn)?

> +
> + i = get_cpu_thread_group_start(cpu->dev.id, &tg);
> +
> + if (i == -1)
> + return -ENODATA;
> +
> + for (j = 0; j < tg.threads_per_group - 1; j++)
> + ret += sprintf(buf + ret, "%d,", tg.thread_list[i + j]);

Here, we are making the assumption that group_start will always be the
first thread in the thread_group. However we didnt make the same
assumption in get_cpu_thread_group_start.



[PATCH] powerpc/mm/hash: Remove unnecessary do { }while(0) loop

2018-08-09 Thread Aneesh Kumar K.V
Avoid coverity false warnings like

*** CID 187347:  Control flow issues  (UNREACHABLE)
/arch/powerpc/mm/hash_native_64.c: 819 in native_flush_hash_range()
813slot += hidx & _PTEIDX_GROUP_IX;
814hptep = htab_address + slot;
815want_v = hpte_encode_avpn(vpn, psize, ssize);
816hpte_v = hpte_get_old_v(hptep);
817
818if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
>>> CID 187347:  Control flow issues  (UNREACHABLE)

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index c81793d47af9..f82ee8a3b561 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -137,10 +137,9 @@ extern bool __rpte_sub_valid(real_pte_t rpte, unsigned 
long index);
shift = mmu_psize_defs[psize].shift;\
for (index = 0; vpn < __end; index++,   \
 vpn += (1L << (shift - VPN_SHIFT))) {  \
-   if (!__split || __rpte_sub_valid(rpte, index))  \
-   do {
+   if (!__split || __rpte_sub_valid(rpte, index))
 
-#define pte_iterate_hashed_end() } while(0); } } while(0)
+#define pte_iterate_hashed_end()  } } while(0)
 
 #define pte_pagesize_index(mm, addr, pte)  \
(((pte) & H_PAGE_COMBO)? MMU_PAGE_4K: MMU_PAGE_64K)
-- 
2.17.1



[PATCH] powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range

2018-08-09 Thread Aneesh Kumar K.V
This patch makes sure we update the mmu_gather page size even if we are
requesting for a fullmm flush. This avoids triggering VM_WARN_ON in code
paths like __tlb_remove_page_size that explicitly check for removing range page
size to be same as mmu gather page size.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/tlb.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 97ecef697e1b..f0e571b2dc7c 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -49,13 +49,11 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather 
*tlb, pte_t *ptep,
 static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 unsigned int page_size)
 {
-   if (tlb->fullmm)
-   return;
-
if (!tlb->page_size)
tlb->page_size = page_size;
else if (tlb->page_size != page_size) {
-   tlb_flush_mmu(tlb);
+   if (!tlb->fullmm)
+   tlb_flush_mmu(tlb);
/*
 * update the page size after flush for the new
 * mmu_gather.
-- 
2.17.1



[PATCH] powerpc/powernv/idle: Fix build error

2018-08-09 Thread Aneesh Kumar K.V
Fix the below build error using strlcpy instead of strncpy

In function 'pnv_parse_cpuidle_dt',
inlined from 'pnv_init_idle_states' at 
arch/powerpc/platforms/powernv/idle.c:840:7,
inlined from '__machine_initcall_powernv_pnv_init_idle_states' at 
arch/powerpc/platforms/powernv/idle.c:870:1:
arch/powerpc/platforms/powernv/idle.c:820:3: error: 'strncpy' specified bound 
16 equals destination size [-Werror=stringop-truncation]
   strncpy(pnv_idle_states[i].name, temp_string[i],
   ^~~~
PNV_IDLE_NAME_LEN);

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/powernv/idle.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index ecb002c5db83..35f699ebb662 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -817,7 +817,7 @@ static int pnv_parse_cpuidle_dt(void)
goto out;
}
for (i = 0; i < nr_idle_states; i++)
-   strncpy(pnv_idle_states[i].name, temp_string[i],
+   strlcpy(pnv_idle_states[i].name, temp_string[i],
PNV_IDLE_NAME_LEN);
nr_pnv_idle_states = nr_idle_states;
rc = 0;
-- 
2.17.1



Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-09 Thread Alex Williamson
On Thu, 9 Aug 2018 14:21:29 +1000
Alexey Kardashevskiy  wrote:

> On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 02/08/2018 02:16, Alex Williamson wrote:  
> >> On Wed, 1 Aug 2018 18:37:35 +1000
> >> Alexey Kardashevskiy  wrote:
> >>  
> >>> On 01/08/2018 00:29, Alex Williamson wrote:  
>  On Tue, 31 Jul 2018 14:03:35 +1000
>  Alexey Kardashevskiy  wrote:
>  
> > On 31/07/2018 02:29, Alex Williamson wrote:
> >> On Mon, 30 Jul 2018 18:58:49 +1000
> >> Alexey Kardashevskiy  wrote:
> >>> After some local discussions, it was pointed out that force disabling
> >>> nvlinks won't bring us much as for an nvlink to work, both sides need 
> >>> to
> >>> enable it so malicious guests cannot penetrate good ones (or a host)
> >>> unless a good guest enabled the link but won't happen with a well
> >>> behaving guest. And if two guests became malicious, then can still 
> >>> only
> >>> harm each other, and so can they via other ways such network. This is
> >>> different from PCIe as once PCIe link is unavoidably enabled, a well
> >>> behaving device cannot firewall itself from peers as it is up to the
> >>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU 
> >>> still
> >>> has means to protect itself, just like a guest can run "firewalld" for
> >>> network.
> >>>
> >>> Although it would be a nice feature to have an extra barrier between
> >>> GPUs, is inability to block the links in hypervisor still a blocker 
> >>> for
> >>> V100 pass through?  
> >>
> >> How is the NVLink configured by the guest, is it 'on'/'off' or are
> >> specific routes configured?   
> >
> > The GPU-GPU links need not to be blocked and need to be enabled
> > (==trained) by a driver in the guest. There are no routes between GPUs
> > in NVLink fabric, these are direct links, it is just a switch on each
> > side, both switches need to be on for a link to work.
> 
>  Ok, but there is at least the possibility of multiple direct links per
>  GPU, the very first diagram I find of NVlink shows 8 interconnected
>  GPUs:
> 
>  https://www.nvidia.com/en-us/data-center/nvlink/
> >>>
> >>> Out design is like the left part of the picture but it is just a detail.  
> >>
> >> Unless we can specifically identify a direct link vs a mesh link, we
> >> shouldn't be making assumptions about the degree of interconnect.
> >>
>  So if each switch enables one direct, point to point link, how does the
>  guest know which links to open for which peer device?
> >>>
> >>> It uses PCI config space on GPUs to discover the topology.  
> >>
> >> So do we need to virtualize this config space if we're going to
> >> virtualize the topology?
> >>  
>  And of course
>  since we can't see the spec, a security audit is at best hearsay :-\
> >>>
> >>> Yup, the exact discovery protocol is hidden.  
> >>
> >> It could be reverse engineered...
> >>  
> > The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> > is controlled via the emulated PCI bridges which I pass through together
> > with the GPU.
> 
>  So there's a special emulated switch, is that how the guest knows which
>  GPUs it can enable NVLinks to?
> >>>
> >>> Since it only has PCI config space (there is nothing relevant in the
> >>> device tree at all), I assume (double checking with the NVIDIA folks
> >>> now) the guest driver enables them all, tests which pair works and
> >>> disables the ones which do not. This gives a malicious guest a tiny
> >>> window of opportunity to break into a good guest. Hm :-/  
> >>
> >> Let's not minimize that window, that seems like a prime candidate for
> >> an exploit.
> >>  
> >> If the former, then isn't a non-malicious
> >> guest still susceptible to a malicious guest?  
> >
> > A non-malicious guest needs to turn its switch on for a link to a GPU
> > which belongs to a malicious guest.
> 
>  Actual security, or obfuscation, will we ever know...
> >>> If the latter, how is
> >> routing configured by the guest given that the guest view of the
> >> topology doesn't match physical hardware?  Are these routes
> >> deconfigured by device reset?  Are they part of the save/restore
> >> state?  Thanks,  
> 
>  Still curious what happens to these routes on reset.  Can a later user
>  of a GPU inherit a device where the links are already enabled?  Thanks,  
>    
> >>>
> >>> I am told that the GPU reset disables links. As a side effect, we get an
> >>> HMI (a hardware fault which reset the host machine) when trying
> >>> accessing the GPU RAM which indicates that the link is down as the
> >>> memory is only accessible via the nvlink. We have special fencing code
> >>> in our host firmware (skiboot) to fence this memory on P

Re: [PATCH 2/2] powerpc/64s: reimplement book3s idle code in C

2018-08-09 Thread Gautham R Shenoy
Hello Nicholas,
On Fri, Aug 03, 2018 at 02:13:50PM +1000, Nicholas Piggin wrote:
> Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
> speific HV idle code to the powernv platform code. Generic book3s
> assembly stubs are kept in common code and used only to save and
> restore the stack frame and non-volatile GPRs just before going to
> idle and just after waking up, which can return into C code.
> 
> Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
> way this stuff will cope with non-trivial new CPU implementation
> details, firmware changes, etc., without becoming unmaintainable.
> 
> This is not a strict translation to C code, there are some
> differences.
> 
> - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
>   but saves and restores them all explicitly.
> 
> - The optimisation where EC=ESL=0 idle modes did not have to save
>   GPRs or change MSR is restored, because it's now simple to do.
>   State loss modes that did not actually lose GPRs can use this
>   optimization too.
> 
> - KVM secondary entry code is now more of a call/return style
>   rather than spaghetti. nap_state_lost is not required beccause
>   KVM always returns via NVGPR restorig path.
> 
> This seems pretty solid, so needs more review and testing now. The
> KVM changes are pretty significant and complicated. POWER7 needs to
> be tested too.
> 
> Open question:
> - Why does P9 restore some of the PMU SPRs (but not others), and
>   P8 only zeroes them?

We are restoring MMCR0 (from the value saved in the stack) and MMCR1,
MMCR2, and MMCRA in the stop_sprs in PACA. We saw that MMCRH and MMCRC
are cleared on both POWER8 and POWER9. Hence we didn't restore
them. MMCRS is being initialized by the KVM code.

Is there anything apart from these that need to be restored ?

> 
> Since RFC v1:
> - Now tested and working with POWER9 hash and radix.
> - KVM support added. This took a bit of work to untangle and might
>   still have some issues, but POWER9 seems to work including hash on
>   radix with dependent threads mode.
> - This snowballed a bit because of KVM and other details making it
>   not feasible to leave POWER7/8 code alone. That's only half done
>   at the moment.
> - So far this trades about 800 lines of asm for 500 of C. With POWER7/8
>   support done it might be another hundred or so lines of C.
> 
> Since RFC v2:
> - Fixed deep state SLB reloading
> - Now tested and working with POWER8.
> - Accounted for most feedback.
> 
> Since RFC v3:
> - Rebased to powerpc merge + idle state bugfix
> - Split SLB flush/restore code out and shared with MCE code (pseries
>   MCE patches can also use).
> - More testing on POWER8 including KVM with secondaries.
> - Performance testing looks good. EC=ESL=0 is about 5% faster, other
>   stop states look a little faster too.
> - Adjusted SPR saving to handler POWER7, haven't tested it.


This patch looks good to me.

A couple of comments below. 

> ---

[... snip ..]

> @@ -178,23 +177,30 @@ struct paca_struct {
>  #endif
> 
>  #ifdef CONFIG_PPC_POWERNV
> - /* Per-core mask tracking idle threads and a lock bit-[L][] */
> - u32 *core_idle_state_ptr;
> - u8 thread_idle_state;   /* PNV_THREAD_RUNNING/NAP/SLEEP */
> - /* Mask to indicate thread id in core */
> - u8 thread_mask;
> - /* Mask to denote subcore sibling threads */
> - u8 subcore_sibling_mask;
> - /* Flag to request this thread not to stop */
> - atomic_t dont_stop;
> - /* The PSSCR value that the kernel requested before going to stop */
> - u64 requested_psscr;
> + /* PowerNV idle fields */
> + /* PNV_CORE_IDLE_* bits, all siblings work on thread 0 paca */
> + unsigned long idle_state;
> + union {
> + /* P7/P8 specific fields */
> + struct {
> + /* PNV_THREAD_RUNNING/NAP/SLEEP */
> + u8 thread_idle_state;
> + /* Mask to indicate thread id in core */
> + u8 thread_mask;

This is no longer needed. We can get this from cpu_thread_in_core()
from the C code.

The only place where we are currently using this is to DUMP the value
of the thread_mask from xmon but not anywhere else in the idle entry
code.



> + /* Mask to denote subcore sibling threads */
> + u8 subcore_sibling_mask;
> + };
> 
> - /*
> -  * Save area for additional SPRs that need to be
> -  * saved/restored during cpuidle stop.
> -  */
> - struct stop_sprs stop_sprs;
> + /* P9 specific fields */
> + struct {
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> + /* The PSSCR value that the kernel requested before 
> going to stop */
> + u64 requested_psscr;
> + /* Flag to request this thread not to stop */
> + atomic_t dont_stop;
> +#endif
> + };
> + };
>  #endif
[..snip..]

>

Re: [PATCH v3 2/2] powerpc/pseries: Wait for completion of hotplug events during PRRN handling

2018-08-09 Thread Nathan Fontenot
On 08/08/2018 10:29 AM, John Allen wrote:
> While handling PRRN events, the time to handle the actual hotplug events
> dwarfs the time it takes to perform the device tree updates and queue the
> hotplug events. In the case that PRRN events are being queued continuously,
> hotplug events have been observed to be queued faster than the kernel can
> actually handle them. This patch avoids the problem by waiting for a
> hotplug request to complete before queueing more hotplug events.
> 
> Signed-off-by: John Allen 

In the V2 thread it was mentioned that we could just call the DLPAR operation
directly instead of going through the workqueue. I have written a patch to do
this that also cleans up some of the request handling.

requests that come through the hotplug interrupt still use the workqueue. The
other requests, PRRN and sysfs, just call the dlpar handler directly. This
eliminates the need for a wait conditional and return code handling in the
workqueue handler and should solve the issue that John solves with his patch.

This still needs testing but wanted to get people's thoughts.

-Nathan

---
 arch/powerpc/platforms/pseries/dlpar.c|   37 +++--
 arch/powerpc/platforms/pseries/mobility.c |   18 +-
 arch/powerpc/platforms/pseries/pseries.h  |5 ++--
 arch/powerpc/platforms/pseries/ras.c  |2 +-
 4 files changed, 19 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index a0b20c03f078..052c4f2ba0a0 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -32,8 +32,6 @@ static struct workqueue_struct *pseries_hp_wq;
 struct pseries_hp_work {
struct work_struct work;
struct pseries_hp_errorlog *errlog;
-   struct completion *hp_completion;
-   int *rc;
 };
 
 struct cc_workarea {
@@ -329,7 +327,7 @@ int dlpar_release_drc(u32 drc_index)
return 0;
 }
 
-static int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
+int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
 {
int rc;
 
@@ -371,20 +369,13 @@ static void pseries_hp_work_fn(struct work_struct *work)
struct pseries_hp_work *hp_work =
container_of(work, struct pseries_hp_work, work);
 
-   if (hp_work->rc)
-   *(hp_work->rc) = handle_dlpar_errorlog(hp_work->errlog);
-   else
-   handle_dlpar_errorlog(hp_work->errlog);
-
-   if (hp_work->hp_completion)
-   complete(hp_work->hp_completion);
+   handle_dlpar_errorlog(hp_work->errlog);
 
kfree(hp_work->errlog);
kfree((void *)work);
 }
 
-void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog,
-struct completion *hotplug_done, int *rc)
+void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog)
 {
struct pseries_hp_work *work;
struct pseries_hp_errorlog *hp_errlog_copy;
@@ -397,13 +388,9 @@ void queue_hotplug_event(struct pseries_hp_errorlog 
*hp_errlog,
if (work) {
INIT_WORK((struct work_struct *)work, pseries_hp_work_fn);
work->errlog = hp_errlog_copy;
-   work->hp_completion = hotplug_done;
-   work->rc = rc;
queue_work(pseries_hp_wq, (struct work_struct *)work);
} else {
-   *rc = -ENOMEM;
kfree(hp_errlog_copy);
-   complete(hotplug_done);
}
 }
 
@@ -521,18 +508,15 @@ static int dlpar_parse_id_type(char **cmd, struct 
pseries_hp_errorlog *hp_elog)
 static ssize_t dlpar_store(struct class *class, struct class_attribute *attr,
   const char *buf, size_t count)
 {
-   struct pseries_hp_errorlog *hp_elog;
-   struct completion hotplug_done;
+   struct pseries_hp_errorlog hp_elog;
char *argbuf;
char *args;
int rc;
 
args = argbuf = kstrdup(buf, GFP_KERNEL);
-   hp_elog = kzalloc(sizeof(*hp_elog), GFP_KERNEL);
-   if (!hp_elog || !argbuf) {
+   if (!argbuf) {
pr_info("Could not allocate resources for DLPAR operation\n");
kfree(argbuf);
-   kfree(hp_elog);
return -ENOMEM;
}
 
@@ -540,25 +524,22 @@ static ssize_t dlpar_store(struct class *class, struct 
class_attribute *attr,
 * Parse out the request from the user, this will be in the form:
 *
 */
-   rc = dlpar_parse_resource(&args, hp_elog);
+   rc = dlpar_parse_resource(&args, &hp_elog);
if (rc)
goto dlpar_store_out;
 
-   rc = dlpar_parse_action(&args, hp_elog);
+   rc = dlpar_parse_action(&args, &hp_elog);
if (rc)
goto dlpar_store_out;
 
-   rc = dlpar_parse_id_type(&args, hp_elog);
+   rc = dlpar_parse_id_type(&args, &hp_elog);
if (rc)
goto dlpar_store_out;
 
-   init_completion(&ho

Re: ssb: Remove SSB_WARN_ON, SSB_BUG_ON and SSB_DEBUG

2018-08-09 Thread Kalle Valo
Michael Büsch  writes:

> Use the standard WARN_ON instead.
> If a small kernel is desired, WARN_ON can be disabled globally.
>
> Also remove SSB_DEBUG. Besides WARN_ON it only adds a tiny debug check.
> Include this check unconditionally.
>
> Signed-off-by: Michael Buesch 

Applied manually:

209b43759d65 ssb: Remove SSB_WARN_ON, SSB_BUG_ON and SSB_DEBUG

-- 
Kalle Valo


Re: [PATCH 2/2] powerpc/64s: reimplement book3s idle code in C

2018-08-09 Thread Nicholas Piggin
On Thu, 9 Aug 2018 20:15:04 +0530
Gautham R Shenoy  wrote:

> Hello Nicholas,
> On Fri, Aug 03, 2018 at 02:13:50PM +1000, Nicholas Piggin wrote:
> > Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
> > speific HV idle code to the powernv platform code. Generic book3s
> > assembly stubs are kept in common code and used only to save and
> > restore the stack frame and non-volatile GPRs just before going to
> > idle and just after waking up, which can return into C code.
> > 
> > Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
> > way this stuff will cope with non-trivial new CPU implementation
> > details, firmware changes, etc., without becoming unmaintainable.
> > 
> > This is not a strict translation to C code, there are some
> > differences.
> > 
> > - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
> >   but saves and restores them all explicitly.
> > 
> > - The optimisation where EC=ESL=0 idle modes did not have to save
> >   GPRs or change MSR is restored, because it's now simple to do.
> >   State loss modes that did not actually lose GPRs can use this
> >   optimization too.
> > 
> > - KVM secondary entry code is now more of a call/return style
> >   rather than spaghetti. nap_state_lost is not required beccause
> >   KVM always returns via NVGPR restorig path.
> > 
> > This seems pretty solid, so needs more review and testing now. The
> > KVM changes are pretty significant and complicated. POWER7 needs to
> > be tested too.
> > 
> > Open question:
> > - Why does P9 restore some of the PMU SPRs (but not others), and
> >   P8 only zeroes them?  
> 
> We are restoring MMCR0 (from the value saved in the stack) and MMCR1,
> MMCR2, and MMCRA in the stop_sprs in PACA. We saw that MMCRH and MMCRC
> are cleared on both POWER8 and POWER9. Hence we didn't restore
> them. MMCRS is being initialized by the KVM code.
> 
> Is there anything apart from these that need to be restored ?

No I'm wondering why it is we restore those on POWER9? POWER8 does not
restore them, only zeroes. What is the difference with POWER9?

I will leave that in for now so we don't change too much with one patch,
but it would be nice to document a bit better the reasons for saving or
clearing SPRs.

> 
> > 
> > Since RFC v1:
> > - Now tested and working with POWER9 hash and radix.
> > - KVM support added. This took a bit of work to untangle and might
> >   still have some issues, but POWER9 seems to work including hash on
> >   radix with dependent threads mode.
> > - This snowballed a bit because of KVM and other details making it
> >   not feasible to leave POWER7/8 code alone. That's only half done
> >   at the moment.
> > - So far this trades about 800 lines of asm for 500 of C. With POWER7/8
> >   support done it might be another hundred or so lines of C.
> > 
> > Since RFC v2:
> > - Fixed deep state SLB reloading
> > - Now tested and working with POWER8.
> > - Accounted for most feedback.
> > 
> > Since RFC v3:
> > - Rebased to powerpc merge + idle state bugfix
> > - Split SLB flush/restore code out and shared with MCE code (pseries
> >   MCE patches can also use).
> > - More testing on POWER8 including KVM with secondaries.
> > - Performance testing looks good. EC=ESL=0 is about 5% faster, other
> >   stop states look a little faster too.
> > - Adjusted SPR saving to handler POWER7, haven't tested it.  
> 
> 
> This patch looks good to me.
> 
> A couple of comments below. 
> 
> > ---  
> 
> [... snip ..]
> 
> > @@ -178,23 +177,30 @@ struct paca_struct {
> >  #endif
> > 
> >  #ifdef CONFIG_PPC_POWERNV
> > -   /* Per-core mask tracking idle threads and a lock bit-[L][] */
> > -   u32 *core_idle_state_ptr;
> > -   u8 thread_idle_state;   /* PNV_THREAD_RUNNING/NAP/SLEEP */
> > -   /* Mask to indicate thread id in core */
> > -   u8 thread_mask;
> > -   /* Mask to denote subcore sibling threads */
> > -   u8 subcore_sibling_mask;
> > -   /* Flag to request this thread not to stop */
> > -   atomic_t dont_stop;
> > -   /* The PSSCR value that the kernel requested before going to stop */
> > -   u64 requested_psscr;
> > +   /* PowerNV idle fields */
> > +   /* PNV_CORE_IDLE_* bits, all siblings work on thread 0 paca */
> > +   unsigned long idle_state;
> > +   union {
> > +   /* P7/P8 specific fields */
> > +   struct {
> > +   /* PNV_THREAD_RUNNING/NAP/SLEEP */
> > +   u8 thread_idle_state;
> > +   /* Mask to indicate thread id in core */
> > +   u8 thread_mask;  
> 
> This is no longer needed. We can get this from cpu_thread_in_core()
> from the C code.
> 
> The only place where we are currently using this is to DUMP the value
> of the thread_mask from xmon but not anywhere else in the idle entry
> code.

Good catch, removed it.

> > +   /* Mask to denote subcore sibling threads */
> > +   u8 subcore_sibling_mask;
> > +   };
> > 
> > -   /*
> > -*

Re: [PATCH] powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range

2018-08-09 Thread Nicholas Piggin
On Thu,  9 Aug 2018 19:06:59 +0530
"Aneesh Kumar K.V"  wrote:

> This patch makes sure we update the mmu_gather page size even if we are
> requesting for a fullmm flush. This avoids triggering VM_WARN_ON in code
> paths like __tlb_remove_page_size that explicitly check for removing range 
> page
> size to be same as mmu gather page size.
> 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Nicholas Piggin 

Thanks, sorry bout that.

> ---
>  arch/powerpc/include/asm/tlb.h | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index 97ecef697e1b..f0e571b2dc7c 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -49,13 +49,11 @@ static inline void __tlb_remove_tlb_entry(struct 
> mmu_gather *tlb, pte_t *ptep,
>  static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
>unsigned int page_size)
>  {
> - if (tlb->fullmm)
> - return;
> -
>   if (!tlb->page_size)
>   tlb->page_size = page_size;
>   else if (tlb->page_size != page_size) {
> - tlb_flush_mmu(tlb);
> + if (!tlb->fullmm)
> + tlb_flush_mmu(tlb);
>   /*
>* update the page size after flush for the new
>* mmu_gather.



Re: [PATCH v02] powerpc/mobility: Fix node detach/rename problem

2018-08-09 Thread Michael Ellerman
Michael Bringmann  writes:
> On 08/08/2018 09:02 AM, Michael Ellerman wrote:
>> Michael Bringmann  writes:
>>> diff --git a/arch/powerpc/platforms/pseries/mobility.c 
>>> b/arch/powerpc/platforms/pseries/mobility.c
>>> index e245a88..efc9442 100644
>>> --- a/arch/powerpc/platforms/pseries/mobility.c
>>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>>> @@ -22,6 +22,9 @@
>>>  #include 
>>>  #include "pseries.h"
>>>  
>>> +extern int of_free_phandle_cache(void);
>>> +extern void of_populate_phandle_cache(void);
>> 
>> We don't do that, they should be in a header.
>> 
>> But that's a minor problem given that the patch doesn't compile, because
>> both those functions are static.
>
> I am building against the latest 'linux-ppc' kernel.  It includes patch

OK you must be using the master branch.

> Commit b9952b5218added5577e4a3443969bc20884cea9 Mon Sep 17 00:00:00 2001
> From: Frank Rowand 
> Date: Thu, 12 Jul 2018 14:00:07 -0700
> Subject: of: overlay: update phandle cache on overlay apply and remove

That only landed in v4.18-rc6, so it's not in my next branch which is
where patches like this targeted for the next release are applied.

> which makes the functions static.  I will rebuild and test with an
> earlier version if you will specify which one.

No that's fine it will just have to wait until next and master are
merged before it can go in.

cheers


[PATCH v2 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c

2018-08-09 Thread Nicholas Piggin
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will also be used by pseries
machine check and idle code in future changes.

Signed-off-by: Nicholas Piggin 

Since v1:
- Restore the test for slb_shadow (mpe)
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++
 arch/powerpc/kernel/mce_power.c   | 26 +
 arch/powerpc/mm/slb.c | 39 +++
 3 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 2f74bdc805e0..d4e398185b3a 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -497,6 +497,9 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_flush_all_realmode(void);
+extern void __slb_restore_bolted_realmode(void);
+extern void slb_restore_bolted_realmode(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index d6756af6ec78..3497c8329c1d 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, 
unsigned long addr)
 #ifdef CONFIG_PPC_BOOK3S_64
 static void flush_and_reload_slb(void)
 {
-   struct slb_shadow *slb;
-   unsigned long i, n;
-
/* Invalidate all SLBs */
-   asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+   slb_flush_all_realmode();
 
 #ifdef CONFIG_KVM_BOOK3S_HANDLER
/*
@@ -76,22 +73,17 @@ static void flush_and_reload_slb(void)
if (get_paca()->kvm_hstate.in_guest)
return;
 #endif
-
-   /* For host kernel, reload the SLBs from shadow SLB buffer. */
-   slb = get_slb_shadow();
-   if (!slb)
+   if (early_radix_enabled())
return;
 
-   n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
-
-   /* Load up the SLB entries from shadow SLB */
-   for (i = 0; i < n; i++) {
-   unsigned long rb = be64_to_cpu(slb->save_area[i].esid);
-   unsigned long rs = be64_to_cpu(slb->save_area[i].vsid);
+   /*
+* This probably shouldn't happen, but it may be possible it's
+* called in early boot before SLB shadows are allocated.
+*/
+   if (!get_slb_shadow())
+   return;
 
-   rb = (rb & ~0xFFFul) | i;
-   asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
-   }
+   slb_restore_bolted_realmode();
 }
 #endif
 
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index cb796724a6fc..0b095fa54049 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -90,6 +90,45 @@ static inline void create_shadowed_slbe(unsigned long ea, 
int ssize,
 : "memory" );
 }
 
+/*
+ * Insert bolted entries into SLB (which may not be empty, so don't clear
+ * slb_cache_ptr).
+ */
+void __slb_restore_bolted_realmode(void)
+{
+   struct slb_shadow *p = get_slb_shadow();
+   enum slb_index index;
+
+/* No isync needed because realmode. */
+   for (index = 0; index < SLB_NUM_BOLTED; index++) {
+   asm volatile("slbmte  %0,%1" :
+: "r" (be64_to_cpu(p->save_area[index].vsid)),
+  "r" (be64_to_cpu(p->save_area[index].esid)));
+   }
+}
+
+/*
+ * Insert the bolted entries into an empty SLB.
+ * This is not the same as rebolt because the bolted segments are not
+ * changed, just loaded from the shadow area.
+ */
+void slb_restore_bolted_realmode(void)
+{
+   __slb_restore_bolted_realmode();
+   get_paca()->slb_cache_ptr = 0;
+}
+
+/*
+ * This flushes all SLB entries including 0, so it must be realmode.
+ */
+void slb_flush_all_realmode(void)
+{
+   /*
+* This flushes all SLB entries including 0, so it must be realmode.
+*/
+   asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+}
+
 static void __slb_flush_and_rebolt(void)
 {
/* If you change this make sure you change SLB_NUM_BOLTED
-- 
2.17.0



[PATCH v2 2/2] powerpc/64s: reimplement book3s idle code in C

2018-08-09 Thread Nicholas Piggin
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.

Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.

The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.

This is not a strict translation to C code, there are some
significant differences:

- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
  but saves and restores them itself.

- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
  or change MSR is restored, because it's now simple to do. ESL=1
  sleeps that do not lose GPRs can use this optimization too.

- KVM secondary entry and cede is now more of a call/return style
  rather than branchy. nap_state_lost is not required because KVM
  always returns via NVGPR restoring path.

Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 

Left to do:
- KVM could use more review, it's pretty tricky. Not sure if what
  I'm doing with the emergency stack is kosher. But it runs pretty fine
  here with a POWER9 SMP+SMT guest. Possible to streamline
  KVM cede code now that idle function saves nv gprs for us?

Open question:
- Why does P9 restore some of the PMU SPRs (but not others), and
  P8 only zeroes them?

Since RFC v1:
- Now tested and working with POWER9 hash and radix.
- KVM support added. This took a bit of work to untangle and might
  still have some issues, but POWER9 seems to work including hash on
  radix with dependent threads mode.
- This snowballed a bit because of KVM and other details making it
  not feasible to leave POWER7/8 code alone. That's only half done
  at the moment.
- So far this trades about 800 lines of asm for 500 of C. With POWER7/8
  support done it might be another hundred or so lines of C.

Since RFC v2:
- Fixed deep state SLB reloading
- Now tested and working with POWER8.
- Accounted for most feedback.

Since RFC v3:
- Rebased to powerpc merge + idle state bugfix
- Split SLB flush/restore code out and shared with MCE code (pseries
  MCE patches can also use).
- More testing on POWER8 including KVM with secondaries.
- Performance testing looks good. EC=ESL=0 is about 5% faster, other
  stop states look a little faster too.
- Adjusted SPR saving to handler POWER7, haven't tested it.

Since submission v1:
- More review comments from Gautham.
- Rename isa3_ to isa300_ prefix.
- Tinkered with some comments, copyright notice, changelog.
- Cede and regular idle do not go via KVM secondary wakeup code path,
  so hwthread_state stores and barriers can be simplified, and some
  KVM code paths simplified a little.
---
 arch/powerpc/include/asm/cpuidle.h   |   19 +-
 arch/powerpc/include/asm/paca.h  |   38 +-
 arch/powerpc/include/asm/processor.h |9 +-
 arch/powerpc/include/asm/reg.h   |7 +-
 arch/powerpc/kernel/asm-offsets.c|   18 -
 arch/powerpc/kernel/dt_cpu_ftrs.c|   21 +-
 arch/powerpc/kernel/exceptions-64s.S |   17 +-
 arch/powerpc/kernel/idle_book3s.S| 1012 +++---
 arch/powerpc/kernel/setup-common.c   |4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |   86 +-
 arch/powerpc/platforms/powernv/idle.c|  826 ++
 arch/powerpc/platforms/powernv/subcore.c |2 +-
 arch/powerpc/xmon/xmon.c |   24 +-
 13 files changed, 877 insertions(+), 1206 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 43e5f31fe64d..9844b3ded187 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -27,10 +27,11 @@
  * the THREAD_WINKLE_BITS are set, which indicate which threads have not
  * yet woken from the winkle state.
  */
-#define PNV_CORE_IDLE_LOCK_BIT 0x1000
+#define NR_PNV_CORE_IDLE_LOCK_BIT  28
+#define PNV_CORE_IDLE_LOCK_BIT (1ULL << 
NR_PNV_CORE_IDLE_LOCK_BIT)
 
+#define PNV_CORE_IDLE_WINKLE_COUNT_SHIFT   16
 #define PNV_CORE_IDLE_WINKLE_COUNT 0x0001
-#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT 0x0008
 #define PNV_CORE_IDLE_WINKLE_COUNT_BITS0x000F
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT 8
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS   0xFF00
@@ -68,16 +69,6 @@
 #define ERR_DEEP_STATE_ESL_MISMATCH-2
 
 #ifndef __ASSEMBLY__
-/* Additional SPRs that need to be saved/restored during stop */
-struct stop_sprs {
-   u64 pid;
-   u64 ldbar;
-   u64 fscr;
-   u64 hfscr;
-   u64 mmcr1;
-   u64 mmcr2;
-   u64 mmcra;
-};
 
 #define PNV_IDLE_NAME_LEN16
 struct pnv_idle_states_t {
@@ -92,10 +83,6 @@ struct pnv_idle_states_t {
 
 extern struct pnv_idle_states_t *pnv_idle_states

[PATCH v3] powerpc/topology: Check at boot for topology updates

2018-08-09 Thread Srikar Dronamraju
On a shared lpar, Phyp will not update the cpu associativity at boot
time. Just after the boot system does recognize itself as a shared lpar and
trigger a request for correct cpu associativity. But by then the scheduler
would have already created/destroyed its sched domains.

This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all cpus to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
  node sched domain"), can cause rcu stalls at boot up.

>From a scheduler maintainer's perspective, moving cpus from one node to
another or creating more numa levels after boot is not appropriate
without some notification to the user space.
https://lore.kernel.org/lkml/20150406214558.ga38...@linux.vnet.ibm.com/T/#u

The sched_domains_numa_masks table which is used to generate cpumasks is
only created at boot time just before creating sched domains and never
updated.  Hence, its better to get the topology correct before the sched
domains are created.

For example on 64 core Power 8 shared lpar, dmesg reports

[2.088360] Brought up 512 CPUs
[2.088368] Node 0 CPUs: 0-511
[2.088371] Node 1 CPUs:
[2.088373] Node 2 CPUs:
[2.088375] Node 3 CPUs:
[2.088376] Node 4 CPUs:
[2.088378] Node 5 CPUs:
[2.088380] Node 6 CPUs:
[2.088382] Node 7 CPUs:
[2.088386] Node 8 CPUs:
[2.088388] Node 9 CPUs:
[2.088390] Node 10 CPUs:
[2.088392] Node 11 CPUs:
...
[3.916091] BUG: arch topology borken
[3.916103]  the DIE domain not a subset of the NUMA domain
[3.916105] BUG: arch topology borken
[3.916106]  the DIE domain not a subset of the NUMA domain
...

numactl/lscpu output will still be correct with cores spreading across
all nodes.

Socket(s): 64
NUMA node(s):  12
Model: 2.0 (pvr 004d 0200)
Model name:POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type:   para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Currently on this lpar, the scheduler detects 2 levels of Numa and
created numa sched domains for all cpus, but it finds a single DIE
domain consisting of all cpus. Hence it deletes all numa sched domains.

To address this, split the topology update init, such that the first
part detects vphn/prrn soon after cpus are setup and force updates
topology just before scheduler creates sched domain.

With the fix, dmesg reports

[0.491336] numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 
368-375 464-471
[0.491351] numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 
376-383 472-479
[0.491359] numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 
384-391 480-487
[0.491366] numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 
392-399 488-495
[0.491374] numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
[0.491379] numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
[0.491384] numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
[0.491389] numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
[0.491394] numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
[0.491399] numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
[0.491404] numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
[0.491409] numa: Node 11 CPUs: 160-167 256-263 352-359 448-455

and lscpu would also report

Socket(s): 64
NUMA node(s):  12
Model: 2.0 (pvr 004d 0200)
Model name:POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type:   para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU