Re: [PATCH 1/3] powerpc: make CPU selection logic generic in Makefile

2018-08-07 Thread Michael Ellerman
Christophe Leroy  writes:

> At the time being, when adding a new CPU for selection, both
> Kconfig.cputype and Makefile have to be modified.
>
> This patch moves into Kconfig.cputype the name of the CPU to me
> passed to the -mcpu= argument.
>
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/Makefile  |  8 +---
>  arch/powerpc/platforms/Kconfig.cputype | 15 +++
>  2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 9704ab360d39..9a5642552abc 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -175,13 +175,7 @@ ifdef CONFIG_MPROFILE_KERNEL
>  endif
>  endif
>  
> -CFLAGS-$(CONFIG_CELL_CPU) += $(call cc-option,-mcpu=cell)
> -CFLAGS-$(CONFIG_POWER5_CPU) += $(call cc-option,-mcpu=power5)
> -CFLAGS-$(CONFIG_POWER6_CPU) += $(call cc-option,-mcpu=power6)
> -CFLAGS-$(CONFIG_POWER7_CPU) += $(call cc-option,-mcpu=power7)
> -CFLAGS-$(CONFIG_POWER8_CPU) += $(call cc-option,-mcpu=power8)
> -CFLAGS-$(CONFIG_POWER9_CPU) += $(call cc-option,-mcpu=power9)
> -CFLAGS-$(CONFIG_PPC_8xx) += $(call cc-option,-mcpu=860)
> +CFLAGS-$(CONFIG_SPECIAL_CPU_BOOL) += $(call 
> cc-option,-mcpu=$(CONFIG_SPECIAL_CPU))

This looks good.

I'll rename it from "SPECIAL_CPU" to "TARGET_CPU" because that's the
terminology used in the GCC docs, eg:

-mcpu=name
   Specify the name of the target processor, optionally suffixed by one 
or more feature modifiers.


cheers


Re: [PATCH v3 00/14] Implement use of HW assistance on TLB table walk on 8xx

2018-08-07 Thread Christophe LEROY

Hi

Please note that I'm currently deeply reworking this serie in order to 
make it more clear hence more simple to review. So don't consider taking 
it yet (except maybe the first patch of the serie which is a bugfix).


Christophe

Le 29/05/2018 à 17:50, Christophe Leroy a écrit :

The purpose of this serie is to implement hardware assistance for TLB table walk
on the 8xx.

First part is to make L1 entries and L2 entries independant.
For that, we need to alter ioremap functions in order to handle GUARD attribute
at the PGD/PMD level.

Last part is to reuse PTE fragment implemented on PPC64 in order to
not waste 16k Pages for page tables as only 4k are used.

Tested successfully on 8xx.

Successfull compilation on following defconfigs (v3):
ppc64_defconfig
ppc64e_defconfig

Successfull compilation on following defconfigs (v2):
ppc64_defconfig
ppc64e_defconfig
pseries_defconfig
pmac32_defconfig
linkstation_defconfig
corenet32_smp_defconfig
ppc40x_defconfig
storcenter_defconfig
ppc44x_defconfig

Changes in v3:
  - Fixed an issue in the 09/14 when CONFIG_PIN_TLB_TEXT was not enabled
  - Added performance measurement in the 09/14 commit log
  - Rebased on latest 'powerpc/merge' tree, which conflicted with 13/14

Changes in v2:
  - Removed the 3 first patchs which have been applied already
  - Fixed compilation errors reported by Michael
  - Squashed the commonalisation of ioremap functions into a single patch
  - Fixed the use of pte_fragment
  - Added a patch optimising perf counting of TLB misses and instructions

Christophe Leroy (14):
   Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for
 CONFIG_SWAP"
   powerpc: move io mapping functions into ioremap.c
   powerpc: make ioremap_bot common to PPC32 and PPC64
   powerpc: common ioremap functions.
   powerpc: use _ALIGN_DOWN macro for VMALLOC_BASE
   powerpc/nohash32: allow setting GUARDED attribute in the PMD directly
   powerpc/8xx: set GUARDED attribute in the PMD directly
   powerpc/8xx: Remove PTE_ATOMIC_UPDATES
   powerpc/mm: Use hardware assistance in TLB handlers on the 8xx
   powerpc/8xx: reunify TLB handler routines
   powerpc/8xx: Free up SPRN_SPRG_SCRATCH2
   powerpc/mm: Make pte_fragment_alloc() common to PPC32 and PPC64
   powerpc/mm: Use pte_fragment_alloc() on 8xx
   powerpc/8xx: Move SW perf counters in first 32kb of memory

  arch/powerpc/include/asm/book3s/32/pgtable.h |  29 +-
  arch/powerpc/include/asm/book3s/64/pgtable.h |   2 +
  arch/powerpc/include/asm/highmem.h   |  11 -
  arch/powerpc/include/asm/hugetlb.h   |   4 +-
  arch/powerpc/include/asm/machdep.h   |   2 +-
  arch/powerpc/include/asm/mmu-8xx.h   |  38 +--
  arch/powerpc/include/asm/mmu_context.h   |  53 
  arch/powerpc/include/asm/nohash/32/pgalloc.h |  56 +++-
  arch/powerpc/include/asm/nohash/32/pgtable.h |  77 +++--
  arch/powerpc/include/asm/nohash/32/pte-8xx.h |   6 +-
  arch/powerpc/include/asm/nohash/pgtable.h|   4 +
  arch/powerpc/include/asm/page.h  |   2 +-
  arch/powerpc/include/asm/pgtable-types.h |   4 +
  arch/powerpc/kernel/head_8xx.S   | 409 +++
  arch/powerpc/mm/8xx_mmu.c|  12 +-
  arch/powerpc/mm/Makefile |   2 +-
  arch/powerpc/mm/dma-noncoherent.c|   2 +-
  arch/powerpc/mm/dump_linuxpagetables.c   |  32 ++-
  arch/powerpc/mm/hugetlbpage.c|  12 +
  arch/powerpc/mm/init_32.c|   6 +-
  arch/powerpc/mm/{pgtable_64.c => ioremap.c}  | 239 ++--
  arch/powerpc/mm/mem.c|  16 +-
  arch/powerpc/mm/mmu_context_book3s64.c   |  44 ---
  arch/powerpc/mm/mmu_context_nohash.c |   4 +
  arch/powerpc/mm/pgtable-book3s64.c   |  72 -
  arch/powerpc/mm/pgtable.c|  82 ++
  arch/powerpc/mm/pgtable_32.c | 167 +++
  arch/powerpc/mm/pgtable_64.c | 177 
  arch/powerpc/platforms/Kconfig.cputype   |  19 ++
  29 files changed, 657 insertions(+), 926 deletions(-)
  copy arch/powerpc/mm/{pgtable_64.c => ioremap.c} (53%)



Re: [PATCH v6 00/11] hugetlb: Factorize hugetlb architecture primitives

2018-08-07 Thread Ingo Molnar


* Alexandre Ghiti  wrote:

> [CC linux-mm for inclusion in -mm tree]   
>
>   
>
> In order to reduce copy/paste of functions across architectures and then  
>
> make riscv hugetlb port (and future ports) simpler and smaller, this  
>
> patchset intends to factorize the numerous hugetlb primitives that are
>
> defined across all the architectures. 
>
>   
>
> Except for prepare_hugepage_range, this patchset moves the versions that  
>
> are just pass-through to standard pte primitives into 
>
> asm-generic/hugetlb.h by using the same #ifdef semantic that can be   
>
> found in asm-generic/pgtable.h, i.e. __HAVE_ARCH_***. 
>
>   
>
> s390 architecture has not been tackled in this serie since it does not
>
> use asm-generic/hugetlb.h at all. 
>
>   
>
> This patchset has been compiled on all addressed architectures with   
>
> success (except for parisc, but the problem does not come from this   
>
> series).  
>
>   
>
> v6:   
>
>   - Remove nohash/32 and book3s/32 powerpc specific implementations in
> order to use the generic ones.
> 
>   - Add all the Reviewed-by, Acked-by and Tested-by in the commits,   
>
> thanks to everyone.   
>
>   
>
> v5:   
>
>   As suggested by Mike Kravetz, no need to move the #include  
>
>for arm and x86 architectures, let it live at   
>
>   the top of the file.
>
>   
>
> v4:   
>
>   Fix powerpc build error due to misplacing of #include   
>
>outside of #ifdef CONFIG_HUGETLB_PAGE, as   
>
>   pointed by Christophe Leroy.
>
>   
>
> v1, v2, v3:   
>
>   Same version, just problems with email provider and misuse of   
>
>   --batch-size option of git send-email
> 
> Alexandre Ghiti (11):
>   hugetlb: Harmonize hugetlb.h arch specific defines with pgtable.h
>   hugetlb: Introduce generic version of hugetlb_free_pgd_range
>   hugetlb: Introduce generic version of set_huge_pte_at
>   hugetlb: Introduce generic version of huge_ptep_get_and_clear
>   hugetlb: Introduce generic version of huge_ptep_clear_flush
>   hugetlb: Introduce generic version of huge_pte_none
>   hugetlb: Introduce generic version of huge_pte_wrprotect
>   hugetlb: Introduce generic version of prepare_hugepage_range
>   hugetlb: Introduce generic version of huge_ptep_set_wrprotect
>   hugetlb: Introduce generic version of huge_ptep_set_access_flags
>   hugetlb: Introduce generic version of huge_ptep_get
> 
>  arch/arm/include/asm/hugetlb-3level.h| 32 +-
>  arch/arm/include/asm/hugetlb.h   | 30 --
>  arch/arm64/include/asm/hugetlb.h | 39 +++-
>  arch/ia64/include/asm/hugetlb.h  | 47 ++-
>  arch/mips/include/asm/hugetlb.h  | 40 +++--
>  arch/parisc/include/asm/hugetlb.h| 33 +++
>  arch/powerpc/include/asm/book3s/32/pgtable.h |  6 --
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  1 +
>  arch/powerpc/include/asm/hugetlb.h   | 43 ++
>  arch/powerpc/include/asm/nohash/32/pgtable.h |  6 --
>  arch/powerpc/include/asm/nohash/64/pgtable.h |  1 +
>  arch/sh/include/asm/hugetlb.h| 54 ++---
>  arch/sparc/include/asm/hugetlb.h | 40 +++--
>  arch/x86/include/asm/hugetlb.h   | 69 --
>  include/asm-generic/hugetlb.h| 88 
> +++-
>  15 files changed, 135 insertions(+), 394 deleti

Re: Several suspected memory leaks

2018-08-07 Thread Catalin Marinas
(catching up with emails)

On Wed, 11 Jul 2018 at 00:40, Benjamin Herrenschmidt
 wrote:
> On Tue, 2018-07-10 at 17:17 +0200, Paul Menzel wrote:
> > On a the IBM S822LC (8335-GTA) with Ubuntu 18.04 I built Linux master
> > – 4.18-rc4+, commit 092150a2 (Merge branch 'for-linus'
> > of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid) – with
> > kmemleak. Several issues are found.
>
> Some of these are completely uninteresting though and look like
> kmemleak bugs to me :-)
>
> > [] __pud_alloc+0x80/0x270
> > [<07135d64>] hash__map_kernel_page+0x30c/0x4d0
> > [<71677858>] __ioremap_at+0x108/0x140
> > [<0023e921>] __ioremap_caller+0x130/0x180
> > [<9dbc3923>] icp_native_init_one_node+0x5cc/0x760
> > [<15f3168a>] icp_native_init+0x70/0x13c
> > [<60ed>] xics_init+0x38/0x1ac
> > [<88dbf9d1>] pnv_init_IRQ+0x30/0x5c
>
> This is the interrupt controller mapping its registers, why on earth
> would that be considered a leak ? kmemleak needs to learn to ignore
> kernel page tables allocations.

Indeed, that's just a false positive for powerpc. Kmemleak ignores
page allocations and most architectures use __get_free_pages() for the
page table. In this particular case, the powerpc code uses
kmem_cache_alloc() and that's tracked by kmemleak. Since the pgd
stores the __pa(pud), kmemleak doesn't detect this pointer and reports
it as a leak. To work around this, you can pass SLAB_NOLEAKTRACE to
kmem_cache_create() in pgtable_cache_add()
(arch/powerpc/mm/init-common.c).

-- 
Catalin


[PATCH] powerpc/64: Disable irq restore warning for now

2018-08-07 Thread Michael Ellerman
We recently added a warning in arch_local_irq_restore() to check that
the soft masking state matches reality.

Unfortunately it trips in a few places, which are not entirely trivial
to fix. The key problem is if we're doing function_graph tracing of
restore_math(), the warning pops and then seems to recurse. It's not
entirely clear because the system continuously oopses on all CPUs,
with the output interleaved and unreadable.

It's also been observed on a G5 coming out of idle.

Until we can fix those cases disable the warning for now.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/irq.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index ca941f1e83a9..916ddc4aac44 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -261,9 +261,16 @@ notrace void arch_local_irq_restore(unsigned long mask)
 */
irq_happened = get_irq_happened();
if (!irq_happened) {
-#ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
-   WARN_ON(!(mfmsr() & MSR_EE));
-#endif
+   /*
+* FIXME. Here we'd like to be able to do:
+*
+* #ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
+*   WARN_ON(!(mfmsr() & MSR_EE));
+* #endif
+*
+* But currently it hits in a few paths, we should fix those and
+* enable the warning.
+*/
return;
}
 
-- 
2.14.1



Re: [RFC PATCH 0/3] New device-tree format and Opal based idle save-restore

2018-08-07 Thread Michael Ellerman
Akshay Adiga  writes:

> Previously if a older kernel runs on a newer firmware, it may enable
> all available states irrespective of its capability of handling it.
> New device tree format adds a compatible flag, so that only kernel
> which has the capability to handle the version of stop state will enable
> it.
>
> Older kernel will still see stop0 and stop0_lite in older format and we
> will depricate it after some time.
>
> 1) Idea is to bump up the version string in firmware if we find a bug or
> regression in stop states. A fix will be provided in linux which would
> now know about the bumped up version of stop states, where as kernel
> without fixes would ignore the states.
>
> 2) Slowly deprecate cpuidle/cpuhotplug threshold which is hard-coded
> into cpuidle-powernv driver. Instead use compatible strings to indicate
> if idle state is suitable for cpuidle and hotplug.
>
> New idle state device tree format :
>power-mgt {
> ...
>  ibm,enabled-stop-levels = <0xec00>;
>  ibm,cpu-idle-state-psscr-mask = <0x0 0x3003ff 0x0 0x3003ff>;
>  ibm,cpu-idle-state-latencies-ns = <0x3e8 0x7d0>;
>  ibm,cpu-idle-state-psscr = <0x0 0x330 0x0 0x300330>;
>  ibm,cpu-idle-state-flags = <0x10 0x101000>;
>  ibm,cpu-idle-state-residency-ns = <0x2710 0x4e20>;
>  ibm,idle-states {
>  stop4 {
>  flags = <0x207000>;
>  compatible = "ibm,state-v1",
> "cpuidle",
> "opal-supported";
>  psscr-mask = <0x0 0x3003ff>;
>  handle = <0x102>;
>  latency-ns = <0x186a0>;
>  residency-ns = <0x989680>;
>  psscr = <0x0 0x300374>;
>   };
> ...
> stop11 {
>  ...
>  compatible = "ibm,state-v1",
> "cpuoffline",
> "opal-supported";
>  ...
>   };
>  };
>
> Skiboot patch-set for device-tree is posted here :
> https://patchwork.ozlabs.org/project/skiboot/list/?series=58934

I don't see a device tree binding documented anywhere?

There is an existing binding defined for ARM chips, presumably it
doesn't do everything we need. But are there good reasons why we are not
using it as a base?

See: Documentation/devicetree/bindings/arm/idle-states.txt


The way you're using compatible is not really consistent with its
traditional meaning.

eg, you have multiple states with:

compatible = "ibm,state-v1",
"cpuoffline",
"opal-supported";


This would typically mean that all those state are all "compatible" with
some semantics defined by the name "ibm,state-v1". What you're trying to
say (I think) is that each state is "version 1" of *that state*. And
only kernels that understand version 1 should use the state.

And "cpuoffline" and "opal-supported" definitely don't belong in
compatible AFAICS, they should simply be boolean properties of the node.

cheers


Re: [PATCH] misc: ibmvsm: Fix wrong assignment of return code

2018-08-07 Thread Michael Ellerman
"Bryant G. Ly"  writes:

> From: "Bryant G. Ly" 
>
> Currently the assignment is flipped and rc is always 0.

If you'd left rc uninitialised at the start of the function the compiler
would have caught it for you.

And what is the consequence of the bug? Nothing, complete system crash,
subtle data corruption?

Also this should be tagged:

Fixes: 0eca353e7ae7 ("misc: IBM Virtual Management Channel Driver (VMC)")

cheers

> diff --git a/drivers/misc/ibmvmc.c b/drivers/misc/ibmvmc.c
> index 8f82bb9..b8aaa68 100644
> --- a/drivers/misc/ibmvmc.c
> +++ b/drivers/misc/ibmvmc.c
> @@ -2131,7 +2131,7 @@ static int ibmvmc_init_crq_queue(struct 
> crq_server_adapter *adapter)
>   retrc = plpar_hcall_norets(H_REG_CRQ,
>  vdev->unit_address,
>  queue->msg_token, PAGE_SIZE);
> - retrc = rc;
> + rc = retrc;
>  
>   if (rc == H_RESOURCE)
>   rc = ibmvmc_reset_crq_queue(adapter);



Re: [PATCH] of/fdt: Remove PPC32 longtrail hack in memory scan

2018-08-07 Thread Michael Ellerman
Rob Herring  writes:
> On Mon, Jul 30, 2018 at 4:47 AM Michael Ellerman  wrote:
>> Rob Herring  writes:
>> > On Thu, Jul 26, 2018 at 11:36 PM Michael Ellerman  
>> > wrote:
>> >> When the OF code was originally made common by Grant in commit
>> >> 51975db0b733 ("of/flattree: merge early_init_dt_scan_memory() common
>> >> code") (Feb 2010), the common code inherited a hack to handle
>> >> PPC "longtrail" machines, which had a "memory@0" node with no
>> >> device_type.
>> >>
>> >> That check was then made to only apply to PPC32 in b44aa25d20e2 ("of:
>> >> Handle memory@0 node on PPC32 only") (May 2014).
>> >>
>> >> But according to Paul Mackerras the "longtrail" machines are long
>> >> dead, if they were ever seen in the wild at all. If someone does still
>> >> have one, we can handle this firmware wart in powerpc platform code.
>> >>
>> >> So remove the hack once and for all.
>> >
>> > Yay. I guess Power Macs and other quirks will never die...
>>
>> Not soon.
>>
>> In base.c I see:
>>  - the hack in arch_find_n_match_cpu_physical_id()
>>- we should just move that into arch code, it's a __weak arch hook
>>  after all.
>
> Except then we'd have to export __of_find_n_match_cpu_property. I
> somewhat prefer it like it is because arch specific functions tend to
> encourage duplication. But if the implementation is completely
> different like sparc, then yes, a separate implementation makes sense.

OK I'll leave it as-is then.

>>  - a PPC hack in of_alias_scan(), I guess we need to retain that
>>behaviour, but it's pretty minor anyway.
>
> It would be nice to know what platform(s) needs this as I don't have a
> clue.

Yeah.

> It would also be nice if I had some dumps of DTs for some of
> these OpenFirmware systems.

Yeah it would. What if I threw some in a git tree somewhere?

I guess fully unpacked is the most useful form, ie. just a direct copy
from /proc/device-tree.

cheers


[PATCH] powerpc/64s: idle_power4 fix PACA_IRQ_HARD_DIS accounting

2018-08-07 Thread Nicholas Piggin
When idle_power4 hard disables interrupts then finds a soft pending
interrupt, it returns with interrupts hard disabled but without
PACA_IRQ_HARD_DIS set. Commit 9b81c0211c ("powerpc/64s: make
PACA_IRQ_HARD_DIS track MSR[EE] closely") added a warning for that
condition.

Fix this by adding the PACA_IRQ_HARD_DIS for that case.

Signed-off-by: Nicholas Piggin 

---
This was half tested by modifying power4_idle to work on later CPUs
because I have no G5.

 arch/powerpc/kernel/idle_power4.S | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/idle_power4.S 
b/arch/powerpc/kernel/idle_power4.S
index dd7471fe20bd..a09b3c7ca176 100644
--- a/arch/powerpc/kernel/idle_power4.S
+++ b/arch/powerpc/kernel/idle_power4.S
@@ -32,6 +32,8 @@ END_FTR_SECTION_IFCLR(CPU_FTR_CAN_NAP)
cmpwi   0,r4,0
beqlr
 
+   /* This sequence is similar to prep_irq_for_idle() */
+
/* Hard disable interrupts */
mfmsr   r7
rldicl  r0,r7,48,1
@@ -41,10 +43,15 @@ END_FTR_SECTION_IFCLR(CPU_FTR_CAN_NAP)
/* Check if something happened while soft-disabled */
lbz r0,PACAIRQHAPPENED(r13)
cmpwi   cr0,r0,0
-   bnelr
+   bne-2f
 
-   /* Soft-enable interrupts */
+   /*
+* Soft-enable interrupts. This will make power4_fixup_nap return
+* to our caller with interrupts enabled (soft and hard). The caller
+* can cope with either interrupts disabled or enabled upon return.
+*/
 #ifdef CONFIG_TRACE_IRQFLAGS
+   /* Tell the tracer interrupts are on, because idle responds to them. */
mflrr0
std r0,16(r1)
stdur1,-128(r1)
@@ -73,3 +80,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
isync
b   1b
 
+2: /* Return if an interrupt had happened while soft disabled */
+   /* Set the HARD_DIS flag because interrupts are now hard disabled */
+   ori r0,r0,PACA_IRQ_HARD_DIS
+   stb r0,PACAIRQHAPPENED(r13)
+   blr
-- 
2.17.0



[PATCH v2] powerpc/tm: Print 64-bits MSR

2018-08-07 Thread Breno Leitao
On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.

This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).

Signed-off-by: Breno Leitao 
---
 arch/powerpc/kernel/traps.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0e17dcb48720..cd561fd89532 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1402,7 +1402,7 @@ void program_check_exception(struct pt_regs *regs)
goto bail;
} else {
printk(KERN_EMERG "Unexpected TM Bad Thing exception "
-  "at %lx (msr 0x%x)\n", regs->nip, reason);
+  "at %lx (msr 0x%lx)\n", regs->nip, regs->msr);
die("Unrecoverable exception", regs, SIGABRT);
}
}
-- 
2.16.3



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-07 Thread Christoph Hellwig
On Tue, Aug 07, 2018 at 04:42:44PM +1000, Benjamin Herrenschmidt wrote:
> Note that I can make it so that the same DMA ops (basically standard
> swiotlb ops without arch hacks) work for both "direct virtio" and
> "normal PCI" devices.
> 
> The trick is simply in the arch to setup the iommu to map the swiotlb
> bounce buffer pool 1:1 in the iommu, so the iommu essentially can be
> ignored without affecting the physical addresses.
> 
> If I do that, *all* I need is a way, from the guest itself (again, the
> other side dosn't know anything about it), to force virtio to use the
> DMA ops as if there was an iommu, that is, use whatever dma ops were
> setup by the platform for the pci device.

In that case just setting VIRTIO_F_IOMMU_PLATFORM in the flags should
do the work (even if that isn't strictly what the current definition
of the flag actually means).  On the qemu side you'll need to make
sure you have a way to set VIRTIO_F_IOMMU_PLATFORM without emulating
an iommu, but with code to take dma offsets into account if your
plaform has any (various power plaforms seem to have them, not sure
if it affects your config).


[PATCH v3] selftests/powerpc: Kill child processes on SIGINT

2018-08-07 Thread Breno Leitao
There are some powerpc selftests, as tm/tm-unavailable, that run for a long
period (>120 seconds), and if it is interrupted, as pressing CRTL-C
(SIGINT), the foreground process (harness) dies but the child process and
threads continue to execute (with PPID = 1 now) in background.

In this case, you'd think the whole test exited, but there are remaining
threads and processes being executed in background. Sometimes these
zombies processes are doing annoying things, as consuming the whole CPU or
dumping things to STDOUT.

This patch fixes this problem by attaching an empty signal handler to
SIGINT in the harness process. This handler will interrupt (EINTR) the
parent process waitpid() call, letting the code to follow through the
normal flow, which will kill all the processes in the child process group.

This patch also fixes a typo.

Signed-off-by: Breno Leitao 
Signed-off-by: Gustavo Romero 
---
 tools/testing/selftests/powerpc/harness.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/powerpc/harness.c 
b/tools/testing/selftests/powerpc/harness.c
index 66d31de60b9a..9d7166dfad1e 100644
--- a/tools/testing/selftests/powerpc/harness.c
+++ b/tools/testing/selftests/powerpc/harness.c
@@ -85,13 +85,13 @@ int run_test(int (test_function)(void), char *name)
return status;
 }
 
-static void alarm_handler(int signum)
+static void sig_handler(int signum)
 {
-   /* Jut wake us up from waitpid */
+   /* Just wake us up from waitpid */
 }
 
-static struct sigaction alarm_action = {
-   .sa_handler = alarm_handler,
+static struct sigaction sig_action = {
+   .sa_handler = sig_handler,
 };
 
 void test_harness_set_timeout(uint64_t time)
@@ -106,8 +106,14 @@ int test_harness(int (test_function)(void), char *name)
test_start(name);
test_set_git_version(GIT_VERSION);
 
-   if (sigaction(SIGALRM, &alarm_action, NULL)) {
-   perror("sigaction");
+   if (sigaction(SIGINT, &sig_action, NULL)) {
+   perror("sigaction (sigint)");
+   test_error(name);
+   return 1;
+   }
+
+   if (sigaction(SIGALRM, &sig_action, NULL)) {
+   perror("sigaction (sigalrm)");
test_error(name);
return 1;
}
-- 
2.16.3



[PATCH v7 0/9] powerpc/pseries: Machine check handler improvements.

2018-08-07 Thread Mahesh J Salgaonkar
This patch series includes some improvement to Machine check handler
for pseries. Patch 1 fixes a buffer overrun issue if rtas extended error
log size is greater than RTAS_ERROR_LOG_MAX.
Patch 2 fixes an issue where machine check handler crashes
kernel while accessing vmalloc-ed buffer while in nmi context.
Patch 3 fixes endain bug while restoring of r3 in MCE handler.
Patch 5 implements a real mode mce handler and flushes the SLBs on SLB error.
Patch 6 display's the MCE error details on console.
Patch 7 saves and dumps the SLB contents on SLB MCE errors to improve the
debugability.
Patch 8 adds sysctl knob for recovery action on recovered MCEs.
Patch 9 consolidates mce early real mode handling code.

Change in V7:
- Fold Michal's patch into patch 5
- Handle MSR_RI=0 and evil context case in MC handler in patch 5.
- Patch 7: Print slb cache ptr value and slb cache data.
- Move patch 8 to patch 9.
- Introduce patch 8 add sysctl knob for recovery action on recovered MCEs.

Change in V6:
- Introduce patch 8 to consolidate early real mode handling code.
- Address Nick's comment on erroneous hunk.

Change in V5:
- Use min_t instead of max_t.
- Fix an issue reported by kbuild test robot and address review comments.

Change in V4:
- Flush the SLBs in real mode mce handler to handle SLB errors for entry 0.
- Allocate buffers per cpu to hold rtas error log and old slb contents.
- Defer the logging of rtas error log to irq work queue.

Change in V3:
- Moved patch 5 to patch 2

Change in V2:
- patch 3: Display additional info (NIP and task info) in MCE error details.
- patch 5: Fix endain bug while restoring of r3 in MCE handler.
---

Mahesh Salgaonkar (9):
  powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX.
  powerpc/pseries: Defer the logging of rtas error to irq work queue.
  powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
  powerpc/pseries: Define MCE error event section.
  powerpc/pseries: flush SLB contents on SLB MCE errors.
  powerpc/pseries: Display machine check error details.
  powerpc/pseries: Dump the SLB contents on SLB MCE errors.
  powerpc/mce: Add sysctl control for recovery action on MCE.
  powernv/pseries: consolidate code for mce early handling.


 arch/powerpc/include/asm/book3s/64/mmu-hash.h |8 +
 arch/powerpc/include/asm/machdep.h|1 
 arch/powerpc/include/asm/mce.h|2 
 arch/powerpc/include/asm/paca.h   |7 +
 arch/powerpc/include/asm/rtas.h   |  116 
 arch/powerpc/kernel/exceptions-64s.S  |   42 
 arch/powerpc/kernel/mce.c |   73 +++-
 arch/powerpc/kernel/traps.c   |3 
 arch/powerpc/mm/slb.c |   79 
 arch/powerpc/platforms/powernv/setup.c|   15 ++
 arch/powerpc/platforms/pseries/pseries.h  |1 
 arch/powerpc/platforms/pseries/ras.c  |  242 +++--
 arch/powerpc/platforms/pseries/setup.c|   27 +++
 13 files changed, 588 insertions(+), 28 deletions(-)

--
Signature



[PATCH v7 1/9] powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

The global mce data buffer that used to copy rtas error log is of 2048
(RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read
extended_log_length from rtas error log header, then use max of
extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied.
Ideally the platform (phyp) will never send extended error log with
size > 2048. But if that happens, then we have a risk of buffer overrun
and corruption. Fix this by using min_t instead.

Fixes: d368514c3097 ("powerpc: Fix corruption when grabbing FWNMI data")
Reported-by: Michal Suchanek 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/platforms/pseries/ras.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 5e1ef9150182..ef104144d4bc 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -371,7 +371,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
pt_regs *regs)
int len, error_log_length;
 
error_log_length = 8 + rtas_error_extended_log_length(h);
-   len = max_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
+   len = min_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
memset(global_mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
memcpy(global_mce_data_buf, h, len);
errhdr = (struct rtas_error_log *)global_mce_data_buf;



[PATCH v7 2/9] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from vmalloc
(non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
also takes a spin_lock while logging error which is not safe in NMI
context. It may endup in deadlock if we get another MCE before releasing
the lock. Fix this by deferring the logging of rtas error to irq work queue.

Current implementation uses two different buffers to hold rtas error log
depending on whether extended log is provided or not. This makes bit
difficult to identify which buffer has valid data that needs to logged
later in irq work. Simplify this using single buffer, one per paca, and
copy rtas log to it irrespective of whether extended log is provided or
not. Allocate this buffer below RMA region so that it can be accessed
in real mode mce handler.

Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
interrupt")
Cc: sta...@vger.kernel.org
Reviewed-by: Nicholas Piggin 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/paca.h|3 ++
 arch/powerpc/platforms/pseries/ras.c   |   47 ++--
 arch/powerpc/platforms/pseries/setup.c |   16 +++
 3 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 6d34bd71139d..7f22929ce915 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -252,6 +252,9 @@ struct paca_struct {
void *rfi_flush_fallback_area;
u64 l1d_flush_size;
 #endif
+#ifdef CONFIG_PPC_PSERIES
+   u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
+#endif /* CONFIG_PPC_PSERIES */
 } cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index ef104144d4bc..14a46b07ab2f 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -32,11 +33,13 @@
 static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
 static DEFINE_SPINLOCK(ras_log_buf_lock);
 
-static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
-static DEFINE_PER_CPU(__u64, mce_data_buf);
-
 static int ras_check_exception_token;
 
+static void mce_process_errlog_event(struct irq_work *work);
+static struct irq_work mce_errlog_process_work = {
+   .func = mce_process_errlog_event,
+};
+
 #define EPOW_SENSOR_TOKEN  9
 #define EPOW_SENSOR_INDEX  0
 
@@ -330,16 +333,20 @@ static irqreturn_t ras_error_interrupt(int irq, void 
*dev_id)
A) >= 0x7000) && ((A) < 0x7ff0)) || \
(((A) >= rtas.base) && ((A) < (rtas.base + rtas.size - 16
 
+static inline struct rtas_error_log *fwnmi_get_errlog(void)
+{
+   return (struct rtas_error_log *)local_paca->mce_data_buf;
+}
+
 /*
  * Get the error information for errors coming through the
  * FWNMI vectors.  The pt_regs' r3 will be updated to reflect
  * the actual r3 if possible, and a ptr to the error log entry
  * will be returned if found.
  *
- * If the RTAS error is not of the extended type, then we put it in a per
- * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
+ * Use one buffer mce_data_buf per cpu to store RTAS error.
  *
- * The global_mce_data_buf does not have any locks or protection around it,
+ * The mce_data_buf does not have any locks or protection around it,
  * if a second machine check comes in, or a system reset is done
  * before we have logged the error, then we will get corruption in the
  * error log.  This is preferable over holding off on calling
@@ -349,7 +356,7 @@ static irqreturn_t ras_error_interrupt(int irq, void 
*dev_id)
 static struct rtas_error_log *fwnmi_get_errinfo(struct pt_regs *regs)
 {
unsigned long *savep;
-   struct rtas_error_log *h, *errhdr = NULL;
+   struct rtas_error_log *h;
 
/* Mask top two bits */
regs->gpr[3] &= ~(0x3UL << 62);
@@ -362,22 +369,20 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
pt_regs *regs)
savep = __va(regs->gpr[3]);
regs->gpr[3] = savep[0];/* restore original r3 */
 
-   /* If it isn't an extended log we can use the per cpu 64bit buffer */
h = (st

[PATCH v7 3/9] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.

Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.

[   62.878965] Severe Machine check interrupt [Recovered]
[   62.878968]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[   62.878969]   Initiator: CPU
[   62.878970]   Error type: SLB [Multihit]
[   62.878971] Effective address: dca7
cpu 0xa: Vector: 380 (Data SLB Access) at [c000fc7775b0]
pc: c09694c0: vsnprintf+0x80/0x480
lr: c09698e0: vscnprintf+0x20/0x60
sp: c000fc777830
   msr: 82009033
   dar: a803a30c00d0
  current = 0xcbc9ef00
  paca= 0xc0001eca5c00   softe: 3irq_happened: 0x01
pid   = 8860, comm = insmod
[c000fc7778b0] c09698e0 vscnprintf+0x20/0x60
[c000fc7778e0] c016b6c4 vprintk_emit+0xb4/0x4b0
[c000fc777960] c016d40c vprintk_func+0x5c/0xd0
[c000fc777980] c016cbb4 printk+0x38/0x4c
[c000fc7779a0] dca301c0 init_module+0x1c0/0x338 [bork_kernel]
[c000fc777a40] c000d9c4 do_one_initcall+0x54/0x230
[c000fc777b00] c01b3b74 do_init_module+0x8c/0x248
[c000fc777b90] c01b2478 load_module+0x12b8/0x15b0
[c000fc777d30] c01b29e8 sys_finit_module+0xa8/0x110
[c000fc777e30] c000b204 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 7fff8bda0644
SP (7fffdfbfe980) is in userspace

This patch fixes this issue.

Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
Cc: sta...@vger.kernel.org
Reviewed-by: Nicholas Piggin 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/platforms/pseries/ras.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 14a46b07ab2f..851ce326874a 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -367,7 +367,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
pt_regs *regs)
}
 
savep = __va(regs->gpr[3]);
-   regs->gpr[3] = savep[0];/* restore original r3 */
+   regs->gpr[3] = be64_to_cpu(savep[0]);   /* restore original r3 */
 
h = (struct rtas_error_log *)&savep[1];
/* Use the per cpu buffer from paca to store rtas error log */



[PATCH v7 4/9] powerpc/pseries: Define MCE error event section.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

On pseries, the machine check error details are part of RTAS extended
event log passed under Machine check exception section. This patch adds
the definition of rtas MCE event section and related helper
functions.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h |  111 +++
 1 file changed, 111 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 71e393c46a49..adc677c5e3a4 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -185,6 +185,13 @@ static inline uint8_t rtas_error_disposition(const struct 
rtas_error_log *elog)
return (elog->byte1 & 0x18) >> 3;
 }
 
+static inline
+void rtas_set_disposition_recovered(struct rtas_error_log *elog)
+{
+   elog->byte1 &= ~0x18;
+   elog->byte1 |= (RTAS_DISP_FULLY_RECOVERED << 3);
+}
+
 static inline uint8_t rtas_error_extended(const struct rtas_error_log *elog)
 {
return (elog->byte1 & 0x04) >> 2;
@@ -275,6 +282,7 @@ inline uint32_t rtas_ext_event_company_id(struct 
rtas_ext_event_log_v6 *ext_log)
 #define PSERIES_ELOG_SECT_ID_CALL_HOME (('C' << 8) | 'H')
 #define PSERIES_ELOG_SECT_ID_USER_DEF  (('U' << 8) | 'D')
 #define PSERIES_ELOG_SECT_ID_HOTPLUG   (('H' << 8) | 'P')
+#define PSERIES_ELOG_SECT_ID_MCE   (('M' << 8) | 'C')
 
 /* Vendor specific Platform Event Log Format, Version 6, section header */
 struct pseries_errorlog {
@@ -326,6 +334,109 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_ID_DRC_COUNT   3
 #define PSERIES_HP_ELOG_ID_DRC_IC  4
 
+/* RTAS pseries MCE errorlog section */
+#pragma pack(push, 1)
+struct pseries_mc_errorlog {
+   __be32  fru_id;
+   __be32  proc_id;
+   uint8_t error_type;
+   union {
+   struct {
+   uint8_t ue_err_type;
+   /* 
+* X1: Permanent or Transient UE.
+*  X   1: Effective address provided.
+*   X  1: Logical address provided.
+*XX2: Reserved.
+*  XXX 3: Type of UE error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   __be64  logical_address;
+   } ue_error;
+   struct {
+   uint8_t soft_err_type;
+   /* 
+* X1: Effective address provided.
+*  X   5: Reserved.
+*   XX 2: Type of SLB/ERAT/TLB error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   uint8_t reserved_2[8];
+   } soft_error;
+   } u;
+};
+#pragma pack(pop)
+
+/* RTAS pseries MCE error types */
+#define PSERIES_MC_ERROR_TYPE_UE   0x00
+#define PSERIES_MC_ERROR_TYPE_SLB  0x01
+#define PSERIES_MC_ERROR_TYPE_ERAT 0x02
+#define PSERIES_MC_ERROR_TYPE_TLB  0x04
+#define PSERIES_MC_ERROR_TYPE_D_CACHE  0x05
+#define PSERIES_MC_ERROR_TYPE_I_CACHE  0x07
+
+/* RTAS pseries MCE error sub types */
+#define PSERIES_MC_ERROR_UE_INDETERMINATE  0
+#define PSERIES_MC_ERROR_UE_IFETCH 1
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH 2
+#define PSERIES_MC_ERROR_UE_LOAD_STORE 3
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE 4
+
+#define PSERIES_MC_ERROR_SLB_PARITY0
+#define PSERIES_MC_ERROR_SLB_MULTIHIT  1
+#define PSERIES_MC_ERROR_SLB_INDETERMINATE 2
+
+#define PSERIES_MC_ERROR_ERAT_PARITY   1
+#define PSERIES_MC_ERROR_ERAT_MULTIHIT 2
+#define PSERIES_MC_ERROR_ERAT_INDETERMINATE3
+
+#define PSERIES_MC_ERROR_TLB_PARITY1
+#define PSERIES_MC_ERROR_TLB_MULTIHIT  2
+#define PSERIES_MC_ERROR_TLB_INDETERMINATE 3
+
+static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog 
*mlog)
+{
+   return mlog->error_type;
+}
+
+static inline uint8_t rtas_mc_error_sub_type(
+   const struct pseries_mc_errorlog *mlog)
+{
+   switch (mlog->error_type) {
+   casePSERIES_MC_ERROR_TYPE_UE:
+   return (mlog->u.ue_error.ue_err_type & 0x07);
+   casePSERIES_MC_ERROR_TYPE_SLB:
+   casePSERIES_MC_ERROR_TYPE_ERAT:
+   casePSERIES_MC_ERROR_TYPE_TLB:
+   return (mlog->u.soft_error.soft_err_type & 0x03);
+   default:
+   return 0;
+   }
+}
+
+static inline uint64_t rtas_mc_get_effective_addr(
+   const struct pseries_mc_errorlog *mlog)
+{
+   uint64_t addr = 0;
+
+   switch (mlog->error_ty

[PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

On pseries, as of today system crashes if we get a machine check
exceptions due to SLB errors. These are soft errors and can be fixed by
flushing the SLBs so the kernel can continue to function instead of
system crash. We do this in real mode before turning on MMU. Otherwise
we would run into nested machine checks. This patch now fetches the
rtas error log in real mode and flushes the SLBs on SLB errors.

Signed-off-by: Mahesh Salgaonkar 
Signed-off-by: Michal Suchanek 
---

Changes in V7:
- Fold Michal's patch into this patch.
- Handle MSR_RI=0 and evil context case in MC handler.
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
 arch/powerpc/include/asm/machdep.h|1 
 arch/powerpc/kernel/exceptions-64s.S  |  112 +
 arch/powerpc/kernel/mce.c |   15 +++
 arch/powerpc/mm/slb.c |6 +
 arch/powerpc/platforms/powernv/setup.c|   11 ++
 arch/powerpc/platforms/pseries/pseries.h  |1 
 arch/powerpc/platforms/pseries/ras.c  |   51 +++
 arch/powerpc/platforms/pseries/setup.c|1 
 9 files changed, 195 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 50ed64fba4ae..cc00a7088cf3 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -487,6 +487,7 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_flush_and_rebolt_realmode(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index a47de82fb8e2..b4831f1338db 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -108,6 +108,7 @@ struct machdep_calls {
 
/* Early exception handlers called in realmode */
int (*hmi_exception_early)(struct pt_regs *regs);
+   long(*machine_check_early)(struct pt_regs *regs);
 
/* Called during machine check exception to retrive fixup address. */
bool(*mce_check_early_recovery)(struct pt_regs *regs);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 285c6465324a..cb06f219570a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -332,6 +332,9 @@ TRAMP_REAL_BEGIN(machine_check_pSeries)
 machine_check_fwnmi:
SET_SCRATCH0(r13)   /* save r13 */
EXCEPTION_PROLOG_0(PACA_EXMC)
+BEGIN_FTR_SECTION
+   b   machine_check_pSeries_early
+END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
 machine_check_pSeries_0:
EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
/*
@@ -343,6 +346,90 @@ machine_check_pSeries_0:
 
 TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
 
+TRAMP_REAL_BEGIN(machine_check_pSeries_early)
+BEGIN_FTR_SECTION
+   EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
+   mr  r10,r1  /* Save r1 */
+   ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
+   subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
+   mfspr   r11,SPRN_SRR0   /* Save SRR0 */
+   mfspr   r12,SPRN_SRR1   /* Save SRR1 */
+   EXCEPTION_PROLOG_COMMON_1()
+   EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
+   EXCEPTION_PROLOG_COMMON_3(0x200)
+   addir3,r1,STACK_FRAME_OVERHEAD
+   BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
+   ld  r12,_MSR(r1)
+   andi.   r11,r12,MSR_PR  /* See if coming from user. */
+   bne 2f  /* continue in V mode if we are. */
+
+   /*
+* At this point we are not sure about what context we come from.
+* We may be in the middle of swithing stack. r1 may not be valid.
+* Hence stay on emergency stack, call machine_check_exception and
+* return from the interrupt.
+* But before that, check if this is an un-recoverable exception.
+* If yes, then stay on emergency stack and panic.
+*/
+   andi.   r11,r12,MSR_RI
+   bne 1f
+
+   /*
+* Check if we have successfully handled/recovered from error, if not
+* then stay on emergency stack and panic.
+*/
+   cmpdi   r3,0/* see if we handled MCE successfully */
+   bne 1f  /* if handled then return from interrupt */
+
+   LOAD_HANDLER(r10,unrecover_mce)
+   mtspr   SPRN_SRR0,r10
+   ld  r10,PACAKMSR(r13)
+   /*
+* We are going down. But there are chances that we might get hit by
+* another MCE during panic path and we may run into unstable state
+* with no way out. Hence, turn ME bit off while going down, so that
+* when another MCE is hit du

[PATCH v7 6/9] powerpc/pseries: Display machine check error details.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

Extract the MCE error details from RTAS extended log and display it to
console.

With this patch you should now see mce logs like below:

[  142.371818] Severe Machine check interrupt [Recovered]
[  142.371822]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[  142.371822]   Initiator: CPU
[  142.371823]   Error type: SLB [Multihit]
[  142.371824] Effective address: dca7

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h  |5 +
 arch/powerpc/platforms/pseries/ras.c |  132 ++
 2 files changed, 137 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index adc677c5e3a4..9b3c6e06dad1 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -197,6 +197,11 @@ static inline uint8_t rtas_error_extended(const struct 
rtas_error_log *elog)
return (elog->byte1 & 0x04) >> 2;
 }
 
+static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
+{
+   return (elog->byte2 & 0xf0) >> 4;
+}
+
 #define rtas_error_type(x) ((x)->byte3)
 
 static inline
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index e4420f7c8fda..656b35a42d93 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -427,6 +427,135 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
 }
 
+#define VAL_TO_STRING(ar, val) ((val < ARRAY_SIZE(ar)) ? ar[val] : "Unknown")
+
+static void pseries_print_mce_info(struct pt_regs *regs,
+   struct rtas_error_log *errp)
+{
+   const char *level, *sevstr;
+   struct pseries_errorlog *pseries_log;
+   struct pseries_mc_errorlog *mce_log;
+   uint8_t error_type, err_sub_type;
+   uint64_t addr;
+   uint8_t initiator = rtas_error_initiator(errp);
+   int disposition = rtas_error_disposition(errp);
+
+   static const char * const initiators[] = {
+   "Unknown",
+   "CPU",
+   "PCI",
+   "ISA",
+   "Memory",
+   "Power Mgmt",
+   };
+   static const char * const mc_err_types[] = {
+   "UE",
+   "SLB",
+   "ERAT",
+   "TLB",
+   "D-Cache",
+   "Unknown",
+   "I-Cache",
+   };
+   static const char * const mc_ue_types[] = {
+   "Indeterminate",
+   "Instruction fetch",
+   "Page table walk ifetch",
+   "Load/Store",
+   "Page table walk Load/Store",
+   };
+
+   /* SLB sub errors valid values are 0x0, 0x1, 0x2 */
+   static const char * const mc_slb_types[] = {
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   /* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
+   static const char * const mc_soft_types[] = {
+   "Unknown",
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   if (!rtas_error_extended(errp)) {
+   pr_err("Machine check interrupt: Missing extended error log\n");
+   return;
+   }
+
+   pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+   if (pseries_log == NULL)
+   return;
+
+   mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+
+   error_type = rtas_mc_error_type(mce_log);
+   err_sub_type = rtas_mc_error_sub_type(mce_log);
+
+   switch (rtas_error_severity(errp)) {
+   case RTAS_SEVERITY_NO_ERROR:
+   level = KERN_INFO;
+   sevstr = "Harmless";
+   break;
+   case RTAS_SEVERITY_WARNING:
+   level = KERN_WARNING;
+   sevstr = "";
+   break;
+   case RTAS_SEVERITY_ERROR:
+   case RTAS_SEVERITY_ERROR_SYNC:
+   level = KERN_ERR;
+   sevstr = "Severe";
+   break;
+   case RTAS_SEVERITY_FATAL:
+   default:
+   level = KERN_ERR;
+   sevstr = "Fatal";
+   break;
+   }
+
+   printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
+   disposition == RTAS_DISP_FULLY_RECOVERED ?
+   "Recovered" : "Not recovered");
+   if (user_mode(regs)) {
+   printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
+   regs->nip, current->pid, current->comm);
+   } else {
+   printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
+   (void *)regs->nip);
+   }
+   printk("%s  Initiator: %s\n", level,
+   VAL_TO_STRING(initiators, initiator));
+
+   switch (error_type) {
+   case PSERIES_MC_ERROR_TYPE_UE:
+   printk("%s  Error

[PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
faulty SLB entries. In real mode mce handler saves the old SLB contents
into this buffer accessible through paca and print it out later in virtual
mode.

With this patch the console will log SLB contents like below on SLB MCE
errors:

[  507.297236] SLB contents of cpu 0x1
[  507.297237] Last SLB entry inserted at slot 16
[  507.297238] 00 c800 400ea1b217000500
[  507.297239]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
[  507.297240] 01 d800 400d43642f000510
[  507.297242]   1T  ESID=   d0  VSID=  d43642f LLP:110
[  507.297243] 11 f800 400a86c85f000500
[  507.297244]   1T  ESID=   f0  VSID=  a86c85f LLP:100
[  507.297245] 12 7f000800 4008119624000d90
[  507.297246]   1T  ESID=   7f  VSID=  8119624 LLP:110
[  507.297247] 13 1800 00092885f5150d90
[  507.297247]  256M ESID=1  VSID=   92885f5150 LLP:110
[  507.297248] 14 01000800 4009e7cb5d90
[  507.297249]   1T  ESID=1  VSID=  9e7cb50 LLP:110
[  507.297250] 15 d800 400d43642f000510
[  507.297251]   1T  ESID=   d0  VSID=  d43642f LLP:110
[  507.297252] 16 d800 400d43642f000510
[  507.297253]   1T  ESID=   d0  VSID=  d43642f LLP:110
[  507.297253] --
[  507.297254] SLB cache ptr value = 3
[  507.297254] Valid SLB cache entries:
[  507.297255] 00 EA[0-35]=7f000
[  507.297256] 01 EA[0-35]=1
[  507.297257] 02 EA[0-35]= 1000
[  507.297257] Rest of SLB cache entries:
[  507.297258] 03 EA[0-35]=7f000
[  507.297258] 04 EA[0-35]=1
[  507.297259] 05 EA[0-35]= 1000
[  507.297260] 06 EA[0-35]=   12
[  507.297260] 07 EA[0-35]=7f000

Suggested-by: Aneesh Kumar K.V 
Suggested-by: Michael Ellerman 
Signed-off-by: Mahesh Salgaonkar 
---

Changes in V7:
- Print slb cache ptr value and slb cache data
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |7 ++
 arch/powerpc/include/asm/paca.h   |4 +
 arch/powerpc/mm/slb.c |   73 +
 arch/powerpc/platforms/pseries/ras.c  |   10 +++
 arch/powerpc/platforms/pseries/setup.c|   10 +++
 5 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index cc00a7088cf3..5a3fe282076d 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -485,9 +485,16 @@ static inline void hpte_init_pseries(void) { }
 
 extern void hpte_init_native(void);
 
+struct slb_entry {
+   u64 esid;
+   u64 vsid;
+};
+
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
 extern void slb_flush_and_rebolt_realmode(void);
+extern void slb_save_contents(struct slb_entry *slb_ptr);
+extern void slb_dump_contents(struct slb_entry *slb_ptr);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 7f22929ce915..233d25ff6f64 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -254,6 +254,10 @@ struct paca_struct {
 #endif
 #ifdef CONFIG_PPC_PSERIES
u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
+
+   /* Capture SLB related old contents in MCE handler. */
+   struct slb_entry *mce_faulty_slbs;
+   u16 slb_save_cache_ptr;
 #endif /* CONFIG_PPC_PSERIES */
 } cacheline_aligned;
 
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index e89f675f1b5e..16a53689ffd4 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
get_paca()->slb_cache_ptr = 0;
 }
 
+void slb_save_contents(struct slb_entry *slb_ptr)
+{
+   int i;
+   unsigned long e, v;
+
+   /* Save slb_cache_ptr value. */
+   get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;
+
+   if (!slb_ptr)
+   return;
+
+   for (i = 0; i < mmu_slb_size; i++) {
+   asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
+   asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
+   slb_ptr->esid = e;
+   slb_ptr->vsid = v;
+   slb_ptr++;
+   }
+}
+
+void slb_dump_contents(struct slb_entry *slb_ptr)
+{
+   int i, n;
+   unsigned long e, v;
+   unsigned long llp;
+
+   if (!slb_ptr)
+   return;
+
+   pr_err("SLB contents of cpu 0x%x\n", smp_processor_id());
+   pr_err("Last SLB entry inserted at slot %lld\n", get_paca()->stab_rr);
+
+   for (i = 0; i < mmu_slb_size; i++) {
+   e = slb

[PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

Introduce recovery action for recovered memory errors (MCEs). There are
soft memory errors like SLB Multihit, which can be a result of a bad
hardware OR software BUG. Kernel can easily recover from these soft errors
by flushing SLB contents. After the recovery kernel can still continue to
function without any issue. But in some scenario's we may keep getting
these soft errors until the root cause is fixed. To be able to analyze and
find the root cause, best way is to gather enough data and system state at
the time of MCE. Hence this patch introduces a sysctl knob where user can
decide either to continue after recovery or panic the kernel to capture the
dump. This will allow one to configure a kernel to capture a dump on MCE
and then toggle back to recovery while dump is being analyzed.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/mce.h |2 +
 arch/powerpc/kernel/mce.c  |   58 
 arch/powerpc/kernel/traps.c|3 +-
 arch/powerpc/platforms/powernv/setup.c |4 ++
 4 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 3a1226e9b465..d46e1903878d 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -202,6 +202,8 @@ struct mce_error_info {
 #define MCE_EVENT_RELEASE  true
 #define MCE_EVENT_DONTRELEASE  false
 
+extern int recover_on_mce;
+
 extern void save_mce_event(struct pt_regs *regs, long handled,
   struct mce_error_info *mce_err, uint64_t nip,
   uint64_t addr, uint64_t phys_addr);
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ae17d8aa60c4..5e2ab5cade81 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -631,3 +632,60 @@ long hmi_exception_realmode(struct pt_regs *regs)
 
return 1;
 }
+
+/*
+ * Recovery action for recovered memory errors.
+ *
+ * There are soft memory errors like SLB Multihit, which can be a result of
+ * a bad hardware OR software BUG. Kernel can easily recover from these
+ * soft errors by flushing SLB contents. After the recovery kernel can
+ * still continue to function without any issue. But in some scenario's we
+ * may keep getting these soft errors until the root cause is fixed. To be
+ * able to analyze and find the root cause, best way is to gather enough
+ * data and system state at the time of MCE. Introduce a sysctl knob where
+ * user can decide either to continue after recovery or panic the kernel
+ * to capture the dump. This will allow one to configure a kernel to capture
+ * dump on MCE and then toggle back to recovery while dump is being analyzed.
+ *
+ * recover_on_mce == 0
+ * panic/crash the kernel to trigger dump capture.
+ *
+ * recover_on_mce == 1
+ * continue after MCE recovery. (no panic)
+ */
+int recover_on_mce;
+
+#ifdef CONFIG_SYSCTL
+/*
+ * Register the sysctl to define memory error recovery action.
+ */
+static struct ctl_table machine_check_ctl_table[] = {
+   {
+   .procname   = "recover_on_mce",
+   .data   = &recover_on_mce,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {}
+};
+
+static struct ctl_table machine_check_sysctl_root[] = {
+   {
+   .procname   = "kernel",
+   .mode   = 0555,
+   .child  = machine_check_ctl_table,
+   },
+   {}
+};
+
+static int __init register_machine_check_sysctl(void)
+{
+   register_sysctl_table(machine_check_sysctl_root);
+
+   return 0;
+}
+__initcall(register_machine_check_sysctl);
+#endif /* CONFIG_SYSCTL */
+
+core_param(recover_on_mce, recover_on_mce, int, 0644);
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0e17dcb48720..246477c790e8 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -70,6 +70,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC_CORE)
 int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -727,7 +728,7 @@ void machine_check_exception(struct pt_regs *regs)
else if (cur_cpu_spec->machine_check)
recover = cur_cpu_spec->machine_check(regs);
 
-   if (recover > 0)
+   if ((recover > 0) && recover_on_mce)
goto bail;
 
if (debugger_fault_handler(regs))
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index b74c93bc2e55..d13278029a94 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "powernv.h"
 
@@ -147,6 +148,9 @@ static void __in

[PATCH v7 9/9] powernv/pseries: consolidate code for mce early handling.

2018-08-07 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

Now that other platforms also implements real mode mce handler,
lets consolidate the code by sharing existing powernv machine check
early code. Rename machine_check_powernv_early to
machine_check_common_early and reuse the code.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/exceptions-64s.S |  138 +++---
 1 file changed, 28 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index cb06f219570a..2f85a7baf026 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -243,14 +243,13 @@ EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
SET_SCRATCH0(r13)   /* save r13 */
EXCEPTION_PROLOG_0(PACA_EXMC)
 BEGIN_FTR_SECTION
-   b   machine_check_powernv_early
+   b   machine_check_common_early
 FTR_SECTION_ELSE
b   machine_check_pSeries_0
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
 EXC_REAL_END(machine_check, 0x200, 0x100)
 EXC_VIRT_NONE(0x4200, 0x100)
-TRAMP_REAL_BEGIN(machine_check_powernv_early)
-BEGIN_FTR_SECTION
+TRAMP_REAL_BEGIN(machine_check_common_early)
EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
/*
 * Register contents:
@@ -306,7 +305,9 @@ BEGIN_FTR_SECTION
/* Save r9 through r13 from EXMC save area to stack frame. */
EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
mfmsr   r11 /* get MSR value */
+BEGIN_FTR_SECTION
ori r11,r11,MSR_ME  /* turn on ME bit */
+END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
ori r11,r11,MSR_RI  /* turn on RI bit */
LOAD_HANDLER(r12, machine_check_handle_early)
 1: mtspr   SPRN_SRR0,r12
@@ -325,7 +326,6 @@ BEGIN_FTR_SECTION
andcr11,r11,r10 /* Turn off MSR_ME */
b   1b
b   .   /* prevent speculative execution */
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
 
 TRAMP_REAL_BEGIN(machine_check_pSeries)
.globl machine_check_fwnmi
@@ -333,7 +333,7 @@ machine_check_fwnmi:
SET_SCRATCH0(r13)   /* save r13 */
EXCEPTION_PROLOG_0(PACA_EXMC)
 BEGIN_FTR_SECTION
-   b   machine_check_pSeries_early
+   b   machine_check_common_early
 END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
 machine_check_pSeries_0:
EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
@@ -346,90 +346,6 @@ machine_check_pSeries_0:
 
 TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
 
-TRAMP_REAL_BEGIN(machine_check_pSeries_early)
-BEGIN_FTR_SECTION
-   EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
-   mr  r10,r1  /* Save r1 */
-   ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
-   subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
-   mfspr   r11,SPRN_SRR0   /* Save SRR0 */
-   mfspr   r12,SPRN_SRR1   /* Save SRR1 */
-   EXCEPTION_PROLOG_COMMON_1()
-   EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
-   EXCEPTION_PROLOG_COMMON_3(0x200)
-   addir3,r1,STACK_FRAME_OVERHEAD
-   BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
-   ld  r12,_MSR(r1)
-   andi.   r11,r12,MSR_PR  /* See if coming from user. */
-   bne 2f  /* continue in V mode if we are. */
-
-   /*
-* At this point we are not sure about what context we come from.
-* We may be in the middle of swithing stack. r1 may not be valid.
-* Hence stay on emergency stack, call machine_check_exception and
-* return from the interrupt.
-* But before that, check if this is an un-recoverable exception.
-* If yes, then stay on emergency stack and panic.
-*/
-   andi.   r11,r12,MSR_RI
-   bne 1f
-
-   /*
-* Check if we have successfully handled/recovered from error, if not
-* then stay on emergency stack and panic.
-*/
-   cmpdi   r3,0/* see if we handled MCE successfully */
-   bne 1f  /* if handled then return from interrupt */
-
-   LOAD_HANDLER(r10,unrecover_mce)
-   mtspr   SPRN_SRR0,r10
-   ld  r10,PACAKMSR(r13)
-   /*
-* We are going down. But there are chances that we might get hit by
-* another MCE during panic path and we may run into unstable state
-* with no way out. Hence, turn ME bit off while going down, so that
-* when another MCE is hit during panic path, hypervisor will
-* power cycle the lpar, instead of getting into MCE loop.
-*/
-   li  r3,MSR_ME
-   andcr10,r10,r3  /* Turn off MSR_ME */
-   mtspr   SPRN_SRR1,r10
-   RFI_TO_KERNEL
-   b   .
-
-   /* Stay on emergency stack and return from interrupt. */
-1: LOAD_HANDLER(r10,mce_return)
-   mtspr   SPRN_SRR0,r10
-   ld  r10,PACAKMSR(r13)
-   mtspr   SPRN_SRR1,r10
-   RFI_TO_KERNEL
-  

Re: [PATCH] misc: ibmvsm: Fix wrong assignment of return code

2018-08-07 Thread Bryant G. Ly


On 8/7/18 7:28 AM, Michael Ellerman wrote:

> "Bryant G. Ly"  writes:
>
>> From: "Bryant G. Ly" 
>>
>> Currently the assignment is flipped and rc is always 0.
> If you'd left rc uninitialised at the start of the function the compiler
> would have caught it for you.
>
> And what is the consequence of the bug? Nothing, complete system crash,
> subtle data corruption?

The consequence would be that if the CRQ Registration failed the first time
due to not enough resources, it would never try to reset and try again. 

If it fails due to any other error then it would just fail the sending of the
crq init message, thus it would just wait for the client to init, which would
never happen. 

We would also have a memory leak since in the error case DMA would never get
un-mapped and the message queue never gets freed. 

>
> Also this should be tagged:
>
> Fixes: 0eca353e7ae7 ("misc: IBM Virtual Management Channel Driver (VMC)")
>
> cheers
>
Yep, sorry I forgot to add the Fixes:..

-Bryant




Re: [PATCH v2] powerpc/tm: Print 64-bits MSR

2018-08-07 Thread Segher Boessenkool
On Tue, Aug 07, 2018 at 10:35:00AM -0300, Breno Leitao wrote:
> On a kernel TM Bad thing program exception, the Machine State Register
> (MSR) is not being properly displayed. The exception code dumps a 32-bits
> value but MSR is a 64 bits register for all platforms that have HTM
> enabled.
> 
> This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
> order to do so, the 'reason' variable could not be used, since it trimmed
> MSR to 32-bits (int).

So maybe reason should be a long instead of an int?


Segher


Re: [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-08-07 Thread Michal Suchánek
Hello,


On Tue, 07 Aug 2018 19:47:14 +0530
"Mahesh J Salgaonkar"  wrote:

> From: Mahesh Salgaonkar 
> 
> On pseries, as of today system crashes if we get a machine check
> exceptions due to SLB errors. These are soft errors and can be fixed
> by flushing the SLBs so the kernel can continue to function instead of
> system crash. We do this in real mode before turning on MMU. Otherwise
> we would run into nested machine checks. This patch now fetches the
> rtas error log in real mode and flushes the SLBs on SLB errors.
> 
> Signed-off-by: Mahesh Salgaonkar 
> Signed-off-by: Michal Suchanek 
> ---
> 
> Changes in V7:
> - Fold Michal's patch into this patch.
> - Handle MSR_RI=0 and evil context case in MC handler.
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
>  arch/powerpc/include/asm/machdep.h|1 
>  arch/powerpc/kernel/exceptions-64s.S  |  112
> +
> arch/powerpc/kernel/mce.c |   15 +++
> arch/powerpc/mm/slb.c |6 +
> arch/powerpc/platforms/powernv/setup.c|   11 ++
> arch/powerpc/platforms/pseries/pseries.h  |1
> arch/powerpc/platforms/pseries/ras.c  |   51 +++
> arch/powerpc/platforms/pseries/setup.c|1 9 files changed,
> 195 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index
> 50ed64fba4ae..cc00a7088cf3 100644 ---
> a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -487,6 +487,7 @@
> extern void hpte_init_native(void); 
>  extern void slb_initialize(void);
>  extern void slb_flush_and_rebolt(void);
> +extern void slb_flush_and_rebolt_realmode(void);
>  
>  extern void slb_vmalloc_update(void);
>  extern void slb_set_size(u16 size);
> diff --git a/arch/powerpc/include/asm/machdep.h
> b/arch/powerpc/include/asm/machdep.h index a47de82fb8e2..b4831f1338db
> 100644 --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -108,6 +108,7 @@ struct machdep_calls {
>  
>   /* Early exception handlers called in realmode */
>   int (*hmi_exception_early)(struct pt_regs
> *regs);
> + long(*machine_check_early)(struct pt_regs
> *regs); 
>   /* Called during machine check exception to retrive fixup
> address. */ bool  (*mce_check_early_recovery)(struct
> pt_regs *regs); diff --git a/arch/powerpc/kernel/exceptions-64s.S
> b/arch/powerpc/kernel/exceptions-64s.S index
> 285c6465324a..cb06f219570a 100644 ---
> a/arch/powerpc/kernel/exceptions-64s.S +++
> b/arch/powerpc/kernel/exceptions-64s.S @@ -332,6 +332,9 @@
> TRAMP_REAL_BEGIN(machine_check_pSeries) machine_check_fwnmi:
>   SET_SCRATCH0(r13)   /* save r13 */
>   EXCEPTION_PROLOG_0(PACA_EXMC)
> +BEGIN_FTR_SECTION
> + b   machine_check_pSeries_early
> +END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
>  machine_check_pSeries_0:
>   EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
>   /*
> @@ -343,6 +346,90 @@ machine_check_pSeries_0:
>  
>  TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
>  
> +TRAMP_REAL_BEGIN(machine_check_pSeries_early)
> +BEGIN_FTR_SECTION
> + EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
> + mr  r10,r1  /* Save r1 */
> + ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency
> stack */
> + subir1,r1,INT_FRAME_SIZE/* alloc stack
> frame */
> + mfspr   r11,SPRN_SRR0   /* Save SRR0 */
> + mfspr   r12,SPRN_SRR1   /* Save SRR1 */
> + EXCEPTION_PROLOG_COMMON_1()
> + EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
> + EXCEPTION_PROLOG_COMMON_3(0x200)
> + addir3,r1,STACK_FRAME_OVERHEAD
> + BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI
> */
> + ld  r12,_MSR(r1)
> + andi.   r11,r12,MSR_PR  /* See if coming
> from user. */
> + bne 2f  /* continue in V mode
> if we are. */ +
> + /*
> +  * At this point we are not sure about what context we come
> from.
> +  * We may be in the middle of swithing stack. r1 may not be
> valid.
> +  * Hence stay on emergency stack, call
> machine_check_exception and
> +  * return from the interrupt.
> +  * But before that, check if this is an un-recoverable
> exception.
> +  * If yes, then stay on emergency stack and panic.
> +  */
> + andi.   r11,r12,MSR_RI
> + bne 1f
> +
> + /*
> +  * Check if we have successfully handled/recovered from
> error, if not
> +  * then stay on emergency stack and panic.
> +  */
> + cmpdi   r3,0/* see if we handled MCE
> successfully */
> + bne 1f  /* if handled then return from
> interrupt */ +
> + LOAD_HANDLER(r10,unrecover_mce)
> + mtspr   SPRN_SRR0,r10
> + ld  r10,PACAKMSR(r13)
> + /*
> +  * We are going down. But there are chances that we 

Re: [PATCH v2] powerpc/tm: Print 64-bits MSR

2018-08-07 Thread Christophe LEROY




Le 07/08/2018 à 15:35, Breno Leitao a écrit :

On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.

This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).


reason is not always regs->msr, see get_reason(), allthough in your case 
it is.


I think it would be better to change 'reason' to 'unsigned long' instead 
of replacing it by regs->msr for the printk.


Christophe




Signed-off-by: Breno Leitao 
---
  arch/powerpc/kernel/traps.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0e17dcb48720..cd561fd89532 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1402,7 +1402,7 @@ void program_check_exception(struct pt_regs *regs)
goto bail;
} else {
printk(KERN_EMERG "Unexpected TM Bad Thing exception "
-  "at %lx (msr 0x%x)\n", regs->nip, reason);
+  "at %lx (msr 0x%lx)\n", regs->nip, regs->msr);
die("Unrecoverable exception", regs, SIGABRT);
}
}



Re: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq ranges

2018-08-07 Thread Rob Herring
On Fri, Jul 27, 2018 at 03:17:59PM +0530, Bharat Bhushan wrote:
> Freescale MPIC h/w may not support all interrupt sources reported
> by hardware, "last-interrupt-source" or platform. On these platforms
> a misconfigured device tree that assigns one of the reserved
> interrupts leaves a non-functioning system without warning.

There are lots of ways to misconfigure DTs. I don't think this is 
special and needs a property. We've had some interrupt mask or valid 
properties in the past, but generally don't accept those.

> 
> This patch adds "supported-irq-ranges" property in device tree to
> provide the range of supported source of interrupts. If a reserved
> interrupt used then it will not be programming h/w, which it does
> currently, and through warning.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  .../devicetree/bindings/powerpc/fsl/mpic.txt   |   8 ++
>  arch/powerpc/include/asm/mpic.h|   9 ++
>  arch/powerpc/sysdev/mpic.c | 113 
> +++--
>  3 files changed, 121 insertions(+), 9 deletions(-)


Re: [PATCH v2] powerpc/tm: Print 64-bits MSR

2018-08-07 Thread Breno Leitao
Hi,

On 08/07/2018 02:15 PM, Christophe LEROY wrote:
> Le 07/08/2018 à 15:35, Breno Leitao a écrit :
>> On a kernel TM Bad thing program exception, the Machine State Register
>> (MSR) is not being properly displayed. The exception code dumps a 32-bits
>> value but MSR is a 64 bits register for all platforms that have HTM
>> enabled.
>>
>> This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
>> order to do so, the 'reason' variable could not be used, since it trimmed
>> MSR to 32-bits (int).
> 
> reason is not always regs->msr, see get_reason(), allthough in your case it 
> is.
> 
> I think it would be better to change 'reason' to 'unsigned long' instead of
> replacing it by regs->msr for the printk.

That was my initial approach, but this code seems to run on 32 bits system,
and I do not want to change the whole 'reason' bit width without having a 32
bits to test, at least.

Also, it is a bit weird doing something as:

printk("(msr 0x%lx)", reason);

I personally think that the follow code is much more readable:

printk(" (msr 0x%lx)...", regs->msr);


Re: [PATCH v2] powerpc/tm: Print 64-bits MSR

2018-08-07 Thread LEROY Christophe

Breno Leitao  a écrit :


Hi,

On 08/07/2018 02:15 PM, Christophe LEROY wrote:

Le 07/08/2018 à 15:35, Breno Leitao a écrit :

On a kernel TM Bad thing program exception, the Machine State Register
(MSR) is not being properly displayed. The exception code dumps a 32-bits
value but MSR is a 64 bits register for all platforms that have HTM
enabled.

This patch dumps the MSR value as a 64-bits value instead of 32 bits. In
order to do so, the 'reason' variable could not be used, since it trimmed
MSR to 32-bits (int).


reason is not always regs->msr, see get_reason(), allthough in your  
case it is.


I think it would be better to change 'reason' to 'unsigned long' instead of
replacing it by regs->msr for the printk.


That was my initial approach, but this code seems to run on 32 bits system,
and I do not want to change the whole 'reason' bit width without having a 32
bits to test, at least.


But 'unsigned long' is still 32 bits on ppc32, so it makes no  
difference with 'unsigned int'

And I will test it for you if needed

Christophe



Also, it is a bit weird doing something as:

printk("(msr 0x%lx)", reason);

I personally think that the follow code is much more readable:

printk(" (msr 0x%lx)...", regs->msr);





Re: [PATCH v2 2/2] powerpc/pseries: Wait for completion of hotplug events during PRRN handling

2018-08-07 Thread John Allen

On Wed, Aug 01, 2018 at 11:16:22PM +1000, Michael Ellerman wrote:

John Allen  writes:


On Mon, Jul 23, 2018 at 11:41:24PM +1000, Michael Ellerman wrote:

John Allen  writes:


While handling PRRN events, the time to handle the actual hotplug events
dwarfs the time it takes to perform the device tree updates and queue the
hotplug events. In the case that PRRN events are being queued continuously,
hotplug events have been observed to be queued faster than the kernel can
actually handle them. This patch avoids the problem by waiting for a
hotplug request to complete before queueing more hotplug events.


Have you tested this patch in isolation, ie. not with patch 1?


While I was away on vacation, I believe a build was tested with just 
this patch and not the first and it has been running with no problems.  
However, I think they've had problems recreating the problem in general 
so it may just be that the environment is not setup properly to recreate 
the issue.





So do we need the hotplug work queue at all? Can we just call
handle_dlpar_errorlog() directly?

Or are we using the work queue to serialise things? And if so would a
mutex be better?


Right, the workqueue is meant to serialize all hotplug events and it
gets used for more than just PRRN events. I believe the motivation for
using the workqueue over a mutex is that KVM guests initiate hotplug
events through the hotplug interrupt and can queue fairly large requests
meaning that in this scenario, waiting for a lock would block interrupts
for a while.


OK, but that just means that path needs to schedule work to run later.


Using the workqueue allows us to serialize hotplug events
from different sources in the same way without worrying about the
context in which the event is generated.


A lock would be so much simpler.

It looks like we have three callers of queue_hotplug_event(), the dlpar
code, the mobility code and the ras interrupt.

The dlpar code already waits synchronously:

 init_completion(&hotplug_done);
 queue_hotplug_event(hp_elog, &hotplug_done, &rc);
 wait_for_completion(&hotplug_done);

You're changing mobility to do the same (this patch), leaving only the
ras interrupt that actually queues work and returns.


So it really seems like a mutex would do the trick, and the ras
interrupt would be the only case that needs to schedule work for later.


I think you may be right, but I would need some feedback from Nathan 
Fontenot before I redesign the queue. He's been thinking about that 
design for longer than I have and may know something that I don't 
regarding the reason we're using a workqueue rather than a mutex.


Given that the bug this is meant to address is pretty high priority, 
would you consider the wait_for_completion an acceptable stopgap while a 
more substantial redesign of this code is discussed?


-John



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-07 Thread Benjamin Herrenschmidt
On Tue, 2018-08-07 at 06:55 -0700, Christoph Hellwig wrote:
> On Tue, Aug 07, 2018 at 04:42:44PM +1000, Benjamin Herrenschmidt wrote:
> > Note that I can make it so that the same DMA ops (basically standard
> > swiotlb ops without arch hacks) work for both "direct virtio" and
> > "normal PCI" devices.
> > 
> > The trick is simply in the arch to setup the iommu to map the swiotlb
> > bounce buffer pool 1:1 in the iommu, so the iommu essentially can be
> > ignored without affecting the physical addresses.
> > 
> > If I do that, *all* I need is a way, from the guest itself (again, the
> > other side dosn't know anything about it), to force virtio to use the
> > DMA ops as if there was an iommu, that is, use whatever dma ops were
> > setup by the platform for the pci device.
> 
> In that case just setting VIRTIO_F_IOMMU_PLATFORM in the flags should
> do the work (even if that isn't strictly what the current definition
> of the flag actually means).  On the qemu side you'll need to make
> sure you have a way to set VIRTIO_F_IOMMU_PLATFORM without emulating
> an iommu, but with code to take dma offsets into account if your
> plaform has any (various power plaforms seem to have them, not sure
> if it affects your config).

Something like that yes. I prefer a slightly different way, see below,
any but in both cases, it should alleviate your concerns since it means
there would be no particular mucking around with DMA ops at all, virtio
would just use whatever "normal" ops we establish for all PCI devices
on that platform, which will be standard ones.

(swiotlb ones today and the new "integrates" ones you're cooking
tomorrow).

As for the flag itself, while we could set it from qemu when we get
notified that the guest is going secure, both Michael and I think it's
rather gross, it requires qemu to go iterate all virtio devices and
"poke" something into them.

It also means qemu will need some other internal nasty flag that says
"set that bit but don't do iommu".

It's nicer if we have a way in the guest virtio driver to do something
along the lines of

if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())

Which would have the same effect and means the issue is entirely
contained in the guest.

Cheers,
Ben.




Re: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq ranges

2018-08-07 Thread Scott Wood
On Tue, 2018-08-07 at 12:09 -0600, Rob Herring wrote:
> On Fri, Jul 27, 2018 at 03:17:59PM +0530, Bharat Bhushan wrote:
> > Freescale MPIC h/w may not support all interrupt sources reported
> > by hardware, "last-interrupt-source" or platform. On these platforms
> > a misconfigured device tree that assigns one of the reserved
> > interrupts leaves a non-functioning system without warning.
> 
> There are lots of ways to misconfigure DTs. I don't think this is 
> special and needs a property.

Yeah, the system will be just as non-functioning if you specify a valid-but-
wrong-for-the-device interrupt number.

>  We've had some interrupt mask or valid 
> properties in the past, but generally don't accept those.

FWIW, some of them like protected-sources and mpic-msgr-receive-mask aren't
for detecting errors, but are for partitioning (though the former is obsolete
with pic-no-reset).

-Scott



Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-07 Thread Scott Wood
On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> MPIC on NXP (Freescale) P2020 supports following irq
> ranges:
>   > 0 - 11  (External interrupt)
>   > 16 - 79 (Internal interrupt)
>   > 176 - 183   (Messaging interrupt)
>   > 224 - 231   (Shared message signaled interrupt)

Why don't you convert to the 4-cell interrupt specifiers that make dealing
with these ranges less error-prone?

> diff --git a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> index 1006950..49ff348 100644
> --- a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> +++ b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> @@ -57,6 +57,11 @@ void __init mpc85xx_rdb_pic_init(void)
>   MPIC_BIG_ENDIAN |
>   MPIC_SINGLE_DEST_CPU,
>   0, 256, " OpenPIC  ");
> + } else if (of_machine_is_compatible("fsl,P2020RDB-PC")) {
> + mpic = mpic_alloc(NULL, 0,
> +   MPIC_BIG_ENDIAN |
> +   MPIC_SINGLE_DEST_CPU,
> +   0, 0, " OpenPIC  ");
>   } else {
>   mpic = mpic_alloc(NULL, 0,
> MPIC_BIG_ENDIAN |

I don't think we want to grow a list of every single revision of every board
in these platform files.

-Scott



[PATCH] powerpc/powernv: Add support for NPU2 relaxed-ordering mode

2018-08-07 Thread Reza Arbab
From: Alistair Popple 

Some device drivers support out of order access to GPU memory. This does
not affect the CPU view of memory but it does affect the GPU view, so it
should only be enabled once the GPU driver has requested it. Add APIs
allowing a driver to do so.

Signed-off-by: Alistair Popple 
[ar...@linux.ibm.com: Rebase, add commit log]
Signed-off-by: Reza Arbab 
---
 arch/powerpc/include/asm/opal-api.h|  4 ++-
 arch/powerpc/include/asm/opal.h|  3 ++
 arch/powerpc/include/asm/powernv.h | 12 
 arch/powerpc/platforms/powernv/npu-dma.c   | 39 ++
 arch/powerpc/platforms/powernv/opal-wrappers.S |  2 ++
 5 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 3bab299..be6fe23e 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -208,7 +208,9 @@
 #define OPAL_SENSOR_READ_U64   162
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   165
-#define OPAL_LAST  165
+#define OPAL_NPU_SET_RELAXED_ORDER 168
+#define OPAL_NPU_GET_RELAXED_ORDER 169
+#define OPAL_LAST  169
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index e1b2910..48bea30 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -43,6 +43,9 @@ int64_t opal_npu_spa_clear_cache(uint64_t phb_id, uint32_t 
bdfn,
uint64_t PE_handle);
 int64_t opal_npu_tl_set(uint64_t phb_id, uint32_t bdfn, long cap,
uint64_t rate_phys, uint32_t size);
+int64_t opal_npu_set_relaxed_order(uint64_t phb_id, uint16_t bdfn,
+  bool request_enabled);
+int64_t opal_npu_get_relaxed_order(uint64_t phb_id, uint16_t bdfn);
 int64_t opal_console_write(int64_t term_number, __be64 *length,
   const uint8_t *buffer);
 int64_t opal_console_read(int64_t term_number, __be64 *length,
diff --git a/arch/powerpc/include/asm/powernv.h 
b/arch/powerpc/include/asm/powernv.h
index 2f3ff7a..874ec6d 100644
--- a/arch/powerpc/include/asm/powernv.h
+++ b/arch/powerpc/include/asm/powernv.h
@@ -22,6 +22,8 @@ extern void pnv_npu2_destroy_context(struct npu_context 
*context,
 extern int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
unsigned long *flags, unsigned long *status,
int count);
+int pnv_npu2_request_relaxed_ordering(struct pci_dev *pdev, bool enable);
+int pnv_npu2_get_relaxed_ordering(struct pci_dev *pdev);
 
 void pnv_tm_init(void);
 #else
@@ -39,6 +41,16 @@ static inline int pnv_npu2_handle_fault(struct npu_context 
*context,
return -ENODEV;
 }
 
+static int pnv_npu2_request_relaxed_ordering(struct pci_dev *pdev, bool enable)
+{
+   return -ENODEV;
+}
+
+static int pnv_npu2_get_relaxed_ordering(struct pci_dev *pdev)
+{
+   return -ENODEV;
+}
+
 static inline void pnv_tm_init(void) { }
 static inline void pnv_power9_force_smt4(void) { }
 #endif
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 8cdf91f..038dc1e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "powernv.h"
 #include "pci.h"
@@ -988,3 +989,41 @@ int pnv_npu2_init(struct pnv_phb *phb)
 
return 0;
 }
+
+/*
+ * Request relaxed ordering be enabled or disabled for the given PCI device.
+ * This function may or may not actually enable relaxed ordering depending on
+ * the exact system configuration. Use pnv_npu2_get_relaxed_ordering() below to
+ * determine the current state of relaxed ordering.
+ */
+int pnv_npu2_request_relaxed_ordering(struct pci_dev *pdev, bool enable)
+{
+   struct pci_controller *hose;
+   struct pnv_phb *phb;
+   int rc;
+
+   hose = pci_bus_to_host(pdev->bus);
+   phb = hose->private_data;
+
+   rc = opal_npu_set_relaxed_order(phb->opal_id,
+   PCI_DEVID(pdev->bus->number, pdev->devfn),
+   enable);
+   if (rc != OPAL_SUCCESS && rc != OPAL_CONSTRAINED)
+   return -EPERM;
+
+   return 0;
+}
+EXPORT_SYMBOL(pnv_npu2_request_relaxed_ordering);
+
+int pnv_npu2_get_relaxed_ordering(struct pci_dev *pdev)
+{
+   struct pci_controller *hose;
+   struct pnv_phb *phb;
+
+   hose = pci_bus_to_host(pdev->bus);
+   phb = hose->private_data;
+
+   return opal_npu_get_relaxed_order(phb->opal_id,
+   PCI_DEVID(pdev->bus->number, pdev->de

RE: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq ranges

2018-08-07 Thread Bharat Bhushan


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Wednesday, August 8, 2018 2:34 AM
> To: Rob Herring ; Bharat Bhushan
> 
> Cc: b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> ga...@kernel.crashing.org; mark.rutl...@arm.com;
> kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> ker...@vger.kernel.org; keesc...@chromium.org;
> tyr...@linux.vnet.ibm.com; j...@perches.com
> Subject: Re: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq
> ranges
> 
> On Tue, 2018-08-07 at 12:09 -0600, Rob Herring wrote:
> > On Fri, Jul 27, 2018 at 03:17:59PM +0530, Bharat Bhushan wrote:
> > > Freescale MPIC h/w may not support all interrupt sources reported by
> > > hardware, "last-interrupt-source" or platform. On these platforms a
> > > misconfigured device tree that assigns one of the reserved
> > > interrupts leaves a non-functioning system without warning.
> >
> > There are lots of ways to misconfigure DTs. I don't think this is
> > special and needs a property.
> 
> Yeah, the system will be just as non-functioning if you specify a valid-but-
> wrong-for-the-device interrupt number.

Some is one additional benefits of this changes, MPIC have reserved regions for 
un-supported interrupts and read/writes to these reserved regions seams have no 
effect.
MPIC driver reads/writes to the reserved regions during init/uninit and 
save/restore state.

Let me know if it make sense to have these changes for mentioned reasons.

Thanks
-Bharat

> 
> >  We've had some interrupt mask or valid properties in the past, but
> > generally don't accept those.
> 
> FWIW, some of them like protected-sources and mpic-msgr-receive-mask
> aren't for detecting errors, but are for partitioning (though the former is
> obsolete with pic-no-reset).
> 
> -Scott



RE: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-07 Thread Bharat Bhushan


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Wednesday, August 8, 2018 2:44 AM
> To: Bharat Bhushan ;
> b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> ga...@kernel.crashing.org; mark.rutl...@arm.com;
> kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> ker...@vger.kernel.org
> Cc: r...@kernel.org; keesc...@chromium.org; tyr...@linux.vnet.ibm.com;
> j...@perches.com
> Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020
> 
> On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> > MPIC on NXP (Freescale) P2020 supports following irq
> > ranges:
> >   > 0 - 11  (External interrupt)
> >   > 16 - 79 (Internal interrupt)
> >   > 176 - 183   (Messaging interrupt)
> >   > 224 - 231   (Shared message signaled interrupt)
> 
> Why don't you convert to the 4-cell interrupt specifiers that make dealing
> with these ranges less error-prone?

Ok , will do if we agree to have this series as per comment on other patch.

> 
> > diff --git a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > index 1006950..49ff348 100644
> > --- a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > +++ b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > @@ -57,6 +57,11 @@ void __init mpc85xx_rdb_pic_init(void)
> > MPIC_BIG_ENDIAN |
> > MPIC_SINGLE_DEST_CPU,
> > 0, 256, " OpenPIC  ");
> > +   } else if (of_machine_is_compatible("fsl,P2020RDB-PC")) {
> > +   mpic = mpic_alloc(NULL, 0,
> > + MPIC_BIG_ENDIAN |
> > + MPIC_SINGLE_DEST_CPU,
> > + 0, 0, " OpenPIC  ");
> > } else {
> > mpic = mpic_alloc(NULL, 0,
> >   MPIC_BIG_ENDIAN |
> 
> I don't think we want to grow a list of every single revision of every board 
> in
> these platform files.

One other confusing observation I have is that "irq_count" from platform code 
is given precedence over "last-interrupt-source" in device-tree.
Should not device-tree should have precedence otherwise there is no point using 
" last-interrupt-source" if platform code passes "irq_count" in mpic_alloc().

Thanks
-Bharat

> 
> -Scott



[PATCH] powerpc/topology: Check at boot for topology updates

2018-08-07 Thread Srikar Dronamraju
On a shared lpar, Phyp will not update the cpu associativity at boot
time. Just after the boot system does recognize itself as a shared lpar and
trigger a request for correct cpu associativity. But by then the scheduler
would have already created/destroyed its sched domains.

This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all cpus to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit: 051f3ca02e46432 ("sched/topology: Introduce NUMA identity
  node sched domain"), can cause rcu stalls at boot up.

>From a scheduler maintainer's perspective, moving cpus from one node to
another or creating more numa levels after boot is not appropriate
without some notification to the user space. 
https://lore.kernel.org/lkml/20150406214558.ga38...@linux.vnet.ibm.com/T/#u

The sched_domains_numa_masks table which is used to generate cpumasks is
only created at boot time just before creating sched domains and never
updated.  Hence, its better to get the topology correct before the sched
domains are created.

For example on 64 core Power 8 shared lpar, dmesg reports

[2.088360] Brought up 512 CPUs
[2.088368] Node 0 CPUs: 0-511
[2.088371] Node 1 CPUs:
[2.088373] Node 2 CPUs:
[2.088375] Node 3 CPUs:
[2.088376] Node 4 CPUs:
[2.088378] Node 5 CPUs:
[2.088380] Node 6 CPUs:
[2.088382] Node 7 CPUs:
[2.088386] Node 8 CPUs:
[2.088388] Node 9 CPUs:
[2.088390] Node 10 CPUs:
[2.088392] Node 11 CPUs:
...
[3.916091] BUG: arch topology borken
[3.916103]  the DIE domain not a subset of the NUMA domain
[3.916105] BUG: arch topology borken
[3.916106]  the DIE domain not a subset of the NUMA domain
...

numactl/lscpu output will still be correct with cores spreading across
all nodes.

Socket(s): 64
NUMA node(s):  12
Model: 2.0 (pvr 004d 0200)
Model name:POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type:   para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s):152-159,248-255,344-351,440-447
NUMA node11 CPU(s):160-167,256-263,352-359,448-455

Currently on this lpar, the scheduler detects 2 levels of Numa and
created numa sched domains for all cpus, but it finds a single DIE
domain consisting of all cpus. Hence it deletes all numa sched domains.

To address this, split the topology update init, such that the first
part detects vphn/prrn soon after cpus are setup and force updates
topology just before scheduler creates sched domain.

With the fix, dmesg reports

[0.491336] numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 
368-375 464-471
[0.491351] numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 
376-383 472-479
[0.491359] numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 
384-391 480-487
[0.491366] numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 
392-399 488-495
[0.491374] numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
[0.491379] numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
[0.491384] numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
[0.491389] numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
[0.491394] numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
[0.491399] numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
[0.491404] numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
[0.491409] numa: Node 11 CPUs: 160-167 256-263 352-359 448-455

and lscpu would also report

Socket(s): 64
NUMA node(s):  12
Model: 2.0 (pvr 004d 0200)
Model name:POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type:   para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node

Re: [PATCH v6 00/11] hugetlb: Factorize hugetlb architecture primitives

2018-08-07 Thread Alex Ghiti

Thanks for your time,

Alex

Le 07/08/2018 à 09:54, Ingo Molnar a écrit :

* Alexandre Ghiti  wrote:


[CC linux-mm for inclusion in -mm tree]
  
In order to reduce copy/paste of functions across architectures and then

make riscv hugetlb port (and future ports) simpler and smaller, this
patchset intends to factorize the numerous hugetlb primitives that are
defined across all the architectures.
  
Except for prepare_hugepage_range, this patchset moves the versions that

are just pass-through to standard pte primitives into
asm-generic/hugetlb.h by using the same #ifdef semantic that can be
found in asm-generic/pgtable.h, i.e. __HAVE_ARCH_***.
  
s390 architecture has not been tackled in this serie since it does not

use asm-generic/hugetlb.h at all.
  
This patchset has been compiled on all addressed architectures with

success (except for parisc, but the problem does not come from this
series).
  
v6:

   - Remove nohash/32 and book3s/32 powerpc specific implementations in
 order to use the generic ones.
   - Add all the Reviewed-by, Acked-by and Tested-by in the commits,
 thanks to everyone.
  
v5:

   As suggested by Mike Kravetz, no need to move the #include
for arm and x86 architectures, let it live at
   the top of the file.
  
v4:

   Fix powerpc build error due to misplacing of #include
outside of #ifdef CONFIG_HUGETLB_PAGE, as
   pointed by Christophe Leroy.
  
v1, v2, v3:

   Same version, just problems with email provider and misuse of
   --batch-size option of git send-email

Alexandre Ghiti (11):
   hugetlb: Harmonize hugetlb.h arch specific defines with pgtable.h
   hugetlb: Introduce generic version of hugetlb_free_pgd_range
   hugetlb: Introduce generic version of set_huge_pte_at
   hugetlb: Introduce generic version of huge_ptep_get_and_clear
   hugetlb: Introduce generic version of huge_ptep_clear_flush
   hugetlb: Introduce generic version of huge_pte_none
   hugetlb: Introduce generic version of huge_pte_wrprotect
   hugetlb: Introduce generic version of prepare_hugepage_range
   hugetlb: Introduce generic version of huge_ptep_set_wrprotect
   hugetlb: Introduce generic version of huge_ptep_set_access_flags
   hugetlb: Introduce generic version of huge_ptep_get

  arch/arm/include/asm/hugetlb-3level.h| 32 +-
  arch/arm/include/asm/hugetlb.h   | 30 --
  arch/arm64/include/asm/hugetlb.h | 39 +++-
  arch/ia64/include/asm/hugetlb.h  | 47 ++-
  arch/mips/include/asm/hugetlb.h  | 40 +++--
  arch/parisc/include/asm/hugetlb.h| 33 +++
  arch/powerpc/include/asm/book3s/32/pgtable.h |  6 --
  arch/powerpc/include/asm/book3s/64/pgtable.h |  1 +
  arch/powerpc/include/asm/hugetlb.h   | 43 ++
  arch/powerpc/include/asm/nohash/32/pgtable.h |  6 --
  arch/powerpc/include/asm/nohash/64/pgtable.h |  1 +
  arch/sh/include/asm/hugetlb.h| 54 ++---
  arch/sparc/include/asm/hugetlb.h | 40 +++--
  arch/x86/include/asm/hugetlb.h   | 69 --
  include/asm-generic/hugetlb.h| 88 +++-
  15 files changed, 135 insertions(+), 394 deletions(-)

The x86 bits look good to me (assuming it's all tested on all relevant 
architectures, etc.)

Acked-by: Ingo Molnar 

Thanks,

Ingo


RE: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq ranges

2018-08-07 Thread Bharat Bhushan


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Wednesday, August 8, 2018 11:21 AM
> To: Bharat Bhushan ; Rob Herring
> 
> Cc: b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> ga...@kernel.crashing.org; mark.rutl...@arm.com;
> kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> ker...@vger.kernel.org; keesc...@chromium.org;
> tyr...@linux.vnet.ibm.com; j...@perches.com
> Subject: Re: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq
> ranges
> 
> On Wed, 2018-08-08 at 03:37 +, Bharat Bhushan wrote:
> > > -Original Message-
> > > From: Scott Wood [mailto:o...@buserror.net]
> > > Sent: Wednesday, August 8, 2018 2:34 AM
> > > To: Rob Herring ; Bharat Bhushan
> > > 
> > > Cc: b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > > ker...@vger.kernel.org; keesc...@chromium.org;
> > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > Subject: Re: [RFC 3/5] powerpc/mpic: Add support for non-contiguous
> > > irq ranges
> > >
> > > On Tue, 2018-08-07 at 12:09 -0600, Rob Herring wrote:
> > > > On Fri, Jul 27, 2018 at 03:17:59PM +0530, Bharat Bhushan wrote:
> > > > > Freescale MPIC h/w may not support all interrupt sources
> > > > > reported by hardware, "last-interrupt-source" or platform. On
> > > > > these platforms a misconfigured device tree that assigns one of
> > > > > the reserved interrupts leaves a non-functioning system without
> warning.
> > > >
> > > > There are lots of ways to misconfigure DTs. I don't think this is
> > > > special and needs a property.
> > >
> > > Yeah, the system will be just as non-functioning if you specify a
> > > valid-
> > > but-
> > > wrong-for-the-device interrupt number.
> >
> > Some is one additional benefits of this changes, MPIC have reserved
> > regions for un-supported interrupts and read/writes to these reserved
> > regions seams have no effect.
> > MPIC driver reads/writes to the reserved regions during init/uninit
> > and save/restore state.
> >
> > Let me know if it make sense to have these changes for mentioned
> reasons.
> 
> The driver has been doing this forever with no ill effect.

Yes, there are no issue reported

>  What is the  motivation for this change?

On Simulation model I see warning when accessing the reserved region, So this 
patch is just an effort to improve.

Thanks
-Bharat

> 
> -Scott



Re: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq ranges

2018-08-07 Thread Scott Wood
On Wed, 2018-08-08 at 03:37 +, Bharat Bhushan wrote:
> > -Original Message-
> > From: Scott Wood [mailto:o...@buserror.net]
> > Sent: Wednesday, August 8, 2018 2:34 AM
> > To: Rob Herring ; Bharat Bhushan
> > 
> > Cc: b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > ker...@vger.kernel.org; keesc...@chromium.org;
> > tyr...@linux.vnet.ibm.com; j...@perches.com
> > Subject: Re: [RFC 3/5] powerpc/mpic: Add support for non-contiguous irq
> > ranges
> > 
> > On Tue, 2018-08-07 at 12:09 -0600, Rob Herring wrote:
> > > On Fri, Jul 27, 2018 at 03:17:59PM +0530, Bharat Bhushan wrote:
> > > > Freescale MPIC h/w may not support all interrupt sources reported by
> > > > hardware, "last-interrupt-source" or platform. On these platforms a
> > > > misconfigured device tree that assigns one of the reserved
> > > > interrupts leaves a non-functioning system without warning.
> > > 
> > > There are lots of ways to misconfigure DTs. I don't think this is
> > > special and needs a property.
> > 
> > Yeah, the system will be just as non-functioning if you specify a valid-
> > but-
> > wrong-for-the-device interrupt number.
> 
> Some is one additional benefits of this changes, MPIC have reserved regions
> for un-supported interrupts and read/writes to these reserved regions seams
> have no effect.
> MPIC driver reads/writes to the reserved regions during init/uninit and
> save/restore state.
> 
> Let me know if it make sense to have these changes for mentioned reasons.

The driver has been doing this forever with no ill effect.  What is the
motivation for this change?

-Scott



Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-07 Thread Scott Wood
On Wed, 2018-08-08 at 03:44 +, Bharat Bhushan wrote:
> > -Original Message-
> > From: Scott Wood [mailto:o...@buserror.net]
> > Sent: Wednesday, August 8, 2018 2:44 AM
> > To: Bharat Bhushan ;
> > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > ker...@vger.kernel.org
> > Cc: r...@kernel.org; keesc...@chromium.org; tyr...@linux.vnet.ibm.com;
> > j...@perches.com
> > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020
> > 
> > On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> > > MPIC on NXP (Freescale) P2020 supports following irq
> > > ranges:
> > >   > 0 - 11  (External interrupt)
> > >   > 16 - 79 (Internal interrupt)
> > >   > 176 - 183   (Messaging interrupt)
> > >   > 224 - 231   (Shared message signaled interrupt)
> > 
> > Why don't you convert to the 4-cell interrupt specifiers that make dealing
> > with these ranges less error-prone?
> 
> Ok , will do if we agree to have this series as per comment on other patch.

If you're concerned with errors, this would be a good things to do regardless.
 Actually, it seems that p2020si-post.dtsi already uses 4-cell interrupts.

What is motivating this patchset?  Is there something wrong in the existing
dts files?


> 
> > 
> > > diff --git a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > index 1006950..49ff348 100644
> > > --- a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > +++ b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > @@ -57,6 +57,11 @@ void __init mpc85xx_rdb_pic_init(void)
> > >   MPIC_BIG_ENDIAN |
> > >   MPIC_SINGLE_DEST_CPU,
> > >   0, 256, " OpenPIC  ");
> > > + } else if (of_machine_is_compatible("fsl,P2020RDB-PC")) {
> > > + mpic = mpic_alloc(NULL, 0,
> > > +   MPIC_BIG_ENDIAN |
> > > +   MPIC_SINGLE_DEST_CPU,
> > > +   0, 0, " OpenPIC  ");
> > >   } else {
> > >   mpic = mpic_alloc(NULL, 0,
> > > MPIC_BIG_ENDIAN |
> > 
> > I don't think we want to grow a list of every single revision of every
> > board in
> > these platform files.
> 
> One other confusing observation I have is that "irq_count" from platform
> code is given precedence over "last-interrupt-source" in device-tree.
> Should not device-tree should have precedence otherwise there is no point
> using " last-interrupt-source" if platform code passes "irq_count" in
> mpic_alloc().

Maybe, though I don't think it matters much given that last-interrupt-source
was only added to avoid having to pass irq_count in platform code.

-Scott



Re: [RFC PATCH 0/3] New device-tree format and Opal based idle save-restore

2018-08-07 Thread Gautham R Shenoy
Hello Michael,

On Tue, Aug 07, 2018 at 10:15:37PM +1000, Michael Ellerman wrote:
> > Skiboot patch-set for device-tree is posted here :
> > https://patchwork.ozlabs.org/project/skiboot/list/?series=58934
> 
> I don't see a device tree binding documented anywhere?
> 
> There is an existing binding defined for ARM chips, presumably it
> doesn't do everything we need. But are there good reasons why we are not
> using it as a base?
> 
> See: Documentation/devicetree/bindings/arm/idle-states.txt
>

In case of ARM, the idle-states node is a child of cpus node. Each
child of the idle-states node is a node describing that particular
idle state.

idle-states {
entry-method = "psci";

CPU_RETENTION_0_0: cpu-retention-0-0 {
compatible = "arm,idle-state";
arm,psci-suspend-param = <0x001>;
entry-latency-us = <20>;
exit-latency-us = <40>;
min-residency-us = <80>;
status = "disabled"
};


CPU_SLEEP_0_0: cpu-sleep-0-0 {
compatible = "arm,idle-state";
local-timer-stop;
arm,psci-suspend-param = <0x001>;
entry-latency-us = <250>;
exit-latency-us = <500>;
min-residency-us = <950>;
status = "okay"
};

.
.
.
}


Furthermore, each CPU can have a different set of cpu-idle states
due to the asymmetric nature of the processors units on the board.
Thus, there is an additional property for each cpu called
cpu-idle-states which points to the containers of the idle states
themselves.


cpus {
#size-cells = <0>;
#address-cells = <2>;

CPU0: cpu@0 {
device_type = "cpu";
compatible = "arm,cortex-a57";
reg = <0x0 0x0>;
enable-method = "psci";
cpu-idle-states = <&CPU_RETENTION_0_0 &CPU_SLEEP_0_0
   &CLUSTER_RETENTION_0 &CLUSTER_SLEEP_0>;
};

. . .
. . .
. . .
. . .

CPU8: cpu@1 {
device_type = "cpu";
compatible = "arm,cortex-a53";
reg = <0x1 0x0>;
enable-method = "psci";
cpu-idle-states = <&CPU_RETENTION_1_0 &CPU_SLEEP_1_0
   &CLUSTER_RETENTION_1 &CLUSTER_SLEEP_1>;
};


In our case, we already have an "ibm,opal/power-mgt/" node in the
device tree where we have defined the idle state so far. This was the
reason to stick the new device tree format under this existing node
that has been specially earmarked for power management related bits,
instead of defining the new format under the cpus node.

Also, in our case, since all the CPU nodes are symmetric they will
have the same set of idle states. Hence, we wouldn't need the
"cpu-idle-states" property for each CPU.

As for the properties of idle states themselves, the only common
things between the ARM idle-states and our case are the compatible,
exit-latency-us, min-residency-us. In addition to this we need the
flags which indicate the nature of the resource loss (Hypervisors
state loss, Timebase loss, etc..) , the psscr_val and the psscr_mask
corresponding to the stop states which the ARM device-tree doesn't
provide.

For this reason we have opted for a new bindings since the overlap
between these two platforms is minimal.

> 
> The way you're using compatible is not really consistent with its
> traditional meaning.
> 
> eg, you have multiple states with:
> 
> compatible = "ibm,state-v1",
> "cpuoffline",
> "opal-supported";
> 
> 
> This would typically mean that all those state are all "compatible" with
> some semantics defined by the name "ibm,state-v1". What you're trying to
> say (I think) is that each state is "version 1" of *that state*. And
> only kernels that understand version 1 should use the state.

Ok, I see what you mean here. Perhaps, we should have had something
like "ibm,stop0-v1" , "ibm,stop1-v2", "ibm,stop2-v2" etc, where
version1, version2 etc, pertains to the versions of those specific
states.

Thus a kernel that knows about "version 1" of stop0 and stop2 and
"version 2" of stop1 will end up using only stop0 and stop1 since it
doesn't know "version 2" of stop2. 

In such a case, kernel should fallback to OPAL for stop2. Does this
make sense ?

> 
> And "cpuoffline" and "opal-supported" definitely don't belong in
> compatible AFAICS, they should simply be boolean properties of the
> node.

I agree. These should be flags.

> 
> cheers
> 



RE: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-07 Thread Bharat Bhushan


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Wednesday, August 8, 2018 11:26 AM
> To: Bharat Bhushan ;
> b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> ga...@kernel.crashing.org; mark.rutl...@arm.com;
> kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> ker...@vger.kernel.org
> Cc: r...@kernel.org; keesc...@chromium.org; tyr...@linux.vnet.ibm.com;
> j...@perches.com
> Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020
> 
> On Wed, 2018-08-08 at 03:44 +, Bharat Bhushan wrote:
> > > -Original Message-
> > > From: Scott Wood [mailto:o...@buserror.net]
> > > Sent: Wednesday, August 8, 2018 2:44 AM
> > > To: Bharat Bhushan ;
> > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > > ker...@vger.kernel.org
> > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > P2020
> > >
> > > On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> > > > MPIC on NXP (Freescale) P2020 supports following irq
> > > > ranges:
> > > >   > 0 - 11  (External interrupt)
> > > >   > 16 - 79 (Internal interrupt)
> > > >   > 176 - 183   (Messaging interrupt)
> > > >   > 224 - 231   (Shared message signaled interrupt)
> > >
> > > Why don't you convert to the 4-cell interrupt specifiers that make
> > > dealing with these ranges less error-prone?
> >
> > Ok , will do if we agree to have this series as per comment on other patch.
> 
> If you're concerned with errors, this would be a good things to do regardless.
>  Actually, it seems that p2020si-post.dtsi already uses 4-cell interrupts.
> 
> What is motivating this patchset?  Is there something wrong in the existing
> dts files?

There is no error in device tree. Main motivation is to improve code for 
following reasons: 
  - While code study it was found that if a reserved irq-number used then there 
are no check in driver. irq will be configured as correct and interrupt will 
never fire.
 - Warnings were observed on development platform (simulator) when read/write 
to reserved MPIC reason during init.
  
> 
> 
> >
> > >
> > > > diff --git a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > > b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > > index 1006950..49ff348 100644
> > > > --- a/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > > +++ b/arch/powerpc/platforms/85xx/mpc85xx_rdb.c
> > > > @@ -57,6 +57,11 @@ void __init mpc85xx_rdb_pic_init(void)
> > > > MPIC_BIG_ENDIAN |
> > > > MPIC_SINGLE_DEST_CPU,
> > > > 0, 256, " OpenPIC  ");
> > > > +   } else if (of_machine_is_compatible("fsl,P2020RDB-PC")) {
> > > > +   mpic = mpic_alloc(NULL, 0,
> > > > + MPIC_BIG_ENDIAN |
> > > > + MPIC_SINGLE_DEST_CPU,
> > > > + 0, 0, " OpenPIC  ");
> > > > } else {
> > > > mpic = mpic_alloc(NULL, 0,
> > > >   MPIC_BIG_ENDIAN |
> > >
> > > I don't think we want to grow a list of every single revision of
> > > every board in these platform files.
> >
> > One other confusing observation I have is that "irq_count" from
> > platform code is given precedence over "last-interrupt-source" in device-
> tree.
> > Should not device-tree should have precedence otherwise there is no
> > point using " last-interrupt-source" if platform code passes
> > "irq_count" in mpic_alloc().
> 
> Maybe, though I don't think it matters much given that last-interrupt-source
> was only added to avoid having to pass irq_count in platform code.

Thanks for clarifying;

My understanding was that "last-interrupt-source" added to ensure that we can 
over-ride value passed from platform code. In that case we do not need to 
change code and can control from device tree.

Thanks
-Bharat


> 
> -Scott



[PATCH] lib/test_hexdump: fix failure on big endian cpu

2018-08-07 Thread Christophe Leroy
On a big endian cpu, test_hexdump fails as follows. The logs show
that bytes are expected in reversed order.

[...]
[   16.643648] test_hexdump: Len: 24 buflen: 130 strlen: 97
[   16.648681] test_hexdump: Result: 97 'be32db7b 0a1893b2 70bac424 7d83349b 
a69c31ad 9c0face9.2.{p..$}.4...1.'
[   16.660951] test_hexdump: Expect: 97 '7bdb32be b293180a 24c4ba70 9b34837d 
ad319ca6 e9ac0f9c.2.{p..$}.4...1.'
[   16.673129] test_hexdump: Len: 8 buflen: 130 strlen: 77
[   16.678113] test_hexdump: Result: 77 'be32db7b0a1893b2   
  .2.{'
[   16.688660] test_hexdump: Expect: 77 'b293180a7bdb32be   
  .2.{'
[   16.699170] test_hexdump: Len: 6 buflen: 131 strlen: 87
[   16.704238] test_hexdump: Result: 87 'be32 db7b 0a18 
  .2.{..'
[   16.715511] test_hexdump: Expect: 87 '32be 7bdb 180a 
  .2.{..'
[   16.726864] test_hexdump: Len: 24 buflen: 131 strlen: 97
[   16.731902] test_hexdump: Result: 97 'be32db7b 0a1893b2 70bac424 7d83349b 
a69c31ad 9c0face9.2.{p..$}.4...1.'
[   16.744175] test_hexdump: Expect: 97 '7bdb32be b293180a 24c4ba70 9b34837d 
ad319ca6 e9ac0f9c.2.{p..$}.4...1.'
[   16.756379] test_hexdump: Len: 32 buflen: 131 strlen: 101
[   16.761507] test_hexdump: Result: 101 'be32db7b0a1893b2 70bac4247d83349b 
a69c31ad9c0face9 4cd1199943b1af0c  .2.{p..$}.4...1.L...C...'
[   16.774212] test_hexdump: Expect: 101 'b293180a7bdb32be 9b34837d24c4ba70 
e9ac0f9cad319ca6 0cafb1439919d14c  .2.{p..$}.4...1.L...C...'
[   16.786763] test_hexdump: failed 801 out of 1184 tests

This patch fixes it.

Fixes: 64d1d77a44697 ("hexdump: introduce test suite")
Signed-off-by: Christophe Leroy 
---
 lib/test_hexdump.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/lib/test_hexdump.c b/lib/test_hexdump.c
index 3f415d8101f3..626f580b4ff7 100644
--- a/lib/test_hexdump.c
+++ b/lib/test_hexdump.c
@@ -18,7 +18,7 @@ static const unsigned char data_b[] = {
 
 static const unsigned char data_a[] = ".2.{p..$}.4...1.L...C...";
 
-static const char * const test_data_1_le[] __initconst = {
+static const char * const test_data_1[] __initconst = {
"be", "32", "db", "7b", "0a", "18", "93", "b2",
"70", "ba", "c4", "24", "7d", "83", "34", "9b",
"a6", "9c", "31", "ad", "9c", "0f", "ac", "e9",
@@ -32,16 +32,33 @@ static const char * const test_data_2_le[] __initconst = {
"d14c", "9919", "b143", "0caf",
 };
 
+static const char * const test_data_2_be[] __initconst = {
+   "be32", "db7b", "0a18", "93b2",
+   "70ba", "c424", "7d83", "349b",
+   "a69c", "31ad", "9c0f", "ace9",
+   "4cd1", "1999", "43b1", "af0c",
+};
+
 static const char * const test_data_4_le[] __initconst = {
"7bdb32be", "b293180a", "24c4ba70", "9b34837d",
"ad319ca6", "e9ac0f9c", "9919d14c", "0cafb143",
 };
 
+static const char * const test_data_4_be[] __initconst = {
+   "be32db7b", "0a1893b2", "70bac424", "7d83349b",
+   "a69c31ad", "9c0face9", "4cd11999", "43b1af0c",
+};
+
 static const char * const test_data_8_le[] __initconst = {
"b293180a7bdb32be", "9b34837d24c4ba70",
"e9ac0f9cad319ca6", "0cafb1439919d14c",
 };
 
+static const char * const test_data_8_be[] __initconst = {
+   "be32db7b0a1893b2", "70bac4247d83349b",
+   "a69c31ad9c0face9", "4cd1199943b1af0c",
+};
+
 #define FILL_CHAR  '#'
 
 static unsigned total_tests __initdata;
@@ -56,6 +73,7 @@ static void __init test_hexdump_prepare_test(size_t len, int 
rowsize,
size_t l = len;
int gs = groupsize, rs = rowsize;
unsigned int i;
+   const bool is_be = IS_ENABLED(CONFIG_CPU_BIG_ENDIAN);
 
if (rs != 16 && rs != 32)
rs = 16;
@@ -67,13 +85,13 @@ static void __init test_hexdump_prepare_test(size_t len, 
int rowsize,
gs = 1;
 
if (gs == 8)
-   result = test_data_8_le;
+   result = is_be ? test_data_8_be : test_data_8_le;
else if (gs == 4)
-   result = test_data_4_le;
+   result = is_be ? test_data_4_be : test_data_4_le;
else if (gs == 2)
-   result = test_data_2_le;
+   result = is_be ? test_data_2_be : test_data_2_le;
else
-   result = test_data_1_le;
+   result = test_data_1;
 
/* hex dump */
p = test;
-- 
2.13.3



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-07 Thread Christoph Hellwig
On Wed, Aug 08, 2018 at 06:32:45AM +1000, Benjamin Herrenschmidt wrote:
> As for the flag itself, while we could set it from qemu when we get
> notified that the guest is going secure, both Michael and I think it's
> rather gross, it requires qemu to go iterate all virtio devices and
> "poke" something into them.

You don't need to set them the time you go secure.  You just need to
set the flag from the beginning on any VM you might want to go secure.
Or for simplicity just any VM - if the DT/ACPI tables exposed by
qemu are good enough that will always exclude a iommu and not set a
DMA offset, so nothing will change on the qemu side of he processing,
and with the new direct calls for the direct dma ops performance in
the guest won't change either.

> It's nicer if we have a way in the guest virtio driver to do something
> along the lines of
> 
>   if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())
> 
> Which would have the same effect and means the issue is entirely
> contained in the guest.

It would not be the same effect.  The problem with that is that you must
now assumes that your qemu knows that for example you might be passing
a dma offset if the bus otherwise requires it.  Or in other words:
you potentially break the contract between qemu and the guest of always
passing down physical addresses.  If we explicitly change that contract
through using a flag that says you pass bus address everything is fine.

Note that in practice your scheme will probably just work for your
initial prototype, but chances are it will get us in trouble later on.