Re: [PATCH] powerpc/powernv: Tell OPAL about our MMU mode

2017-06-27 Thread Michael Ellerman
Benjamin Herrenschmidt  writes:

> That will allow OPAL to configure the CPU in an optimal way.
>
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>
> The matching OPAL change has been sent to the skiboot list.
>
> Setting those bits in the reinit() call with an older OPAL
> will result in the call returning an error which Linux ignores
> but it will still work in the sense that it will still honor
> the other flags it understands (the endian switch ones).

My Tuleta disagrees (P8 DD2.1)

Booting with this applied, console output just stops at:

Early memory node ranges
  node   0: [mem 0x-0x0007]
  node   1: [mem 0x0008-0x000f]
  node  16: [mem 0x0010-0x0017]
  node  17: [mem 0x0018-0x001f]
Initmem setup node 0 [mem 0x-0x0007]
On node 0 totalpages: 524288
  DMA zone: 512 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 524288 pages, LIFO batch:1


Which doesn't really make sense. FSP says it's running (runtime).

The end of the OPAL log is below.

I think your patch means we're now calling slw_reinit(), whereas
previously we would skip it?

cheers


[   77.369430361,5] SkiBoot skiboot-5.4.5 starting...
...
[ 3657.439295457,7] OPAL: CPU re-init with flags: 0x1
[ 3657.439297714,5] OPAL: Switch to big-endian OS
[ 3657.439331559,6] CPU: Switching HILE on all CPUs to 0
[ 3657.439334329,7] CPU: [0020] HID0 set to 0x
[ 3657.439338421,7] CPU: [0021] HID0 set to 0x
[ 3657.59269,7] CPU: [0022] HID0 set to 0x
[ 3657.449579783,7] CPU: [0023] HID0 set to 0x
[ 3657.454700263,7] CPU: [0024] HID0 set to 0x
[ 3657.459820839,7] CPU: [0025] HID0 set to 0x
[ 3657.464941340,7] CPU: [0026] HID0 set to 0x
[ 3657.470061789,7] CPU: [0027] HID0 set to 0x
[ 3657.475182489,7] CPU: [0028] HID0 set to 0x
[ 3657.480303113,7] CPU: [0029] HID0 set to 0x
[ 3657.485423724,7] CPU: [002a] HID0 set to 0x
[ 3657.490544363,7] CPU: [002b] HID0 set to 0x
[ 3657.495665019,7] CPU: [002c] HID0 set to 0x
[ 3657.500785664,7] CPU: [002d] HID0 set to 0x
[ 3657.505906312,7] CPU: [002e] HID0 set to 0x
[ 3657.511026948,7] CPU: [002f] HID0 set to 0x
[ 3658.004148353,7] CPU: [0060] HID0 set to 0x
[ 3658.009268971,7] CPU: [0061] HID0 set to 0x
[ 3658.014389562,7] CPU: [0062] HID0 set to 0x
[ 3658.019510236,7] CPU: [0063] HID0 set to 0x
[ 3658.024630888,7] CPU: [0064] HID0 set to 0x
[ 3658.029751538,7] CPU: [0065] HID0 set to 0x
[ 3658.034872229,7] CPU: [0066] HID0 set to 0x
[ 3658.039992873,7] CPU: [0067] HID0 set to 0x
[ 3658.045113621,7] CPU: [0068] HID0 set to 0x
[ 3658.050234216,7] CPU: [0069] HID0 set to 0x
[ 3658.055354862,7] CPU: [006a] HID0 set to 0x
[ 3658.060475496,7] CPU: [006b] HID0 set to 0x
[ 3658.065596111,7] CPU: [006c] HID0 set to 0x
[ 3658.070716747,7] CPU: [006d] HID0 set to 0x
[ 3658.075837392,7] CPU: [006e] HID0 set to 0x
[ 3658.080958012,7] CPU: [006f] HID0 set to 0x
[ 3658.086078859,7] CPU: [0070] HID0 set to 0x
[ 3658.091199390,7] CPU: [0071] HID0 set to 0x
[ 3658.096320012,7] CPU: [0072] HID0 set to 0x
[ 3658.101440692,7] CPU: [0073] HID0 set to 0x
[ 3658.106561322,7] CPU: [0074] HID0 set to 0x
[ 3658.111681979,7] CPU: [0075] HID0 set to 0x
[ 3658.116802652,7] CPU: [0076] HID0 set to 0x
[ 3658.121923328,7] CPU: [0077] HID0 set to 0x
[ 3658.127044809,7] CPU: [00a0] HID0 set to 0x
[ 3658.132165386,7] CPU: [00a1] HID0 set to 0x
[ 3658.137286102,7] CPU: [00a2] HID0 set to 0x
[ 3658.142406857,7] CPU: [00a3] HID0 set to 0x
[ 3658.147527622,7] CPU: [00a4] HID0 set to 0x
[ 3658.152648378,7] CPU: [00a5] HID0 set to 0x
[ 3658.157769125,7] CPU: [00a6] HID0 set to 0x
[ 3658.162889853,7] CPU: [00a7] HID0 set to 0x
[ 3658.168010827,7] CPU: [00b0] HID0 set to 0x
[ 3658.173131569,7] CPU: [00b1] HID0 set to 0x
[ 3658.178252302,7] CPU: [00b2] HID0 set to 0x
[ 3658.183373016,7] CPU: [00b3] HID0 set to 0x
[ 3658.188493773,7] CPU: [00b4] 

Re: [PATCH] powernv:idle: Clear r12 on wakeup from stop lite

2017-06-27 Thread Nicholas Piggin
On Wed, 28 Jun 2017 06:46:49 +0530
Akshay Adiga  wrote:

> pnv_wakeup_noloss expects R12 to contain SRR1 value to determine if
> the wakeup reason is an HMI in CHECK_HMI_INTERRUPT.
> 
> When we wakeup with ESL=0, SRR1 will not contain the wakeup reason, so
> there is no point setting R12 to SRR1.
> 
> However, we don't set R12 at all and R12 contains garbage, and still
> being used to check HMI assuming that it had SRR1. causing the
> OPAL msglog to be filled with the following print:
>   HMI: Received HMI interrupt: HMER = 0x0040
> 
> This patch clears R12 after waking up from stop with ESL=EC=0, so that
> we don't accidentally enter the HMI handler in pnv_wakeup_noloss if
> the R12[42:45] corresponds to HMI as wakeup reason.
> 
> Bug existed prior to "commit 9d29250136f6 ("powerpc/64s/idle: Avoid SRR
> usage in idle sleep/wake paths")  but was never hit in practice
> 
> Signed-off-by: Akshay Adiga 
> Fixes: 9d29250136f6 ("powerpc/64s/idle: Avoid SRR usage in idle
> sleep/wake paths")

Thanks guys, appreciate you finding and fixing my bug :)

I think this looks like the best fix. Really minor nitpick but you
could adjust the line widths on the comment slightly (mpe might do
that when merging).

Reviewed-by: Nicholas Piggin 


> ---
>  arch/powerpc/kernel/idle_book3s.S | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 1ea14b9..34794fd 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -256,6 +256,21 @@ power_enter_stop:
>   bne  .Lhandle_esl_ec_set
>   IDLE_STATE_ENTER_SEQ(PPC_STOP)
>   li  r3,0  /* Since we didn't lose state, return 0 */
> + /*
> +  * pnv_wakeup_noloss expects R12 to contain SRR1 value
> +  * to determine if the wakeup reason is an HMI in
> +  * CHECK_HMI_INTERRUPT.
> +  *
> +  * However, when we wakeup with ESL=0,
> +  * SRR1 will not contain the wakeup reason,
> +  * so there is no point setting R12 to SRR1.
> +  *
> +  * Further, we clear R12 here, so that we
> +  * don't accidentally enter the HMI
> +  * in pnv_wakeup_noloss if the
> +  * R12[42:45] == WAKE_HMI.
> +  */
> + li  r12, 0
>   b   pnv_wakeup_noloss
>  
>  .Lhandle_esl_ec_set:



Re: [PATCH] powerpc/powernv: Rework local TLB flush for boot and MCE on POWER9

2017-06-27 Thread Nicholas Piggin
On Wed, 28 Jun 2017 08:21:55 +0530
"Aneesh Kumar K.V"  wrote:

> Nicholas Piggin  writes:
> 
> > There are two cases outside the normal address space management
> > where a CPU's local TLB is to be flushed:
> >
> >   1. Host boot; in case something has left stale entries in the
> >  TLB (e.g., kexec).
> >
> >   2. Machine check; to clean corrupted TLB entries.
> >
> > CPU state restore from deep idle states also flushes the TLB. However
> > this seems to be a side effect of reusing the boot code to set CPU
> > state, rather than a requirement itself.
> >
> > This type of TLB flush is coded inflexibly, several times for each CPU
> > type, and they have a number of problems with ISA v3.0B:
> >
> > - The current radix mode of the MMU is not taken into account. tlbiel
> >   is undefined if the R field does not match the current radix mode.
> >
> > - ISA v3.0B hash mode should be flushing the partition and process
> >   table caches.
> >
> > - ISA v3.0B radix mode should be flushing partition and process table
> >   caches, and also the page walk cache.
> >
> > To improve this situation, consolidate the flushing code and implement
> > it in C and inline asm under the mm/ directory, and add ISA v3.0B cases
> > for radix and hash.
> >
> > Take it out from early cputable detection hooks, and move it later in
> > the boot process after the MMU registers are set up and before
> > relocation is first turned on.
> >
> > Provide capability for LPID flush to specify radix mode.
> >
> > TLB flush is no longer called when restoring from deep idle states.  
> 
> 
> I am not sure the new location of flushing the tlb is correct/perfect. For ex:
> may be we should do it before htab_initialize() so that we start with
> all everything flushed ? But otherwise
> 
> Reviewed-by: Aneesh Kumar K.V 


Thanks for taking a look over it. The location of the flush is based on
the thinking that:

1. We don't have to flush while MSR IR/DR = 0 because real mode
   translation entries should be correct (if not we have much bigger
   problems). But we must flush before setting IR/DR.

2. We should flush after all setup is done (e.g., all SPRs set) in
   case there is some influence on internal translation structures
   or invalidation.

The conclusion is that we should flush just before turning on MSR IR/DR.

If there is something wrong with my assumptions, it would be be
important to adjust the patch.

Thanks,
Nick



Re: [PATCH] powerpc/powernv: Rework local TLB flush for boot and MCE on POWER9

2017-06-27 Thread Aneesh Kumar K.V
Nicholas Piggin  writes:

> There are two cases outside the normal address space management
> where a CPU's local TLB is to be flushed:
>
>   1. Host boot; in case something has left stale entries in the
>  TLB (e.g., kexec).
>
>   2. Machine check; to clean corrupted TLB entries.
>
> CPU state restore from deep idle states also flushes the TLB. However
> this seems to be a side effect of reusing the boot code to set CPU
> state, rather than a requirement itself.
>
> This type of TLB flush is coded inflexibly, several times for each CPU
> type, and they have a number of problems with ISA v3.0B:
>
> - The current radix mode of the MMU is not taken into account. tlbiel
>   is undefined if the R field does not match the current radix mode.
>
> - ISA v3.0B hash mode should be flushing the partition and process
>   table caches.
>
> - ISA v3.0B radix mode should be flushing partition and process table
>   caches, and also the page walk cache.
>
> To improve this situation, consolidate the flushing code and implement
> it in C and inline asm under the mm/ directory, and add ISA v3.0B cases
> for radix and hash.
>
> Take it out from early cputable detection hooks, and move it later in
> the boot process after the MMU registers are set up and before
> relocation is first turned on.
>
> Provide capability for LPID flush to specify radix mode.
>
> TLB flush is no longer called when restoring from deep idle states.


I am not sure the new location of flushing the tlb is correct/perfect. For ex:
may be we should do it before htab_initialize() so that we start with
all everything flushed ? But otherwise

Reviewed-by: Aneesh Kumar K.V 


>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/book3s/64/tlbflush-hash.h |  1 +
>  .../powerpc/include/asm/book3s/64/tlbflush-radix.h |  3 +
>  arch/powerpc/include/asm/book3s/64/tlbflush.h  | 34 +
>  arch/powerpc/include/asm/cputable.h| 12 
>  arch/powerpc/kernel/cpu_setup_power.S  | 43 
>  arch/powerpc/kernel/cputable.c | 14 
>  arch/powerpc/kernel/dt_cpu_ftrs.c  | 42 ---
>  arch/powerpc/kernel/mce_power.c| 61 +---
>  arch/powerpc/kvm/book3s_hv_ras.c   |  6 +-
>  arch/powerpc/mm/hash_native_64.c   | 82 
> ++
>  arch/powerpc/mm/hash_utils_64.c|  4 ++
>  arch/powerpc/mm/pgtable-radix.c|  4 ++
>  arch/powerpc/mm/tlb-radix.c| 57 +++
>  13 files changed, 189 insertions(+), 174 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h 
> b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 2f6373144e2c..c02ece27fd7b 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -50,6 +50,7 @@ static inline void arch_leave_lazy_mmu_mode(void)
>
>  #define arch_flush_lazy_mmu_mode()  do {} while (0)
>
> +extern void hash__tlbiel_all(unsigned int action);
>
>  extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize,
>   int ssize, unsigned long flags);
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
> b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> index cc7fbde4f53c..e7b767a3b2fa 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> @@ -10,6 +10,8 @@ static inline int mmu_get_ap(int psize)
>   return mmu_psize_defs[psize].ap;
>  }
>
> +extern void radix__tlbiel_all(unsigned int action);
> +
>  extern void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma,
>  unsigned long start, unsigned long 
> end);
>  extern void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long 
> start,
> @@ -44,4 +46,5 @@ extern void radix__flush_tlb_lpid(unsigned long lpid);
>  extern void radix__flush_tlb_all(void);
>  extern void radix__flush_tlb_pte_p9_dd1(unsigned long old_pte, struct 
> mm_struct *mm,
>   unsigned long address);
> +
>  #endif
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
> b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> index 72b925f97bab..a6f3a210d4de 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> @@ -7,6 +7,40 @@
>  #include 
>  #include 
>
> +/* TLB flush actions. Used as argument to tlbiel_all() */
> +enum {
> + TLB_INVAL_SCOPE_GLOBAL = 0, /* invalidate all TLBs */
> + TLB_INVAL_SCOPE_LPID = 1,   /* invalidate TLBs for current LPID */
> +};
> +
> +static inline void tlbiel_all(void)
> +{
> + /*
> +  * This is used for host machine check and bootup.
> +  *
> +  * This could be 

Re: [RFC v4 17/17] procfs: display the protection-key number associated with a vma

2017-06-27 Thread Michael Ellerman
Ram Pai  writes:

> Display the pkey number associated with the vma in smaps of a task.
> The key will be seen as below:
>
> VmFlags: rd wr mr mw me dw ac key=0

Why wouldn't we just emit a "ProtectionKey:" line like x86 does?

See their arch_show_smap().

You should probably also do what x86 does, which is to not display the
key on CPUs that don't support keys.

cheers


Re: [PATCH backport pre-4.9] powerpc/slb: Force a full SLB flush when we insert for a bad EA

2017-06-27 Thread Michael Ellerman
Greg Kroah-Hartman  writes:
> On Thu, Jun 22, 2017 at 04:52:51PM +1000, Michael Ellerman wrote:
>> The SLB miss handler calls slb_allocate_realmode() in order to create an
>> SLB entry for the faulting address. At the very start of that function
>> we check that the faulting Effective Address (EA) is less than
>> PGTABLE_RANGE (ignoring the region), ie. is it an address which could
>> possibly fit in the virtual address space.
...
>> ---
>>  arch/powerpc/mm/slb_low.S | 10 ++
>>  1 file changed, 10 insertions(+)
>> 
>> Note this patch is not upstream. The bug fix was fixed differently in
>> upstream prior to the bug being identified.
>
> Now applied to 4.4 and 3.18-stable kernels, thanks,

Thanks.

cheers


[PATCH v4 6/6] powerpc/mm: Enable ZONE_DEVICE on powerpc

2017-06-27 Thread Oliver O'Halloran
Flip the switch. Running around and screaming "IT'S ALIVE" is optional,
but recommended.

Signed-off-by: Oliver O'Halloran 
---
v3: Only select when building for 64bit Book3-S
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d556f9557f04..4526c9ba09b6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -131,6 +131,7 @@ config PPC
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
-- 
2.9.4



[PATCH v4 5/6] powerpc/mm: Wire up hpte_removebolted for powernv

2017-06-27 Thread Oliver O'Halloran
From: Anton Blanchard 

Adds support for removing bolted (i.e kernel linear mapping) mappings on
powernv. This is needed to support memory hot unplug operations which
are required for the teardown of DAX/PMEM devices.

Reviewed-by: Balbir Singh 
Reviewed-by: Rashmica Gupta 
Signed-off-by: Anton Blanchard 
Signed-off-by: Oliver O'Halloran 
---
v1 -> v2: Fixed the commit author
  Added VM_WARN_ON() if we attempt to remove an unbolted hpte
---
 arch/powerpc/mm/hash_native_64.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 65bb8f33b399..b534d041cfe8 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -407,6 +407,38 @@ static void native_hpte_updateboltedpp(unsigned long 
newpp, unsigned long ea,
tlbie(vpn, psize, psize, ssize, 0);
 }
 
+/*
+ * Remove a bolted kernel entry. Memory hotplug uses this.
+ *
+ * No need to lock here because we should be the only user.
+ */
+static int native_hpte_removebolted(unsigned long ea, int psize, int ssize)
+{
+   unsigned long vpn;
+   unsigned long vsid;
+   long slot;
+   struct hash_pte *hptep;
+
+   vsid = get_kernel_vsid(ea, ssize);
+   vpn = hpt_vpn(ea, vsid, ssize);
+
+   slot = native_hpte_find(vpn, psize, ssize);
+   if (slot == -1)
+   return -ENOENT;
+
+   hptep = htab_address + slot;
+
+   VM_WARN_ON(!(be64_to_cpu(hptep->v) & HPTE_V_BOLTED));
+
+   /* Invalidate the hpte */
+   hptep->v = 0;
+
+   /* Invalidate the TLB */
+   tlbie(vpn, psize, psize, ssize, 0);
+   return 0;
+}
+
+
 static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
   int bpsize, int apsize, int ssize, int local)
 {
@@ -725,6 +757,7 @@ void __init hpte_init_native(void)
mmu_hash_ops.hpte_invalidate= native_hpte_invalidate;
mmu_hash_ops.hpte_updatepp  = native_hpte_updatepp;
mmu_hash_ops.hpte_updateboltedpp = native_hpte_updateboltedpp;
+   mmu_hash_ops.hpte_removebolted = native_hpte_removebolted;
mmu_hash_ops.hpte_insert= native_hpte_insert;
mmu_hash_ops.hpte_remove= native_hpte_remove;
mmu_hash_ops.hpte_clear_all = native_hpte_clear;
-- 
2.9.4



[PATCH v4 4/6] powerpc/mm: Add devmap support for ppc64

2017-06-27 Thread Oliver O'Halloran
Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S.  This
is used to differentiate device backed memory from transparent huge
pages since they are handled in more or less the same manner by the core
mm code.

Cc: Aneesh Kumar K.V 
Signed-off-by: Oliver O'Halloran 
---
v1 -> v2: Properly differentiate THP and PMD Devmap entries. The
mm core assumes that pmd_trans_huge() and pmd_devmap() are mutually
exclusive and v1 had pmd_trans_huge() being true on a devmap pmd.

v2 -> v3:
Remove setting of _PAGE_SPECIAL in pmd_mkdevmap()
Make pud_pfn() a BUILD_BUG()
Remove unnecessary _PAGE_DEVMAP check in hash__pmd_trans_huge()

v3 -> v4:
Moved pud_pfn() outside #ifdef THP. This is required to work
around a build breakage.
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 45 
 arch/powerpc/include/asm/book3s/64/radix.h   |  2 +-
 arch/powerpc/mm/hugetlbpage.c|  2 +-
 arch/powerpc/mm/pgtable-book3s64.c   |  4 +--
 arch/powerpc/mm/pgtable-hash64.c |  4 ++-
 arch/powerpc/mm/pgtable-radix.c  |  3 +-
 arch/powerpc/mm/pgtable_64.c |  2 +-
 7 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc9875c3be..c0737c86a362 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -5,6 +5,7 @@
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
 #endif
 
 /*
@@ -79,6 +80,9 @@
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
+#define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
+#define __HAVE_ARCH_PTE_DEVMAP
+
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
  * Instead of fixing all of them, add an alternate define which
@@ -599,6 +603,16 @@ static inline pte_t pte_mkhuge(pte_t pte)
return pte;
 }
 
+static inline pte_t pte_mkdevmap(pte_t pte)
+{
+   return __pte(pte_val(pte) | _PAGE_SPECIAL|_PAGE_DEVMAP);
+}
+
+static inline int pte_devmap(pte_t pte)
+{
+   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP));
+}
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
/* FIXME!! check whether this need to be a conditional */
@@ -1146,6 +1160,37 @@ static inline bool arch_needs_pgtable_deposit(void)
return true;
 }
 
+
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+   return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP));
+}
+
+static inline int pmd_devmap(pmd_t pmd)
+{
+   return pte_devmap(pmd_pte(pmd));
+}
+
+static inline int pud_devmap(pud_t pud)
+{
+   return 0;
+}
+
+static inline int pgd_devmap(pgd_t pgd)
+{
+   return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+static inline const int pud_pfn(pud_t pud)
+{
+   /*
+* Currently all calls to pud_pfn() are gated around a pud_devmap()
+* check so this should never be used. If it grows another user we
+* want to know about it.
+*/
+   BUILD_BUG();
+   return 0;
+}
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index ac16d1943022..ba43754e96d2 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -252,7 +252,7 @@ static inline int radix__pgd_bad(pgd_t pgd)
 
 static inline int radix__pmd_trans_huge(pmd_t pmd)
 {
-   return !!(pmd_val(pmd) & _PAGE_PTE);
+   return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
 }
 
 static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index ceeab69cf7fc..5c4645e73cc8 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -945,7 +945,7 @@ pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned 
long ea,
if (pmd_none(pmd))
return NULL;
 
-   if (pmd_trans_huge(pmd)) {
+   if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
if (is_thp)
*is_thp = true;
ret_pte = (pte_t *) pmdp;
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 5fcb3dd74c13..31eed8fa8e99 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -32,7 +32,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, 
unsigned long address,
 {
int changed;
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(!pmd_trans_huge(*pmdp));
+   WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));

[PATCH v4 3/6] powerpc/vmemmap: Add altmap support

2017-06-27 Thread Oliver O'Halloran
Adds support to powerpc for the altmap feature of ZONE_DEVICE memory. An
altmap is a driver provided region that is used to provide the backing
storage for the struct pages of ZONE_DEVICE memory. In situations where
large amount of ZONE_DEVICE memory is being added to the system the
altmap reduces pressure on main system memory by allowing the mm/
metadata to be stored on the device itself rather in main memory.

Reviewed-by: Balbir Singh 
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/mm/init_64.c | 15 +--
 arch/powerpc/mm/mem.c | 16 +---
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 8851e4f5dbab..225fbb8034e6 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -171,13 +172,17 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node)
pr_debug("vmemmap_populate %lx..%lx, node %d\n", start, end, node);
 
for (; start < end; start += page_size) {
+   struct vmem_altmap *altmap;
void *p;
int rc;
 
if (vmemmap_populated(start, page_size))
continue;
 
-   p = vmemmap_alloc_block(page_size, node);
+   /* altmap lookups only work at section boundaries */
+   altmap = to_vmem_altmap(SECTION_ALIGN_DOWN(start));
+
+   p =  __vmemmap_alloc_block_buf(page_size, node, altmap);
if (!p)
return -ENOMEM;
 
@@ -242,6 +247,8 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end)
 
for (; start < end; start += page_size) {
unsigned long nr_pages, addr;
+   struct vmem_altmap *altmap;
+   struct page *section_base;
struct page *page;
 
/*
@@ -257,9 +264,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end)
continue;
 
page = pfn_to_page(addr >> PAGE_SHIFT);
+   section_base = pfn_to_page(vmemmap_section_start(start));
nr_pages = 1 << page_order;
 
-   if (PageReserved(page)) {
+   altmap = to_vmem_altmap((unsigned long) section_base);
+   if (altmap) {
+   vmem_altmap_free(altmap, nr_pages);
+   } else if (PageReserved(page)) {
/* allocated from bootmem */
if (page_size < PAGE_SIZE) {
/*
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 8e9bef964dbf..8541f18694a4 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -151,11 +152,20 @@ int arch_remove_memory(u64 start, u64 size)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
+   struct vmem_altmap *altmap;
+   struct page *page;
int ret;
 
-   zone = page_zone(pfn_to_page(start_pfn));
-   ret = __remove_pages(zone, start_pfn, nr_pages);
+   /*
+* If we have an altmap then we need to skip over any reserved PFNs
+* when querying the zone.
+*/
+   page = pfn_to_page(start_pfn);
+   altmap = to_vmem_altmap((unsigned long) page);
+   if (altmap)
+   page += vmem_altmap_offset(altmap);
+
+   ret = __remove_pages(page_zone(page), start_pfn, nr_pages);
if (ret)
return ret;
 
-- 
2.9.4



[PATCH v4 2/6] powerpc/vmemmap: Reshuffle vmemmap_free()

2017-06-27 Thread Oliver O'Halloran
Removes an indentation level and shuffles some code around to make the
following patch cleaner. No functional changes.

Reviewed-by: Balbir Singh 
Signed-off-by: Oliver O'Halloran 
---
v1 -> v2: Remove broken initialiser
---
 arch/powerpc/mm/init_64.c | 48 ---
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index ec84b31c6c86..8851e4f5dbab 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -234,13 +234,15 @@ static unsigned long vmemmap_list_free(unsigned long 
start)
 void __ref vmemmap_free(unsigned long start, unsigned long end)
 {
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long page_order = get_order(page_size);
 
start = _ALIGN_DOWN(start, page_size);
 
pr_debug("vmemmap_free %lx...%lx\n", start, end);
 
for (; start < end; start += page_size) {
-   unsigned long addr;
+   unsigned long nr_pages, addr;
+   struct page *page;
 
/*
 * the section has already be marked as invalid, so
@@ -251,29 +253,29 @@ void __ref vmemmap_free(unsigned long start, unsigned 
long end)
continue;
 
addr = vmemmap_list_free(start);
-   if (addr) {
-   struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
-
-   if (PageReserved(page)) {
-   /* allocated from bootmem */
-   if (page_size < PAGE_SIZE) {
-   /*
-* this shouldn't happen, but if it is
-* the case, leave the memory there
-*/
-   WARN_ON_ONCE(1);
-   } else {
-   unsigned int nr_pages =
-   1 << get_order(page_size);
-   while (nr_pages--)
-   free_reserved_page(page++);
-   }
-   } else
-   free_pages((unsigned long)(__va(addr)),
-   get_order(page_size));
-
-   vmemmap_remove_mapping(start, page_size);
+   if (!addr)
+   continue;
+
+   page = pfn_to_page(addr >> PAGE_SHIFT);
+   nr_pages = 1 << page_order;
+
+   if (PageReserved(page)) {
+   /* allocated from bootmem */
+   if (page_size < PAGE_SIZE) {
+   /*
+* this shouldn't happen, but if it is
+* the case, leave the memory there
+*/
+   WARN_ON_ONCE(1);
+   } else {
+   while (nr_pages--)
+   free_reserved_page(page++);
+   }
+   } else {
+   free_pages((unsigned long)(__va(addr)), page_order);
}
+
+   vmemmap_remove_mapping(start, page_size);
}
 }
 #endif
-- 
2.9.4



[PATCH v4 1/6] mm, x86: Add ARCH_HAS_ZONE_DEVICE to Kconfig

2017-06-27 Thread Oliver O'Halloran
Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as
new architectures (and platforms) get ZONE_DEVICE support. Move to an
arch selected Kconfig option to save us the trouble.

Cc: linux...@kvack.org
Acked-by: Ingo Molnar 
Acked-by: Balbir Singh 
Signed-off-by: Oliver O'Halloran 
---
Andew, the rest of the series should be going in via the ppc tree, but
since there's nothing ppc specific about this patch do you want to
take it via mm?
--
v2: Added missing hunk.
---
 arch/x86/Kconfig | 1 +
 mm/Kconfig   | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 37a14d7a4e3f..569e39a8293d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -61,6 +61,7 @@ config X86
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/mm/Kconfig b/mm/Kconfig
index 5027cbc251f9..48b1af447fa7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -668,12 +668,16 @@ config IDLE_PAGE_TRACKING
 
  See Documentation/vm/idle_page_tracking.txt for more details.
 
+# arch_add_memory() comprehends device memory
+config ARCH_HAS_ZONE_DEVICE
+   bool
+
 config ZONE_DEVICE
bool "Device memory (pmem, etc...) hotplug support"
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP
-   depends on X86_64 #arch_add_memory() comprehends device memory
+   depends on ARCH_HAS_ZONE_DEVICE
 
help
  Device memory hotplug support allows for establishing pmem,
-- 
2.9.4



[PATCH] powernv:idle: Clear r12 on wakeup from stop lite

2017-06-27 Thread Akshay Adiga
pnv_wakeup_noloss expects R12 to contain SRR1 value to determine if
the wakeup reason is an HMI in CHECK_HMI_INTERRUPT.

When we wakeup with ESL=0, SRR1 will not contain the wakeup reason, so
there is no point setting R12 to SRR1.

However, we don't set R12 at all and R12 contains garbage, and still
being used to check HMI assuming that it had SRR1. causing the
OPAL msglog to be filled with the following print:
HMI: Received HMI interrupt: HMER = 0x0040

This patch clears R12 after waking up from stop with ESL=EC=0, so that
we don't accidentally enter the HMI handler in pnv_wakeup_noloss if
the R12[42:45] corresponds to HMI as wakeup reason.

Bug existed prior to "commit 9d29250136f6 ("powerpc/64s/idle: Avoid SRR
usage in idle sleep/wake paths")  but was never hit in practice

Signed-off-by: Akshay Adiga 
Fixes: 9d29250136f6 ("powerpc/64s/idle: Avoid SRR usage in idle
sleep/wake paths")
---
 arch/powerpc/kernel/idle_book3s.S | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 1ea14b9..34794fd 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -256,6 +256,21 @@ power_enter_stop:
bne  .Lhandle_esl_ec_set
IDLE_STATE_ENTER_SEQ(PPC_STOP)
li  r3,0  /* Since we didn't lose state, return 0 */
+   /*
+* pnv_wakeup_noloss expects R12 to contain SRR1 value
+* to determine if the wakeup reason is an HMI in
+* CHECK_HMI_INTERRUPT.
+*
+* However, when we wakeup with ESL=0,
+* SRR1 will not contain the wakeup reason,
+* so there is no point setting R12 to SRR1.
+*
+* Further, we clear R12 here, so that we
+* don't accidentally enter the HMI
+* in pnv_wakeup_noloss if the
+* R12[42:45] == WAKE_HMI.
+*/
+   li  r12, 0
b   pnv_wakeup_noloss
 
 .Lhandle_esl_ec_set:
-- 
2.5.5



Re: [PATCH] powerpc/nvram: use memdup_user

2017-06-27 Thread Kees Cook
On Fri, Apr 28, 2017 at 6:45 PM, Geliang Tang  wrote:
> Use memdup_user() helper instead of open-coding to simplify the code.
>
> Signed-off-by: Geliang Tang 

Thanks! Applied for -next.

-Kees

> ---
>  arch/powerpc/kernel/nvram_64.c | 14 +-
>  1 file changed, 5 insertions(+), 9 deletions(-)
>
> diff --git a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c
> index eae61b0..496d639 100644
> --- a/arch/powerpc/kernel/nvram_64.c
> +++ b/arch/powerpc/kernel/nvram_64.c
> @@ -792,21 +792,17 @@ static ssize_t dev_nvram_write(struct file *file, const 
> char __user *buf,
> count = min_t(size_t, count, size - *ppos);
> count = min(count, PAGE_SIZE);
>
> -   ret = -ENOMEM;
> -   tmp = kmalloc(count, GFP_KERNEL);
> -   if (!tmp)
> -   goto out;
> -
> -   ret = -EFAULT;
> -   if (copy_from_user(tmp, buf, count))
> +   tmp = memdup_user(buf, count);
> +   if (IS_ERR(tmp)) {
> +   ret = PTR_ERR(tmp);
> goto out;
> +   }
>
> ret = ppc_md.nvram_write(tmp, count, ppos);
>
> -out:
> kfree(tmp);
> +out:
> return ret;
> -
>  }
>
>  static long dev_nvram_ioctl(struct file *file, unsigned int cmd,
> --
> 2.9.3
>



-- 
Kees Cook
Pixel Security


Re: [PATCH 6/7] drm/tilcdc: clean up ifdef hacks around iowrite64

2017-06-27 Thread Arnd Bergmann
On Mon, Jun 26, 2017 at 6:26 PM, Logan Gunthorpe  wrote:
> Hi Jyri,
>
> Thanks for the ack. However, I'm reworking this patch set to use the
> include/linux/io-64-nonatomic* headers which will explicitly devolve
> into two 32-bit transfers. It's not clear whether this is appropriate
> for the tilcdc driver as it was never setup to use 32-bit transfers
> (unlike the others I had patched).
>
> If you think it's ok, I can still patch this driver to use the
> non-atomic headers. Otherwise I can leave it out. Please let me know.

You'd have to first figure out whether this device is of the lo-hi
or the hi-lo variant, or doesn't allow the I/O to be split at all.

Note that we could theoretically define ARM to use strd/ldrd
for writeq/readq, but I would expect that to be wrong with many
other devices that can use the existing io-64-nonatomic headers.

The comment in set_scanout() suggests that we actually do rely
on the write64 to be atomic, so we probably don't want to change
this driver.

 Arnd


Re: [PATCH v2] fsl/fman: add dependency on HAS_DMA

2017-06-27 Thread David Miller
From: Madalin Bucur 
Date: Mon, 26 Jun 2017 18:47:00 +0300

> A previous commit (5567e989198b5a8d) inserted a dependency on DMA
> API that requires HAS_DMA to be added in Kconfig.
> 
> Signed-off-by: Madalin Bucur 

Applied, thank you.


Re: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer

2017-06-27 Thread Frederic Barrat



Le 27/06/2017 à 14:32, David Laight a écrit :

From: Frederic Barrat

Sent: 26 June 2017 19:09
P9 has support for PCI peer-to-peer, enabling a device to write in the
mmio space of another device directly, without interrupting the CPU.

This patch adds support for it on powernv, by adding a new API to be
called by drivers. The pnv_pci_set_p2p(...) call configures an
'initiator', i.e the device which will issue the mmio operation, and a
'target', i.e. the device on the receiving side.

...

Two questions:

1) How does the driver get the correct address to give to the 'initiator'
in order to perform an access to the 'target'?


That's left out of this patch intentionally. The assumption is that 
there's some handshake happening between the 2 drivers. But that's an 
area where we could work to make it easier in the future.



2) Surely the API call the driver makes should be architecture neutral,
returning an error on other architectures.


The point of the patch is just to enable it on p9. I've heard of a more 
generic, on-going effort, at the PCI API level, which would be 
cross-arch. But here we just want to allow it for p9 to allow some early 
drivers to take advantage of it if they choose to.


  Fred



At least some x86 cpus also support peer-to-peer writes,
I believe they can work between cpu chips.
PCIe bridges might support them (or be configurable to support them).

David





RE: [PATCH] soc/qman: Sleep instead of stuck hacking jiffies.

2017-06-27 Thread Leo Li


> -Original Message-
> From: Linuxppc-dev [mailto:linuxppc-dev-
> bounces+leoli=freescale@lists.ozlabs.org] On Behalf Of David Laight
> Sent: Monday, June 26, 2017 10:55 AM
> To: 'Karim Eshapa' ; o...@buserror.net
> Cc: Roy Pledge ; linux-ker...@vger.kernel.org;
> Claudiu Manoil ; colin.k...@canonical.com;
> linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org
> Subject: RE: [PATCH] soc/qman: Sleep instead of stuck hacking jiffies.
> 
> From: Karim Eshapa
> > Sent: 25 June 2017 16:14
> > Use msleep() instead of stucking with
> > long delay will be more efficient.
> ...
> > --- a/drivers/soc/fsl/qbman/qman.c
> > +++ b/drivers/soc/fsl/qbman/qman.c
> > @@ -1084,11 +1084,7 @@ static int drain_mr_fqrni(struct qm_portal *p)
> >  * entries well before the ring has been fully consumed, so
> >  * we're being *really* paranoid here.
> >  */
> > -   u64 now, then = jiffies;
> > -
> > -   do {
> > -   now = jiffies;
> > -   } while ((then + 1) > now);
> > +   msleep(1);
> ...
> How is that in any way equivalent?
> If HZ is 1000 the old code loops for 10 seconds.
> If HZ is 250 (common for some distros) it loops for 40 seconds.
> 
> Clearly both are horrid, but it isn't at all clear that a 1ms sleep is 
> performing
> the same job.
> 
> My guess is that this code is never called, and broken if actually called.

It was indeed broken.  The intent was to wait for 1 cycles but mistakenly 
coded as 1 jiffies.  I think we choose 1ms as it is not too long and almost 
guarantees the 1 cycles delay.

Regards,
Leo


RE: [v3] drivers:soc:fsl:qbman:qman.c: Sleep instead of stuck hacking jiffies.

2017-06-27 Thread Leo Li


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Saturday, June 24, 2017 9:47 PM
> To: Karim Eshapa 
> Cc: Roy Pledge ; linux-ker...@vger.kernel.org;
> Claudiu Manoil ; colin.k...@canonical.com;
> linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org; Leo Li
> 
> Subject: Re: [v3] drivers:soc:fsl:qbman:qman.c: Sleep instead of stuck
> hacking jiffies.
> 
> On Fri, May 05, 2017 at 07:45:18AM +0200, Karim Eshapa wrote:
> > Use msleep() instead of stucking with
> > long delay will be more efficient.
> >
> > Signed-off-by: Karim Eshapa 
> > ---
> >  drivers/soc/fsl/qbman/qman.c | 6 +-
> >  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> Acked-by: Scott Wood 
> 
> (though the subject line should be "soc/qman: ...")
> 
> Leo, are you going to send this patch (and other qman patches) via arm-soc?

Yes.  I can take it through the pull request for soc/fsl via arm-soc.  As 
mentioned in the feedback from David in another email, probably we should 
update the comment and commit message to mention how 1 cycles becomes 1ms.

Regards,
Leo


Re: [next-20170609] Oops while running CPU off-on (cpuset.c/cpuset_can_attach)

2017-06-27 Thread Tejun Heo
Hello, Abdul.

Sorry about the long delay.

On Mon, Jun 12, 2017 at 04:53:42PM +0530, Abdul Haleem wrote:
> linux-next kernel crashed while running CPU offline and online.
> 
> Machine: Power 8 LPAR
> Kernel : 4.12.0-rc4-next-20170609
> gcc : version 5.2.1
> config: attached
> testcase: CPU off/on
> 
> for i in $(seq 100);do 
> for j in $(seq 0 15);do 
> echo 0 >  /sys/devices/system/cpu/cpu$j/online
> sleep 5
> echo 1 > /sys/devices/system/cpu/cpu$j/online
> done
> done
> 
...
> NIP [c01d6868] cpuset_can_attach+0x58/0x1b0

Can you please map this to the source line?

> LR [c01d6858] cpuset_can_attach+0x48/0x1b0
> Call Trace:
> [cc72b9a0] [c01d6858] cpuset_can_attach+0x48/0x1b0
> (unreliable)
> [cc72ba00] [c01cbe80] cgroup_migrate_execute+0xb0/0x450
> [cc72ba80] [c01d3754] cgroup_transfer_tasks+0x1c4/0x360
> [cc72bba0] [c01d923c] cpuset_hotplug_workfn+0x86c/0xa20
> [cc72bca0] [c011aa44] process_one_work+0x1e4/0x580
> [cc72bd30] [c011ae78] worker_thread+0x98/0x5c0
> [cc72bdc0] [c0124058] kthread+0x168/0x1b0
> [cc72be30] [c000b2e8] ret_from_kernel_thread+0x5c/0x74
> Instruction dump:
> f821ffa1 7c7d1b78 6000 6000 38810020 7fa3eb78 3f42ffed 4bff4c25 
> 6000 3b5a0448 3d420020 eb610020  7f43d378 e929
> f92af200 

Thanks.

-- 
tejun


Re: [RFC v4 09/17] powerpc: call the hash functions with the correct pkey value

2017-06-27 Thread Aneesh Kumar K.V



On Tuesday 27 June 2017 03:41 PM, Ram Pai wrote:

Pass the correct protection key value to the hash functions on
page fault.

Signed-off-by: Ram Pai 
---
  arch/powerpc/include/asm/pkeys.h | 11 +++
  arch/powerpc/mm/hash_utils_64.c  |  4 
  arch/powerpc/mm/mem.c|  6 ++
  3 files changed, 21 insertions(+)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index ef1c601..1370b3f 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -74,6 +74,17 @@ static inline bool mm_pkey_is_allocated(struct mm_struct 
*mm, int pkey)
  }

  /*
+ * return the protection key of the vma corresponding to the
+ * given effective address @ea.
+ */
+static inline int mm_pkey(struct mm_struct *mm, unsigned long ea)
+{
+   struct vm_area_struct *vma = find_vma(mm, ea);
+   int pkey = vma ? vma_pkey(vma) : 0;
+   return pkey;
+}
+
+/*



That is not going to work in hash fault path right ? We can't do a 
find_vma there without holding the mmap_sem


-aneesh



[PATCH] powerpc: conditionally compile platform-specific serial drivers

2017-06-27 Thread Hannes Reinecke
mpsc.c and mpc52xx-psc.c are platform-specific serial drivers, and
should be compiled for the respective platforms only.

Signed-off-by: Hannes Reinecke 
---
 arch/powerpc/boot/Makefile | 7 ---
 arch/powerpc/boot/serial.c | 4 
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index e82f333..1a4609c 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -99,14 +99,15 @@ src-wlib-y := string.S crt0.S crtsavres.S stdio.c 
decompress.c main.c \
$(libfdt) libfdt-wrapper.c \
ns16550.c serial.c simple_alloc.c div64.S util.S \
elf_util.c $(zlib-y) devtree.c stdlib.c \
-   oflib.c ofconsole.c cuboot.c mpsc.c cpm-serial.c \
-   uartlite.c mpc52xx-psc.c opal.c
+   oflib.c ofconsole.c cuboot.c cpm-serial.c \
+   uartlite.c opal.c
+src-wlib-$(CONFIG_PPC_MPC52XX) += mpc52xx-psc.c
 src-wlib-$(CONFIG_PPC64_BOOT_WRAPPER) +=  opal-calls.S
 src-wlib-$(CONFIG_40x) += 4xx.c planetcore.c
 src-wlib-$(CONFIG_44x) += 4xx.c ebony.c bamboo.c
 src-wlib-$(CONFIG_8xx) += mpc8xx.c planetcore.c fsl-soc.c
 src-wlib-$(CONFIG_PPC_82xx) += pq2.c fsl-soc.c planetcore.c
-src-wlib-$(CONFIG_EMBEDDED6xx) += mv64x60.c mv64x60_i2c.c ugecon.c fsl-soc.c
+src-wlib-$(CONFIG_EMBEDDED6xx) += mpsc.c mv64x60.c mv64x60_i2c.c ugecon.c 
fsl-soc.c
 
 src-plat-y := of.c epapr.c
 src-plat-$(CONFIG_40x) += fixed-head.S ep405.c cuboot-hotfoot.c \
diff --git a/arch/powerpc/boot/serial.c b/arch/powerpc/boot/serial.c
index e04c1e4..7b5c02b 100644
--- a/arch/powerpc/boot/serial.c
+++ b/arch/powerpc/boot/serial.c
@@ -120,15 +120,19 @@ int serial_console_init(void)
if (dt_is_compatible(devp, "ns16550") ||
dt_is_compatible(devp, "pnpPNP,501"))
rc = ns16550_console_init(devp, _cd);
+#ifdef CONFIG_EMBEDDED6xx
else if (dt_is_compatible(devp, "marvell,mv64360-mpsc"))
rc = mpsc_console_init(devp, _cd);
+#endif
else if (dt_is_compatible(devp, "fsl,cpm1-scc-uart") ||
 dt_is_compatible(devp, "fsl,cpm1-smc-uart") ||
 dt_is_compatible(devp, "fsl,cpm2-scc-uart") ||
 dt_is_compatible(devp, "fsl,cpm2-smc-uart"))
rc = cpm_console_init(devp, _cd);
+#ifdef CONFIG_PPC_MPC52XX
else if (dt_is_compatible(devp, "fsl,mpc5200-psc-uart"))
rc = mpc5200_psc_console_init(devp, _cd);
+#endif
else if (dt_is_compatible(devp, "xlnx,opb-uartlite-1.00.b") ||
 dt_is_compatible(devp, "xlnx,xps-uartlite-1.00.a"))
rc = uartlite_console_init(devp, _cd);
-- 
1.8.5.6



Re: [RFC 2/4] libnvdimm: Add a device-tree interface

2017-06-27 Thread Oliver
Hi Mark,

Thanks for the review and sorry, I really should have added more
context. I was originally just going to send this to the linux-nvdimm
list, but I figured the wider device-tree community might be
interested too.

Preamble:

Non-volatile DIMMs (nvdimms) are otherwise normal DDR DIMMs that are
based on some kind of non-volatile memory with DRAM-like performance
(i.e. not flash). The best known example would probably be Intel's 3D
XPoint technology, but there are a few others around. The non-volatile
aspect makes them useful as storage devices and being part of the
memory space allows the backing storage to be exposed to userspace via
mmap() provided the kernel supports it. The mmap() trick is enabled by
the kernel supporting "direct access" aka DAX.

With that out of the way...

On Tue, Jun 27, 2017 at 8:43 PM, Mark Rutland  wrote:
> Hi,
>
> On Tue, Jun 27, 2017 at 08:28:49PM +1000, Oliver O'Halloran wrote:
>> A fairly bare-bones set of device-tree bindings so libnvdimm can be used
>> on powerpc and other, less cool, device-tree based platforms.
>
> ;)
>
>> Cc: devicet...@vger.kernel.org
>> Signed-off-by: Oliver O'Halloran 
>> ---
>> The current bindings are essentially this:
>>
>> nonvolatile-memory {
>>   compatible = "nonvolatile-memory", "special-memory";
>>   ranges;
>>
>>   region@0 {
>>   compatible = "nvdimm,byte-addressable";
>>   reg = <0x0 0x1000>;
>>   };
>>
>>   region@1000 {
>>   compatible = "nvdimm,byte-addressable";
>>   reg = <0x1000 0x1000>;
>>   };
>> };
>
> This needs to have a proper binding document under
> Documentation/devicetree/bindings/. Something like the reserved-memory
> bdings would be a good template.
>
> If we want thet "nvdimm" vendor-prefix, that'll have to be reserved,
> too (see Documentation/devicetree/bindings/vendor-prefixes.txt).

It's on my TODO list, I just wanted to get some comments on the
overall approach before doing the rest of the grunt work.

>
> What is "special-memory"? What other memory types would be described
> here?
>
> What exacctly does "nvdimm,byte-addressable" imply? I suspect that you
> also expect such memory to be compatible with mappings using (some)
> cacheable attributes?

I think it's always been assumed that nvdimm memory can be treated as
cacheable system memory for all intents and purposes. It might be
useful to be able to override it on a per-bus or per-region basis
though.

>
> Perhaps the byte-addressable property should be a boolean property on
> the region, rather than part of the compatible string.
See below.

>> To handle interleave sets, etc the plan was the add an extra property with 
>> the
>> interleave stride and a "mapping" property with <, dimm-start-offset>
>> tuples for each dimm in the interleave set. Block MMIO regions can be added
>> with a different compatible type, but I'm not too concerned with them for
>> now.
>
> Sorry, I'm not too familiar with nonvolatile memory. What are interleave
> sets?

An interleave set refers to a group of DIMMs which share a physical
address range. The addresses in the range are assigned to different
backing DIMMs to improve performance. E.g

Addr 0 to Addr 127 are on DIMM0, Addr 127 to 255 are on DIMM1, Addr
256 to 384 are on DIMM0, etc, etc

software needs to be aware of the interleave pattern so it can
localise memory errors to a specific DIMM.

>
> What are block MMIO regions?

NVDIMMs come in two flavours: byte addressable and block aperture. The
byte addressable type can be treated as conventional memory while the
block aperture type are essentially an MMIO block device. Their
contents are accessed via the MMIO window rather than being presented
to the system as RAM so they don't have any of the features that make
NVDIMMs interesting. It would be nice if we could punt them into a
different driver, unfortunately ACPI allows storage on one DIMM to be
partitioned into byte addressable and block regions and libnvdimm
provides the management interface for both. Dan Williams, who
maintains libnvdimm and the ACPI interface to it, would be a better
person to ask about the finer details.

>
> Is there any documentation one can refer to for any of this?

Documentation/nvdimm/nvdimm.txt has a fairly detailed overview of how
libnvdimm operates. The short version is that libnvdimm provides a
"nvdimm_bus" container for "regions" and "dimms." Regions are chunks
of memory and come in the block or byte types mentioned above, while
DIMMs refer to the physical devices. A firmware specific driver
converts the firmware's hardware description into a set of DIMMs, a
set of regions, and a set of relationships between the two.

On top of that, regions are partitioned into "namespaces" which are
then exported to userspace as either a block device (with PAGE_SIZE
blocks) or as a "DAX device." In the block device case a filesystem is
used to manage the storage and provided the filesystem 

Re: [PATCH 0/8] Support for 24x7 hcall interface version 2

2017-06-27 Thread Michael Ellerman
Thiago Jung Bauermann  writes:

> Hello,
>
> The hypervisor interface to access 24x7 performance counters (which collect
> performance information from system power on to system power off) has been
> extended in POWER9 adding new fields to the request and result element
> structures.
>
> Also, results for some domains now return more than one result element and
> those need to be added to get a total count.
>
> The first two patches fix bugs in the existing code. The following 4
> patches are code improvements and the last two finally implement support
> for the changes in POWER9 described above.
>
> POWER8 systems only support version 1 of the interface, while POWER9
> systems only support version 2. I tested these patches on POWER8 to verify
> that there are no regressions, and also on POWER9 DD1.

Where is version 2 documented?

And what happens when we boot on a POWER9 in POWER8 compatibility mode?

cheers


RE: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer

2017-06-27 Thread David Laight
From: Frederic Barrat
> Sent: 26 June 2017 19:09
> P9 has support for PCI peer-to-peer, enabling a device to write in the
> mmio space of another device directly, without interrupting the CPU.
> 
> This patch adds support for it on powernv, by adding a new API to be
> called by drivers. The pnv_pci_set_p2p(...) call configures an
> 'initiator', i.e the device which will issue the mmio operation, and a
> 'target', i.e. the device on the receiving side.
...

Two questions:

1) How does the driver get the correct address to give to the 'initiator'
   in order to perform an access to the 'target'?

2) Surely the API call the driver makes should be architecture neutral,
   returning an error on other architectures.

At least some x86 cpus also support peer-to-peer writes,
I believe they can work between cpu chips.
PCIe bridges might support them (or be configurable to support them).

David



Re: powernv/npu-dma.c: Add explicit flush when sending an ATSD

2017-06-27 Thread Michael Ellerman
On Tue, 2017-06-20 at 08:37:28 UTC, Alistair Popple wrote:
> NPU2 requires an extra explicit flush to an active GPU PID when sending
> address translation shoot downs (ATSDs) to reliably flush the GPU TLB. This
> patch adds just such a flush at the end of each sequence of ATSDs.
> 
> We can safely use PID 0 which is always reserved and active on the GPU. PID
> 0 is only used for init_mm which will never be a user mm on the GPU. To
> enforce this we add a check in pnv_npu2_init_context() just in case someone
> tries to use PID 0 on the GPU.
> 
> Signed-off-by: Alistair Popple 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/bbd5ff50afffcf4a01d05367524736

cheers


Re: [PATCH] kernel/power/suspend: use CONFIG_HAVE_SET_MEMORY for include condition

2017-06-27 Thread Balbir Singh
On Tue, Jun 27, 2017 at 7:07 AM, Rafael J. Wysocki  wrote:
> On Monday, June 26, 2017 01:34:52 PM Balbir Singh wrote:
>> On Sat, Jun 3, 2017 at 11:27 PM, Pavel Machek  wrote:
>> > On Sat 2017-06-03 20:52:32, Balbir Singh wrote:
>> >> Kbuild reported a build failure when CONFIG_STRICT_KERNEL_RWX was
>> >> enabled on powerpc. We don't yet have ARCH_HAS_SET_MEMORY and ppc32
>> >> saw a build failure.
>> >>
>> >> fixes(50327dd kernel/power/snapshot.c: use set_memory.h header)
>> >>
>> >> I've only done a basic compile test with a config that has
>> >> hibernation enabled.
>> >>
>> >> Cc: "Rafael J. Wysocki" 
>> >> Cc: Len Brown 
>> > Acked-by: Pavel Machek 
>>
>> Ping. Could we please pick this up? it breaks any attempt to support
>> STRICT_KERNEL_RWX on powerpc
>
> Yes, I'm going to pick it up for 4.13.

Thanks,
Balbir Singh.


Re: [RFC v4 02/17] mm: ability to disable execute permission on a key at creation

2017-06-27 Thread Balbir Singh
On Tue, 2017-06-27 at 03:11 -0700, Ram Pai wrote:
> Currently sys_pkey_create() provides the ability to disable read
> and write permission on the key, at  creation. powerpc  has  the
> hardware support to disable execute on a pkey as well.This patch
> enhances the interface to let disable execute  at  key  creation
> time. x86 does  not  allow  this.  Hence the next patch will add
> ability  in  x86  to  return  error  if  PKEY_DISABLE_EXECUTE is
> specified.
> 
> Signed-off-by: Ram Pai 
> ---

Acked-by: Balbir Singh 



Re: [RFC v4 03/17] x86: key creation with PKEY_DISABLE_EXECUTE disallowed

2017-06-27 Thread Balbir Singh
On Tue, 2017-06-27 at 03:11 -0700, Ram Pai wrote:
> x86 does not support disabling execute permissions on a pkey.
> 
> Signed-off-by: Ram Pai 
> ---
>  arch/x86/kernel/fpu/xstate.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c24ac1e..d582631 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -900,6 +900,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
> int pkey,
>   if (!boot_cpu_has(X86_FEATURE_OSPKE))
>   return -EINVAL;
>  
> + if (init_val & PKEY_DISABLE_EXECUTE)
> + return -EINVAL;
> +
>   /* Set the bits we need in PKRU:  */
>   if (init_val & PKEY_DISABLE_ACCESS)
>   new_pkru_bits |= PKRU_AD_BIT;

I am not an x86 expert. IIUC, execute disable is done via allocating an
execute_only_pkey and checking vma_key via AD + vma_flags against VM_EXEC.

Your patch looks good to me

Acked-by: Balbir Singh 

Balbir Singh.



Re: [PATCH v4 1/9] powerpc/lib/code-patching: Use alternate map for patch_instruction()

2017-06-27 Thread Balbir Singh
On Tue, 2017-06-27 at 10:32 +0200, Christophe LEROY wrote:
> 
> Le 27/06/2017 à 09:48, Balbir Singh a écrit :
> > This patch creates the window using text_poke_area, allocated
> > via get_vm_area(). text_poke_area is per CPU to avoid locking.
> > text_poke_area for each cpu is setup using late_initcall, prior
> > to setup of these alternate mapping areas, we continue to use
> > direct write to change/modify kernel text. With the ability
> > to use alternate mappings to write to kernel text, it provides
> > us the freedom to then turn text read-only and implement
> > CONFIG_STRICT_KERNEL_RWX.
> > 
> > This code is CPU hotplug aware to ensure that the we have mappings
> > for any new cpus as they come online and tear down mappings for
> > any cpus that are offline.
> > 
> > Other arches do similar things, but use fixmaps. The reason
> > for not using fixmaps is to make use of any randomization in
> > the future.
> > 
> > Signed-off-by: Balbir Singh 
> > ---
> >   arch/powerpc/lib/code-patching.c | 160 
> > ++-
> >   1 file changed, 156 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 500b0f6..19b8368 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -12,23 +12,172 @@
> >   #include 
> >   #include 
> >   #include 
> > -#include 
> > -#include 
> > +#include 
> > +#include 
> >   #include 
> >   #include 
> >   
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> >   
> > -int patch_instruction(unsigned int *addr, unsigned int instr)
> > +static int __patch_instruction(unsigned int *addr, unsigned int instr)
> >   {
> > int err;
> >   
> > __put_user_size(instr, addr, 4, err);
> > if (err)
> > return err;
> > -   asm ("dcbst 0, %0; sync; icbi 0,%0; sync; isync" : : "r" (addr));
> > +   asm ("dcbst 0, %0; sync; icbi 0,%0; sync; isync" :: "r" (addr));
> > +   return 0;
> > +}
> > +
> > +#ifdef CONFIG_STRICT_KERNEL_RWX
> > +static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > +
> > +static int text_area_cpu_up(unsigned int cpu)
> > +{
> > +   struct vm_struct *area;
> > +
> > +   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > +   if (!area) {
> > +   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > +   cpu);
> > +   return -1;
> > +   }
> > +   this_cpu_write(text_poke_area, area);
> > +   return 0;
> > +}
> > +
> > +static int text_area_cpu_down(unsigned int cpu)
> > +{
> > +   free_vm_area(this_cpu_read(text_poke_area));
> > +   return 0;
> > +}
> > +
> > +/*
> > + * This is an early_initcall and early_initcalls happen at the right time
> > + * for us, after slab is enabled and before we mark ro pages R/O. In the
> > + * future if get_vm_area is randomized, this will be more flexible than
> > + * fixmap
> > + */
> > +static int __init setup_text_poke_area(void)
> > +{
> > +   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > +   "powerpc/text_poke:online", text_area_cpu_up,
> > +   text_area_cpu_down));
> > +
> > +   pr_info("text_poke area ready...\n");
> > +   return 0;
> > +}
> > +
> > +/*
> > + * This can be called for kernel text or a module.
> > + */
> > +static int map_patch_area(void *addr, unsigned long text_poke_addr)
> > +{
> > +   unsigned long pfn;
> > +   int err;
> > +
> > +   if (is_vmalloc_addr(addr))
> > +   pfn = vmalloc_to_pfn(addr);
> > +   else
> > +   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
> > +
> > +   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT),
> > +   pgprot_val(PAGE_KERNEL));
> > +   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
> > +   if (err)
> > +   return -1;
> > +   return 0;
> > +}
> > +
> > +static inline int unmap_patch_area(unsigned long addr)
> > +{
> > +   pte_t *ptep;
> > +   pmd_t *pmdp;
> > +   pud_t *pudp;
> > +   pgd_t *pgdp;
> > +
> > +   pgdp = pgd_offset_k(addr);
> > +   if (unlikely(!pgdp))
> > +   return -EINVAL;
> > +   pudp = pud_offset(pgdp, addr);
> > +   if (unlikely(!pudp))
> > +   return -EINVAL;
> > +   pmdp = pmd_offset(pudp, addr);
> > +   if (unlikely(!pmdp))
> > +   return -EINVAL;
> > +   ptep = pte_offset_kernel(pmdp, addr);
> > +   if (unlikely(!ptep))
> > +   return -EINVAL;
> > +
> > +   pr_devel("clearing mm %p, pte %p, addr %lx\n", _mm, ptep, addr);
> > +   /*
> > +* In hash, pte_clear flushes the tlb, in radix, we have to
> > +*/
> > +   pte_clear(_mm, addr, ptep);
> > +   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > return 0;
> >   }
> >   
> > +int patch_instruction(unsigned int *addr, unsigned int instr)
> > +{
> > +   int err;
> > +   unsigned int *dest = NULL;
> > +   unsigned long flags;
> > +   unsigned long text_poke_addr;
> > +   unsigned long kaddr = (unsigned long)addr;
> > +
> > +   /*
> > +* 

Re: [PATCH v4 3/9] powerpc/kprobes/optprobes: Move over to patch_instruction

2017-06-27 Thread Balbir Singh
On Tue, 2017-06-27 at 10:34 +0200, Christophe LEROY wrote:
> 
> Le 27/06/2017 à 09:48, Balbir Singh a écrit :
> > With text moving to read-only migrate optprobes to using
> > the patch_instruction infrastructure. Without this optprobes
> > will fail and complain.
> > 
> > Signed-off-by: Balbir Singh 
> 
> Didn't Michael picked it up already ?
>

Yes, he did, I posted the entire series and I'll let him keep the
better versions, he has edited. I spoke to him, but I was not 100%
sure what was picked up, the email responses mentioned 3, but I thought
4 patches were picked up

Balbir Singh.



Patch "powerpc/slb: Force a full SLB flush when we insert for a bad EA" has been added to the 4.4-stable tree

2017-06-27 Thread gregkh

This is a note to let you know that I've just added the patch titled

powerpc/slb: Force a full SLB flush when we insert for a bad EA

to the 4.4-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 powerpc-slb-force-a-full-slb-flush-when-we-insert-for-a-bad-ea.patch
and it can be found in the queue-4.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From m...@ellerman.id.au  Tue Jun 27 12:58:25 2017
From: Michael Ellerman 
Date: Thu, 22 Jun 2017 16:52:51 +1000
Subject: powerpc/slb: Force a full SLB flush when we insert for a bad EA
To: sta...@vger.kernel.org  , 
linuxppc-dev@lists.ozlabs.org 
Cc: lkml  , Greg Kroah-Hartman  

Message-ID: <87k244sed8@concordia.ellerman.id.au>

From: Michael Ellerman 

[Note this patch is not upstream. The bug fix was fixed differently in
upstream prior to the bug being identified.]

The SLB miss handler calls slb_allocate_realmode() in order to create an
SLB entry for the faulting address. At the very start of that function
we check that the faulting Effective Address (EA) is less than
PGTABLE_RANGE (ignoring the region), ie. is it an address which could
possibly fit in the virtual address space.

For an EA which fails that test, we branch out of line (to label 8), but
we still go on to create an SLB entry for the address. The SLB entry we
create has a VSID of 0, which means it will never match anything in the
hash table and so can't actually translate to a physical address.

However that SLB entry will be inserted in the SLB, and so needs to be
managed properly like any other SLB entry. In particular we need to
insert the SLB entry in the SLB cache, so that it will be flushed when
the process is descheduled.

And that is where the bugs begin. The first bug is that slb_finish_load()
uses cr7 to decide if it should insert the SLB entry into the SLB cache.
When we come from the invalid EA case we don't set cr7, it just has some
junk value from userspace. So we may or may not insert the SLB entry in
the SLB cache. If we fail to insert it, we may then incorrectly leave it
in the SLB when the process is descheduled.

The second bug is that even if we do happen to add the entry to the SLB
cache, we do not have enough bits in the SLB cache to remember the full
ESID value for very large EAs.

For example if a process branches to 0x788c545a1800, that results in
a 256MB SLB entry with an ESID of 0x788c545a1. But each entry in the SLB
cache is only 32-bits, meaning we truncate the ESID to 0x88c545a1. This
has the same effect as the first bug, we incorrectly leave the SLB entry
in the SLB when the process is descheduled.

When a process accesses an invalid EA it results in a SEGV signal being
sent to the process, which typically results in the process being
killed. Process death isn't instantaneous however, the process may catch
the SEGV signal and continue somehow, or the kernel may start writing a
core dump for the process, either of which means it's possible for the
process to be preempted while its processing the SEGV but before it's
been killed.

If that happens, when the process is scheduled back onto the CPU we will
allocate a new SLB entry for the NIP, which will insert a second entry
into the SLB for the bad EA. Because we never flushed the original
entry, due to either bug one or two, we now have two SLB entries that
match the same EA.

If another access is made to that EA, either by the process continuing
after catching the SEGV, or by a second process accessing the same bad
EA on the same CPU, we will trigger an SLB multi-hit machine check
exception. This has been observed happening in the wild.

The fix is when we hit the invalid EA case, we mark the SLB cache as
being full. This causes us to not insert the truncated ESID into the SLB
cache, and means when the process is switched out we will flush the
entire SLB. Note that this works both for the original fault and for a
subsequent call to slb_allocate_realmode() from switch_slb().

Because we mark the SLB cache as full, it doesn't really matter what
value is in cr7, but rather than leaving it as something random we set
it to indicate the address was a kernel address. That also skips the
attempt to insert it in the SLB cache which is a nice side effect.

Another way to fix the bug would be to make the entries in the SLB cache
wider, so that we don't truncate the ESID. However this would be a more
intrusive change as it alters the size and layout of the paca.

This bug was fixed in upstream by commit f0f558b131db ("powerpc/mm:
Preserve CFAR value on SLB miss caused by access to bogus address"),
which changed the way we handle a bad EA entirely removing this bug in

Re: [PATCH backport pre-4.9] powerpc/slb: Force a full SLB flush when we insert for a bad EA

2017-06-27 Thread Greg Kroah-Hartman
On Thu, Jun 22, 2017 at 04:52:51PM +1000, Michael Ellerman wrote:
> The SLB miss handler calls slb_allocate_realmode() in order to create an
> SLB entry for the faulting address. At the very start of that function
> we check that the faulting Effective Address (EA) is less than
> PGTABLE_RANGE (ignoring the region), ie. is it an address which could
> possibly fit in the virtual address space.
> 
> For an EA which fails that test, we branch out of line (to label 8), but
> we still go on to create an SLB entry for the address. The SLB entry we
> create has a VSID of 0, which means it will never match anything in the
> hash table and so can't actually translate to a physical address.
> 
> However that SLB entry will be inserted in the SLB, and so needs to be
> managed properly like any other SLB entry. In particular we need to
> insert the SLB entry in the SLB cache, so that it will be flushed when
> the process is descheduled.
> 
> And that is where the bugs begin. The first bug is that slb_finish_load()
> uses cr7 to decide if it should insert the SLB entry into the SLB cache.
> When we come from the invalid EA case we don't set cr7, it just has some
> junk value from userspace. So we may or may not insert the SLB entry in
> the SLB cache. If we fail to insert it, we may then incorrectly leave it
> in the SLB when the process is descheduled.
> 
> The second bug is that even if we do happen to add the entry to the SLB
> cache, we do not have enough bits in the SLB cache to remember the full
> ESID value for very large EAs.
> 
> For example if a process branches to 0x788c545a1800, that results in
> a 256MB SLB entry with an ESID of 0x788c545a1. But each entry in the SLB
> cache is only 32-bits, meaning we truncate the ESID to 0x88c545a1. This
> has the same effect as the first bug, we incorrectly leave the SLB entry
> in the SLB when the process is descheduled.
> 
> When a process accesses an invalid EA it results in a SEGV signal being
> sent to the process, which typically results in the process being
> killed. Process death isn't instantaneous however, the process may catch
> the SEGV signal and continue somehow, or the kernel may start writing a
> core dump for the process, either of which means it's possible for the
> process to be preempted while its processing the SEGV but before it's
> been killed.
> 
> If that happens, when the process is scheduled back onto the CPU we will
> allocate a new SLB entry for the NIP, which will insert a second entry
> into the SLB for the bad EA. Because we never flushed the original
> entry, due to either bug one or two, we now have two SLB entries that
> match the same EA.
> 
> If another access is made to that EA, either by the process continuing
> after catching the SEGV, or by a second process accessing the same bad
> EA on the same CPU, we will trigger an SLB multi-hit machine check
> exception. This has been observed happening in the wild.
> 
> The fix is when we hit the invalid EA case, we mark the SLB cache as
> being full. This causes us to not insert the truncated ESID into the SLB
> cache, and means when the process is switched out we will flush the
> entire SLB. Note that this works both for the original fault and for a
> subsequent call to slb_allocate_realmode() from switch_slb().
> 
> Because we mark the SLB cache as full, it doesn't really matter what
> value is in cr7, but rather than leaving it as something random we set
> it to indicate the address was a kernel address. That also skips the
> attempt to insert it in the SLB cache which is a nice side effect.
> 
> Another way to fix the bug would be to make the entries in the SLB cache
> wider, so that we don't truncate the ESID. However this would be a more
> intrusive change as it alters the size and layout of the paca.
> 
> This bug was fixed in upstream by commit f0f558b131db ("powerpc/mm:
> Preserve CFAR value on SLB miss caused by access to bogus address"),
> which changed the way we handle a bad EA entirely removing this bug in
> the process.
> 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Paul Mackerras 
> ---
>  arch/powerpc/mm/slb_low.S | 10 ++
>  1 file changed, 10 insertions(+)
> 
> Note this patch is not upstream. The bug fix was fixed differently in
> upstream prior to the bug being identified.

Now applied to 4.4 and 3.18-stable kernels, thanks,

greg k-h


Patch "powerpc/slb: Force a full SLB flush when we insert for a bad EA" has been added to the 3.18-stable tree

2017-06-27 Thread gregkh

This is a note to let you know that I've just added the patch titled

powerpc/slb: Force a full SLB flush when we insert for a bad EA

to the 3.18-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 powerpc-slb-force-a-full-slb-flush-when-we-insert-for-a-bad-ea.patch
and it can be found in the queue-3.18 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From m...@ellerman.id.au  Tue Jun 27 12:58:25 2017
From: Michael Ellerman 
Date: Thu, 22 Jun 2017 16:52:51 +1000
Subject: powerpc/slb: Force a full SLB flush when we insert for a bad EA
To: sta...@vger.kernel.org  , 
linuxppc-dev@lists.ozlabs.org 
Cc: lkml  , Greg Kroah-Hartman  

Message-ID: <87k244sed8@concordia.ellerman.id.au>

From: Michael Ellerman 

[Note this patch is not upstream. The bug fix was fixed differently in
upstream prior to the bug being identified.]

The SLB miss handler calls slb_allocate_realmode() in order to create an
SLB entry for the faulting address. At the very start of that function
we check that the faulting Effective Address (EA) is less than
PGTABLE_RANGE (ignoring the region), ie. is it an address which could
possibly fit in the virtual address space.

For an EA which fails that test, we branch out of line (to label 8), but
we still go on to create an SLB entry for the address. The SLB entry we
create has a VSID of 0, which means it will never match anything in the
hash table and so can't actually translate to a physical address.

However that SLB entry will be inserted in the SLB, and so needs to be
managed properly like any other SLB entry. In particular we need to
insert the SLB entry in the SLB cache, so that it will be flushed when
the process is descheduled.

And that is where the bugs begin. The first bug is that slb_finish_load()
uses cr7 to decide if it should insert the SLB entry into the SLB cache.
When we come from the invalid EA case we don't set cr7, it just has some
junk value from userspace. So we may or may not insert the SLB entry in
the SLB cache. If we fail to insert it, we may then incorrectly leave it
in the SLB when the process is descheduled.

The second bug is that even if we do happen to add the entry to the SLB
cache, we do not have enough bits in the SLB cache to remember the full
ESID value for very large EAs.

For example if a process branches to 0x788c545a1800, that results in
a 256MB SLB entry with an ESID of 0x788c545a1. But each entry in the SLB
cache is only 32-bits, meaning we truncate the ESID to 0x88c545a1. This
has the same effect as the first bug, we incorrectly leave the SLB entry
in the SLB when the process is descheduled.

When a process accesses an invalid EA it results in a SEGV signal being
sent to the process, which typically results in the process being
killed. Process death isn't instantaneous however, the process may catch
the SEGV signal and continue somehow, or the kernel may start writing a
core dump for the process, either of which means it's possible for the
process to be preempted while its processing the SEGV but before it's
been killed.

If that happens, when the process is scheduled back onto the CPU we will
allocate a new SLB entry for the NIP, which will insert a second entry
into the SLB for the bad EA. Because we never flushed the original
entry, due to either bug one or two, we now have two SLB entries that
match the same EA.

If another access is made to that EA, either by the process continuing
after catching the SEGV, or by a second process accessing the same bad
EA on the same CPU, we will trigger an SLB multi-hit machine check
exception. This has been observed happening in the wild.

The fix is when we hit the invalid EA case, we mark the SLB cache as
being full. This causes us to not insert the truncated ESID into the SLB
cache, and means when the process is switched out we will flush the
entire SLB. Note that this works both for the original fault and for a
subsequent call to slb_allocate_realmode() from switch_slb().

Because we mark the SLB cache as full, it doesn't really matter what
value is in cr7, but rather than leaving it as something random we set
it to indicate the address was a kernel address. That also skips the
attempt to insert it in the SLB cache which is a nice side effect.

Another way to fix the bug would be to make the entries in the SLB cache
wider, so that we don't truncate the ESID. However this would be a more
intrusive change as it alters the size and layout of the paca.

This bug was fixed in upstream by commit f0f558b131db ("powerpc/mm:
Preserve CFAR value on SLB miss caused by access to bogus address"),
which changed the way we handle a bad EA entirely removing this bug 

Re: [RFC v4 01/17] mm: introduce an additional vma bit for powerpc pkey

2017-06-27 Thread Balbir Singh
On Tue, 2017-06-27 at 03:11 -0700, Ram Pai wrote:
> Currently there are only 4bits in the vma flags to support 16 keys
> on x86.  powerpc supports 32 keys, which needs 5bits. This patch
> introduces an addition bit in the vma flags.
> 
> Signed-off-by: Ram Pai 
> ---
>  fs/proc/task_mmu.c |  6 +-
>  include/linux/mm.h | 18 +-
>  2 files changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33..2ddc298 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, 
> struct vm_area_struct *vma)
>   [ilog2(VM_MERGEABLE)]   = "mg",
>   [ilog2(VM_UFFD_MISSING)]= "um",
>   [ilog2(VM_UFFD_WP)] = "uw",
> -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +#ifdef CONFIG_ARCH_HAS_PKEYS
>   /* These come out via ProtectionKey: */
>   [ilog2(VM_PKEY_BIT0)]   = "",
>   [ilog2(VM_PKEY_BIT1)]   = "",
>   [ilog2(VM_PKEY_BIT2)]   = "",
>   [ilog2(VM_PKEY_BIT3)]   = "",
> +#endif /* CONFIG_ARCH_HAS_PKEYS */
> +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> + /* Additional bit in ProtectionKey: */
> + [ilog2(VM_PKEY_BIT4)]   = "",
>  #endif

Not sure why these are linked with smap bits, but I guess the keys live
in the Supervisor Mode Access Prevention area?

Balbir Singh.



Re: [RFC v4 00/17] powerpc: Memory Protection Keys

2017-06-27 Thread Balbir Singh
On Tue, 2017-06-27 at 03:11 -0700, Ram Pai wrote:
> Memory protection keys enable applications to protect its
> address space from inadvertent access or corruption from
> itself.
> 
> The overall idea:
> 
>  A process allocates a   key  and associates it with
>  a  address  range  withinits   address   space.
>  The process  than  can  dynamically  set read/write 
>  permissions on  the   key   without  involving  the 
>  kernel. Any  code that  violates   the  permissions
>  off the address space; as defined by its associated
>  key, will receive a segmentation fault.
> 
> This patch series enables the feature on PPC64 HPTE
> platform.
> 
> ISA3.0 section 5.7.13 describes the detailed specifications.
> 
> 
> Testing:
>   This patch series has passed all the protection key
>   tests available in  the selftests directory.
>   The tests are updated to work on both x86 and powerpc.
> 
> version v4:
>   (1) patches no more depend on the pte bits to program
>   the hpte -- comment by Balbir
>   (2) documentation updates
>   (3) fixed a bug in the selftest.
>   (4) unlike x86, powerpc lets signal handler change key
>   permission bits; the change will persist across
>   signal handler boundaries. Earlier we allowed
>   the signal handler to modify a field in the siginfo
>   structure which would than be used by the kernel
>   to program the key protection register (AMR)
>   -- resolves a issue raised by Ben.
>   "Calls to sys_swapcontext with a made-up context
>   will end up with a crap AMR if done by code who
>   didn't know about that register".
>   (5) these changes enable protection keys on 4k-page 
>   kernel aswell.

I have not looked at the full series, but it seems cleaner than the original
one and the side-effect is that we can support 4k as well. Nice!

Balbir Singh.



Re: [RFC 2/4] libnvdimm: Add a device-tree interface

2017-06-27 Thread Mark Rutland
Hi,

On Tue, Jun 27, 2017 at 08:28:49PM +1000, Oliver O'Halloran wrote:
> A fairly bare-bones set of device-tree bindings so libnvdimm can be used
> on powerpc and other, less cool, device-tree based platforms.

;)

> Cc: devicet...@vger.kernel.org
> Signed-off-by: Oliver O'Halloran 
> ---
> The current bindings are essentially this:
> 
> nonvolatile-memory {
>   compatible = "nonvolatile-memory", "special-memory";
>   ranges;
> 
>   region@0 {
>   compatible = "nvdimm,byte-addressable";
>   reg = <0x0 0x1000>;
>   };
> 
>   region@1000 {
>   compatible = "nvdimm,byte-addressable";
>   reg = <0x1000 0x1000>;
>   };
> };

This needs to have a proper binding document under
Documentation/devicetree/bindings/. Something like the reserved-memory
bdings would be a good template.

If we want thet "nvdimm" vendor-prefix, that'll have to be reserved,
too (see Documentation/devicetree/bindings/vendor-prefixes.txt).

What is "special-memory"? What other memory types would be described
here?

What exacctly does "nvdimm,byte-addressable" imply? I suspect that you
also expect such memory to be compatible with mappings using (some)
cacheable attributes?

Perhaps the byte-addressable property should be a boolean property on
the region, rather than part of the compatible string.

> To handle interleave sets, etc the plan was the add an extra property with the
> interleave stride and a "mapping" property with <, dimm-start-offset>
> tuples for each dimm in the interleave set. Block MMIO regions can be added
> with a different compatible type, but I'm not too concerned with them for
> now.

Sorry, I'm not too familiar with nonvolatile memory. What are interleave
sets?

What are block MMIO regions?

Is there any documentation one can refer to for any of this?

[...]

> +static const struct of_device_id of_nvdimm_bus_match[] = {
> + { .compatible = "nonvolatile-memory" },
> + { .compatible = "special-memory" },
> + { },
> +};

Why both? Is the driver handling other "special-memory"?

Thanks,
Mark.


[RFC 4/4] powerpc/powernv: Create platform devs for nvdimm buses

2017-06-27 Thread Oliver O'Halloran
Scan the devicetree for nonvolatile-memory buses and instantiate a
platform device for them.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/opal.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 59684b4af4d1..bd3ed78f6f04 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -775,6 +775,9 @@ static int __init opal_init(void)
/* Create i2c platform devices */
opal_pdev_init("ibm,opal-i2c");
 
+   /* Handle non-volatile memory devices */
+   opal_pdev_init("nonvolatile-memory");
+
/* Setup a heatbeat thread if requested by OPAL */
opal_init_heartbeat();
 
-- 
2.9.4



[RFC 3/4] powerpc: Add pmem API support

2017-06-27 Thread Oliver O'Halloran
Adds powerpc64 implementations of:

memcpy_flushcache()
arch_wb_cache_pmem()
arch_invalidate_pmem()

Which form the architecture-specific portition of the persistent memory
API. These functions provide cache-management primitives for the DAX
drivers and libNVDIMM.

Signed-off-by: Oliver O'Halloran 
---
This should go on to of the ZONE_DEVICE patches. If you want a full tree
there's one here that's based on next-20170626 with Dan's libnvdimm-pending
branch merged in: https://github.com/oohal/linux/tree/ppc-nvdimm-4.13
---
 arch/powerpc/Kconfig|  1 +
 arch/powerpc/include/asm/pmem.h | 42 +
 2 files changed, 43 insertions(+)
 create mode 100644 arch/powerpc/include/asm/pmem.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 4526c9ba09b6..f551f3a26130 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,6 +127,7 @@ config PPC
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
+   select ARCH_HAS_PMEM_APIif PPC64
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/pmem.h b/arch/powerpc/include/asm/pmem.h
new file mode 100644
index ..7b0282e420fc
--- /dev/null
+++ b/arch/powerpc/include/asm/pmem.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright(c) 2017 IBM Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ASM_PMEM_H__
+#define __ASM_PMEM_H__
+
+#include 
+#include 
+
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+   unsigned long start = (unsigned long) addr;
+   flush_inval_dcache_range(start, start + size);
+}
+
+static inline void arch_invalidate_pmem(void *addr, size_t size)
+{
+   unsigned long start = (unsigned long) addr;
+   flush_inval_dcache_range(start, start + size);
+}
+
+static inline void *memcpy_flushcache(void *dest, const void *src, size_t size)
+{
+   unsigned long start = (unsigned long) dest;
+
+   memcpy(dest, src, size);
+   flush_inval_dcache_range(start, start + size);
+
+   return dest;
+}
+#endif /* CONFIG_ARCH_HAS_PMEM_API */
+#endif /* __ASM_PMEM_H__ */
-- 
2.9.4



[RFC 2/4] libnvdimm: Add a device-tree interface

2017-06-27 Thread Oliver O'Halloran
A fairly bare-bones set of device-tree bindings so libnvdimm can be used
on powerpc and other, less cool, device-tree based platforms.

Cc: devicet...@vger.kernel.org
Signed-off-by: Oliver O'Halloran 
---
The current bindings are essentially this:

nonvolatile-memory {
compatible = "nonvolatile-memory", "special-memory";
ranges;

region@0 {
compatible = "nvdimm,byte-addressable";
reg = <0x0 0x1000>;
};

region@1000 {
compatible = "nvdimm,byte-addressable";
reg = <0x1000 0x1000>;
};
};

To handle interleave sets, etc the plan was the add an extra property with the
interleave stride and a "mapping" property with <, dimm-start-offset>
tuples for each dimm in the interleave set. Block MMIO regions can be added
with a different compatible type, but I'm not too concerned with them for
now.

Does this sound reasonable? Is there anything this scheme would make difficult?
---
 drivers/nvdimm/Kconfig |  10 +++
 drivers/nvdimm/Makefile|   1 +
 drivers/nvdimm/of_nvdimm.c | 209 +
 3 files changed, 220 insertions(+)
 create mode 100644 drivers/nvdimm/of_nvdimm.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 5bdd499b5f4f..72d147b55596 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -102,4 +102,14 @@ config NVDIMM_DAX
 
  Select Y if unsure
 
+config OF_NVDIMM
+   tristate "Device-tree support for NVDIMMs"
+   depends on OF
+   default LIBNVDIMM
+   help
+ Allows byte addressable persistent memory regions to be described in 
the
+ device-tree.
+
+ Select Y if unsure.
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 909554c3f955..622961f4849d 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
 obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
+obj-$(CONFIG_OF_NVDIMM) += of_nvdimm.o
 
 nd_pmem-y := pmem.o
 
diff --git a/drivers/nvdimm/of_nvdimm.c b/drivers/nvdimm/of_nvdimm.c
new file mode 100644
index ..359808200feb
--- /dev/null
+++ b/drivers/nvdimm/of_nvdimm.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright 2017, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ */
+
+#define pr_fmt(fmt) "of_nvdimm: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const struct attribute_group *region_attr_groups[] = {
+   _region_attribute_group,
+   _device_attribute_group,
+   NULL,
+};
+
+static int of_nvdimm_add_byte(struct nvdimm_bus *bus, struct device_node *np)
+{
+   struct nd_region_desc ndr_desc;
+   struct resource temp_res;
+   struct nd_region *region;
+
+   /*
+* byte regions should only have one address range
+*/
+   if (of_address_to_resource(np, 0, _res)) {
+   pr_warn("Unable to parse reg[0] for %s\n", np->full_name);
+   return -ENXIO;
+   }
+
+   pr_debug("Found %pR for %s\n", _res, np->full_name);
+
+   memset(_desc, 0, sizeof(ndr_desc));
+   ndr_desc.res = _res;
+   ndr_desc.attr_groups = region_attr_groups;
+#ifdef CONFIG_NUMA
+   ndr_desc.numa_node = of_node_to_nid(np);
+#endif
+   set_bit(ND_REGION_PAGEMAP, _desc.flags);
+
+   region = nvdimm_pmem_region_create(bus, _desc);
+   if (!region)
+   return -ENXIO;
+
+   /*
+* Bind the region to the OF node we spawned it from. We
+* already bumped the node's refcount while walking the
+* bus.
+*/
+   to_nd_region_dev(region)->of_node = np;
+
+   return 0;
+}
+
+/*
+ * 'data' is a pointer to the function that handles registering the device
+ * on the nvdimm bus.
+ */
+static struct of_device_id of_nvdimm_dev_types[] = {
+   { .compatible = "nvdimm,byte-addressable", .data = of_nvdimm_add_byte },
+   { },
+};
+
+static void of_nvdimm_parse_one(struct nvdimm_bus *bus,
+   struct device_node *node)
+{
+   int (*parse_node)(struct nvdimm_bus *, struct device_node *);
+   const struct of_device_id *match;
+   int rc;
+
+   if (of_node_test_and_set_flag(node, 

[RFC 1/4] libnvdimm: add to_{nvdimm,nd_region}_dev()

2017-06-27 Thread Oliver O'Halloran
struct device contains the ->of_node pointer so that devices can be
assoicated with the device-tree node that created them on DT platforms.
libnvdimm hides the struct device for regions and nvdimm devices inside
of an opaque structure so this patch adds accessors for each to allow
the of_nvdimm driver to set the of_node pointer.

Signed-off-by: Oliver O'Halloran 
---
 drivers/nvdimm/dimm_devs.c   | 6 ++
 drivers/nvdimm/region_devs.c | 6 ++
 include/linux/libnvdimm.h| 2 ++
 3 files changed, 14 insertions(+)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index f0d1b7e5de01..cbddac011181 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -227,6 +227,12 @@ struct nvdimm *to_nvdimm(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nvdimm);
 
+struct device *to_nvdimm_dev(struct nvdimm *nvdimm)
+{
+   return >dev;
+}
+EXPORT_SYMBOL_GPL(to_nvdimm_dev);
+
 struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
 {
struct nd_region *nd_region = >nd_region;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index cbaab4210c39..6c3988135fd5 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -182,6 +182,12 @@ struct nd_region *to_nd_region(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(to_nd_region);
 
+struct device *to_nd_region_dev(struct nd_region *region)
+{
+   return >dev;
+}
+EXPORT_SYMBOL_GPL(to_nd_region_dev);
+
 struct nd_blk_region *to_nd_blk_region(struct device *dev)
 {
struct nd_region *nd_region = to_nd_region(dev);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 550761477005..10fbc523ff95 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -139,6 +139,8 @@ struct nd_region *to_nd_region(struct device *dev);
 struct nd_blk_region *to_nd_blk_region(struct device *dev);
 struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus);
 struct device *to_nvdimm_bus_dev(struct nvdimm_bus *nvdimm_bus);
+struct device *to_nvdimm_dev(struct nvdimm *nvdimm);
+struct device *to_nd_region_dev(struct nd_region *region);
 const char *nvdimm_name(struct nvdimm *nvdimm);
 struct kobject *nvdimm_kobj(struct nvdimm *nvdimm);
 unsigned long nvdimm_cmd_mask(struct nvdimm *nvdimm);
-- 
2.9.4



[RFC v4 17/17] procfs: display the protection-key number associated with a vma

2017-06-27 Thread Ram Pai
Display the pkey number associated with the vma in smaps of a task.
The key will be seen as below:

VmFlags: rd wr mr mw me dw ac key=0

Signed-off-by: Ram Pai 
---
 Documentation/filesystems/proc.txt |  3 ++-
 fs/proc/task_mmu.c | 22 +++---
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 4cddbce..a8c74aa 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -423,7 +423,7 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
-VmFlags: rd ex mr mw me dw
+VmFlags: rd ex mr mw me dw key=
 
 the first of these lines shows the same information as is displayed for the
 mapping in /proc/PID/maps.  The remaining lines show the size of the mapping
@@ -491,6 +491,7 @@ manner. The codes are the following:
 hg  - huge page advise flag
 nh  - no-huge page advise flag
 mg  - mergable advise flag
+key= - the memory protection key number
 
 Note that there is no guarantee that every flag and associated mnemonic will
 be present in all further kernel releases. Things get changed, the flags may
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2ddc298..d2eb096 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1,4 +1,6 @@
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -666,22 +668,20 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
[ilog2(VM_MERGEABLE)]   = "mg",
[ilog2(VM_UFFD_MISSING)]= "um",
[ilog2(VM_UFFD_WP)] = "uw",
-#ifdef CONFIG_ARCH_HAS_PKEYS
-   /* These come out via ProtectionKey: */
-   [ilog2(VM_PKEY_BIT0)]   = "",
-   [ilog2(VM_PKEY_BIT1)]   = "",
-   [ilog2(VM_PKEY_BIT2)]   = "",
-   [ilog2(VM_PKEY_BIT3)]   = "",
-#endif /* CONFIG_ARCH_HAS_PKEYS */
-#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
-   /* Additional bit in ProtectionKey: */
-   [ilog2(VM_PKEY_BIT4)]   = "",
-#endif
};
size_t i;
 
seq_puts(m, "VmFlags: ");
for (i = 0; i < BITS_PER_LONG; i++) {
+#ifdef CONFIG_ARCH_HAS_PKEYS
+   if (i == ilog2(VM_PKEY_BIT0)) {
+   int keyvalue = vma_pkey(vma);
+
+   i += ilog2(arch_max_pkey())-1;
+   seq_printf(m, "key=%d ", keyvalue);
+   continue;
+   }
+#endif /* CONFIG_ARCH_HAS_PKEYS */
if (!mnemonics[i][0])
continue;
if (vma->vm_flags & (1UL << i)) {
-- 
1.8.3.1



[RFC v4 16/17] Documentation: PowerPC specific updates to memory protection keys

2017-06-27 Thread Ram Pai
Add documentation updates that capture PowerPC specific changes.

Signed-off-by: Ram Pai 
---
 Documentation/vm/protection-keys.txt | 89 
 1 file changed, 69 insertions(+), 20 deletions(-)

diff --git a/Documentation/vm/protection-keys.txt 
b/Documentation/vm/protection-keys.txt
index b643045..889f32e 100644
--- a/Documentation/vm/protection-keys.txt
+++ b/Documentation/vm/protection-keys.txt
@@ -1,21 +1,46 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature found in
+new generation of intel CPUs and on PowerPC 7 and higher CPUs.
 
 Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
-
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register.  The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs.  These
-permissions are enforced on data access only and have no effect on
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+
+On Intel:
+
+   It works by dedicating 4 previously ignored bits in each page table
+   entry to a "protection key", giving 16 possible keys.
+
+   There is also a new user-accessible register (PKRU) with two separate
+   bits (Access Disable and Write Disable) for each key.  Being a CPU
+   register, PKRU is inherently thread-local, potentially giving each
+   thread a different set of protections from every other thread.
+
+   There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+   to the new register.  The feature is only available in 64-bit mode,
+   even though there is theoretically space in the PAE PTEs.  These
+   permissions are enforced on data access only and have no effect on
+   instruction fetches.
+
+
+On PowerPC:
+
+   It works by dedicating 5 hash-page table entry bits to a "protection 
key",
+   giving 32 possible keys.
+
+   There  is  a  user-accessible  register (AMR)  with  two separate bits;
+   Access Disable and  Write  Disable, for  each key.  Being  a  CPU
+   register,  AMR  is inherently  thread-local,  potentially  giving  each
+   thread a different set of protections from every other thread.  NOTE:
+   Disabling read permission does not disable write and vice-versa.
+
+   The feature is available on 64-bit HPTE mode only.
+   'mtspr 0xd, mem' reads the AMR register
+   'mfspr mem, 0xd' writes into the AMR register.
+
+
+
+Permissions are enforced on data access only and have no effect on
 instruction fetches.
 
 === Syscalls ===
@@ -28,9 +53,9 @@ There are 3 system calls which directly interact with pkeys:
  unsigned long prot, int pkey);
 
 Before a pkey can be used, it must first be allocated with
-pkey_alloc().  An application calls the WRPKRU instruction
+pkey_alloc().  An application calls the WRPKRU/AMR instruction
 directly in order to change access permissions to memory covered
-with a key.  In this example WRPKRU is wrapped by a C function
+with a key.  In this example WRPKRU/AMR is wrapped by a C function
 called pkey_set().
 
int real_prot = PROT_READ|PROT_WRITE;
@@ -52,11 +77,11 @@ is no longer in use:
munmap(ptr, PAGE_SIZE);
pkey_free(pkey);
 
-(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+(Note: pkey_set() is a wrapper for the RDPKRU,WRPKRU or AMR instructions.
  An example implementation can be found in
  tools/testing/selftests/x86/protection_keys.c)
 
-=== Behavior ===
+=== Behavior =
 
 The kernel attempts to make protection keys consistent with the
 behavior of a plain mprotect().  For instance if you do this:
@@ -83,3 +108,27 @@ with a read():
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+
+   Semantic differences
+
+The following semantic differences exist between x86 and power.
+
+a) powerpc allows creation of a key with 

[RFC v4 15/17] Documentation: Move protecton key documentation to arch neutral directory

2017-06-27 Thread Ram Pai
Since PowerPC and Intel both support memory protection keys, moving
the documenation to arch-neutral directory.

Signed-off-by: Ram Pai 
---
 Documentation/vm/protection-keys.txt  | 85 +++
 Documentation/x86/protection-keys.txt | 85 ---
 2 files changed, 85 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/vm/protection-keys.txt
 delete mode 100644 Documentation/x86/protection-keys.txt

diff --git a/Documentation/vm/protection-keys.txt 
b/Documentation/vm/protection-keys.txt
new file mode 100644
index 000..b643045
--- /dev/null
+++ b/Documentation/vm/protection-keys.txt
@@ -0,0 +1,85 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+=== Syscalls ===
+
+There are 3 system calls which directly interact with pkeys:
+
+   int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
+   int pkey_free(int pkey);
+   int pkey_mprotect(unsigned long start, size_t len,
+ unsigned long prot, int pkey);
+
+Before a pkey can be used, it must first be allocated with
+pkey_alloc().  An application calls the WRPKRU instruction
+directly in order to change access permissions to memory covered
+with a key.  In this example WRPKRU is wrapped by a C function
+called pkey_set().
+
+   int real_prot = PROT_READ|PROT_WRITE;
+   pkey = pkey_alloc(0, PKEY_DENY_WRITE);
+   ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 
0);
+   ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
+   ... application runs here
+
+Now, if the application needs to update the data at 'ptr', it can
+gain access, do the update, then remove its write access:
+
+   pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
+   *ptr = foo; // assign something
+   pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
+
+Now when it frees the memory, it will also free the pkey since it
+is no longer in use:
+
+   munmap(ptr, PAGE_SIZE);
+   pkey_free(pkey);
+
+(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+ An example implementation can be found in
+ tools/testing/selftests/x86/protection_keys.c)
+
+=== Behavior ===
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+   mprotect(ptr, size, PROT_NONE);
+   something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+   pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+   pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
+   something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+   *ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+   read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
diff --git a/Documentation/x86/protection-keys.txt 
b/Documentation/x86/protection-keys.txt
deleted file mode 100644
index b643045..000
--- a/Documentation/x86/protection-keys.txt
+++ /dev/null
@@ -1,85 +0,0 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of 

[RFC v4 14/17] selftest: PowerPC specific test updates to memory protection keys

2017-06-27 Thread Ram Pai
Abstracted out the arch specific code into the header file, and
added powerpc specific changes.

a) added 4k-backed hpte, memory allocator, powerpc specific.
b) added three test case where the key is associated after the page is
accessed/allocated/mapped.
c) cleaned up the code to make checkpatch.pl happy

Signed-off-by: Ram Pai 
---
 tools/testing/selftests/vm/pkey-helpers.h| 230 +--
 tools/testing/selftests/vm/protection_keys.c | 567 ---
 2 files changed, 518 insertions(+), 279 deletions(-)

diff --git a/tools/testing/selftests/vm/pkey-helpers.h 
b/tools/testing/selftests/vm/pkey-helpers.h
index b202939..69bfa89 100644
--- a/tools/testing/selftests/vm/pkey-helpers.h
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -12,13 +12,72 @@
 #include 
 #include 
 
-#define NR_PKEYS 16
-#define PKRU_BITS_PER_PKEY 2
+/* Define some kernel-like types */
+#define  u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__ /* arch */
+
+#define SYS_mprotect_key 380
+#define SYS_pkey_alloc  381
+#define SYS_pkey_free   382
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x14
+
+#define NR_PKEYS   16
+#define NR_RESERVED_PKEYS  1
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS0x1
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<21)
+
+#define INIT_PRKU 0x0UL
+
+#elif __powerpc64__ /* arch */
+
+#define SYS_mprotect_key 386
+#define SYS_pkey_alloc  384
+#define SYS_pkey_free   385
+#define si_pkey_offset 0x20
+#define REG_IP_IDX PT_NIP
+#define REG_TRAPNO PT_TRAP
+#define REG_AMR45
+#define gregs gp_regs
+#define fpregs fp_regs
+
+#define NR_PKEYS   32
+#define NR_RESERVED_PKEYS  3
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS0x3  /* disable read and write */
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<24)
+
+#define INIT_PRKU 0x3UL
+#else /* arch */
+
+   NOT SUPPORTED
+
+#endif /* arch */
+
 
 #ifndef DEBUG_LEVEL
 #define DEBUG_LEVEL 0
 #endif
 #define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+
+
+static inline u32 pkey_to_shift(int pkey)
+{
+#ifdef __i386__ /* arch */
+   return pkey * PKRU_BITS_PER_PKEY;
+#elif __powerpc64__ /* arch */
+   return (NR_PKEYS - pkey - 1) * PKRU_BITS_PER_PKEY;
+#endif /* arch */
+}
+
+
 extern int dprint_in_signal;
 extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
 static inline void sigsafe_printf(const char *format, ...)
@@ -53,53 +112,76 @@ static inline void sigsafe_printf(const char *format, ...)
 #define dprintf3(args...) dprintf_level(3, args)
 #define dprintf4(args...) dprintf_level(4, args)
 
-extern unsigned int shadow_pkru;
-static inline unsigned int __rdpkru(void)
+extern u64 shadow_pkey_reg;
+
+static inline u64 __rdpkey_reg(void)
 {
+#ifdef __i386__ /* arch */
unsigned int eax, edx;
unsigned int ecx = 0;
-   unsigned int pkru;
+   unsigned int pkey_reg;
 
asm volatile(".byte 0x0f,0x01,0xee\n\t"
 : "=a" (eax), "=d" (edx)
 : "c" (ecx));
-   pkru = eax;
-   return pkru;
+#elif __powerpc64__ /* arch */
+   u64 eax;
+   u64 pkey_reg;
+
+   asm volatile("mfspr %0, 0xd" : "=r" ((u64)(eax)));
+#endif /* arch */
+   pkey_reg = (u64)eax;
+   return pkey_reg;
 }
 
-static inline unsigned int _rdpkru(int line)
+static inline u64 _rdpkey_reg(int line)
 {
-   unsigned int pkru = __rdpkru();
+   u64 pkey_reg = __rdpkey_reg();
 
-   dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
-   line, pkru, shadow_pkru);
-   assert(pkru == shadow_pkru);
+   dprintf4("rdpkey_reg(line=%d) pkey_reg: %lx shadow: %lx\n",
+   line, pkey_reg, shadow_pkey_reg);
+   assert(pkey_reg == shadow_pkey_reg);
 
-   return pkru;
+   return pkey_reg;
 }
 
-#define rdpkru() _rdpkru(__LINE__)
+#define rdpkey_reg() _rdpkey_reg(__LINE__)
 
-static inline void __wrpkru(unsigned int pkru)
+static inline void __wrpkey_reg(u64 pkey_reg)
 {
-   unsigned int eax = pkru;
+#ifdef __i386__ /* arch */
+   unsigned int eax = pkey_reg;
unsigned int ecx = 0;
unsigned int edx = 0;
 
-   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   dprintf4("%s() changing %lx to %lx\n",
+__func__, __rdpkey_reg(), pkey_reg);
asm volatile(".byte 0x0f,0x01,0xef\n\t"
 : : "a" (eax), "c" (ecx), "d" (edx));
-   assert(pkru == __rdpkru());
+   dprintf4("%s() PKRUP after changing %lx to %lx\n",
+   __func__, __rdpkey_reg(), pkey_reg);
+#else /* arch */
+   u64 eax = pkey_reg;
+
+   dprintf4("%s() changing %llx to %llx\n",
+__func__, __rdpkey_reg(), pkey_reg);
+   asm volatile("mtspr 0xd, %0" : : "r" ((unsigned long)(eax)) : "memory");
+   

[RFC v4 13/17] selftest: Move protecton key selftest to arch neutral directory

2017-06-27 Thread Ram Pai
Signed-off-by: Ram Pai 
---
 tools/testing/selftests/vm/Makefile   |1 +
 tools/testing/selftests/vm/pkey-helpers.h |  219 
 tools/testing/selftests/vm/protection_keys.c  | 1395 +
 tools/testing/selftests/x86/Makefile  |2 +-
 tools/testing/selftests/x86/pkey-helpers.h|  219 
 tools/testing/selftests/x86/protection_keys.c | 1395 -
 6 files changed, 1616 insertions(+), 1615 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
 create mode 100644 tools/testing/selftests/vm/protection_keys.c
 delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
 delete mode 100644 tools/testing/selftests/x86/protection_keys.c

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index cbb29e4..1d32f78 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -17,6 +17,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += protection_keys
 
 TEST_PROGS := run_vmtests
 
diff --git a/tools/testing/selftests/vm/pkey-helpers.h 
b/tools/testing/selftests/vm/pkey-helpers.h
new file mode 100644
index 000..b202939
--- /dev/null
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -0,0 +1,219 @@
+#ifndef _PKEYS_HELPER_H
+#define _PKEYS_HELPER_H
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_PKEYS 16
+#define PKRU_BITS_PER_PKEY 2
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+extern int dprint_in_signal;
+extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
+static inline void sigsafe_printf(const char *format, ...)
+{
+   va_list ap;
+
+   va_start(ap, format);
+   if (!dprint_in_signal) {
+   vprintf(format, ap);
+   } else {
+   int len = vsnprintf(dprint_in_signal_buffer,
+   DPRINT_IN_SIGNAL_BUF_SIZE,
+   format, ap);
+   /*
+* len is amount that would have been printed,
+* but actual write is truncated at BUF_SIZE.
+*/
+   if (len > DPRINT_IN_SIGNAL_BUF_SIZE)
+   len = DPRINT_IN_SIGNAL_BUF_SIZE;
+   write(1, dprint_in_signal_buffer, len);
+   }
+   va_end(ap);
+}
+#define dprintf_level(level, args...) do { \
+   if (level <= DEBUG_LEVEL)   \
+   sigsafe_printf(args);   \
+   fflush(NULL);   \
+} while (0)
+#define dprintf0(args...) dprintf_level(0, args)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern unsigned int shadow_pkru;
+static inline unsigned int __rdpkru(void)
+{
+   unsigned int eax, edx;
+   unsigned int ecx = 0;
+   unsigned int pkru;
+
+   asm volatile(".byte 0x0f,0x01,0xee\n\t"
+: "=a" (eax), "=d" (edx)
+: "c" (ecx));
+   pkru = eax;
+   return pkru;
+}
+
+static inline unsigned int _rdpkru(int line)
+{
+   unsigned int pkru = __rdpkru();
+
+   dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
+   line, pkru, shadow_pkru);
+   assert(pkru == shadow_pkru);
+
+   return pkru;
+}
+
+#define rdpkru() _rdpkru(__LINE__)
+
+static inline void __wrpkru(unsigned int pkru)
+{
+   unsigned int eax = pkru;
+   unsigned int ecx = 0;
+   unsigned int edx = 0;
+
+   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   asm volatile(".byte 0x0f,0x01,0xef\n\t"
+: : "a" (eax), "c" (ecx), "d" (edx));
+   assert(pkru == __rdpkru());
+}
+
+static inline void wrpkru(unsigned int pkru)
+{
+   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   /* will do the shadow check for us: */
+   rdpkru();
+   __wrpkru(pkru);
+   shadow_pkru = pkru;
+   dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+   unsigned int pkru = rdpkru();
+   int bit = pkey * 2;
+
+   if (do_allow)
+   pkru &= (1<

[RFC v4 12/17] powerpc: Deliver SEGV signal on pkey violation

2017-06-27 Thread Ram Pai
The value of the AMR register at the time of exception
is made available in gp_regs[PT_AMR] of the siginfo.

The value of the pkey, whose protection got violated,
is made available in si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/paca.h|  1 +
 arch/powerpc/include/uapi/asm/ptrace.h |  3 ++-
 arch/powerpc/kernel/asm-offsets.c  |  5 
 arch/powerpc/kernel/exceptions-64s.S   | 16 +--
 arch/powerpc/kernel/signal_32.c|  5 
 arch/powerpc/kernel/signal_64.c|  4 +++
 arch/powerpc/kernel/traps.c| 49 ++
 arch/powerpc/mm/fault.c|  2 ++
 8 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1c09f8f..a41afd3 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -92,6 +92,7 @@ struct paca_struct {
struct dtl_entry *dispatch_log_end;
 #endif /* CONFIG_PPC_STD_MMU_64 */
u64 dscr_default;   /* per-CPU default DSCR */
+   u64 paca_amr;   /* value of amr at exception */
 
 #ifdef CONFIG_PPC_STD_MMU_64
/*
diff --git a/arch/powerpc/include/uapi/asm/ptrace.h 
b/arch/powerpc/include/uapi/asm/ptrace.h
index 8036b38..7ec2428 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -108,8 +108,9 @@ struct pt_regs {
 #define PT_DAR 41
 #define PT_DSISR 42
 #define PT_RESULT 43
-#define PT_DSCR 44
 #define PT_REGS_COUNT 44
+#define PT_DSCR 44
+#define PT_AMR 45
 
 #define PT_FPR048  /* each FP reg occupies 2 slots in this space */
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 709e234..17f5d8a 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -241,6 +241,11 @@ int main(void)
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   OFFSET(PACA_AMR, paca_struct, paca_amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 3fd0528..a4de1b4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,9 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
ld  r12,_MSR(r1)
ld  r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
-   li  r5,0x300
std r3,_DAR(r1)
std r4,_DSISR(r1)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+   beq+1f
+   mfspr   r5,SPRN_AMR
+   std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+1: li  r5,0x300
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
@@ -561,9 +567,15 @@ EXC_COMMON_BEGIN(instruction_access_common)
ld  r12,_MSR(r1)
ld  r3,_NIP(r1)
andis.  r4,r12,0x5820
-   li  r5,0x400
std r3,_DAR(r1)
std r4,_DSISR(r1)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+   beq+1f
+   mfspr   r5,SPRN_AMR
+   std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+1: li  r5,0x400
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 97bb138..9c4a7f3 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct 
mcontext __user *frame,
   (unsigned long) >tramp[2]);
}
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (__put_user(get_paca()->paca_amr, >mc_gregs[PT_AMR]))
+   return 1;
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return 0;
 }
 
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index c83c115..86a4262 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -174,6 +174,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
if (set != NULL)
err |=  __put_user(set->sig[0], >oldmask);
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   err |= __put_user(get_paca()->paca_amr, 

[RFC v4 11/17] powerpc: Handle exceptions caused by pkey violation

2017-06-27 Thread Ram Pai
Handle Data and Instruction exceptions caused by memory
protection-key.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mmu_context.h | 12 ++
 arch/powerpc/include/asm/reg.h |  2 +-
 arch/powerpc/mm/fault.c| 20 +
 arch/powerpc/mm/pkeys.c| 79 ++
 4 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index da7e943..71fffe0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_pte_access_permitted(pte_t pte, bool write);
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+   bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+   /* by default, allow everything */
+   return true;
+}
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
 {
/* by default, allow everything */
return true;
 }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index ba110dd..6e2a860 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -286,7 +286,7 @@
 #define   DSISR_SET_RC 0x0004  /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
 #define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \
-   DSISR_BADACCESS | DSISR_BIT43)
+   DSISR_BADACCESS | DSISR_KEYFAULT | DSISR_BIT43)
 #define SPRN_TBRL  0x10C   /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR   0x11B   /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580..3d71984 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -261,6 +261,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
}
 #endif
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (error_code & DSISR_KEYFAULT) {
+   code = SEGV_PKUERR;
+   goto bad_area_nosemaphore;
+   }
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/* We restore the interrupt state now */
if (!arch_irq_disabled_regs(regs))
local_irq_enable();
@@ -441,6 +448,19 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
 #endif /* CONFIG_PPC_STD_MMU */
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+   is_exec, 0)) {
+   code = SEGV_PKUERR;
+   goto bad_area;
+   }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+   /* handle_mm_fault() needs to know if its a instruction access
+* fault.
+*/
+   if (is_exec)
+   flags |= FAULT_FLAG_INSTRUCTION;
/*
 * If for any reason at all we couldn't handle the fault,
 * make sure we exit gracefully rather than endlessly redo
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 11a32b3..514f503 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
 }
 
+static inline bool pkey_allows_read(int pkey)
+{
+   int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+   if (!(read_uamor() & (0x3ul << pkey_shift)))
+   return true;
+
+   return !(read_amr() & (AMR_AD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_write(int pkey)
+{
+   int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+   if (!(read_uamor() & (0x3ul << pkey_shift)))
+   return true;
+
+   return !(read_amr() & (AMR_WD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_execute(int pkey)
+{
+   int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+   if (!(read_uamor() & (0x3ul << pkey_shift)))
+   return true;
+
+   return !(read_iamr() & (IAMR_EX_BIT << pkey_shift));
+}
+
+
 /*
  * set the access right in AMR IAMR and UAMOR register
  * for @pkey to that specified in @init_val.
@@ -175,3 +206,51 @@ int __arch_override_mprotect_pkey(struct vm_area_struct 
*vma, int prot,
 */
return 

[RFC v4 10/17] powerpc: Macro the mask used for checking DSI exception

2017-06-27 Thread Ram Pai
Replace the magic number used to check for DSI exception
with a meaningful value.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/reg.h   | 7 ++-
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7e50e47..ba110dd 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -272,16 +272,21 @@
 #define SPRN_DAR   0x013   /* Data Address Register */
 #define SPRN_DBCR  0x136   /* e300 Data Breakpoint Control Reg */
 #define SPRN_DSISR 0x012   /* Data Storage Interrupt Status Register */
+#define   DSISR_BIT32  0x8000  /* not defined */
 #define   DSISR_NOHPTE 0x4000  /* no translation found */
+#define   DSISR_PAGEATTR_CONFLT0x2000  /* page attribute 
conflict */
+#define   DSISR_BIT35  0x1000  /* not defined */
 #define   DSISR_PROTFAULT  0x0800  /* protection fault */
 #define   DSISR_BADACCESS  0x0400  /* bad access to CI or G */
 #define   DSISR_ISSTORE0x0200  /* access was a store */
 #define   DSISR_DABRMATCH  0x0040  /* hit data breakpoint */
-#define   DSISR_NOSEGMENT  0x0020  /* SLB miss */
 #define   DSISR_KEYFAULT   0x0020  /* Key fault */
+#define   DSISR_BIT43  0x0010  /* not defined */
 #define   DSISR_UNSUPP_MMU 0x0008  /* Unsupported MMU config */
 #define   DSISR_SET_RC 0x0004  /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
+#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \
+   DSISR_BADACCESS | DSISR_BIT43)
 #define SPRN_TBRL  0x10C   /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR   0x11B   /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index ae418b8..3fd0528 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
.balign IFETCH_ALIGN_BYTES
 do_hash_page:
 #ifdef CONFIG_PPC_STD_MMU_64
-   andis.  r0,r4,0xa410/* weird error? */
+   andis.  r0,r4,DSISR_PAGE_FAULT_MASK@h
bne-handle_page_fault   /* if not, try to insert a HPTE */
andis.  r0,r4,DSISR_DABRMATCH@h
bne-handle_dabr_fault
-- 
1.8.3.1



[RFC v4 09/17] powerpc: call the hash functions with the correct pkey value

2017-06-27 Thread Ram Pai
Pass the correct protection key value to the hash functions on
page fault.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/pkeys.h | 11 +++
 arch/powerpc/mm/hash_utils_64.c  |  4 
 arch/powerpc/mm/mem.c|  6 ++
 3 files changed, 21 insertions(+)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index ef1c601..1370b3f 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -74,6 +74,17 @@ static inline bool mm_pkey_is_allocated(struct mm_struct 
*mm, int pkey)
 }
 
 /*
+ * return the protection key of the vma corresponding to the
+ * given effective address @ea.
+ */
+static inline int mm_pkey(struct mm_struct *mm, unsigned long ea)
+{
+   struct vm_area_struct *vma = find_vma(mm, ea);
+   int pkey = vma ? vma_pkey(vma) : 0;
+   return pkey;
+}
+
+/*
  * Returns a positive, 5-bit key on success, or -1 on failure.
  */
 static inline int mm_pkey_alloc(struct mm_struct *mm)
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 7e67dea..403f75d 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1319,6 +1319,10 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
goto bail;
}
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   pkey = mm_pkey(mm, ea);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
if (hugeshift) {
if (is_thp)
rc = __hash_page_thp(ea, access, vsid, (pmd_t *)ptep,
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ec890d3..0fcaa48 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -541,8 +541,14 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned 
long address,
return;
}
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   hash_preload_pkey(vma->vm_mm, address, access, trap, vma_pkey(vma));
+#else
hash_preload(vma->vm_mm, address, access, trap);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #endif /* CONFIG_PPC_STD_MMU */
+
 #if (defined(CONFIG_PPC_BOOK3E_64) || defined(CONFIG_PPC_FSL_BOOK3E)) \
&& defined(CONFIG_HUGETLB_PAGE)
if (is_vm_hugetlb_page(vma))
-- 
1.8.3.1



[RFC v4 08/17] powerpc: Program HPTE key protection bits

2017-06-27 Thread Ram Pai
Map the PTE protection key bits to the HPTE key protection bits,
while creating HPTE  entries.

Signed-off-by: Ram Pai 
---
 Makefile  | 2 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +
 arch/powerpc/include/asm/pkeys.h  | 9 +
 arch/powerpc/mm/hash_utils_64.c   | 4 
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 470bd4d..141ea4e 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 4
 PATCHLEVEL = 12
 SUBLEVEL = 0
-EXTRAVERSION = -rc3
+EXTRAVERSION = -rc3-64k
 NAME = Fearless Coyote
 
 # *DOCUMENTATION*
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index aa3c299..721a4c3 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
 #define HPTE_R_PP0 ASM_CONST(0x8000)
 #define HPTE_R_TS  ASM_CONST(0x4000)
 #define HPTE_R_KEY_HI  ASM_CONST(0x3000)
+#define HPTE_R_KEY_BIT0ASM_CONST(0x2000)
+#define HPTE_R_KEY_BIT1ASM_CONST(0x1000)
 #define HPTE_R_RPN_SHIFT   12
 #define HPTE_R_RPN ASM_CONST(0x0000)
 #define HPTE_R_RPN_3_0 ASM_CONST(0x01fff000)
@@ -104,6 +106,9 @@
 #define HPTE_R_C   ASM_CONST(0x0080)
 #define HPTE_R_R   ASM_CONST(0x0100)
 #define HPTE_R_KEY_LO  ASM_CONST(0x0e00)
+#define HPTE_R_KEY_BIT2ASM_CONST(0x0800)
+#define HPTE_R_KEY_BIT3ASM_CONST(0x0400)
+#define HPTE_R_KEY_BIT4ASM_CONST(0x0200)
 
 #define HPTE_V_1TB_SEG ASM_CONST(0x4000)
 #define HPTE_V_VRMA_MASK   ASM_CONST(0x4001ff00)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 41bf5d4..ef1c601 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -23,6 +23,15 @@ static inline unsigned long  pkey_to_vmflag_bits(int pkey)
((pkey & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL));
 }
 
+static inline unsigned long  pkey_to_hpte_pkey_bits(int pkey)
+{
+   return  (((pkey & 0x10) ? HPTE_R_KEY_BIT0 : 0x0UL) |
+   ((pkey & 0x8) ? HPTE_R_KEY_BIT1 : 0x0UL) |
+   ((pkey & 0x4) ? HPTE_R_KEY_BIT2 : 0x0UL) |
+   ((pkey & 0x2) ? HPTE_R_KEY_BIT3 : 0x0UL) |
+   ((pkey & 0x1) ? HPTE_R_KEY_BIT4 : 0x0UL));
+}
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 2254ff0..7e67dea 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -231,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long 
pteflags, int pkey)
 */
rflags |= HPTE_R_M;
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   rflags |= pkey_to_hpte_pkey_bits(pkey);
+#endif
+
return rflags;
 }
 
-- 
1.8.3.1



[RFC v4 07/17] powerpc: make the hash functions protection-key aware

2017-06-27 Thread Ram Pai
Prepare the hash functions to be aware of protection keys.
This key will later be used to program the HPTE.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash.h |  2 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 14 ++-
 arch/powerpc/mm/hash64_4k.c   |  4 ++--
 arch/powerpc/mm/hash64_64k.c  |  8 +++
 arch/powerpc/mm/hash_utils_64.c   | 34 ++-
 arch/powerpc/mm/hugepage-hash64.c |  4 ++--
 arch/powerpc/mm/hugetlbpage-hash64.c  |  5 ++--
 arch/powerpc/mm/mem.c |  1 +
 arch/powerpc/mm/mmu_decl.h|  5 +++-
 9 files changed, 48 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..3c1ef01 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -92,7 +92,7 @@ static inline int hash__pgd_bad(pgd_t pgd)
 
 extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long pte, int huge);
-extern unsigned long htab_convert_pte_flags(unsigned long pteflags);
+extern unsigned long htab_convert_pte_flags(unsigned long pteflags, int pkey);
 /* Atomic PTE updates */
 static inline unsigned long hash__pte_update(struct mm_struct *mm,
 unsigned long addr,
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52..aa3c299 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -430,11 +430,11 @@ static inline unsigned long hpt_hash(unsigned long vpn,
 #define HPTE_NOHPTE_UPDATE 0x2
 
 extern int __hash_page_4K(unsigned long ea, unsigned long access,
- unsigned long vsid, pte_t *ptep, unsigned long trap,
- unsigned long flags, int ssize, int subpage_prot);
+ unsigned long vsid, pte_t *ptep, unsigned long trap,
+ unsigned long flags, int ssize, int subpage_prot, int pkey);
 extern int __hash_page_64K(unsigned long ea, unsigned long access,
   unsigned long vsid, pte_t *ptep, unsigned long trap,
-  unsigned long flags, int ssize);
+  unsigned long flags, int ssize, int pkey);
 struct mm_struct;
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
 extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
@@ -444,16 +444,18 @@ extern int hash_page(unsigned long ea, unsigned long 
access, unsigned long trap,
 unsigned long dsisr);
 int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long 
vsid,
 pte_t *ptep, unsigned long trap, unsigned long flags,
-int ssize, unsigned int shift, unsigned int mmu_psize);
+int ssize, unsigned int shift, unsigned int mmu_psize,
+int pkey);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern int __hash_page_thp(unsigned long ea, unsigned long access,
   unsigned long vsid, pmd_t *pmdp, unsigned long trap,
-  unsigned long flags, int ssize, unsigned int psize);
+  unsigned long flags, int ssize, unsigned int psize,
+  int pkey);
 #else
 static inline int __hash_page_thp(unsigned long ea, unsigned long access,
  unsigned long vsid, pmd_t *pmdp,
  unsigned long trap, unsigned long flags,
- int ssize, unsigned int psize)
+ int ssize, unsigned int psize, int pkey)
 {
BUG();
return -1;
diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..6765ba2 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -18,7 +18,7 @@
 
 int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
   pte_t *ptep, unsigned long trap, unsigned long flags,
-  int ssize, int subpg_prot)
+  int ssize, int subpg_prot, int pkey)
 {
unsigned long hpte_group;
unsigned long rflags, pa;
@@ -53,7 +53,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * PP bits. _PAGE_USER is already PP bit 0x2, so we only
 * need to add in 0x1 if it's a read-only user page
 */
-   rflags = htab_convert_pte_flags(new_pte);
+   rflags = htab_convert_pte_flags(new_pte, pkey);
 
if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 1a68cb1..9ce4d7b 100644

[RFC v4 06/17] powerpc: Implementation for sys_mprotect_pkey() system call

2017-06-27 Thread Ram Pai
This system call, associates the pkey with vma corresponding to
the given address range.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mman.h|  8 ++-
 arch/powerpc/include/asm/pkeys.h   | 17 ++-
 arch/powerpc/include/asm/systbl.h  |  1 +
 arch/powerpc/include/asm/unistd.h  |  4 +-
 arch/powerpc/include/uapi/asm/unistd.h |  1 +
 arch/powerpc/mm/pkeys.c| 93 +-
 6 files changed, 117 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..067eec2 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,6 +13,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -22,7 +23,12 @@
 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
unsigned long pkey)
 {
-   return (prot & PROT_SAO) ? VM_SAO : 0;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   return (((prot & PROT_SAO) ? VM_SAO : 0) |
+   pkey_to_vmflag_bits(pkey));
+#else
+   return ((prot & PROT_SAO) ? VM_SAO : 0);
+#endif
 }
 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7bc8746..41bf5d4 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,15 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)
 
+static inline unsigned long  pkey_to_vmflag_bits(int pkey)
+{
+   return (((pkey & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) |
+   ((pkey & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |
+   ((pkey & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |
+   ((pkey & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) |
+   ((pkey & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL));
+}
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
@@ -42,6 +51,12 @@
 #define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
pkeybit_mask(pkey))
 
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+   return (vma->vm_flags & ARCH_VM_PKEY_FLAGS) >> VM_PKEY_SHIFT;
+}
+
 static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
/* a reserved key is never considered as 'explicitly allocated' */
@@ -114,7 +129,7 @@ static inline int arch_set_user_pkey_access(struct 
task_struct *tsk, int pkey,
return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
-static inline pkey_mm_init(struct mm_struct *mm)
+static inline void pkey_mm_init(struct mm_struct *mm)
 {
mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
/* -1 means unallocated or invalid */
diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 22dd776..b33b551 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@
 SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
+SYSCALL(pkey_mprotect)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index e0273bc..daf1ba9 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,12 +12,10 @@
 #include 
 
 
-#define NR_syscalls386
+#define NR_syscalls387
 
 #define __NR__exit __NR_exit
 
-#define __IGNORE_pkey_mprotect
-
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 7993a07..71ae45e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
 #define __NR_statx 383
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
+#define __NR_pkey_mprotect 386
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index b97366e..11a32b3 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -15,6 +15,17 @@
 #include /* PKEY_*   */
 #include 
 
+#define pkeyshift(pkey) ((arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY)
+
+static inline bool pkey_allows_readwrite(int pkey)
+{
+   int pkey_shift = pkeyshift(pkey);
+
+   if (!(read_uamor() & (0x3UL << pkey_shift)))
+   return true;
+
+   return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
+}
 
 /*
  * set the access right in AMR IAMR and UAMOR register
@@ -68,7 +79,60 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
-   return -1;
+   bool need_to_set_mm_pkey = false;
+   int execute_only_pkey = mm->context.execute_only_pkey;
+   int ret;
+
+   /* Do we need to assign a pkey for mm's execute-only maps? */
+   if (execute_only_pkey == -1) {
+   /* Go allocate one to use, which might 

[RFC v4 05/17] powerpc: store and restore the pkey state across context switches

2017-06-27 Thread Ram Pai
Store and restore the AMR, IAMR and UMOR register state of the task
before scheduling out and after scheduling in, respectively.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/processor.h |  5 +
 arch/powerpc/kernel/process.c| 18 ++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index a2123f2..1f714df 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -310,6 +310,11 @@ struct thread_struct {
struct thread_vr_state ckvr_state; /* Checkpointed VR state */
unsigned long   ckvrsave; /* Checkpointed VRSAVE */
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   unsigned long   amr;
+   unsigned long   iamr;
+   unsigned long   uamor;
+#endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
void*   kvm_shadow_vcpu; /* KVM internal data */
 #endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index baae104..37d001a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t)
t->tar = mfspr(SPRN_TAR);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   t->amr = mfspr(SPRN_AMR);
+   t->iamr = mfspr(SPRN_IAMR);
+   t->uamor = mfspr(SPRN_UAMOR);
+#endif
 }
 
 static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct 
*old_thread,
mtspr(SPRN_TAR, new_thread->tar);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (old_thread->amr != new_thread->amr)
+   mtspr(SPRN_AMR, new_thread->amr);
+   if (old_thread->iamr != new_thread->iamr)
+   mtspr(SPRN_IAMR, new_thread->iamr);
+   if (old_thread->uamor != new_thread->uamor)
+   mtspr(SPRN_UAMOR, new_thread->uamor);
+#endif
 }
 
 struct task_struct *__switch_to(struct task_struct *prev,
@@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long 
start, unsigned long sp)
current->thread.tm_texasr = 0;
current->thread.tm_tfiar = 0;
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   current->thread.amr   = 0x0ul;
+   current->thread.iamr  = 0x0ul;
+   current->thread.uamor = 0x0ul;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL(start_thread);
 
-- 
1.8.3.1



[RFC v4 04/17] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call

2017-06-27 Thread Ram Pai
Sys_pkey_alloc() allocates and returns available pkey
Sys_pkey_free()  frees up the pkey.

Total 32 keys are supported on powerpc. However pkey 0,1 and 31
are reserved. So effectively we have 29 pkeys.

Each key  can  be  initialized  to disable read, write and execute
permissions. On powerpc a key can be initialize to disable execute.

Signed-off-by: Ram Pai 
---
 arch/powerpc/Kconfig |  15 
 arch/powerpc/include/asm/book3s/64/mmu.h |  10 +++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  62 ++
 arch/powerpc/include/asm/pkeys.h | 124 +++
 arch/powerpc/include/asm/systbl.h|   2 +
 arch/powerpc/include/asm/unistd.h|   4 +-
 arch/powerpc/include/uapi/asm/unistd.h   |   2 +
 arch/powerpc/mm/Makefile |   1 +
 arch/powerpc/mm/mmu_context_book3s64.c   |   5 ++
 arch/powerpc/mm/pkeys.c  |  88 +++
 10 files changed, 310 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7c8f99..81202e5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -871,6 +871,21 @@ config SECCOMP
 
  If unsure, say Y. Only embedded should say N here.
 
+config PPC64_MEMORY_PROTECTION_KEYS
+   prompt "PowerPC Memory Protection Keys"
+   def_bool y
+   # Note: only available in 64-bit mode
+   depends on PPC64
+   select ARCH_USES_HIGH_VMA_FLAGS
+   select ARCH_HAS_PKEYS
+   ---help---
+ Memory Protection Keys provides a mechanism for enforcing
+ page-based protections, but without requiring modification of the
+ page tables when an application changes protection domains.
+
+ For details, see Documentation/powerpc/protection-keys.txt
+
+ If unsure, say y.
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3..0c0a2a8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -108,6 +108,16 @@ struct patb_entry {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
struct list_head iommu_group_mem_list;
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /*
+* Each bit represents one protection key.
+* bit set   -> key allocated
+* bit unset -> key available for allocation
+*/
+   u32 pkey_allocation_map;
+   s16 execute_only_pkey; /* key holding execute-only protection */
+#endif
 } mm_context_t;
 
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc987..87e9a89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct 
mm_struct *mm,
pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
+#include 
+static inline u64 read_amr(void)
+{
+   return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+   mtspr(SPRN_AMR, value);
+}
+static inline u64 read_iamr(void)
+{
+   return mfspr(SPRN_IAMR);
+}
+static inline void write_iamr(u64 value)
+{
+   mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+   return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+   mtspr(SPRN_UAMOR, value);
+}
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+static inline u64 read_amr(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_amr(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_uamor(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_uamor(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_iamr(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_iamr(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
   unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 000..7bc8746
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,124 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+
+#define arch_max_pkey()  32
+
+#define 

[RFC v4 03/17] x86: key creation with PKEY_DISABLE_EXECUTE disallowed

2017-06-27 Thread Ram Pai
x86 does not support disabling execute permissions on a pkey.

Signed-off-by: Ram Pai 
---
 arch/x86/kernel/fpu/xstate.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c24ac1e..d582631 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -900,6 +900,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return -EINVAL;
 
+   if (init_val & PKEY_DISABLE_EXECUTE)
+   return -EINVAL;
+
/* Set the bits we need in PKRU:  */
if (init_val & PKEY_DISABLE_ACCESS)
new_pkru_bits |= PKRU_AD_BIT;
-- 
1.8.3.1



[RFC v4 02/17] mm: ability to disable execute permission on a key at creation

2017-06-27 Thread Ram Pai
Currently sys_pkey_create() provides the ability to disable read
and write permission on the key, at  creation. powerpc  has  the
hardware support to disable execute on a pkey as well.This patch
enhances the interface to let disable execute  at  key  creation
time. x86 does  not  allow  this.  Hence the next patch will add
ability  in  x86  to  return  error  if  PKEY_DISABLE_EXECUTE is
specified.

Signed-off-by: Ram Pai 
---
 include/uapi/asm-generic/mman-common.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 8c27db0..bf4fa07 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -74,7 +74,9 @@
 
 #define PKEY_DISABLE_ACCESS0x1
 #define PKEY_DISABLE_WRITE 0x2
+#define PKEY_DISABLE_EXECUTE   0x4
 #define PKEY_ACCESS_MASK   (PKEY_DISABLE_ACCESS |\
-PKEY_DISABLE_WRITE)
+PKEY_DISABLE_WRITE  |\
+PKEY_DISABLE_EXECUTE)
 
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
-- 
1.8.3.1



[RFC v4 01/17] mm: introduce an additional vma bit for powerpc pkey

2017-06-27 Thread Ram Pai
Currently there are only 4bits in the vma flags to support 16 keys
on x86.  powerpc supports 32 keys, which needs 5bits. This patch
introduces an addition bit in the vma flags.

Signed-off-by: Ram Pai 
---
 fs/proc/task_mmu.c |  6 +-
 include/linux/mm.h | 18 +-
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33..2ddc298 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
[ilog2(VM_MERGEABLE)]   = "mg",
[ilog2(VM_UFFD_MISSING)]= "um",
[ilog2(VM_UFFD_WP)] = "uw",
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_ARCH_HAS_PKEYS
/* These come out via ProtectionKey: */
[ilog2(VM_PKEY_BIT0)]   = "",
[ilog2(VM_PKEY_BIT1)]   = "",
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
+#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /* Additional bit in ProtectionKey: */
+   [ilog2(VM_PKEY_BIT4)]   = "",
 #endif
};
size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..3d35bcc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,21 +208,29 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 #define VM_HIGH_ARCH_BIT_1 33  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_2 34  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_3 35  /* bit only usable on 64-bit 
architectures */
+#define VM_HIGH_ARCH_BIT_4 36  /* bit only usable on 64-bit arch */
 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
-#if defined(CONFIG_X86)
-# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+#ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0  VM_HIGH_ARCH_0  /* A protection key is a 4-bit value */
+# define VM_PKEY_BIT0  VM_HIGH_ARCH_0
 # define VM_PKEY_BIT1  VM_HIGH_ARCH_1
 # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
 # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
-#endif
+#endif /* CONFIG_ARCH_HAS_PKEYS */
+
+#if defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT4  VM_HIGH_ARCH_4 /* additional key bit used on ppc64 */
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
+#if defined(CONFIG_X86)
+# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
 #elif defined(CONFIG_PPC)
 # define VM_SAOVM_ARCH_1   /* Strong Access Ordering 
(powerpc) */
 #elif defined(CONFIG_PARISC)
-- 
1.8.3.1



[RFC v4 00/17] powerpc: Memory Protection Keys

2017-06-27 Thread Ram Pai
Memory protection keys enable applications to protect its
address space from inadvertent access or corruption from
itself.

The overall idea:

 A process allocates a   key  and associates it with
 a  address  range  withinits   address   space.
 The process  than  can  dynamically  set read/write 
 permissions on  the   key   without  involving  the 
 kernel. Any  code that  violates   the  permissions
 off the address space; as defined by its associated
 key, will receive a segmentation fault.

This patch series enables the feature on PPC64 HPTE
platform.

ISA3.0 section 5.7.13 describes the detailed specifications.


Testing:
This patch series has passed all the protection key
tests available in  the selftests directory.
The tests are updated to work on both x86 and powerpc.

version v4:
(1) patches no more depend on the pte bits to program
the hpte -- comment by Balbir
(2) documentation updates
(3) fixed a bug in the selftest.
(4) unlike x86, powerpc lets signal handler change key
permission bits; the change will persist across
signal handler boundaries. Earlier we allowed
the signal handler to modify a field in the siginfo
structure which would than be used by the kernel
to program the key protection register (AMR)
-- resolves a issue raised by Ben.
"Calls to sys_swapcontext with a made-up context
will end up with a crap AMR if done by code who
didn't know about that register".
(5) these changes enable protection keys on 4k-page 
kernel aswell.

version v3:
(1) split the patches into smaller consumable
patches.
(2) added the ability to disable execute permission
on a key at creation.
(3) rename  calc_pte_to_hpte_pkey_bits() to
pte_to_hpte_pkey_bits() -- suggested by Anshuman
(4) some code optimization and clarity in
do_page_fault()  
(5) A bug fix while invalidating a hpte slot in 
__hash_page_4K() -- noticed by Aneesh


version v2:
(1) documentation and selftest added
(2) fixed a bug in 4k hpte backed 64k pte where page
invalidation was not done correctly, and 
initialization of second-part-of-the-pte was not
done correctly if the pte was not yet Hashed
with a hpte.  Reported by Aneesh.
(3) Fixed ABI breakage caused in siginfo structure.
Reported by Anshuman.


version v1: Initial version

Ram Pai (17):
  mm: introduce an additional vma bit for powerpc pkey
  mm: ability to disable execute permission on a key at creation
  x86: key creation with PKEY_DISABLE_EXECUTE disallowed
  powerpc: Implement sys_pkey_alloc and sys_pkey_free system call
  powerpc: store and restore the pkey state across context switches
  powerpc: Implementation for sys_mprotect_pkey() system call
  powerpc: make the hash functions protection-key aware
  powerpc: Program HPTE key protection bits
  powerpc: call the hash functions with the correct pkey value
  powerpc: Macro the mask used for checking DSI exception
  powerpc: Handle exceptions caused by pkey violation
  powerpc: Deliver SEGV signal on pkey violation
  selftest: Move protecton key selftest to arch neutral directory
  selftest: PowerPC specific test updates to memory protection keys
  Documentation: Move protecton key documentation to arch neutral
directory
  Documentation: PowerPC specific updates to memory protection keys
  procfs: display the protection-key number associated with a vma

 Documentation/filesystems/proc.txt|3 +-
 Documentation/vm/protection-keys.txt  |  134 +++
 Documentation/x86/protection-keys.txt |   85 --
 Makefile  |2 +-
 arch/powerpc/Kconfig  |   15 +
 arch/powerpc/include/asm/book3s/64/hash.h |2 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   19 +-
 arch/powerpc/include/asm/book3s/64/mmu.h  |   10 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   62 ++
 arch/powerpc/include/asm/mman.h   |8 +-
 arch/powerpc/include/asm/mmu_context.h|   12 +
 arch/powerpc/include/asm/paca.h   |1 +
 arch/powerpc/include/asm/pkeys.h  |  159 +++
 arch/powerpc/include/asm/processor.h  |5 +
 arch/powerpc/include/asm/reg.h|7 +-
 arch/powerpc/include/asm/systbl.h |3 +
 arch/powerpc/include/asm/unistd.h |6 +-
 arch/powerpc/include/uapi/asm/ptrace.h|3 +-
 arch/powerpc/include/uapi/asm/unistd.h|3 +
 arch/powerpc/kernel/asm-offsets.c |5 +
 arch/powerpc/kernel/exceptions-64s.S  |   18 +-
 arch/powerpc/kernel/process.c |   18 +
 

Re: [PATCH 1/2] powerpc/fadump: return 0 on re-registration

2017-06-27 Thread Michal Suchánek
On Mon, 26 Jun 2017 16:06:00 +0200
Michal Suchanek  wrote:

> When fadump is already registered return success.
> 
> Currently EEXIST is returned which is difficult to handle race-free in
> userspace when shell scripts are used. If multiple writers are trying
> to write '1' there is no difference in whichever succeeds so just
> return 0 to all.
> 
> Signed-off-by: Michal Suchanek 
> ---
>  arch/powerpc/kernel/fadump.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c
> b/arch/powerpc/kernel/fadump.c index 436aedf195ab..5a7355381dac 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -1214,7 +1214,6 @@ static ssize_t fadump_register_store(struct
> kobject *kobj, break;
>   case '1':
>   if (fw_dump.dump_registered == 1) {
> - ret = -EEXIST;
>   goto unlock_out;
>   }
>   /* Register Firmware-assisted dump */

Forget about this one.

It breaks another case when fadump is registered and you need to
re-register to account for change in system configuration.

Thanks

Michal


Re: [PATCH] kallsyms: optimize kallsyms_lookup_name() for a few cases

2017-06-27 Thread Naveen N. Rao
On 2017/04/27 11:21AM, Masami Hiramatsu wrote:
> On Thu, 27 Apr 2017 01:38:10 +0530
> "Naveen N. Rao"  wrote:
> 
> > Michael Ellerman wrote:
> > > "Naveen N. Rao"  writes:
> > >> diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> > >> index 6a3b249a2ae1..d134b060564f 100644
> > >> --- a/kernel/kallsyms.c
> > >> +++ b/kernel/kallsyms.c
> > >> @@ -205,6 +205,12 @@ unsigned long kallsyms_lookup_name(const char *name)
> > >>  unsigned long i;
> > >>  unsigned int off;
> > >>  
> > >> +if (!name || *name == '\0')
> > >> +return false;
> > >> +
> > >> +if (strnchr(name, MODULE_NAME_LEN, ':'))
> > >> +return module_kallsyms_lookup_name(name);
> > >> +
> > >>  for (i = 0, off = 0; i < kallsyms_num_syms; i++) {
> > >>  off = kallsyms_expand_symbol(off, namebuf, 
> > >> ARRAY_SIZE(namebuf));
> > >   ... 
> > >   }
> > >   return module_kallsyms_lookup_name(name);
> > > 
> > > Is the rest of the context.
> > > 
> > > Which looks a bit odd, we already did module lookup previously?
> > > 
> > > But it's correct, because you can lookup a symbol in a module without a
> > > module prefix, it just looks in every module.
> > 
> > Yes.
> > 
> > > 
> > > You could invert the logic, ie. check that there isn't a ":" in the name
> > > and only in that case do the for loop, always falling back to module
> > > lookup.
> > > 
> > > Or just add a comment explaining why we call module lookup in two places.
> > 
> > Good point. Here's a v2 - I'm using a goto so as to not indent the code too 
> > much.
> > 
> > Thanks for the review!
> > - Naveen
> > 
> > --
> > [PATCH v2] kallsyms: optimize kallsyms_lookup_name() for a few cases
> > 
> > 1. Fail early for invalid/zero length symbols.
> > 2. Detect names of the form  and skip checking for kernel
> > symbols in that case.
> > 
> 
> Looks good to me.
> 
> Reviewed-by: Masami Hiramatsu 

Thanks, Masami!

I am not quite sure who maintains kallsyms...

Ingo,
Can you please help with merging this patch?

Thanks,
Naveen

> 
> Thanks,
> 
> 
> > Signed-off-by: Naveen N. Rao 
> > ---
> >  kernel/kallsyms.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> > index 6a3b249a2ae1..f7558dc5c6ac 100644
> > --- a/kernel/kallsyms.c
> > +++ b/kernel/kallsyms.c
> > @@ -205,12 +205,20 @@ unsigned long kallsyms_lookup_name(const char *name)
> > unsigned long i;
> > unsigned int off;
> >  
> > +   if (!name || *name == '\0')
> > +   return 0;
> > +
> > +   /* For symbols of the form :, only check the modules */
> > +   if (strnchr(name, MODULE_NAME_LEN, ':'))
> > +   goto mod;
> > +
> > for (i = 0, off = 0; i < kallsyms_num_syms; i++) {
> > off = kallsyms_expand_symbol(off, namebuf, ARRAY_SIZE(namebuf));
> >  
> > if (strcmp(namebuf, name) == 0)
> > return kallsyms_sym_address(i);
> > }
> > +mod:
> > return module_kallsyms_lookup_name(name);
> >  }
> >  EXPORT_SYMBOL_GPL(kallsyms_lookup_name);
> > -- 
> > 2.12.2
> > 
> 
> 
> -- 
> Masami Hiramatsu 
> 



Re: [PATCH v4 3/9] powerpc/kprobes/optprobes: Move over to patch_instruction

2017-06-27 Thread Christophe LEROY



Le 27/06/2017 à 09:48, Balbir Singh a écrit :

With text moving to read-only migrate optprobes to using
the patch_instruction infrastructure. Without this optprobes
will fail and complain.

Signed-off-by: Balbir Singh 


Didn't Michael picked it up already ?

Christophe



---
  arch/powerpc/kernel/optprobes.c | 58 ++---
  1 file changed, 37 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index ec60ed0..1c7326c 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -158,12 +158,13 @@ void arch_remove_optimized_kprobe(struct optimized_kprobe 
*op)
  void patch_imm32_load_insns(unsigned int val, kprobe_opcode_t *addr)
  {
/* addis r4,0,(insn)@h */
-   *addr++ = PPC_INST_ADDIS | ___PPC_RT(4) |
- ((val >> 16) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ADDIS | ___PPC_RT(4) |
+   ((val >> 16) & 0x));
+   addr++;
  
  	/* ori r4,r4,(insn)@l */

-   *addr = PPC_INST_ORI | ___PPC_RA(4) | ___PPC_RS(4) |
-   (val & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(4) |
+   ___PPC_RS(4) | (val & 0x));
  }
  
  /*

@@ -173,24 +174,28 @@ void patch_imm32_load_insns(unsigned int val, 
kprobe_opcode_t *addr)
  void patch_imm64_load_insns(unsigned long val, kprobe_opcode_t *addr)
  {
/* lis r3,(op)@highest */
-   *addr++ = PPC_INST_ADDIS | ___PPC_RT(3) |
- ((val >> 48) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ADDIS | ___PPC_RT(3) |
+   ((val >> 48) & 0x));
+   addr++;
  
  	/* ori r3,r3,(op)@higher */

-   *addr++ = PPC_INST_ORI | ___PPC_RA(3) | ___PPC_RS(3) |
- ((val >> 32) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(3) |
+   ___PPC_RS(3) | ((val >> 32) & 0x));
+   addr++;
  
  	/* rldicr r3,r3,32,31 */

-   *addr++ = PPC_INST_RLDICR | ___PPC_RA(3) | ___PPC_RS(3) |
- __PPC_SH64(32) | __PPC_ME64(31);
+   patch_instruction((unsigned int *)addr, PPC_INST_RLDICR | ___PPC_RA(3) |
+   ___PPC_RS(3) | __PPC_SH64(32) | __PPC_ME64(31));
+   addr++;
  
  	/* oris r3,r3,(op)@h */

-   *addr++ = PPC_INST_ORIS | ___PPC_RA(3) | ___PPC_RS(3) |
- ((val >> 16) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORIS | ___PPC_RA(3) |
+   ___PPC_RS(3) | ((val >> 16) & 0x));
+   addr++;
  
  	/* ori r3,r3,(op)@l */

-   *addr = PPC_INST_ORI | ___PPC_RA(3) | ___PPC_RS(3) |
-   (val & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(3) |
+   ___PPC_RS(3) | (val & 0x));
  }
  
  int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe *p)

@@ -198,7 +203,8 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
kprobe_opcode_t *buff, branch_op_callback, branch_emulate_step;
kprobe_opcode_t *op_callback_addr, *emulate_step_addr;
long b_offset;
-   unsigned long nip;
+   unsigned long nip, size;
+   int rc, i;
  
  	kprobe_ppc_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
  
@@ -231,8 +237,15 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe *p)

goto error;
  
  	/* Setup template */

-   memcpy(buff, optprobe_template_entry,
-   TMPL_END_IDX * sizeof(kprobe_opcode_t));
+   /* We can optimize this via patch_instruction_window later */
+   size = (TMPL_END_IDX * sizeof(kprobe_opcode_t)) / sizeof(int);
+   pr_devel("Copying template to %p, size %lu\n", buff, size);
+   for (i = 0; i < size; i++) {
+   rc = patch_instruction((unsigned int *)buff + i,
+   *((unsigned int *)(optprobe_template_entry) + i));
+   if (rc < 0)
+   goto error;
+   }
  
  	/*

 * Fixup the template with instructions to:
@@ -261,8 +274,10 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
if (!branch_op_callback || !branch_emulate_step)
goto error;
  
-	buff[TMPL_CALL_HDLR_IDX] = branch_op_callback;

-   buff[TMPL_EMULATE_IDX] = branch_emulate_step;
+   patch_instruction((unsigned int *)buff + TMPL_CALL_HDLR_IDX,
+   branch_op_callback);
+   patch_instruction((unsigned int *)buff + TMPL_EMULATE_IDX,
+   branch_emulate_step);
  
  	/*

 * 3. load instruction to be emulated into relevant register, and
@@ -272,8 +287,9 @@ int arch_prepare_optimized_kprobe(struct 

Re: [PATCH v4 2/9] powerpc/kprobes: Move kprobes over to patch_instruction

2017-06-27 Thread Christophe LEROY



Le 27/06/2017 à 09:48, Balbir Singh a écrit :

arch_arm/disarm_probe use direct assignment for copying
instructions, replace them with patch_instruction

Signed-off-by: Balbir Singh 


Didn't Michael picked it up already ?

Christophe


---
  arch/powerpc/kernel/kprobes.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 01addfb..a52bae8 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -164,7 +164,7 @@ NOKPROBE_SYMBOL(arch_prepare_kprobe);
  
  void arch_arm_kprobe(struct kprobe *p)

  {
-   *p->addr = BREAKPOINT_INSTRUCTION;
+   patch_instruction(p->addr, BREAKPOINT_INSTRUCTION);
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
  }
@@ -172,7 +172,7 @@ NOKPROBE_SYMBOL(arch_arm_kprobe);
  
  void arch_disarm_kprobe(struct kprobe *p)

  {
-   *p->addr = p->opcode;
+   patch_instruction(p->addr, p->opcode);
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
  }



Re: [PATCH v4 1/9] powerpc/lib/code-patching: Use alternate map for patch_instruction()

2017-06-27 Thread Christophe LEROY



Le 27/06/2017 à 09:48, Balbir Singh a écrit :

This patch creates the window using text_poke_area, allocated
via get_vm_area(). text_poke_area is per CPU to avoid locking.
text_poke_area for each cpu is setup using late_initcall, prior
to setup of these alternate mapping areas, we continue to use
direct write to change/modify kernel text. With the ability
to use alternate mappings to write to kernel text, it provides
us the freedom to then turn text read-only and implement
CONFIG_STRICT_KERNEL_RWX.

This code is CPU hotplug aware to ensure that the we have mappings
for any new cpus as they come online and tear down mappings for
any cpus that are offline.

Other arches do similar things, but use fixmaps. The reason
for not using fixmaps is to make use of any randomization in
the future.

Signed-off-by: Balbir Singh 
---
  arch/powerpc/lib/code-patching.c | 160 ++-
  1 file changed, 156 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 500b0f6..19b8368 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -12,23 +12,172 @@
  #include 
  #include 
  #include 
-#include 
-#include 
+#include 
+#include 
  #include 
  #include 
  
+#include 

+#include 
+#include 
+#include 
  
-int patch_instruction(unsigned int *addr, unsigned int instr)

+static int __patch_instruction(unsigned int *addr, unsigned int instr)
  {
int err;
  
  	__put_user_size(instr, addr, 4, err);

if (err)
return err;
-   asm ("dcbst 0, %0; sync; icbi 0,%0; sync; isync" : : "r" (addr));
+   asm ("dcbst 0, %0; sync; icbi 0,%0; sync; isync" :: "r" (addr));
+   return 0;
+}
+
+#ifdef CONFIG_STRICT_KERNEL_RWX
+static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+
+static int text_area_cpu_up(unsigned int cpu)
+{
+   struct vm_struct *area;
+
+   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
+   if (!area) {
+   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
+   cpu);
+   return -1;
+   }
+   this_cpu_write(text_poke_area, area);
+   return 0;
+}
+
+static int text_area_cpu_down(unsigned int cpu)
+{
+   free_vm_area(this_cpu_read(text_poke_area));
+   return 0;
+}
+
+/*
+ * This is an early_initcall and early_initcalls happen at the right time
+ * for us, after slab is enabled and before we mark ro pages R/O. In the
+ * future if get_vm_area is randomized, this will be more flexible than
+ * fixmap
+ */
+static int __init setup_text_poke_area(void)
+{
+   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+   "powerpc/text_poke:online", text_area_cpu_up,
+   text_area_cpu_down));
+
+   pr_info("text_poke area ready...\n");
+   return 0;
+}
+
+/*
+ * This can be called for kernel text or a module.
+ */
+static int map_patch_area(void *addr, unsigned long text_poke_addr)
+{
+   unsigned long pfn;
+   int err;
+
+   if (is_vmalloc_addr(addr))
+   pfn = vmalloc_to_pfn(addr);
+   else
+   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
+
+   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT),
+   pgprot_val(PAGE_KERNEL));
+   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
+   if (err)
+   return -1;
+   return 0;
+}
+
+static inline int unmap_patch_area(unsigned long addr)
+{
+   pte_t *ptep;
+   pmd_t *pmdp;
+   pud_t *pudp;
+   pgd_t *pgdp;
+
+   pgdp = pgd_offset_k(addr);
+   if (unlikely(!pgdp))
+   return -EINVAL;
+   pudp = pud_offset(pgdp, addr);
+   if (unlikely(!pudp))
+   return -EINVAL;
+   pmdp = pmd_offset(pudp, addr);
+   if (unlikely(!pmdp))
+   return -EINVAL;
+   ptep = pte_offset_kernel(pmdp, addr);
+   if (unlikely(!ptep))
+   return -EINVAL;
+
+   pr_devel("clearing mm %p, pte %p, addr %lx\n", _mm, ptep, addr);
+   /*
+* In hash, pte_clear flushes the tlb, in radix, we have to
+*/
+   pte_clear(_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
return 0;
  }
  
+int patch_instruction(unsigned int *addr, unsigned int instr)

+{
+   int err;
+   unsigned int *dest = NULL;
+   unsigned long flags;
+   unsigned long text_poke_addr;
+   unsigned long kaddr = (unsigned long)addr;
+
+   /*
+* During early early boot patch_instruction is called
+* when text_poke_area is not ready, but we still need
+* to allow patching. We just do the plain old patching
+* We use slab_is_available and per cpu read * via this_cpu_read
+* of text_poke_area. Per-CPU areas might not be up early
+* this can create problems with just using this_cpu_read()
+*/
+   if (!slab_is_available() 

Re: [PATCH v3 4/6] powerpc/mm: Add devmap support for ppc64

2017-06-27 Thread Oliver
On Tue, Jun 27, 2017 at 12:33 PM, Michael Ellerman  wrote:
> kbuild test robot  writes:
>
>> Hi Oliver,
>>
>> [auto build test ERROR on powerpc/next]
>> [also build test ERROR on v4.12-rc6 next-20170623]
>> [if your patch is applied to the wrong git tree, please drop us a note to 
>> help improve the system]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/Oliver-O-Halloran/mm-x86-Add-ARCH_HAS_ZONE_DEVICE-to-Kconfig/20170625-102522
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
>> next
>> config: powerpc-defconfig (attached as .config)
>> compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
>> reproduce:
>> wget 
>> https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
>> ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> # save the attached .config to linux build tree
>> make.cross ARCH=powerpc
>>
>> All errors (new ones prefixed by >>):
>>
>>mm/gup.c: In function '__gup_device_huge_pud':
 mm/gup.c:1329:14: error: implicit declaration of function 'pud_pfn' 
 [-Werror=implicit-function-declaration]
>>  fault_pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>>  ^~~
>>cc1: some warnings being treated as errors
>
> The key here is that CONFIG_TRANSPARENT_HUGEPAGE=n.
>
> So this needs to be fixed before I can merge this.
>
> I think the problem is just that pud_pfn() is inside #ifdef
> CONFIG_TRANSPARENT_HUGEPAGE but shouldn't be.

I'm not 100% sold on making pud_pfn() independent of THP. pmd_pfn() is
used a few times in generic code, but the usages are always gated by a
#ifdef CONFIG_TRANSPARENT_HUGEPAGE so I think we should be doing the
same here. I sent a patch[1] yesterday to fix the usage in gup.c, but
I'll do a respin if you want.

Thanks,
Oliver

[1] http://marc.info/?l=linux-mm=149845912612363=4


[PATCH v4 9/9] powerpc/Kconfig: Enable STRICT_KERNEL_RWX

2017-06-27 Thread Balbir Singh
We have the basic support in the form of patching R/O
text sections, linker scripts for extending alignment
of text data. We've also got mark_rodata_ro()

NOTE: There is a temporary work-around for disabling
STRICT_KERNEL_RWX if CONFIG_HIBERNATION is enabled

Signed-off-by: Balbir Singh 
---
 arch/powerpc/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 7d95c1d..cda69f3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -163,6 +163,8 @@ config PPC
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if COMPAT
select HAVE_ARCH_SECCOMP_FILTER
select HAVE_ARCH_TRACEHOOK
+   select ARCH_HAS_STRICT_KERNEL_RWX   if (PPC_BOOK3S_64 && 
!HIBERNATION)
+   select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select HAVE_CBPF_JITif !PPC64
select HAVE_CONTEXT_TRACKINGif PPC64
select HAVE_DEBUG_KMEMLEAK
-- 
2.9.4



[PATCH v4 8/9] powerpc/mm/radix: Implement mark_rodata_ro() for radix

2017-06-27 Thread Balbir Singh
The patch splits the linear page mapping such that
the ones with kernel text are mapped as 2M and others
are mapped with the largest possible size - 1G. The downside
of this is that we split a 1G mapping into 512 2M mappings
for the kernel, but in the absence of that we cannot support
R/O areas in 1G, the kernel size is much smaller and using
1G as the granularity will waste a lot of space at the cost
of optimizing the TLB. The text itself should fit into about
6-8 mappings, so the effect should not be all that bad.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/mm/pgtable-radix.c | 68 +++--
 1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 0797c4e..bdd85f9 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -23,6 +24,8 @@
 
 #include 
 
+int mmu_radix_linear_psize = PAGE_SIZE;
+
 static int native_register_process_table(unsigned long base, unsigned long 
pg_sz,
 unsigned long table_size)
 {
@@ -112,7 +115,53 @@ int radix__map_kernel_page(unsigned long ea, unsigned long 
pa,
 #ifdef CONFIG_STRICT_KERNEL_RWX
 void radix__mark_rodata_ro(void)
 {
-   pr_warn("Not yet implemented for radix\n");
+   unsigned long start = (unsigned long)_stext;
+   unsigned long end = (unsigned long)__init_begin;
+   unsigned long idx;
+   unsigned int step, shift;
+   pgd_t *pgdp;
+   pud_t *pudp;
+   pmd_t *pmdp;
+   pte_t *ptep;
+
+   if (!mmu_has_feature(MMU_FTR_KERNEL_RO)) {
+   pr_info("R/O rodata not supported\n");
+   return;
+   }
+
+   shift = mmu_psize_defs[mmu_radix_linear_psize].shift;
+   step = 1 << shift;
+
+   start = ((start + step - 1) >> shift) << shift;
+   end = (end >> shift) << shift;
+
+   pr_devel("marking ro start %lx, end %lx, step %x\n",
+   start, end, step);
+
+   for (idx = start; idx < end; idx += step) {
+   pgdp = pgd_offset_k(idx);
+   pudp = pud_alloc(_mm, pgdp, idx);
+   if (!pudp)
+   continue;
+   if (pud_huge(*pudp)) {
+   ptep = (pte_t *)pudp;
+   goto update_the_pte;
+   }
+   pmdp = pmd_alloc(_mm, pudp, idx);
+   if (!pmdp)
+   continue;
+   if (pmd_huge(*pmdp)) {
+   ptep = pmdp_ptep(pmdp);
+   goto update_the_pte;
+   }
+   ptep = pte_alloc_kernel(pmdp, idx);
+   if (!ptep)
+   continue;
+update_the_pte:
+   radix__pte_update(_mm, idx, ptep, _PAGE_WRITE, 0, 0);
+   }
+   radix__flush_tlb_kernel_range(start, end);
+
 }
 #endif
 
@@ -131,6 +180,7 @@ static int __meminit create_physical_mapping(unsigned long 
start,
 {
unsigned long vaddr, addr, mapping_size = 0;
pgprot_t prot;
+   unsigned long max_mapping_size;
 
start = _ALIGN_UP(start, PAGE_SIZE);
for (addr = start; addr < end; addr += mapping_size) {
@@ -139,9 +189,12 @@ static int __meminit create_physical_mapping(unsigned long 
start,
 
gap = end - addr;
previous_size = mapping_size;
+   max_mapping_size = PUD_SIZE;
 
+retry:
if (IS_ALIGNED(addr, PUD_SIZE) && gap >= PUD_SIZE &&
-   mmu_psize_defs[MMU_PAGE_1G].shift)
+   mmu_psize_defs[MMU_PAGE_1G].shift &&
+   PUD_SIZE <= max_mapping_size)
mapping_size = PUD_SIZE;
else if (IS_ALIGNED(addr, PMD_SIZE) && gap >= PMD_SIZE &&
 mmu_psize_defs[MMU_PAGE_2M].shift)
@@ -149,6 +202,17 @@ static int __meminit create_physical_mapping(unsigned long 
start,
else
mapping_size = PAGE_SIZE;
 
+   if (mapping_size == PUD_SIZE &&
+   addr <= __pa_symbol(__init_begin) &&
+   (addr + mapping_size) >= __pa_symbol(_stext)) {
+   max_mapping_size = PMD_SIZE;
+   goto retry;
+   }
+
+   if (addr <= __pa_symbol(__init_begin) &&
+   (addr + mapping_size) >= __pa_symbol(_stext))
+   mmu_radix_linear_psize = mapping_size;
+
if (mapping_size != previous_size) {
print_mapping(start, addr, previous_size);
start = addr;
-- 
2.9.4



[PATCH v4 7/9] powerpc/mm/hash: Implement mark_rodata_ro() for hash

2017-06-27 Thread Balbir Singh
With hash we update the bolted pte to mark it read-only. We rely
on the MMU_FTR_KERNEL_RO to generate the correct permissions
for read-only text. The radix implementation just prints a warning
in this implementation

Signed-off-by: Balbir Singh 
---
 arch/powerpc/include/asm/book3s/64/hash.h  |  3 +++
 arch/powerpc/include/asm/book3s/64/radix.h |  4 +++
 arch/powerpc/mm/pgtable-hash64.c   | 41 ++
 arch/powerpc/mm/pgtable-radix.c|  7 +
 arch/powerpc/mm/pgtable_64.c   |  9 +++
 5 files changed, 64 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..0ce513f 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -89,6 +89,9 @@ static inline int hash__pgd_bad(pgd_t pgd)
 {
return (pgd_val(pgd) == 0);
 }
+#ifdef CONFIG_STRICT_KERNEL_RWX
+extern void hash__mark_rodata_ro(void);
+#endif
 
 extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long pte, int huge);
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index ac16d19..368cb54 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -116,6 +116,10 @@
 #define RADIX_PUD_TABLE_SIZE   (sizeof(pud_t) << RADIX_PUD_INDEX_SIZE)
 #define RADIX_PGD_TABLE_SIZE   (sizeof(pgd_t) << RADIX_PGD_INDEX_SIZE)
 
+#ifdef CONFIG_STRICT_KERNEL_RWX
+extern void radix__mark_rodata_ro(void);
+#endif
+
 static inline unsigned long __radix_pte_update(pte_t *ptep, unsigned long clr,
   unsigned long set)
 {
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index 8b85a14..7e9c924 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -11,8 +11,12 @@
 
 #include 
 #include 
+#include 
 
 #include 
+#include 
+#include 
+#include 
 #include 
 
 #include "mmu_decl.h"
@@ -342,3 +346,40 @@ int hash__has_transparent_hugepage(void)
return 1;
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#ifdef CONFIG_STRICT_KERNEL_RWX
+void hash__mark_rodata_ro(void)
+{
+   unsigned long start = (unsigned long)_stext;
+   unsigned long end = (unsigned long)__init_begin;
+   unsigned long idx;
+   unsigned int step, shift;
+   unsigned long newpp = PP_RXXX;
+
+   if (!mmu_has_feature(MMU_FTR_KERNEL_RO)) {
+   pr_info("R/O rodata not supported\n");
+   return;
+   }
+
+   shift = mmu_psize_defs[mmu_linear_psize].shift;
+   step = 1 << shift;
+
+   start = ((start + step - 1) >> shift) << shift;
+   end = (end >> shift) << shift;
+
+   pr_devel("marking ro start %lx, end %lx, step %x\n",
+   start, end, step);
+
+   if (start == end) {
+   pr_warn("could not set rodata ro, relocate the start"
+   " of the kernel to a 0x%x boundary\n", step);
+   return;
+   }
+
+   for (idx = start; idx < end; idx += step)
+   /* Not sure if we can do much with the return value */
+   mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
+   mmu_kernel_ssize);
+
+}
+#endif
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 6c062f9..0797c4e 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -109,6 +109,13 @@ int radix__map_kernel_page(unsigned long ea, unsigned long 
pa,
return 0;
 }
 
+#ifdef CONFIG_STRICT_KERNEL_RWX
+void radix__mark_rodata_ro(void)
+{
+   pr_warn("Not yet implemented for radix\n");
+}
+#endif
+
 static inline void __meminit print_mapping(unsigned long start,
   unsigned long end,
   unsigned long size)
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 8d2d674..34d8633 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -487,3 +487,12 @@ void mmu_partition_table_set_entry(unsigned int lpid, 
unsigned long dw0,
 }
 EXPORT_SYMBOL_GPL(mmu_partition_table_set_entry);
 #endif /* CONFIG_PPC_BOOK3S_64 */
+
+#ifdef CONFIG_STRICT_KERNEL_RWX
+void mark_rodata_ro(void)
+{
+   if (radix_enabled())
+   return radix__mark_rodata_ro();
+   return hash__mark_rodata_ro();
+}
+#endif
-- 
2.9.4



[PATCH v4 6/9] powerpc/platform/pseries/lpar: Fix updatepp and updateboltedpp

2017-06-27 Thread Balbir Singh
PAPR has pp0 in bit 55, currently we assumed that bit
pp0 is bit 0 (all bits in IBM order). This patch fixes
the pp0 bits for both these routines that use H_PROTECT.

(e58e87a powerpc/mm: Update _PAGE_KERNEL_RO)

Signed-off-by: Balbir Singh 
---
 arch/powerpc/platforms/pseries/lpar.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 6541d0b..2d36571 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -301,7 +301,7 @@ static long pSeries_lpar_hpte_updatepp(unsigned long slot,
   int ssize, unsigned long inv_flags)
 {
unsigned long lpar_rc;
-   unsigned long flags = (newpp & 7) | H_AVPN;
+   unsigned long flags;
unsigned long want_v;
 
want_v = hpte_encode_avpn(vpn, psize, ssize);
@@ -309,6 +309,15 @@ static long pSeries_lpar_hpte_updatepp(unsigned long slot,
pr_devel("update: avpnv=%016lx, hash=%016lx, f=%lx, psize: %d ...",
 want_v, slot, flags, psize);
 
+   /*
+* Move pp0 and set the mask, pp0 is bit 55
+* We ignore the keys for now.
+*/
+   if (mmu_has_feature(MMU_FTR_KERNEL_RO))
+   flags = ((newpp & HPTE_R_PP0) >> 55) | (newpp & 7) | H_AVPN;
+   else
+   flags = (newpp & 7) | H_AVPN;
+
lpar_rc = plpar_pte_protect(flags, slot, want_v);
 
if (lpar_rc == H_NOT_FOUND) {
@@ -379,7 +388,15 @@ static void pSeries_lpar_hpte_updateboltedpp(unsigned long 
newpp,
slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
BUG_ON(slot == -1);
 
-   flags = newpp & 7;
+   /*
+* Move pp0 and set the mask, pp0 is bit 55
+* We ignore the keys for now.
+*/
+   if (mmu_has_feature(MMU_FTR_KERNEL_RO))
+   flags = ((newpp & HPTE_R_PP0) >> 55) | (newpp & 7);
+   else
+   flags = (newpp & 7);
+
lpar_rc = plpar_pte_protect(flags, slot, 0);
 
BUG_ON(lpar_rc != H_SUCCESS);
-- 
2.9.4



[PATCH v4 5/9] powerpc/vmlinux.lds: Align __init_begin to 16M

2017-06-27 Thread Balbir Singh
For CONFIG_STRICT_KERNEL_RWX align __init_begin to 16M.
We use 16M since its the larger of 2M on radix and 16M
on hash for our linear mapping. The plan is to have
.text, .rodata and everything upto __init_begin marked
as RX. Note we still have executable read only data.
We could further align rodata to another 16M boundary.
I've used keeping text plus rodata as read-only-executable
as a trade-off to doing read-only-executable for text and
read-only for rodata.

We don't use multi PT_LOAD in PHDRS because we are
not sure if all bootloaders support them. This patch keeps
PHDRS in vmlinux.lds.S as the same they are with just one
PT_LOAD for all of the kernel marked as RWX (7).

Signed-off-by: Balbir Singh 
---
 arch/powerpc/kernel/vmlinux.lds.S | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index ace6b65..b1a2505 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -8,6 +8,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_STRICT_KERNEL_RWX
+#define STRICT_ALIGN_SIZE  (1 << 24)
+#else
+#define STRICT_ALIGN_SIZE  PAGE_SIZE
+#endif
+
 ENTRY(_stext)
 
 PHDRS {
@@ -123,7 +129,7 @@ SECTIONS
PROVIDE32 (etext = .);
 
/* Read-only data */
-   RODATA
+   RO_DATA(PAGE_SIZE)
 
EXCEPTION_TABLE(0)
 
@@ -140,7 +146,7 @@ SECTIONS
 /*
  * Init sections discarded at runtime
  */
-   . = ALIGN(PAGE_SIZE);
+   . = ALIGN(STRICT_ALIGN_SIZE);
__init_begin = .;
INIT_TEXT_SECTION(PAGE_SIZE) :kernel
 
-- 
2.9.4



[PATCH v4 4/9] powerpc/xmon: Add patch_instruction() support for xmon

2017-06-27 Thread Balbir Singh
Move from mwrite() to patch_instruction() for xmon for
breakpoint addition and removal.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/xmon/xmon.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index a728e19..08e367e 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 #include 
@@ -837,7 +838,8 @@ static void insert_bpts(void)
store_inst(>instr[0]);
if (bp->enabled & BP_CIABR)
continue;
-   if (mwrite(bp->address, , 4) != 4) {
+   if (patch_instruction((unsigned int *)bp->address,
+   bpinstr) != 0) {
printf("Couldn't write instruction at %lx, "
   "disabling breakpoint there\n", bp->address);
bp->enabled &= ~BP_TRAP;
@@ -874,7 +876,8 @@ static void remove_bpts(void)
continue;
if (mread(bp->address, , 4) == 4
&& instr == bpinstr
-   && mwrite(bp->address, >instr, 4) != 4)
+   && patch_instruction(
+   (unsigned int *)bp->address, bp->instr[0]) != 0)
printf("Couldn't remove breakpoint at %lx\n",
   bp->address);
else
-- 
2.9.4



[PATCH v4 3/9] powerpc/kprobes/optprobes: Move over to patch_instruction

2017-06-27 Thread Balbir Singh
With text moving to read-only migrate optprobes to using
the patch_instruction infrastructure. Without this optprobes
will fail and complain.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/kernel/optprobes.c | 58 ++---
 1 file changed, 37 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index ec60ed0..1c7326c 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -158,12 +158,13 @@ void arch_remove_optimized_kprobe(struct optimized_kprobe 
*op)
 void patch_imm32_load_insns(unsigned int val, kprobe_opcode_t *addr)
 {
/* addis r4,0,(insn)@h */
-   *addr++ = PPC_INST_ADDIS | ___PPC_RT(4) |
- ((val >> 16) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ADDIS | ___PPC_RT(4) |
+   ((val >> 16) & 0x));
+   addr++;
 
/* ori r4,r4,(insn)@l */
-   *addr = PPC_INST_ORI | ___PPC_RA(4) | ___PPC_RS(4) |
-   (val & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(4) |
+   ___PPC_RS(4) | (val & 0x));
 }
 
 /*
@@ -173,24 +174,28 @@ void patch_imm32_load_insns(unsigned int val, 
kprobe_opcode_t *addr)
 void patch_imm64_load_insns(unsigned long val, kprobe_opcode_t *addr)
 {
/* lis r3,(op)@highest */
-   *addr++ = PPC_INST_ADDIS | ___PPC_RT(3) |
- ((val >> 48) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ADDIS | ___PPC_RT(3) |
+   ((val >> 48) & 0x));
+   addr++;
 
/* ori r3,r3,(op)@higher */
-   *addr++ = PPC_INST_ORI | ___PPC_RA(3) | ___PPC_RS(3) |
- ((val >> 32) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(3) |
+   ___PPC_RS(3) | ((val >> 32) & 0x));
+   addr++;
 
/* rldicr r3,r3,32,31 */
-   *addr++ = PPC_INST_RLDICR | ___PPC_RA(3) | ___PPC_RS(3) |
- __PPC_SH64(32) | __PPC_ME64(31);
+   patch_instruction((unsigned int *)addr, PPC_INST_RLDICR | ___PPC_RA(3) |
+   ___PPC_RS(3) | __PPC_SH64(32) | __PPC_ME64(31));
+   addr++;
 
/* oris r3,r3,(op)@h */
-   *addr++ = PPC_INST_ORIS | ___PPC_RA(3) | ___PPC_RS(3) |
- ((val >> 16) & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORIS | ___PPC_RA(3) |
+   ___PPC_RS(3) | ((val >> 16) & 0x));
+   addr++;
 
/* ori r3,r3,(op)@l */
-   *addr = PPC_INST_ORI | ___PPC_RA(3) | ___PPC_RS(3) |
-   (val & 0x);
+   patch_instruction((unsigned int *)addr, PPC_INST_ORI | ___PPC_RA(3) |
+   ___PPC_RS(3) | (val & 0x));
 }
 
 int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe 
*p)
@@ -198,7 +203,8 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
kprobe_opcode_t *buff, branch_op_callback, branch_emulate_step;
kprobe_opcode_t *op_callback_addr, *emulate_step_addr;
long b_offset;
-   unsigned long nip;
+   unsigned long nip, size;
+   int rc, i;
 
kprobe_ppc_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
 
@@ -231,8 +237,15 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
goto error;
 
/* Setup template */
-   memcpy(buff, optprobe_template_entry,
-   TMPL_END_IDX * sizeof(kprobe_opcode_t));
+   /* We can optimize this via patch_instruction_window later */
+   size = (TMPL_END_IDX * sizeof(kprobe_opcode_t)) / sizeof(int);
+   pr_devel("Copying template to %p, size %lu\n", buff, size);
+   for (i = 0; i < size; i++) {
+   rc = patch_instruction((unsigned int *)buff + i,
+   *((unsigned int *)(optprobe_template_entry) + i));
+   if (rc < 0)
+   goto error;
+   }
 
/*
 * Fixup the template with instructions to:
@@ -261,8 +274,10 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
if (!branch_op_callback || !branch_emulate_step)
goto error;
 
-   buff[TMPL_CALL_HDLR_IDX] = branch_op_callback;
-   buff[TMPL_EMULATE_IDX] = branch_emulate_step;
+   patch_instruction((unsigned int *)buff + TMPL_CALL_HDLR_IDX,
+   branch_op_callback);
+   patch_instruction((unsigned int *)buff + TMPL_EMULATE_IDX,
+   branch_emulate_step);
 
/*
 * 3. load instruction to be emulated into relevant register, and
@@ -272,8 +287,9 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
/*
 * 4. branch back from 

[PATCH v4 2/9] powerpc/kprobes: Move kprobes over to patch_instruction

2017-06-27 Thread Balbir Singh
arch_arm/disarm_probe use direct assignment for copying
instructions, replace them with patch_instruction

Signed-off-by: Balbir Singh 
---
 arch/powerpc/kernel/kprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 01addfb..a52bae8 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -164,7 +164,7 @@ NOKPROBE_SYMBOL(arch_prepare_kprobe);
 
 void arch_arm_kprobe(struct kprobe *p)
 {
-   *p->addr = BREAKPOINT_INSTRUCTION;
+   patch_instruction(p->addr, BREAKPOINT_INSTRUCTION);
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
 }
@@ -172,7 +172,7 @@ NOKPROBE_SYMBOL(arch_arm_kprobe);
 
 void arch_disarm_kprobe(struct kprobe *p)
 {
-   *p->addr = p->opcode;
+   patch_instruction(p->addr, p->opcode);
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
 }
-- 
2.9.4



[PATCH v4 1/9] powerpc/lib/code-patching: Use alternate map for patch_instruction()

2017-06-27 Thread Balbir Singh
This patch creates the window using text_poke_area, allocated
via get_vm_area(). text_poke_area is per CPU to avoid locking.
text_poke_area for each cpu is setup using late_initcall, prior
to setup of these alternate mapping areas, we continue to use
direct write to change/modify kernel text. With the ability
to use alternate mappings to write to kernel text, it provides
us the freedom to then turn text read-only and implement
CONFIG_STRICT_KERNEL_RWX.

This code is CPU hotplug aware to ensure that the we have mappings
for any new cpus as they come online and tear down mappings for
any cpus that are offline.

Other arches do similar things, but use fixmaps. The reason
for not using fixmaps is to make use of any randomization in
the future.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/lib/code-patching.c | 160 ++-
 1 file changed, 156 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 500b0f6..19b8368 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -12,23 +12,172 @@
 #include 
 #include 
 #include 
-#include 
-#include 
+#include 
+#include 
 #include 
 #include 
 
+#include 
+#include 
+#include 
+#include 
 
-int patch_instruction(unsigned int *addr, unsigned int instr)
+static int __patch_instruction(unsigned int *addr, unsigned int instr)
 {
int err;
 
__put_user_size(instr, addr, 4, err);
if (err)
return err;
-   asm ("dcbst 0, %0; sync; icbi 0,%0; sync; isync" : : "r" (addr));
+   asm ("dcbst 0, %0; sync; icbi 0,%0; sync; isync" :: "r" (addr));
+   return 0;
+}
+
+#ifdef CONFIG_STRICT_KERNEL_RWX
+static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+
+static int text_area_cpu_up(unsigned int cpu)
+{
+   struct vm_struct *area;
+
+   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
+   if (!area) {
+   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
+   cpu);
+   return -1;
+   }
+   this_cpu_write(text_poke_area, area);
+   return 0;
+}
+
+static int text_area_cpu_down(unsigned int cpu)
+{
+   free_vm_area(this_cpu_read(text_poke_area));
+   return 0;
+}
+
+/*
+ * This is an early_initcall and early_initcalls happen at the right time
+ * for us, after slab is enabled and before we mark ro pages R/O. In the
+ * future if get_vm_area is randomized, this will be more flexible than
+ * fixmap
+ */
+static int __init setup_text_poke_area(void)
+{
+   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+   "powerpc/text_poke:online", text_area_cpu_up,
+   text_area_cpu_down));
+
+   pr_info("text_poke area ready...\n");
+   return 0;
+}
+
+/*
+ * This can be called for kernel text or a module.
+ */
+static int map_patch_area(void *addr, unsigned long text_poke_addr)
+{
+   unsigned long pfn;
+   int err;
+
+   if (is_vmalloc_addr(addr))
+   pfn = vmalloc_to_pfn(addr);
+   else
+   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
+
+   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT),
+   pgprot_val(PAGE_KERNEL));
+   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
+   if (err)
+   return -1;
+   return 0;
+}
+
+static inline int unmap_patch_area(unsigned long addr)
+{
+   pte_t *ptep;
+   pmd_t *pmdp;
+   pud_t *pudp;
+   pgd_t *pgdp;
+
+   pgdp = pgd_offset_k(addr);
+   if (unlikely(!pgdp))
+   return -EINVAL;
+   pudp = pud_offset(pgdp, addr);
+   if (unlikely(!pudp))
+   return -EINVAL;
+   pmdp = pmd_offset(pudp, addr);
+   if (unlikely(!pmdp))
+   return -EINVAL;
+   ptep = pte_offset_kernel(pmdp, addr);
+   if (unlikely(!ptep))
+   return -EINVAL;
+
+   pr_devel("clearing mm %p, pte %p, addr %lx\n", _mm, ptep, addr);
+   /*
+* In hash, pte_clear flushes the tlb, in radix, we have to
+*/
+   pte_clear(_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
return 0;
 }
 
+int patch_instruction(unsigned int *addr, unsigned int instr)
+{
+   int err;
+   unsigned int *dest = NULL;
+   unsigned long flags;
+   unsigned long text_poke_addr;
+   unsigned long kaddr = (unsigned long)addr;
+
+   /*
+* During early early boot patch_instruction is called
+* when text_poke_area is not ready, but we still need
+* to allow patching. We just do the plain old patching
+* We use slab_is_available and per cpu read * via this_cpu_read
+* of text_poke_area. Per-CPU areas might not be up early
+* this can create problems with just using this_cpu_read()
+*/
+   if (!slab_is_available() || !this_cpu_read(text_poke_area))
+   return 

[PATCH v4 0/9] Provide STRICT_KERNEL_RWX for powerpc

2017-06-27 Thread Balbir Singh
Provide STRICT_KERNEL_RWX for PPC64/BOOK3S

These patches enable RX mappings of kernel text.
rodata is mapped RX as well as a trade-off, there
are more details in the patch description

As a prerequisite for R/O text, patch_instruction
is moved over to using a separate mapping that
allows write to kernel text. xmon/ftrace/kprobes
have been moved over to work with patch_instruction

There is a bug fix, the updatepp and updateboltedpp
(pseries) providers, did not use flags as described in
PAPR (patch 6). I would like to see that patch marked
to stable, I've not added a Cc:stable myself.

Another build failure was reported, because instead
of using ARCH_HAS_SET_MEMORY as a gate for set_memory.h
inclusion, some of the infrastructure in the core kernel
uses CONFIG_STRICT_KERNEL_RWX. I've sent a fix to the
fix the latter. It should be picked up by 4.13 at which
time we can remove the config dependency on !HIBERNATION
in arch/powerpc/Kconfig

This version received testing under CONFIG_RELOCATABLE_TEST.
CONFIG_STRICT_KERNEL_RWX does not work great with that config
and disables the feature when relocation is on and prints
a warning message. The radix variant of CONFIG_RELOCATABLE_TEST
did not boot with and without the config, I am investigating
the issue. I also suspect optprobes might not be compatible
with relocatable kernels (even without these patches).

After these changes go in we can get the PPC32 varaint
for the same feature based on patches already posted
by Christophe.

Changelog v4:
Multiple cleanups to patch_instruction() based on
review comments from Michael Ellerman
Changes to Kconfig to make the feature selectable
Changelog v3:
Support radix
Drop ptdump patch, already picked from v2
Changelog v2:
Support optprobes via patch_instruction

Balbir Singh (9):
  powerpc/lib/code-patching: Use alternate map for patch_instruction()
  powerpc/kprobes: Move kprobes over to patch_instruction
  powerpc/kprobes/optprobes: Move over to patch_instruction
  powerpc/xmon: Add patch_instruction() support for xmon
  powerpc/vmlinux.lds: Align __init_begin to 16M
  powerpc/platform/pseries/lpar: Fix updatepp and updateboltedpp
  powerpc/mm/hash: Implement mark_rodata_ro() for hash
  powerpc/mm/radix: Implement mark_rodata_ro() for radix
  powerpc/Kconfig: Enable STRICT_KERNEL_RWX

 arch/powerpc/Kconfig   |   2 +
 arch/powerpc/include/asm/book3s/64/hash.h  |   3 +
 arch/powerpc/include/asm/book3s/64/radix.h |   4 +
 arch/powerpc/kernel/kprobes.c  |   4 +-
 arch/powerpc/kernel/optprobes.c|  58 +++
 arch/powerpc/kernel/vmlinux.lds.S  |  10 +-
 arch/powerpc/lib/code-patching.c   | 160 -
 arch/powerpc/mm/pgtable-hash64.c   |  41 
 arch/powerpc/mm/pgtable-radix.c|  73 -
 arch/powerpc/mm/pgtable_64.c   |   9 ++
 arch/powerpc/platforms/pseries/lpar.c  |  21 +++-
 arch/powerpc/xmon/xmon.c   |   7 +-
 12 files changed, 358 insertions(+), 34 deletions(-)

-- 
2.9.4



Re: [PATCH] powerpc/perf: Add POWER9 alternate PM_RUN_CYC and PM_RUN_INST_CMPL events

2017-06-27 Thread Madhavan Srinivasan


On Monday 19 June 2017 05:21 AM, Anton Blanchard wrote:

From: Anton Blanchard 

Similar to POWER8, POWER9 can count run cycles and run instructions
completed on more than one PMU.


Acked-by: Madhavan Srinivasan 



Signed-off-by: Anton Blanchard 
---
  arch/powerpc/perf/power9-events-list.h | 4 
  arch/powerpc/perf/power9-pmu.c | 2 ++
  2 files changed, 6 insertions(+)

diff --git a/arch/powerpc/perf/power9-events-list.h 
b/arch/powerpc/perf/power9-events-list.h
index 71a6bfee5c02..e9e417eefa59 100644
--- a/arch/powerpc/perf/power9-events-list.h
+++ b/arch/powerpc/perf/power9-events-list.h
@@ -51,8 +51,12 @@ EVENT(PM_DTLB_MISS,  0x300fc)
  EVENT(PM_ITLB_MISS,   0x400fc)
  /* Run_Instructions */
  EVENT(PM_RUN_INST_CMPL,   0x500fa)
+/* Alternate event code for PM_RUN_INST_CMPL */
+EVENT(PM_RUN_INST_CMPL_ALT,0x400fa)
  /* Run_cycles */
  EVENT(PM_RUN_CYC, 0x600f4)
+/* Alternate event code for Run_cycles */
+EVENT(PM_RUN_CYC_ALT,  0x200f4)
  /* Instruction Dispatched */
  EVENT(PM_INST_DISP,   0x200f2)
  EVENT(PM_INST_DISP_ALT,   0x300f2)
diff --git a/arch/powerpc/perf/power9-pmu.c b/arch/powerpc/perf/power9-pmu.c
index 018f8e90ac35..b9168163b2b2 100644
--- a/arch/powerpc/perf/power9-pmu.c
+++ b/arch/powerpc/perf/power9-pmu.c
@@ -107,6 +107,8 @@ extern struct attribute_group isa207_pmu_format_group;
  /* Table of alternatives, sorted by column 0 */
  static const unsigned int power9_event_alternatives[][MAX_ALT] = {
{ PM_INST_DISP, PM_INST_DISP_ALT },
+   { PM_RUN_CYC_ALT,   PM_RUN_CYC },
+   { PM_RUN_INST_CMPL_ALT, PM_RUN_INST_CMPL },
  };

  static int power9_get_alternatives(u64 event, unsigned int flags, u64 alt[])




[PATCH 2/2] powerpc/smp: Convert NR_CPUS to nr_cpu_ids

2017-06-27 Thread Santosh Sivaraj
nr_cpu_ids can be limited by nr_cpus boot parameter, whereas NR_CPUS is a
compile time constant, which shouldn't be compared against during cpu kick.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/kernel/smp.c| 2 +-
 arch/powerpc/platforms/cell/smp.c| 2 +-
 arch/powerpc/platforms/powernv/smp.c | 2 +-
 arch/powerpc/platforms/pseries/smp.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 05bf583..4180197 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -112,7 +112,7 @@ int smp_generic_cpu_bootable(unsigned int nr)
 #ifdef CONFIG_PPC64
 int smp_generic_kick_cpu(int nr)
 {
-   if (nr < 0 || nr >= NR_CPUS)
+   if (nr < 0 || nr >= nr_cpu_ids)
return -EINVAL;
 
/*
diff --git a/arch/powerpc/platforms/cell/smp.c 
b/arch/powerpc/platforms/cell/smp.c
index ee8c535..f84d52a 100644
--- a/arch/powerpc/platforms/cell/smp.c
+++ b/arch/powerpc/platforms/cell/smp.c
@@ -115,7 +115,7 @@ static void smp_cell_setup_cpu(int cpu)
 
 static int smp_cell_kick_cpu(int nr)
 {
-   if (nr < 0 || nr >= NR_CPUS)
+   if (nr < 0 || nr >= nr_cpu_ids)
return -EINVAL;
 
if (!smp_startup_cpu(nr))
diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index 292825f..40dae96 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -63,7 +63,7 @@ static int pnv_smp_kick_cpu(int nr)
long rc;
uint8_t status;
 
-   if (nr < 0 || nr >= NR_CPUS)
+   if (nr < 0 || nr >= nr_cpu_ids)
return -EINVAL;
 
/*
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index c82182a..24785f6 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -151,7 +151,7 @@ static void smp_setup_cpu(int cpu)
 
 static int smp_pSeries_kick_cpu(int nr)
 {
-   if (nr < 0 || nr >= NR_CPUS)
+   if (nr < 0 || nr >= nr_cpu_ids)
return -EINVAL;
 
if (!smp_startup_cpu(nr))
-- 
2.9.4



[PATCH 1/2] powerpc/smp: Do not BUG_ON if invalid CPU during kick

2017-06-27 Thread Santosh Sivaraj
During secondary start, we do not need to BUG_ON if an invalid CPU number
is passed. We already print an error if secondary cannot be started, so
just return an error instead.

Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/kernel/smp.c| 3 ++-
 arch/powerpc/platforms/cell/smp.c| 3 ++-
 arch/powerpc/platforms/powernv/smp.c | 3 ++-
 arch/powerpc/platforms/pseries/smp.c | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index df2a416..05bf583 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -112,7 +112,8 @@ int smp_generic_cpu_bootable(unsigned int nr)
 #ifdef CONFIG_PPC64
 int smp_generic_kick_cpu(int nr)
 {
-   BUG_ON(nr < 0 || nr >= NR_CPUS);
+   if (nr < 0 || nr >= NR_CPUS)
+   return -EINVAL;
 
/*
 * The processor is currently spinning, waiting for the
diff --git a/arch/powerpc/platforms/cell/smp.c 
b/arch/powerpc/platforms/cell/smp.c
index 895560f..ee8c535 100644
--- a/arch/powerpc/platforms/cell/smp.c
+++ b/arch/powerpc/platforms/cell/smp.c
@@ -115,7 +115,8 @@ static void smp_cell_setup_cpu(int cpu)
 
 static int smp_cell_kick_cpu(int nr)
 {
-   BUG_ON(nr < 0 || nr >= NR_CPUS);
+   if (nr < 0 || nr >= NR_CPUS)
+   return -EINVAL;
 
if (!smp_startup_cpu(nr))
return -ENOENT;
diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index c04c87a..292825f 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -63,7 +63,8 @@ static int pnv_smp_kick_cpu(int nr)
long rc;
uint8_t status;
 
-   BUG_ON(nr < 0 || nr >= NR_CPUS);
+   if (nr < 0 || nr >= NR_CPUS)
+   return -EINVAL;
 
/*
 * If we already started or OPAL is not supported, we just
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index 52ca6b3..c82182a 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -151,7 +151,8 @@ static void smp_setup_cpu(int cpu)
 
 static int smp_pSeries_kick_cpu(int nr)
 {
-   BUG_ON(nr < 0 || nr >= NR_CPUS);
+   if (nr < 0 || nr >= NR_CPUS)
+   return -EINVAL;
 
if (!smp_startup_cpu(nr))
return -ENOENT;
-- 
2.9.4



RE: [PATCH 1/2] fsl/fman: propagate dma_ops

2017-06-27 Thread Madalin-cristian Bucur
> -Original Message-
> From: geert.uytterhoe...@gmail.com [mailto:geert.uytterhoe...@gmail.com]
> On Behalf Of Geert Uytterhoeven
> Sent: Monday, June 26, 2017 7:24 PM
> To: Madalin-cristian Bucur 
> Cc: net...@vger.kernel.org; David S. Miller ;
> linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 1/2] fsl/fman: propagate dma_ops
> 
> Hi Madalin,
> 
> On Mon, Jun 26, 2017 at 4:55 PM, Madalin-cristian Bucur
>  wrote:
> >> -Original Message-
> >> From: geert.uytterhoe...@gmail.com
> [mailto:geert.uytterhoe...@gmail.com]
> >> On Behalf Of Geert Uytterhoeven
> >> Sent: Monday, June 26, 2017 10:49 AM
> >> To: Madalin-cristian Bucur 
> >> Cc: net...@vger.kernel.org; David S. Miller ;
> >> linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
> >> Subject: Re: [PATCH 1/2] fsl/fman: propagate dma_ops
> >>
> > On Mon, Jun 19, 2017 at 5:04 PM, Madalin Bucur 
> >> wrote:
> >> > Make sure dma_ops are set, to be later used by the Ethernet driver.
> >> >
> >> > Signed-off-by: Madalin Bucur 
> >> > ---
> >> >  drivers/net/ethernet/freescale/fman/mac.c | 2 ++
> >> >  1 file changed, 2 insertions(+)
> >> >
> >> > diff --git a/drivers/net/ethernet/freescale/fman/mac.c
> >> b/drivers/net/ethernet/freescale/fman/mac.c
> >> > index 0b31f85..6e67d22f 100644
> >> > --- a/drivers/net/ethernet/freescale/fman/mac.c
> >> > +++ b/drivers/net/ethernet/freescale/fman/mac.c
> >> > @@ -623,6 +623,8 @@ static struct platform_device
> >> *dpaa_eth_add_device(int fman_id,
> >> > goto no_mem;
> >> > }
> >> >
> >> > +   set_dma_ops(>dev, get_dma_ops(priv->dev));
> >> > +
> >>
> >> When compile-testing with f NO_DMA=y:
> >>
> >> drivers/net/ethernet/freescale/fman/mac.c: In function
> >> ‘dpaa_eth_add_device’:
> >> drivers/net/ethernet/freescale/fman/mac.c:626: error: implicit
> >> declaration of function ‘set_dma_ops’
> >>
> >> Reverting commit 5567e989198b5a8d fixes this regression in v4.12-rc7.
> >>
> >> Why is this change needed?
> >> There's no single other call to the DMA API in this file?
> >
> > We're setting here the dma_ops that are later used in the other
> driver/patch.
> > The problem is we now depend upon DMA but do not explicitly declare it:
> >
> > < HAS_DMA'
> > in its Kconfig>>
> 
> Sure. But only if the driver really uses DMA.
> I can stick a set_dma_ops() call in whatever driver, but that doesn't
> mean it will
> suddenly use DMA.
> Why does the FMan driver suddenly has a dependency on DMA, if it doesn't
> use DMA?
> 
> > I'll need to add this to the FMan driver Kconfig.
> 
> Why does the FMan driver need this?
> Why can't his call be done in the driver that uses the DMA APIO?

The DPAA Ethernet driver makes use of DMA ops. It used to get them from
an API call (arch_setup_dma_ops) that was not exported. The DPAA Ethernet
that makes use of the FMan devices does not get the dma_ops as it does not
probe neither as an OF platform device nor thorough ACPI. It probes as a
platform device based on information prepared by the FMan driver. What the
FMan change [1] does is supplement the information shared with the Ethernet
driver with the dma_ops that the FMan driver gets during OF probing. There
are no scenarios one can use the DPAA drivers with NO_DMA, as far as I know.

For general info on the DPAA drivers please refer to the documentation
found in Documentation/networking/dpaa.txt. For the probing of the Ethernet
driver see change [2] and dpaa_eth_add_device() in fsl/fman, dpaa_eth_probe()
in dpaa_eth.
 
[1] 5567e989198b5a8d fsl/fman: propagate dma_ops
[2] fb52728a9294d97d dpaa_eth: reuse the dma_ops provided by the FMan MAC device

> Gr{oetje,eeting}s,
> 
> Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-
> m68k.org
> 
> In personal conversations with technical people, I call myself a hacker.
> But
> when I'm talking to journalists I just say "programmer" or something like
> that.
> -- Linus Torvalds

Thanks,
Madalin


RE: [PATCH] fsl/fman: add dependency on HAS_DMA

2017-06-27 Thread Madalin-cristian Bucur
> -Original Message-
> From: geert.uytterhoe...@gmail.com [mailto:geert.uytterhoe...@gmail.com]
> On Behalf Of Geert Uytterhoeven
> Sent: Monday, June 26, 2017 7:17 PM
> To: Fabio Estevam 
> Cc: Madalin-cristian Bucur ;
> net...@vger.kernel.org; David S. Miller ; linuxppc-
> d...@lists.ozlabs.org; linux-kernel 
> Subject: Re: [PATCH] fsl/fman: add dependency on HAS_DMA
> 
> On Mon, Jun 26, 2017 at 5:20 PM, Fabio Estevam  wrote:
> > On Mon, Jun 26, 2017 at 12:12 PM, Madalin Bucur 
> wrote:
> >> A previous commit inserted a dependency on DMA API that requires
> >> HAS_DMA to be added in Kconfig.
> >
> > It would be nice to specify the commit that caused this.
> 
> That would be commit 5567e989198b5a8d ("fsl/fman: propagate dma_ops").
> 
> However, none of the fman code uses any DMA API calls, so IMHO
> the set_dma_ops() should be done somewhere else.

The Ethernet driver is making use of the DMA ops set here.

> Gr{oetje,eeting}s,
> 
> Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-
> m68k.org
> 
> In personal conversations with technical people, I call myself a hacker.
> But
> when I'm talking to journalists I just say "programmer" or something like
> that.
> -- Linus Torvalds