[RFC v3 06/23] powerpc: use helper functions in __hash_page_4K() for 64K PTE

2017-06-21 Thread Ram Pai
replace redundant code in __hash_page_4K() with helper
functions get_hidx_gslot() and set_hidx_slot()

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/hash64_64k.c | 24 ++--
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 5cbdaa9..cb48a60 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -103,18 +103,12 @@ int __hash_page_4K(unsigned long ea, unsigned long 
access, unsigned long vsid,
if (__rpte_sub_valid(rpte, subpg_index)) {
int ret;
 
-   hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(rpte, subpg_index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-
-   ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
+   gslot = get_hidx_gslot(vpn, shift, ssize, rpte, subpg_index);
+   ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
 MMU_PAGE_4K, MMU_PAGE_4K,
 ssize, flags);
/*
-*if we failed because typically the HPTE wasn't really here
+* if we failed because typically the HPTE wasn't really here
 * we try an insertion.
 */
if (ret == -1)
@@ -214,15 +208,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * Since we have H_PAGE_BUSY set on ptep, we can be sure
 * nobody is undating hidx.
 */
-   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-   rpte.hidx &= ~(0xfUL << (subpg_index << 2));
-   *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
-   new_pte = mark_subptegroup_valid(new_pte, subpg_index);
-   new_pte |=  H_PAGE_HASHPTE;
-   /*
-* check __real_pte for details on matching smp_rmb()
-*/
-   smp_wmb();
+   new_pte |= H_PAGE_HASHPTE;
+   new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
+
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 01/23] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-06-21 Thread Ram Pai
Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
in the 4K backed HPTE pages. These bits continue to be used
for 64K backed HPTE pages in this patch,  but will be freed
up in the next patch. The  bit  numbers  are big-endian  as
defined in the ISA3.0

The patch does the following change to the 64K PTE format

H_PAGE_BUSY moves from bit 3 to bit 9
H_PAGE_F_SECOND which occupied bit 4 moves to the second part
of the pte.
H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
second part of the pte.

the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
is  initialized  to  0xF  indicating  an invalid  slot.  If  a HPTE
gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
released immediately. In  other  words, even  though   0xF   is   a
valid slot we discard  and consider it as an invalid
slot;i.e HPTE(). This  gives  us  an opportunity to not
depend on a bit in the primary PTE in order to determine the
validity of a slot.

When  we  release  aHPTE   in the 0xF   slot we also   release a
legitimate primary   slot  andunmapthat  entry. This  is  to
ensure  that we do get a   legimate   non-0xF  slot the next time we
retry for a slot.

Though treating 0xF slot as invalid reduces the number of available
slots  and  may  have an effect  on the performance, the probabilty
of hitting a 0xF is extermely low.

Compared  to the current scheme, the above described scheme reduces
the number of false hash table updates  significantly  and  has the
added  advantage  of  releasing  four  valuable  PTE bits for other
purpose.

This idea was jointly developed by Paul Mackerras, Aneesh, Michael
Ellermen and myself.

4K PTE format remain unchanged currently.

Signed-off-by: Ram Pai 

Conflicts:
arch/powerpc/include/asm/book3s/64/hash.h
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  7 +++
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 17 ---
 arch/powerpc/include/asm/book3s/64/hash.h | 12 +++--
 arch/powerpc/mm/hash64_64k.c  | 70 +++
 arch/powerpc/mm/hash_utils_64.c   |  4 +-
 5 files changed, 66 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b..9c2c8f1 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,13 @@
 #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
+#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
+
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
 H_PAGE_F_SECOND | H_PAGE_F_GIX)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..3f49941 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,20 +10,21 @@
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
  */
-#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
+
 /*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
-/*
- * Used to track subpage group valid if H_PAGE_COMBO is set
- * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
- */
-#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
-
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..ac049de 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -8,11 +8,8 @@
  *
  */
 #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS
-#define H_PAGE_F_GIX_SHIF

[RFC v3 07/23] powerpc: use helper functions in __hash_page_4K() for 4K PTE

2017-06-21 Thread Ram Pai
replace redundant code with helper functions
get_hidx_gslot() and set_hidx_slot()

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/hash64_4k.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..c673829 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
   pte_t *ptep, unsigned long trap, unsigned long flags,
   int ssize, int subpg_prot)
 {
+   real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * need to add in 0x1 if it's a read-only user page
 */
rflags = htab_convert_pte_flags(new_pte);
+   rpte = __real_pte(__pte(old_pte), ptep);
 
if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
/*
 * There MIGHT be an HPTE for this pte
 */
-   hash = hpt_hash(vpn, shift, ssize);
-   if (old_pte & H_PAGE_F_SECOND)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
+   unsigned long gslot = get_hidx_gslot(vpn, shift,
+   ssize, rpte, 0);
 
-   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
+   if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K,
   MMU_PAGE_4K, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
}
@@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
return -1;
}
new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
-   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
-   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+   new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 02/23] powerpc: introduce set_hidx_slot helper

2017-06-21 Thread Ram Pai
Introduce set_hidx_slot() which sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX)
bits at  the  appropriate  location  in  the  PTE  of  4K  PTE.  In the
case of 64K PTE, it sets the bits in the second part of the PTE. Though
the implementation for the former just needs the slot parameter, it does
take some additional parameters to keep the prototype consistent.

This function will come in handy as we  work  towards  re-arranging the
bits in the later patches.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  7 +++
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 16 
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 9c2c8f1..cef644c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -55,6 +55,13 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
 }
 #endif
 
+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+   unsigned int subpg_index, unsigned long slot)
+{
+   return (slot << H_PAGE_F_GIX_SHIFT) &
+   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 static inline char *get_hpte_slot_array(pmd_t *pmdp)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 3f49941..4bac70a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -75,6 +75,22 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, 
unsigned long index)
return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
 }
 
+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+   unsigned int subpg_index, unsigned long slot)
+{
+   unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+
+   rpte.hidx &= ~(0xfUL << (subpg_index << 2));
+   *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
+   /*
+* Avoid race with __real_pte()
+* hidx must be committed to memory before committing
+* the pte.
+*/
+   smp_wmb();
+   return 0x0UL;
+}
+
 #define __rpte_to_pte(r)   ((r).pte)
 extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
 /*
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 03/23] powerpc: introduce get_hidx_gslot helper

2017-06-21 Thread Ram Pai
Introduce get_hidx_gslot() which returns the slot number of the HPTE
in the global hash table.

This function will come in handy as we work towards re-arranging the
PTE bits in the later patches.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash.h |  3 +++
 arch/powerpc/mm/hash_utils_64.c   | 14 ++
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index ac049de..e7cf03a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -162,6 +162,9 @@ static inline bool hpte_soft_invalid(unsigned long slot)
return ((slot & 0xfUL) == 0xfUL);
 }
 
+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+   int ssize, real_pte_t rpte, unsigned int subpg_index);
+
 /* This low level function performs the actual PTE insertion
  * Setting the PTE depends on the MMU type and other factors. It's
  * an horrible mess that I'm not going to try to clean up now but
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 1b494d0..99f97754c 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1591,6 +1591,20 @@ static inline void tm_flush_hash_page(int local)
 }
 #endif
 
+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+   int ssize, real_pte_t rpte, unsigned int subpg_index)
+{
+   unsigned long hash, slot, hidx;
+
+   hash = hpt_hash(vpn, shift, ssize);
+   hidx = __rpte_to_hidx(rpte, subpg_index);
+   if (hidx & _PTEIDX_SECONDARY)
+   hash = ~hash;
+   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+   slot += hidx & _PTEIDX_GROUP_IX;
+   return slot;
+}
+
 /* WARNING: This is called from hash_low_64.S, if you change this prototype,
  *  do not forget to update the assembly call site !
  */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 04/23] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages

2017-06-21 Thread Ram Pai
Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
in the 64K backed HPTE pages. This along with the earlier
patch will entirely free up the four bits from 64K PTE.
The bit numbers are big-endian as defined in the ISA3.0

This patch does the following change to 64K PTE that is
backed by 64K HPTE.

H_PAGE_F_SECOND which occupied bit 4 moves to the second part
of the pte.
H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
second part of the pte.

since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
to bit 7. Trying to minimize gaps so that contiguous bits
can be allocated if needed in the future.

The second part of the PTE will hold
(H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.

The above PTE changes is applicable to hugetlbpages aswell.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 28 +--
 arch/powerpc/mm/hash64_64k.c  | 17 
 arch/powerpc/mm/hugetlbpage-hash64.c  | 16 ++-
 3 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 4bac70a..7b5dbf3 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,11 +12,8 @@
  */
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
-#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_F_GIX_SHIFT 56
 
-#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are busy */
 #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
 
 /*
@@ -26,8 +23,7 @@
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
 /* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
-H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
 /*
  * we support 16 fragments per PTE page of 64K size.
  */
@@ -55,24 +51,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
unsigned long *hidxp;
 
rpte.pte = pte;
-   rpte.hidx = 0;
-   if (pte_val(pte) & H_PAGE_COMBO) {
-   /*
-* Make sure we order the hidx load against the H_PAGE_COMBO
-* check. The store side ordering is done in __hash_page_4K
-*/
-   smp_rmb();
-   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-   rpte.hidx = *hidxp;
-   }
+   /*
+* The store side ordering is done in set_hidx_slot()
+*/
+   smp_rmb();
+   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+   rpte.hidx = *hidxp;
return rpte;
 }
 
 static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long 
index)
 {
-   if ((pte_val(rpte.pte) & H_PAGE_COMBO))
-   return (rpte.hidx >> (index<<2)) & 0xf;
-   return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
+   return ((rpte.hidx >> (index<<2)) & 0xfUL);
 }
 
 static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index a16cd28..5cbdaa9 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -231,6 +231,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
unsigned long vsid, pte_t *ptep, unsigned long trap,
unsigned long flags, int ssize)
 {
+   real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -267,6 +268,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
 
rflags = htab_convert_pte_flags(new_pte);
+   rpte = __real_pte(__pte(old_pte), ptep);
 
if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -274,16 +276,13 @@ int __hash_page_64K(unsigned long ea, unsigned long 
access,
 
vpn  = hpt_vpn(ea, vsid, ssize);
if (unlikely(old_pte & H_PAGE_HASHPTE)) {
+   unsigned long gslot;
+
/*
 * There MIGHT be an HPTE for this pte
 */
-   hash = hpt_hash(vpn, shift, ssize);
-   if (old_pte & H_PAGE_F_SECOND)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
-   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
+   gslot = get_hidx_gslot(vpn, shift, 

[RFC v3 05/23] powerpc: capture the PTE format changes in the dump pte report

2017-06-21 Thread Ram Pai
The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE.
capture these changes in the dump pte report.

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/dump_linuxpagetables.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
b/arch/powerpc/mm/dump_linuxpagetables.c
index 44fe483..5627edd 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -213,7 +213,7 @@ struct flag_info {
.val= H_PAGE_4K_PFN,
.set= "4K_pfn",
}, {
-#endif
+#else /* CONFIG_PPC_64K_PAGES */
.mask   = H_PAGE_F_GIX,
.val= H_PAGE_F_GIX,
.set= "f_gix",
@@ -224,6 +224,7 @@ struct flag_info {
.val= H_PAGE_F_SECOND,
.set= "f_second",
}, {
+#endif /* CONFIG_PPC_64K_PAGES */
 #endif
.mask   = _PAGE_SPECIAL,
.val= _PAGE_SPECIAL,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 10/23] mm: provide the ability to disable execute on a key at creation

2017-06-21 Thread Ram Pai
Currently sys_pkey_create() provides the ability to disable read
and write permission on the key, at  creation. powerpc  has  the
hardware support to disable execute on a pkey as well.This patch
enhances the interface to let disable execute  at  key  creation
time. x86 does  not  allow  this.  Hence the next patch will add
ability  in  x86  to  return  error  is  PKEY_DISABLE_EXECUTE is
specified.

Signed-off-by: Ram Pai 
---
 include/uapi/asm-generic/mman-common.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 8c27db0..bf4fa07 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -74,7 +74,9 @@
 
 #define PKEY_DISABLE_ACCESS0x1
 #define PKEY_DISABLE_WRITE 0x2
+#define PKEY_DISABLE_EXECUTE   0x4
 #define PKEY_ACCESS_MASK   (PKEY_DISABLE_ACCESS |\
-PKEY_DISABLE_WRITE)
+PKEY_DISABLE_WRITE  |\
+PKEY_DISABLE_EXECUTE)
 
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 08/23] powerpc: use helper functions in flush_hash_page()

2017-06-21 Thread Ram Pai
replace redundant code in flush_hash_page() with helper functions
get_hidx_gslot() and set_hidx_slot()

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/hash_utils_64.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 99f97754c..b3bc5d6 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1611,23 +1611,18 @@ unsigned long get_hidx_gslot(unsigned long vpn, 
unsigned long shift,
 void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
 unsigned long flags)
 {
-   unsigned long hash, index, shift, hidx, slot;
+   unsigned long index, shift, gslot;
int local = flags & HPTE_LOCAL_UPDATE;
 
DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
-   hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(pte, index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-   DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
+   gslot = get_hidx_gslot(vpn, shift, ssize, pte, index);
+   DBG_LOW(" sub %ld: gslot=%lx\n", index, gslot);
/*
 * We use same base page size and actual psize, because we don't
 * use these functions for hugepage
 */
-   mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize,
+   mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize,
 ssize, local);
} pte_iterate_hashed_end();
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 11/23] x86: key creation with PKEY_DISABLE_EXECUTE is disallowed

2017-06-21 Thread Ram Pai
x86 does not support disabling execute permissions on a pkey.

Signed-off-by: Ram Pai 
---
 arch/x86/kernel/fpu/xstate.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c24ac1e..d582631 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -900,6 +900,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return -EINVAL;
 
+   if (init_val & PKEY_DISABLE_EXECUTE)
+   return -EINVAL;
+
/* Set the bits we need in PKRU:  */
if (init_val & PKEY_DISABLE_ACCESS)
new_pkru_bits |= PKRU_AD_BIT;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 12/23] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call

2017-06-21 Thread Ram Pai
Sys_pkey_alloc() allocates and returns available pkey
Sys_pkey_free()  frees up the pkey.

Total 32 keys are supported on powerpc. However pkey 0,1 and 31
are reserved. So effectively we have 29 pkeys.

Each key  can  be  initialized  to disable read, write and execute
permissions. On powerpc a key can be initialize to disable execute.

Signed-off-by: Ram Pai 
---
 arch/powerpc/Kconfig |  15 
 arch/powerpc/include/asm/book3s/64/mmu.h |  10 +++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  62 ++
 arch/powerpc/include/asm/pkeys.h | 124 +++
 arch/powerpc/include/asm/systbl.h|   2 +
 arch/powerpc/include/asm/unistd.h|   4 +-
 arch/powerpc/include/uapi/asm/unistd.h   |   2 +
 arch/powerpc/mm/Makefile |   1 +
 arch/powerpc/mm/mmu_context_book3s64.c   |   5 ++
 arch/powerpc/mm/pkeys.c  |  88 +++
 10 files changed, 310 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7c8f99..b6960617 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -871,6 +871,21 @@ config SECCOMP
 
  If unsure, say Y. Only embedded should say N here.
 
+config PPC64_MEMORY_PROTECTION_KEYS
+   prompt "PowerPC Memory Protection Keys"
+   def_bool y
+   # Note: only available in 64-bit mode
+   depends on PPC64 && PPC_64K_PAGES
+   select ARCH_USES_HIGH_VMA_FLAGS
+   select ARCH_HAS_PKEYS
+   ---help---
+ Memory Protection Keys provides a mechanism for enforcing
+ page-based protections, but without requiring modification of the
+ page tables when an application changes protection domains.
+
+ For details, see Documentation/powerpc/protection-keys.txt
+
+ If unsure, say y.
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3..0c0a2a8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -108,6 +108,16 @@ struct patb_entry {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
struct list_head iommu_group_mem_list;
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /*
+* Each bit represents one protection key.
+* bit set   -> key allocated
+* bit unset -> key available for allocation
+*/
+   u32 pkey_allocation_map;
+   s16 execute_only_pkey; /* key holding execute-only protection */
+#endif
 } mm_context_t;
 
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc987..87e9a89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct 
mm_struct *mm,
pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
+#include 
+static inline u64 read_amr(void)
+{
+   return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+   mtspr(SPRN_AMR, value);
+}
+static inline u64 read_iamr(void)
+{
+   return mfspr(SPRN_IAMR);
+}
+static inline void write_iamr(u64 value)
+{
+   mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+   return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+   mtspr(SPRN_UAMOR, value);
+}
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+static inline u64 read_amr(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_amr(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_uamor(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_uamor(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_iamr(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_iamr(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
   unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 000..7bc8746
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,124 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+
+#define arch_max_pkey()  32
+
+#define AMR_AD_BIT 

[RFC v3 09/23] mm: introduce an additional vma bit for powerpc pkey

2017-06-21 Thread Ram Pai
Currently there are only 4bits in the vma flags to support 16 keys
on x86.  powerpc supports 32 keys, which needs 5bits. This patch
introduces an addition bit in the vma flags.

Signed-off-by: Ram Pai 
---
 fs/proc/task_mmu.c |  6 +-
 include/linux/mm.h | 18 +-
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33..2ddc298 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
[ilog2(VM_MERGEABLE)]   = "mg",
[ilog2(VM_UFFD_MISSING)]= "um",
[ilog2(VM_UFFD_WP)] = "uw",
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_ARCH_HAS_PKEYS
/* These come out via ProtectionKey: */
[ilog2(VM_PKEY_BIT0)]   = "",
[ilog2(VM_PKEY_BIT1)]   = "",
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
+#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /* Additional bit in ProtectionKey: */
+   [ilog2(VM_PKEY_BIT4)]   = "",
 #endif
};
size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..3d35bcc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,21 +208,29 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 #define VM_HIGH_ARCH_BIT_1 33  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_2 34  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_3 35  /* bit only usable on 64-bit 
architectures */
+#define VM_HIGH_ARCH_BIT_4 36  /* bit only usable on 64-bit arch */
 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
-#if defined(CONFIG_X86)
-# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+#ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0  VM_HIGH_ARCH_0  /* A protection key is a 4-bit value */
+# define VM_PKEY_BIT0  VM_HIGH_ARCH_0
 # define VM_PKEY_BIT1  VM_HIGH_ARCH_1
 # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
 # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
-#endif
+#endif /* CONFIG_ARCH_HAS_PKEYS */
+
+#if defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT4  VM_HIGH_ARCH_4 /* additional key bit used on ppc64 */
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
+#if defined(CONFIG_X86)
+# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
 #elif defined(CONFIG_PPC)
 # define VM_SAOVM_ARCH_1   /* Strong Access Ordering 
(powerpc) */
 #elif defined(CONFIG_PARISC)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 13/23] powerpc: store and restore the pkey state across context switches

2017-06-21 Thread Ram Pai
Store and restore the AMR, IAMR and UMOR register state of the task
before scheduling out and after scheduling in, respectively.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/processor.h |  5 +
 arch/powerpc/kernel/process.c| 18 ++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index a2123f2..1f714df 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -310,6 +310,11 @@ struct thread_struct {
struct thread_vr_state ckvr_state; /* Checkpointed VR state */
unsigned long   ckvrsave; /* Checkpointed VRSAVE */
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   unsigned long   amr;
+   unsigned long   iamr;
+   unsigned long   uamor;
+#endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
void*   kvm_shadow_vcpu; /* KVM internal data */
 #endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index baae104..37d001a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t)
t->tar = mfspr(SPRN_TAR);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   t->amr = mfspr(SPRN_AMR);
+   t->iamr = mfspr(SPRN_IAMR);
+   t->uamor = mfspr(SPRN_UAMOR);
+#endif
 }
 
 static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct 
*old_thread,
mtspr(SPRN_TAR, new_thread->tar);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (old_thread->amr != new_thread->amr)
+   mtspr(SPRN_AMR, new_thread->amr);
+   if (old_thread->iamr != new_thread->iamr)
+   mtspr(SPRN_IAMR, new_thread->iamr);
+   if (old_thread->uamor != new_thread->uamor)
+   mtspr(SPRN_UAMOR, new_thread->uamor);
+#endif
 }
 
 struct task_struct *__switch_to(struct task_struct *prev,
@@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long 
start, unsigned long sp)
current->thread.tm_texasr = 0;
current->thread.tm_tfiar = 0;
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   current->thread.amr   = 0x0ul;
+   current->thread.iamr  = 0x0ul;
+   current->thread.uamor = 0x0ul;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL(start_thread);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 14/23] powerpc: Implementation for sys_mprotect_pkey() system call

2017-06-21 Thread Ram Pai
This system call, associates the pkey with PTE of all
pages covering the given address range.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 22 ++-
 arch/powerpc/include/asm/mman.h  | 14 -
 arch/powerpc/include/asm/pkeys.h | 21 ++-
 arch/powerpc/include/asm/systbl.h|  1 +
 arch/powerpc/include/asm/unistd.h|  4 +-
 arch/powerpc/include/uapi/asm/unistd.h   |  1 +
 arch/powerpc/mm/pkeys.c  | 93 +++-
 7 files changed, 148 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 87e9a89..bc845cd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -37,6 +37,7 @@
 #define _RPAGE_RSV20x0800UL
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
+#define _RPAGE_RSV50x00040UL
 
 #define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
 #define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
@@ -56,6 +57,20 @@
 /* Max physical address bit as per radix table */
 #define _RPAGE_PA_MAX  57
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#define H_PAGE_PKEY_BIT0   _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1   _RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2   _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3   _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4   _RPAGE_RSV5
+#else /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+#define H_PAGE_PKEY_BIT0   0
+#define H_PAGE_PKEY_BIT1   0
+#define H_PAGE_PKEY_BIT2   0
+#define H_PAGE_PKEY_BIT3   0
+#define H_PAGE_PKEY_BIT4   0
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 /*
  * Max physical address bit we will use for now.
  *
@@ -122,7 +137,12 @@
 #define PAGE_PROT_BITS  (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \
 H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \
 _PAGE_READ | _PAGE_WRITE |  _PAGE_DIRTY | _PAGE_EXEC | 
\
-_PAGE_SOFT_DIRTY)
+_PAGE_SOFT_DIRTY | \
+H_PAGE_PKEY_BIT0 | \
+H_PAGE_PKEY_BIT1 | \
+H_PAGE_PKEY_BIT2 | \
+H_PAGE_PKEY_BIT3 | \
+H_PAGE_PKEY_BIT4)
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
  * cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..624f6a2 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,6 +13,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -22,13 +23,24 @@
 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
unsigned long pkey)
 {
-   return (prot & PROT_SAO) ? VM_SAO : 0;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   return (((prot & PROT_SAO) ? VM_SAO : 0) |
+   pkey_to_vmflag_bits(pkey));
+#else
+   return ((prot & PROT_SAO) ? VM_SAO : 0);
+#endif
 }
 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   return (vm_flags & VM_SAO) ?
+   __pgprot(_PAGE_SAO | vmflag_to_page_pkey_bits(vm_flags)) :
+   __pgprot(0 | vmflag_to_page_pkey_bits(vm_flags));
+#else
return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
+#endif
 }
 #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
 
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7bc8746..0f3dca8 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,19 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)
 
+#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
+   ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\
+   ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\
+   ((key & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) |\
+   ((key & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL))
+
+#define vmflag_to_page_pkey_bits(vm_flags)  \
+   (((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL)| \
+   ((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) | \
+   ((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) | \
+   ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
+   ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
@

[RFC v3 18/23] powerpc: Deliver SEGV signal on pkey violation

2017-06-21 Thread Ram Pai
The value of the AMR register at the time of exception
is made available in gp_regs[PT_AMR] of the siginfo.

This field can be used to reprogram the permission bits of
any valid pkey.

Similarly the value of the pkey, whose protection got violated,
is made available at si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/paca.h|  1 +
 arch/powerpc/include/uapi/asm/ptrace.h |  3 ++-
 arch/powerpc/kernel/asm-offsets.c  |  5 
 arch/powerpc/kernel/exceptions-64s.S   | 16 +--
 arch/powerpc/kernel/signal_32.c| 14 ++
 arch/powerpc/kernel/signal_64.c| 14 ++
 arch/powerpc/kernel/traps.c| 49 ++
 arch/powerpc/mm/fault.c|  2 ++
 8 files changed, 101 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1c09f8f..a41afd3 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -92,6 +92,7 @@ struct paca_struct {
struct dtl_entry *dispatch_log_end;
 #endif /* CONFIG_PPC_STD_MMU_64 */
u64 dscr_default;   /* per-CPU default DSCR */
+   u64 paca_amr;   /* value of amr at exception */
 
 #ifdef CONFIG_PPC_STD_MMU_64
/*
diff --git a/arch/powerpc/include/uapi/asm/ptrace.h 
b/arch/powerpc/include/uapi/asm/ptrace.h
index 8036b38..7ec2428 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -108,8 +108,9 @@ struct pt_regs {
 #define PT_DAR 41
 #define PT_DSISR 42
 #define PT_RESULT 43
-#define PT_DSCR 44
 #define PT_REGS_COUNT 44
+#define PT_DSCR 44
+#define PT_AMR 45
 
 #define PT_FPR048  /* each FP reg occupies 2 slots in this space */
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 709e234..17f5d8a 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -241,6 +241,11 @@ int main(void)
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   OFFSET(PACA_AMR, paca_struct, paca_amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 3fd0528..a4de1b4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,9 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
ld  r12,_MSR(r1)
ld  r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
-   li  r5,0x300
std r3,_DAR(r1)
std r4,_DSISR(r1)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+   beq+1f
+   mfspr   r5,SPRN_AMR
+   std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+1: li  r5,0x300
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
@@ -561,9 +567,15 @@ EXC_COMMON_BEGIN(instruction_access_common)
ld  r12,_MSR(r1)
ld  r3,_NIP(r1)
andis.  r4,r12,0x5820
-   li  r5,0x400
std r3,_DAR(r1)
std r4,_DSISR(r1)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+   beq+1f
+   mfspr   r5,SPRN_AMR
+   std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+1: li  r5,0x400
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 97bb138..059766a 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct 
mcontext __user *frame,
   (unsigned long) &frame->tramp[2]);
}
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR]))
+   return 1;
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return 0;
 }
 
@@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
long err;
unsigned int save_r2 = 0;
unsigned long msr;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #ifdef CONFIG_VSX
int i;
 #endif
@@ -750,6 +758,12 @@ static long resto

[RFC v3 16/23] powerpc: Macro the mask used for checking DSI exception

2017-06-21 Thread Ram Pai
Replace the magic number used to check for DSI exception
with a meaningful value.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/reg.h   | 7 ++-
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7e50e47..ba110dd 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -272,16 +272,21 @@
 #define SPRN_DAR   0x013   /* Data Address Register */
 #define SPRN_DBCR  0x136   /* e300 Data Breakpoint Control Reg */
 #define SPRN_DSISR 0x012   /* Data Storage Interrupt Status Register */
+#define   DSISR_BIT32  0x8000  /* not defined */
 #define   DSISR_NOHPTE 0x4000  /* no translation found */
+#define   DSISR_PAGEATTR_CONFLT0x2000  /* page attribute 
conflict */
+#define   DSISR_BIT35  0x1000  /* not defined */
 #define   DSISR_PROTFAULT  0x0800  /* protection fault */
 #define   DSISR_BADACCESS  0x0400  /* bad access to CI or G */
 #define   DSISR_ISSTORE0x0200  /* access was a store */
 #define   DSISR_DABRMATCH  0x0040  /* hit data breakpoint */
-#define   DSISR_NOSEGMENT  0x0020  /* SLB miss */
 #define   DSISR_KEYFAULT   0x0020  /* Key fault */
+#define   DSISR_BIT43  0x0010  /* not defined */
 #define   DSISR_UNSUPP_MMU 0x0008  /* Unsupported MMU config */
 #define   DSISR_SET_RC 0x0004  /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
+#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \
+   DSISR_BADACCESS | DSISR_BIT43)
 #define SPRN_TBRL  0x10C   /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR   0x11B   /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index ae418b8..3fd0528 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
.balign IFETCH_ALIGN_BYTES
 do_hash_page:
 #ifdef CONFIG_PPC_STD_MMU_64
-   andis.  r0,r4,0xa410/* weird error? */
+   andis.  r0,r4,DSISR_PAGE_FAULT_MASK@h
bne-handle_page_fault   /* if not, try to insert a HPTE */
andis.  r0,r4,DSISR_DABRMATCH@h
bne-handle_dabr_fault
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 15/23] powerpc: Program HPTE key protection bits

2017-06-21 Thread Ram Pai
Map the PTE protection key bits to the HPTE key protection bits,
while creating HPTE  entries.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +
 arch/powerpc/include/asm/pkeys.h  | 7 +++
 arch/powerpc/mm/hash_utils_64.c   | 5 +
 3 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52..f7a6ed3 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
 #define HPTE_R_PP0 ASM_CONST(0x8000)
 #define HPTE_R_TS  ASM_CONST(0x4000)
 #define HPTE_R_KEY_HI  ASM_CONST(0x3000)
+#define HPTE_R_KEY_BIT0ASM_CONST(0x2000)
+#define HPTE_R_KEY_BIT1ASM_CONST(0x1000)
 #define HPTE_R_RPN_SHIFT   12
 #define HPTE_R_RPN ASM_CONST(0x0000)
 #define HPTE_R_RPN_3_0 ASM_CONST(0x01fff000)
@@ -104,6 +106,9 @@
 #define HPTE_R_C   ASM_CONST(0x0080)
 #define HPTE_R_R   ASM_CONST(0x0100)
 #define HPTE_R_KEY_LO  ASM_CONST(0x0e00)
+#define HPTE_R_KEY_BIT2ASM_CONST(0x0800)
+#define HPTE_R_KEY_BIT3ASM_CONST(0x0400)
+#define HPTE_R_KEY_BIT4ASM_CONST(0x0200)
 
 #define HPTE_V_1TB_SEG ASM_CONST(0x4000)
 #define HPTE_V_VRMA_MASK   ASM_CONST(0x4001ff00)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0f3dca8..af3882f 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -27,6 +27,13 @@
((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
 
+#define pte_to_hpte_pkey_bits(pteflags)\
+   (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |\
+   ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
+   ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
+   ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
+   ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index b3bc5d6..34bc94c 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -230,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long 
pteflags)
 */
rflags |= HPTE_R_M;
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   rflags |= pte_to_hpte_pkey_bits(pteflags);
+#endif
+
return rflags;
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 20/23] selftest: PowerPC specific test updates to memory protection keys

2017-06-21 Thread Ram Pai
Abstracted out the arch specific code into the header file, and
added powerpc specific changes.

a) added 4k-backed hpte, memory allocator, powerpc specific.
b) added three test case where the key is associated after the page is
accessed/allocated/mapped.
c) cleaned up the code to make checkpatch.pl happy

Signed-off-by: Ram Pai 
---
 tools/testing/selftests/vm/pkey-helpers.h| 230 +--
 tools/testing/selftests/vm/protection_keys.c | 562 ---
 2 files changed, 513 insertions(+), 279 deletions(-)

diff --git a/tools/testing/selftests/vm/pkey-helpers.h 
b/tools/testing/selftests/vm/pkey-helpers.h
index b202939..69bfa89 100644
--- a/tools/testing/selftests/vm/pkey-helpers.h
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -12,13 +12,72 @@
 #include 
 #include 
 
-#define NR_PKEYS 16
-#define PKRU_BITS_PER_PKEY 2
+/* Define some kernel-like types */
+#define  u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__ /* arch */
+
+#define SYS_mprotect_key 380
+#define SYS_pkey_alloc  381
+#define SYS_pkey_free   382
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x14
+
+#define NR_PKEYS   16
+#define NR_RESERVED_PKEYS  1
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS0x1
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<21)
+
+#define INIT_PRKU 0x0UL
+
+#elif __powerpc64__ /* arch */
+
+#define SYS_mprotect_key 386
+#define SYS_pkey_alloc  384
+#define SYS_pkey_free   385
+#define si_pkey_offset 0x20
+#define REG_IP_IDX PT_NIP
+#define REG_TRAPNO PT_TRAP
+#define REG_AMR45
+#define gregs gp_regs
+#define fpregs fp_regs
+
+#define NR_PKEYS   32
+#define NR_RESERVED_PKEYS  3
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS0x3  /* disable read and write */
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<24)
+
+#define INIT_PRKU 0x3UL
+#else /* arch */
+
+   NOT SUPPORTED
+
+#endif /* arch */
+
 
 #ifndef DEBUG_LEVEL
 #define DEBUG_LEVEL 0
 #endif
 #define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+
+
+static inline u32 pkey_to_shift(int pkey)
+{
+#ifdef __i386__ /* arch */
+   return pkey * PKRU_BITS_PER_PKEY;
+#elif __powerpc64__ /* arch */
+   return (NR_PKEYS - pkey - 1) * PKRU_BITS_PER_PKEY;
+#endif /* arch */
+}
+
+
 extern int dprint_in_signal;
 extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
 static inline void sigsafe_printf(const char *format, ...)
@@ -53,53 +112,76 @@ static inline void sigsafe_printf(const char *format, ...)
 #define dprintf3(args...) dprintf_level(3, args)
 #define dprintf4(args...) dprintf_level(4, args)
 
-extern unsigned int shadow_pkru;
-static inline unsigned int __rdpkru(void)
+extern u64 shadow_pkey_reg;
+
+static inline u64 __rdpkey_reg(void)
 {
+#ifdef __i386__ /* arch */
unsigned int eax, edx;
unsigned int ecx = 0;
-   unsigned int pkru;
+   unsigned int pkey_reg;
 
asm volatile(".byte 0x0f,0x01,0xee\n\t"
 : "=a" (eax), "=d" (edx)
 : "c" (ecx));
-   pkru = eax;
-   return pkru;
+#elif __powerpc64__ /* arch */
+   u64 eax;
+   u64 pkey_reg;
+
+   asm volatile("mfspr %0, 0xd" : "=r" ((u64)(eax)));
+#endif /* arch */
+   pkey_reg = (u64)eax;
+   return pkey_reg;
 }
 
-static inline unsigned int _rdpkru(int line)
+static inline u64 _rdpkey_reg(int line)
 {
-   unsigned int pkru = __rdpkru();
+   u64 pkey_reg = __rdpkey_reg();
 
-   dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
-   line, pkru, shadow_pkru);
-   assert(pkru == shadow_pkru);
+   dprintf4("rdpkey_reg(line=%d) pkey_reg: %lx shadow: %lx\n",
+   line, pkey_reg, shadow_pkey_reg);
+   assert(pkey_reg == shadow_pkey_reg);
 
-   return pkru;
+   return pkey_reg;
 }
 
-#define rdpkru() _rdpkru(__LINE__)
+#define rdpkey_reg() _rdpkey_reg(__LINE__)
 
-static inline void __wrpkru(unsigned int pkru)
+static inline void __wrpkey_reg(u64 pkey_reg)
 {
-   unsigned int eax = pkru;
+#ifdef __i386__ /* arch */
+   unsigned int eax = pkey_reg;
unsigned int ecx = 0;
unsigned int edx = 0;
 
-   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   dprintf4("%s() changing %lx to %lx\n",
+__func__, __rdpkey_reg(), pkey_reg);
asm volatile(".byte 0x0f,0x01,0xef\n\t"
 : : "a" (eax), "c" (ecx), "d" (edx));
-   assert(pkru == __rdpkru());
+   dprintf4("%s() PKRUP after changing %lx to %lx\n",
+   __func__, __rdpkey_reg(), pkey_reg);
+#else /* arch */
+   u64 eax = pkey_reg;
+
+   dprintf4("%s() changing %llx to %llx\n",
+__func__, __rdpkey_reg(), pkey_reg);
+   asm volatile("mtspr 0xd, %0" : : "r" ((unsigned long)(eax)) : "memory");
+   dprintf4("%s() PKRUP after changi

[RFC v3 19/23] selftest: Move protecton key selftest to arch neutral directory

2017-06-21 Thread Ram Pai
Signed-off-by: Ram Pai 
---
 tools/testing/selftests/vm/Makefile   |1 +
 tools/testing/selftests/vm/pkey-helpers.h |  219 
 tools/testing/selftests/vm/protection_keys.c  | 1395 +
 tools/testing/selftests/x86/Makefile  |2 +-
 tools/testing/selftests/x86/pkey-helpers.h|  219 
 tools/testing/selftests/x86/protection_keys.c | 1395 -
 6 files changed, 1616 insertions(+), 1615 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
 create mode 100644 tools/testing/selftests/vm/protection_keys.c
 delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
 delete mode 100644 tools/testing/selftests/x86/protection_keys.c

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index cbb29e4..1d32f78 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -17,6 +17,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += protection_keys
 
 TEST_PROGS := run_vmtests
 
diff --git a/tools/testing/selftests/vm/pkey-helpers.h 
b/tools/testing/selftests/vm/pkey-helpers.h
new file mode 100644
index 000..b202939
--- /dev/null
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -0,0 +1,219 @@
+#ifndef _PKEYS_HELPER_H
+#define _PKEYS_HELPER_H
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_PKEYS 16
+#define PKRU_BITS_PER_PKEY 2
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+extern int dprint_in_signal;
+extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
+static inline void sigsafe_printf(const char *format, ...)
+{
+   va_list ap;
+
+   va_start(ap, format);
+   if (!dprint_in_signal) {
+   vprintf(format, ap);
+   } else {
+   int len = vsnprintf(dprint_in_signal_buffer,
+   DPRINT_IN_SIGNAL_BUF_SIZE,
+   format, ap);
+   /*
+* len is amount that would have been printed,
+* but actual write is truncated at BUF_SIZE.
+*/
+   if (len > DPRINT_IN_SIGNAL_BUF_SIZE)
+   len = DPRINT_IN_SIGNAL_BUF_SIZE;
+   write(1, dprint_in_signal_buffer, len);
+   }
+   va_end(ap);
+}
+#define dprintf_level(level, args...) do { \
+   if (level <= DEBUG_LEVEL)   \
+   sigsafe_printf(args);   \
+   fflush(NULL);   \
+} while (0)
+#define dprintf0(args...) dprintf_level(0, args)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern unsigned int shadow_pkru;
+static inline unsigned int __rdpkru(void)
+{
+   unsigned int eax, edx;
+   unsigned int ecx = 0;
+   unsigned int pkru;
+
+   asm volatile(".byte 0x0f,0x01,0xee\n\t"
+: "=a" (eax), "=d" (edx)
+: "c" (ecx));
+   pkru = eax;
+   return pkru;
+}
+
+static inline unsigned int _rdpkru(int line)
+{
+   unsigned int pkru = __rdpkru();
+
+   dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
+   line, pkru, shadow_pkru);
+   assert(pkru == shadow_pkru);
+
+   return pkru;
+}
+
+#define rdpkru() _rdpkru(__LINE__)
+
+static inline void __wrpkru(unsigned int pkru)
+{
+   unsigned int eax = pkru;
+   unsigned int ecx = 0;
+   unsigned int edx = 0;
+
+   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   asm volatile(".byte 0x0f,0x01,0xef\n\t"
+: : "a" (eax), "c" (ecx), "d" (edx));
+   assert(pkru == __rdpkru());
+}
+
+static inline void wrpkru(unsigned int pkru)
+{
+   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   /* will do the shadow check for us: */
+   rdpkru();
+   __wrpkru(pkru);
+   shadow_pkru = pkru;
+   dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+   unsigned int pkru = rdpkru();
+   int bit = pkey * 2;
+
+   if (do_allow)
+   pkru &= (1

[RFC v3 21/23] Documentation: Move protecton key documentation to arch neutral directory

2017-06-21 Thread Ram Pai
Since PowerPC and Intel both support memory protection keys, moving
the documenation to arch-neutral directory.

Signed-off-by: Ram Pai 
---
 Documentation/vm/protection-keys.txt  | 85 +++
 Documentation/x86/protection-keys.txt | 85 ---
 2 files changed, 85 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/vm/protection-keys.txt
 delete mode 100644 Documentation/x86/protection-keys.txt

diff --git a/Documentation/vm/protection-keys.txt 
b/Documentation/vm/protection-keys.txt
new file mode 100644
index 000..b643045
--- /dev/null
+++ b/Documentation/vm/protection-keys.txt
@@ -0,0 +1,85 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+=== Syscalls ===
+
+There are 3 system calls which directly interact with pkeys:
+
+   int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
+   int pkey_free(int pkey);
+   int pkey_mprotect(unsigned long start, size_t len,
+ unsigned long prot, int pkey);
+
+Before a pkey can be used, it must first be allocated with
+pkey_alloc().  An application calls the WRPKRU instruction
+directly in order to change access permissions to memory covered
+with a key.  In this example WRPKRU is wrapped by a C function
+called pkey_set().
+
+   int real_prot = PROT_READ|PROT_WRITE;
+   pkey = pkey_alloc(0, PKEY_DENY_WRITE);
+   ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 
0);
+   ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
+   ... application runs here
+
+Now, if the application needs to update the data at 'ptr', it can
+gain access, do the update, then remove its write access:
+
+   pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
+   *ptr = foo; // assign something
+   pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
+
+Now when it frees the memory, it will also free the pkey since it
+is no longer in use:
+
+   munmap(ptr, PAGE_SIZE);
+   pkey_free(pkey);
+
+(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+ An example implementation can be found in
+ tools/testing/selftests/x86/protection_keys.c)
+
+=== Behavior ===
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+   mprotect(ptr, size, PROT_NONE);
+   something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+   pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+   pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
+   something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+   *ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+   read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
diff --git a/Documentation/x86/protection-keys.txt 
b/Documentation/x86/protection-keys.txt
deleted file mode 100644
index b643045..000
--- a/Documentation/x86/protection-keys.txt
+++ /dev/null
@@ -1,85 +0,0 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every oth

[RFC v3 22/23] Documentation: PowerPC specific updates to memory protection keys

2017-06-21 Thread Ram Pai
Add documentation updates that capture PowerPC specific changes.

Signed-off-by: Ram Pai 
---
 Documentation/vm/protection-keys.txt | 65 +---
 1 file changed, 45 insertions(+), 20 deletions(-)

diff --git a/Documentation/vm/protection-keys.txt 
b/Documentation/vm/protection-keys.txt
index b643045..965ad75 100644
--- a/Documentation/vm/protection-keys.txt
+++ b/Documentation/vm/protection-keys.txt
@@ -1,21 +1,46 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature found in
+new generation of intel CPUs and on PowerPC 7 and higher CPUs.
 
 Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
-
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register.  The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs.  These
-permissions are enforced on data access only and have no effect on
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+
+On Intel:
+
+   It works by dedicating 4 previously ignored bits in each page table
+   entry to a "protection key", giving 16 possible keys.
+
+   There is also a new user-accessible register (PKRU) with two separate
+   bits (Access Disable and Write Disable) for each key.  Being a CPU
+   register, PKRU is inherently thread-local, potentially giving each
+   thread a different set of protections from every other thread.
+
+   There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+   to the new register.  The feature is only available in 64-bit mode,
+   even though there is theoretically space in the PAE PTEs.  These
+   permissions are enforced on data access only and have no effect on
+   instruction fetches.
+
+
+On PowerPC:
+
+   It works by dedicating 5 page table entry bits to a "protection key",
+   giving 32 possible keys.
+
+   There  is  a  user-accessible  register (AMR)  with  two separate bits;
+   Access Disable and  Write  Disable, for  each key.  Being  a  CPU
+   register,  AMR  is inherently  thread-local,  potentially  giving  each
+   thread a different set of protections from every other thread.  NOTE:
+   Disabling read permission does not disable write and vice-versa.
+
+   The feature is available on 64-bit HPTE mode only.
+   'mtspr 0xd, mem' reads the AMR register
+   'mfspr mem, 0xd' writes into the AMR register.
+
+
+
+Permissions are enforced on data access only and have no effect on
 instruction fetches.
 
 === Syscalls ===
@@ -28,9 +53,9 @@ There are 3 system calls which directly interact with pkeys:
  unsigned long prot, int pkey);
 
 Before a pkey can be used, it must first be allocated with
-pkey_alloc().  An application calls the WRPKRU instruction
+pkey_alloc().  An application calls the WRPKRU/AMR instruction
 directly in order to change access permissions to memory covered
-with a key.  In this example WRPKRU is wrapped by a C function
+with a key.  In this example WRPKRU/AMR is wrapped by a C function
 called pkey_set().
 
int real_prot = PROT_READ|PROT_WRITE;
@@ -52,11 +77,11 @@ is no longer in use:
munmap(ptr, PAGE_SIZE);
pkey_free(pkey);
 
-(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+(Note: pkey_set() is a wrapper for the RDPKRU,WRPKRU or AMR instructions.
  An example implementation can be found in
  tools/testing/selftests/x86/protection_keys.c)
 
-=== Behavior ===
+=== Behavior =
 
 The kernel attempts to make protection keys consistent with the
 behavior of a plain mprotect().  For instance if you do this:
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 23/23] procfs: display the protection-key number associated with a vma

2017-06-21 Thread Ram Pai
Display the pkey number associated with the vma in smaps of a task.
The key will be seen as below:

VmFlags: rd wr mr mw me dw ac key=0

Signed-off-by: Ram Pai 
---
 Documentation/filesystems/proc.txt |  3 ++-
 fs/proc/task_mmu.c | 22 +++---
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 4cddbce..a8c74aa 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -423,7 +423,7 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
-VmFlags: rd ex mr mw me dw
+VmFlags: rd ex mr mw me dw key=
 
 the first of these lines shows the same information as is displayed for the
 mapping in /proc/PID/maps.  The remaining lines show the size of the mapping
@@ -491,6 +491,7 @@ manner. The codes are the following:
 hg  - huge page advise flag
 nh  - no-huge page advise flag
 mg  - mergable advise flag
+key= - the memory protection key number
 
 Note that there is no guarantee that every flag and associated mnemonic will
 be present in all further kernel releases. Things get changed, the flags may
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2ddc298..d2eb096 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1,4 +1,6 @@
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -666,22 +668,20 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
[ilog2(VM_MERGEABLE)]   = "mg",
[ilog2(VM_UFFD_MISSING)]= "um",
[ilog2(VM_UFFD_WP)] = "uw",
-#ifdef CONFIG_ARCH_HAS_PKEYS
-   /* These come out via ProtectionKey: */
-   [ilog2(VM_PKEY_BIT0)]   = "",
-   [ilog2(VM_PKEY_BIT1)]   = "",
-   [ilog2(VM_PKEY_BIT2)]   = "",
-   [ilog2(VM_PKEY_BIT3)]   = "",
-#endif /* CONFIG_ARCH_HAS_PKEYS */
-#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
-   /* Additional bit in ProtectionKey: */
-   [ilog2(VM_PKEY_BIT4)]   = "",
-#endif
};
size_t i;
 
seq_puts(m, "VmFlags: ");
for (i = 0; i < BITS_PER_LONG; i++) {
+#ifdef CONFIG_ARCH_HAS_PKEYS
+   if (i == ilog2(VM_PKEY_BIT0)) {
+   int keyvalue = vma_pkey(vma);
+
+   i += ilog2(arch_max_pkey())-1;
+   seq_printf(m, "key=%d ", keyvalue);
+   continue;
+   }
+#endif /* CONFIG_ARCH_HAS_PKEYS */
if (!mnemonics[i][0])
continue;
if (vma->vm_flags & (1UL << i)) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 17/23] powerpc: Handle exceptions caused by violation of pkey protection

2017-06-21 Thread Ram Pai
Handle Data and Instruction exceptions caused by memory
protection-key.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mmu_context.h | 12 +
 arch/powerpc/include/asm/pkeys.h   |  9 
 arch/powerpc/include/asm/reg.h |  2 +-
 arch/powerpc/mm/fault.c| 20 
 arch/powerpc/mm/pkeys.c| 90 ++
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index da7e943..71fffe0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_pte_access_permitted(pte_t pte, bool write);
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+   bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+   /* by default, allow everything */
+   return true;
+}
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
 {
/* by default, allow everything */
return true;
 }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index af3882f..a83722e 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,15 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)
 
+static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
+{
+   return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
+}
+
 #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\
((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index ba110dd..6e2a860 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -286,7 +286,7 @@
 #define   DSISR_SET_RC 0x0004  /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
 #define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \
-   DSISR_BADACCESS | DSISR_BIT43)
+   DSISR_BADACCESS | DSISR_KEYFAULT | DSISR_BIT43)
 #define SPRN_TBRL  0x10C   /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR   0x11B   /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580..3d71984 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -261,6 +261,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
}
 #endif
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (error_code & DSISR_KEYFAULT) {
+   code = SEGV_PKUERR;
+   goto bad_area_nosemaphore;
+   }
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/* We restore the interrupt state now */
if (!arch_irq_disabled_regs(regs))
local_irq_enable();
@@ -441,6 +448,19 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
 #endif /* CONFIG_PPC_STD_MMU */
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+   is_exec, 0)) {
+   code = SEGV_PKUERR;
+   goto bad_area;
+   }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+   /* handle_mm_fault() needs to know if its a instruction access
+* fault.
+*/
+   if (is_exec)
+   flags |= FAULT_FLAG_INSTRUCTION;
/*
 * If for any reason at all we couldn't handle the fault,
 * make sure we exit gracefully rather than endlessly redo
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 11a32b3..439241a 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
 }
 
+static inline bool pkey_allows_read(int pkey)
+{
+   int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+

[RFC v3 00/23] powerpc: Memory Protection Keys

2017-06-21 Thread Ram Pai
Memory protection keys enable applications to protect its
address space from inadvertent access or corruption from
itself.

The overall idea:

 A process allocates a   key  and associates it with
 a  address  range  withinits   address   space.
 The process  than  can  dynamically  set read/write 
 permissions on  the   key   without  involving  the 
 kernel. Any  code that  violates   the  permissions
 off the address space; as defined by its associated
 key, will receive a segmentation fault.

This patch series enables the feature on PPC64.
It is enabled on HPTE 64K-page platform.

ISA3.0 section 5.7.13 describes the detailed specifications.


Testing:
This patch series has passed all the protection key
tests available in  the selftests directory.
The tests are updated to work on both x86 and powerpc.

version v3:
(1) split the patches into smaller consumable
patches.
(2) added the ability to disable execute permission
on a key at creation.
(3) rename  calc_pte_to_hpte_pkey_bits() to
pte_to_hpte_pkey_bits() -- suggested by Anshuman
(4) some code optimization and clarity in
do_page_fault()  
(5) A bug fix while invalidating a hpte slot in 
__hash_page_4K() -- noticed by Aneesh


version v2:
(1) documentation and selftest added
(2) fixed a bug in 4k hpte backed 64k pte where page
invalidation was not done correctly, and 
initialization of second-part-of-the-pte was not
done correctly if the pte was not yet Hashed
with a hpte.  Reported by Aneesh.
(3) Fixed ABI breakage caused in siginfo structure.
Reported by Anshuman.

Outstanding known issue:
Calls to sys_swapcontext with a made-up context will end 
up with a crap AMR if done by code who didn't know about
that register. -- Reported by Ben.

version v1: Initial version

Thanks-to: Dave Hansen, Aneesh, Paul Mackerras,
   Michael Ellermen


Ram Pai (23):
  powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
  powerpc: introduce set_hidx_slot helper
  powerpc: introduce get_hidx_gslot helper
  powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
  powerpc: capture the PTE format changes in the dump pte report
  powerpc: use helper functions in __hash_page_4K() for 64K PTE
  powerpc: use helper functions in __hash_page_4K() for 4K PTE
  powerpc: use helper functions in flush_hash_page()
  mm: introduce an additional vma bit for powerpc pkey
  mm: provide the ability to disable execute on a key at creation
  x86: key creation with PKEY_DISABLE_EXECUTE is disallowed
  powerpc: Implement sys_pkey_alloc and sys_pkey_free system call
  powerpc: store and restore the pkey state across context switches
  powerpc: Implementation for sys_mprotect_pkey() system call
  powerpc: Program HPTE key protection bits
  powerpc: Macro the mask used for checking DSI exception
  powerpc: Handle exceptions caused by violation of pkey protection
  powerpc: Deliver SEGV signal on pkey violation
  selftest: Move protecton key selftest to arch neutral directory
  selftest: PowerPC specific test updates to memory protection keys
  Documentation: Move protecton key documentation to arch neutral
directory
  Documentation: PowerPC specific updates to memory protection keys
  procfs: display the protection-key number associated with a vma

 Documentation/filesystems/proc.txt|3 +-
 Documentation/vm/protection-keys.txt  |  110 ++
 Documentation/x86/protection-keys.txt |   85 --
 arch/powerpc/Kconfig  |   15 +
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   14 +
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   53 +-
 arch/powerpc/include/asm/book3s/64/hash.h |   15 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |5 +
 arch/powerpc/include/asm/book3s/64/mmu.h  |   10 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   84 +-
 arch/powerpc/include/asm/mman.h   |   14 +-
 arch/powerpc/include/asm/mmu_context.h|   12 +
 arch/powerpc/include/asm/paca.h   |1 +
 arch/powerpc/include/asm/pkeys.h  |  159 +++
 arch/powerpc/include/asm/processor.h  |5 +
 arch/powerpc/include/asm/reg.h|7 +-
 arch/powerpc/include/asm/systbl.h |3 +
 arch/powerpc/include/asm/unistd.h |6 +-
 arch/powerpc/include/uapi/asm/ptrace.h|3 +-
 arch/powerpc/include/uapi/asm/unistd.h|3 +
 arch/powerpc/kernel/asm-offsets.c |5 +
 arch/powerpc/kernel/exceptions-64s.S  |   18 +-
 arch/powerpc/kernel/process.c |   18 +
 arch/powerpc/kernel/signal_32.c   |   14 +
 arch/powerpc/kernel/signal_64.c   |   14 +
 arch/powerpc/kernel/traps.c   |   49 +
 arch/powerpc/mm/Makefile

Re: [PATCH] kbuild: replace genhdr-y with generated-y, deprecating genhdr-y

2017-06-21 Thread Masahiro Yamada
2017-06-09 17:29 GMT+09:00 Masahiro Yamada :
> Prior to commit fcc8487d477a ("uapi: export all headers under uapi
> directories"), genhdr-y was meant to specify generated UAPI headers.
>
> - generated-y: generated headers (other than asm-generic wrappers)
> - header-y   : headers to be exported
> - genhdr-y   : generated headers to be exported (generated-y + header-y)
>
> Now headers under UAPI directories are all exported.  So, there is no
> more difference between generated-y and genhdr-y.
>
> We see two users of genhdr-y, arch/{arm,x86}/include/uapi/asm/Kbuild.
> They generate some headers in arch/{arm,x86}/include/uapi/generated/
> directories, which are obviously exported.
>
> Replace genhdr-y with generated-y, and deprecate genhdr-y.
>
> Signed-off-by: Masahiro Yamada 

Applied to linux-kbuild/kbuild.



-- 
Best Regards
Masahiro Yamada
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE

2017-06-21 Thread Roman Gushchin
We want to limit the number of tasks which are having an access
to the memory reserves. To ensure the progress it's enough
to have one such process at the time.

If we need to kill the whole cgroup, let's give an access to the
memory reserves only to the first process in the list, which is
(usually) the biggest process.
This will give us good chances that all other processes will be able
to quit without an access to the memory reserves.

Otherwise, to keep going forward, let's grant the access to the memory
reserves for tasks, which can't be reaped by the oom_reaper.
As it will be done from the oom reaper thread, which handles the
oom reaper queue consequently, there is no high risk to have too many
such processes at the same time.

To implement this solution, we need to stop using TIF_MEMDIE flag
as an universal marker for oom victims tasks. It's not a big issue,
as we have oom_mm pointer/tsk_is_oom_victim(), which are just better.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: Tetsuo Handa 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 kernel/exit.c |  2 +-
 mm/oom_kill.c | 31 ++-
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d211425..5b95d74 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -554,7 +554,7 @@ static void exit_mm(void)
task_unlock(current);
mm_update_next_owner(mm);
mmput(mm);
-   if (test_thread_flag(TIF_MEMDIE))
+   if (tsk_is_oom_victim(current))
exit_oom_victim();
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 489ab69..b55bd18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk)
struct mm_struct *mm = tsk->signal->oom_mm;
 
/* Retry the down_read_trylock(mmap_sem) a few times */
-   while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
mm))
+   while (attempts++ < MAX_OOM_REAP_RETRIES &&
+  !__oom_reap_task_mm(tsk, mm)) {
+
+   /*
+* If the task has no access to the memory reserves,
+* grant it to help the task to exit.
+*/
+   if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+   set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
schedule_timeout_idle(HZ/10);
+   }
 
if (attempts <= MAX_OOM_REAP_RETRIES)
goto done;
@@ -647,16 +657,13 @@ static inline void wake_oom_reaper(struct task_struct 
*tsk)
  */
 static void mark_oom_victim(struct task_struct *tsk)
 {
-   struct mm_struct *mm = tsk->mm;
-
WARN_ON(oom_killer_disabled);
-   /* OOM killer might race with memcg OOM */
-   if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
-   return;
 
/* oom_mm is bound to the signal struct life time. */
-   if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
-   mmgrab(tsk->signal->oom_mm);
+   if (cmpxchg(&tsk->signal->oom_mm, NULL, tsk->mm) != NULL)
+   return;
+
+   mmgrab(tsk->signal->oom_mm);
 
/*
 * Make sure that the task is woken up from uninterruptible sleep
@@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
 * that TIF_MEMDIE tasks should be ignored.
 */
__thaw_task(tsk);
-   atomic_inc(&oom_victims);
+
+   /*
+* If there are no oom victims in flight,
+* give the task an access to the memory reserves.
+*/
+   if (atomic_inc_return(&oom_victims) == 1)
+   set_tsk_thread_flag(tsk, TIF_MEMDIE);
 }
 
 /**
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v3 3/6] mm, oom: cgroup-aware OOM killer debug info

2017-06-21 Thread Roman Gushchin
Dump the cgroup oom badness score, as well as the name
of chosen victim cgroup.

Here how it looks like in dmesg:
[   18.824495] Choosing a victim memcg because of the system-wide OOM
[   18.826911] Cgroup /A1: 200805
[   18.827996] Cgroup /A2: 273072
[   18.828937] Cgroup /A2/B3: 51
[   18.829795] Cgroup /A2/B4: 272969
[   18.830800] Cgroup /A2/B5: 52
[   18.831890] Chosen cgroup /A2/B4: 272969

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: Johannes Weiner 
Cc: Li Zefan 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/memcontrol.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdb5103..4face20 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2669,7 +2669,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
+
+   pr_info("Choosing a victim memcg because of the %s",
+   oc->memcg ?
+   "memory limit reached of cgroup " :
+   "system-wide OOM\n");
if (oc->memcg) {
+   pr_cont_cgroup_path(oc->memcg->css.cgroup);
+   pr_cont("\n");
+
chosen_memcg = oc->memcg;
parent = oc->memcg;
}
@@ -2683,6 +2691,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 
points = mem_cgroup_oom_badness(iter, oc->nodemask);
 
+   pr_info("Cgroup ");
+   pr_cont_cgroup_path(iter->css.cgroup);
+   pr_cont(": %ld\n", points);
+
if (points > chosen_memcg_points) {
chosen_memcg = iter;
chosen_memcg_points = points;
@@ -2731,6 +2743,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
oc->chosen_memcg = chosen_memcg;
}
 
+   pr_info("Chosen cgroup ");
+   pr_cont_cgroup_path(chosen_memcg->css.cgroup);
+   pr_cont(": %ld\n", oc->chosen_points);
+
/*
 * Even if we have to kill all tasks in the cgroup,
 * we need to select the biggest task to start with.
@@ -2739,7 +2755,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 */
oc->chosen_points = 0;
mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc);
-   }
+   } else if (oc->chosen)
+   pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
+   oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
 
rcu_read_unlock();
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-21 Thread Roman Gushchin
Oom killer should avoid unnecessary kills. To prevent them, during
the tasks list traverse we check for task which was previously
selected as oom victims. If there is such a task, new victim
is not selected.

This approach is sub-optimal (we're doing costly iteration over the task
list every time) and will not work for the cgroup-aware oom killer.

We already have oom_victims counter, which can be effectively used
for the task.

If there are victims in flight, don't do anything; if the counter
falls to 0, there are no more oom victims left.
So, it's a good time to start looking for a new victim.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: Tetsuo Handa 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..e3aaf5c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
if (oom_killer_disabled)
return false;
 
+   /*
+* If there are oom victims in flight, we don't need to select
+* a new victim.
+*/
+   if (atomic_read(&oom_victims) > 0)
+   return true;
+
if (!is_memcg_oom(oc)) {
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer

2017-06-21 Thread Roman Gushchin
Update cgroups v2 docs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 44 
 1 file changed, 44 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index a86f3cb..7a1a1ac 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -44,6 +44,7 @@ CONTENTS
 5-2-1. Memory Interface Files
 5-2-2. Usage Guidelines
 5-2-3. Memory Ownership
+5-2-4. Cgroup-aware OOM Killer
   5-3. IO
 5-3-1. IO Interface Files
 5-3-2. Writeback
@@ -799,6 +800,26 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_kill_all_tasks
+
+   A read-write single value file which exits on non-root
+   cgroups.  The default is "0".
+
+   Defines whether the OOM killer should treat the cgroup
+   as a single entity during the victim selection.
+
+   If set, it will cause the OOM killer to kill all belonging
+   tasks, both in case of a system-wide or cgroup-wide OOM.
+
+  memory.oom_score_adj
+
+   A read-write single value file which exits on non-root
+   cgroups.  The default is "0".
+
+   OOM killer score adjustment, which has as similar meaning
+   to a per-process value, available via /proc//oom_score_adj.
+   Should be in a range [-1000, 1000].
+
   memory.events
 
A read-only flat-keyed file which exists on non-root cgroups.
@@ -1028,6 +1049,29 @@ POSIX_FADV_DONTNEED to relinquish the ownership of 
memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+5-2-4. Cgroup-aware OOM Killer
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choise of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf cgroup.
+
+But a user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, it causes the OOM killer treat
+the whole cgroup as an indivisible memory consumer. In case if it's
+selected as on OOM victim, all belonging tasks will be killed.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (e.g. leaf cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
 5-3. IO
 
 The "io" controller regulates the distribution of IO resources.  This
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups

2017-06-21 Thread Roman Gushchin
Introduce a per-memory-cgroup oom_score_adj setting.
A read-write single value file which exits on non-root
cgroups. The default is "0".

It will have a similar meaning to a per-process value,
available via /proc//oom_score_adj.
Should be in a range [-1000, 1000].

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: Tetsuo Handa 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c| 36 
 2 files changed, 39 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c59926c..b84a050 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -203,6 +203,9 @@ struct mem_cgroup {
/* kill all tasks in the subtree in case of OOM */
bool oom_kill_all_tasks;
 
+   /* OOM kill score adjustment */
+   short oom_score_adj;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4face20..e474eba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5333,6 +5333,36 @@ static ssize_t memory_oom_kill_all_tasks_write(struct 
kernfs_open_file *of,
return nbytes;
 }
 
+static int memory_oom_score_adj_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+   short oom_score_adj = memcg->oom_score_adj;
+
+   seq_printf(m, "%d\n", oom_score_adj);
+
+   return 0;
+}
+
+static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   int oom_score_adj;
+   int err;
+
+   err = kstrtoint(strstrip(buf), 0, &oom_score_adj);
+   if (err)
+   return err;
+
+   if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+   oom_score_adj > OOM_SCORE_ADJ_MAX)
+   return -EINVAL;
+
+   memcg->oom_score_adj = (short)oom_score_adj;
+
+   return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5459,6 +5489,12 @@ static struct cftype memory_files[] = {
.write = memory_oom_kill_all_tasks_write,
},
{
+   .name = "oom_score_adj",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_oom_score_adj_show,
+   .write = memory_oom_score_adj_write,
+   },
+   {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v3 2/6] mm, oom: cgroup-aware OOM killer

2017-06-21 Thread Roman Gushchin
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers. There are two main issues:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much more safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  20 ++
 include/linux/oom.h|   3 +
 mm/memcontrol.c| 155 ++
 mm/oom_kill.c  | 164 +
 4 files changed, 285 insertions(+), 57 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3914e3d..c59926c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /* kill all tasks in the subtree in case of OOM */
+   bool oom_kill_all_tasks;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -342,6 +346,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +489,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -739,6 +750,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -926,6 +941,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 static inline void __inc_memcg_state(struct mem_cgroup *memcg,
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..b7ec3bd 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+   struct mem_cgroup *chosen_memcg;
 };
 
 extern struct mutex oom_lock;
@@ -79,6 +80,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544d47e..bdb5103 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,128 @@ static inline bool memcg_has_children(struct mem_cgr

Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Thomas Gleixner
On Wed, 21 Jun 2017, Tom Lendacky wrote:
> On 6/21/2017 10:38 AM, Thomas Gleixner wrote:
> > /*
> >  * Sanitize CPU configuration and retrieve the modifier
> >  * for the initial pgdir entry which will be programmed
> >  * into CR3. Depends on enabled SME encryption, normally 0.
> >  */
> > call __startup_secondary_64
> > 
> >  addq$(init_top_pgt - __START_KERNEL_map), %rax
> > 
> > You can hide that stuff in C-code nicely without adding any cruft to the
> > ASM code.
> > 
> 
> Moving the call to verify_cpu into the C-code might be quite a bit of
> change.  Currently, the verify_cpu code is included code and not a
> global function.

Ah. Ok. I missed that.

> I can still do the __startup_secondary_64() function and then look to
> incorporate verify_cpu into both __startup_64() and
> __startup_secondary_64() as a post-patch to this series.

Yes, just having __startup_secondary_64() for now and there the extra bits
for that encryption stuff is fine.

> At least the secondary path will have a base C routine to which
> modifications can be made in the future if needed.  How does that sound?

Sounds like a plan.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 26/34] iommu/amd: Allow the AMD IOMMU to work with memory encryption

2017-06-21 Thread Tom Lendacky

On 6/21/2017 11:59 AM, Borislav Petkov wrote:

On Wed, Jun 21, 2017 at 05:37:22PM +0200, Joerg Roedel wrote:

Do you mean this is like the last exception case in that document above:

"
   - Pointers to data structures in coherent memory which might be modified
 by I/O devices can, sometimes, legitimately be volatile.  A ring buffer
 used by a network adapter, where that adapter changes pointers to
 indicate which descriptors have been processed, is an example of this
 type of situation."

?


So currently (without this patch) the build_completion_wait function
does not take a volatile parameter, only wait_on_sem() does.

Wait_on_sem() needs it because its purpose is to poll a memory location
which is changed by the iommu-hardware when its done with command
processing.


Right, the reason above - memory modifiable by an IO device. You could
add a comment there explaining the need for the volatile.


But the 'volatile' in build_completion_wait() looks unnecessary, because
the function does not poll the memory location. It only uses the
pointer, converts it to a physical address and writes it to the command
to be queued.


Ok.


Ok, so the (now) current version of the patch that doesn't change the
function signature is the right way to go.

Thanks,
Tom



Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Tom Lendacky

On 6/21/2017 10:38 AM, Thomas Gleixner wrote:

On Wed, 21 Jun 2017, Tom Lendacky wrote:

On 6/21/2017 2:16 AM, Thomas Gleixner wrote:

Why is this an unconditional function? Isn't the mask simply 0 when the MEM
ENCRYPT support is disabled?


I made it unconditional because of the call from head_64.S. I can't make
use of the C level static inline function and since the mask is not a
variable if CONFIG_AMD_MEM_ENCRYPT is not configured (#defined to 0) I
can't reference the variable directly.

I could create a #define in head_64.S that changes this to load rax with
the variable if CONFIG_AMD_MEM_ENCRYPT is configured or a zero if it's
not or add a #ifdef at that point in the code directly. Thoughts on
that?


See below.


That does not make any sense. Neither the call to sme_encrypt_kernel() nor
the following call to sme_get_me_mask().

__startup_64() is already C code, so why can't you simply call that from
__startup_64() in C and return the mask from there?


I was trying to keep it explicit as to what was happening, but I can
move those calls into __startup_64().


That's much preferred. And the return value wants to be documented in both
C and ASM code.


Will do.




I'll still need the call to sme_get_me_mask() in the secondary_startup_64
path, though (depending on your thoughts to the above response).


 call verify_cpu

 movq$(init_top_pgt - __START_KERNEL_map), %rax

So if you make that:

/*
 * Sanitize CPU configuration and retrieve the modifier
 * for the initial pgdir entry which will be programmed
 * into CR3. Depends on enabled SME encryption, normally 0.
 */
call __startup_secondary_64

 addq$(init_top_pgt - __START_KERNEL_map), %rax

You can hide that stuff in C-code nicely without adding any cruft to the
ASM code.



Moving the call to verify_cpu into the C-code might be quite a bit of
change.  Currently, the verify_cpu code is included code and not a
global function.  I can still do the __startup_secondary_64() function
and then look to incorporate verify_cpu into both __startup_64() and
__startup_secondary_64() as a post-patch to this series. At least the
secondary path will have a base C routine to which modifications can
be made in the future if needed.  How does that sound?

Thanks,
Tom


Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 26/34] iommu/amd: Allow the AMD IOMMU to work with memory encryption

2017-06-21 Thread Borislav Petkov
On Wed, Jun 21, 2017 at 05:37:22PM +0200, Joerg Roedel wrote:
> > Do you mean this is like the last exception case in that document above:
> > 
> > "
> >   - Pointers to data structures in coherent memory which might be modified
> > by I/O devices can, sometimes, legitimately be volatile.  A ring buffer
> > used by a network adapter, where that adapter changes pointers to
> > indicate which descriptors have been processed, is an example of this
> > type of situation."
> > 
> > ?
> 
> So currently (without this patch) the build_completion_wait function
> does not take a volatile parameter, only wait_on_sem() does.
> 
> Wait_on_sem() needs it because its purpose is to poll a memory location
> which is changed by the iommu-hardware when its done with command
> processing.

Right, the reason above - memory modifiable by an IO device. You could
add a comment there explaining the need for the volatile.

> But the 'volatile' in build_completion_wait() looks unnecessary, because
> the function does not poll the memory location. It only uses the
> pointer, converts it to a physical address and writes it to the command
> to be queued.

Ok.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Thomas Gleixner
On Wed, 21 Jun 2017, Tom Lendacky wrote:
> On 6/21/2017 2:16 AM, Thomas Gleixner wrote:
> > Why is this an unconditional function? Isn't the mask simply 0 when the MEM
> > ENCRYPT support is disabled?
> 
> I made it unconditional because of the call from head_64.S. I can't make
> use of the C level static inline function and since the mask is not a
> variable if CONFIG_AMD_MEM_ENCRYPT is not configured (#defined to 0) I
> can't reference the variable directly.
> 
> I could create a #define in head_64.S that changes this to load rax with
> the variable if CONFIG_AMD_MEM_ENCRYPT is configured or a zero if it's
> not or add a #ifdef at that point in the code directly. Thoughts on
> that?

See below.

> > That does not make any sense. Neither the call to sme_encrypt_kernel() nor
> > the following call to sme_get_me_mask().
> > 
> > __startup_64() is already C code, so why can't you simply call that from
> > __startup_64() in C and return the mask from there?
> 
> I was trying to keep it explicit as to what was happening, but I can
> move those calls into __startup_64().

That's much preferred. And the return value wants to be documented in both
C and ASM code.

> I'll still need the call to sme_get_me_mask() in the secondary_startup_64
> path, though (depending on your thoughts to the above response).

call verify_cpu

movq$(init_top_pgt - __START_KERNEL_map), %rax

So if you make that:

/*
 * Sanitize CPU configuration and retrieve the modifier
 * for the initial pgdir entry which will be programmed
 * into CR3. Depends on enabled SME encryption, normally 0.
 */
call __startup_secondary_64

addq$(init_top_pgt - __START_KERNEL_map), %rax

You can hide that stuff in C-code nicely without adding any cruft to the
ASM code.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 26/34] iommu/amd: Allow the AMD IOMMU to work with memory encryption

2017-06-21 Thread Joerg Roedel
On Thu, Jun 15, 2017 at 11:41:12AM +0200, Borislav Petkov wrote:
> On Wed, Jun 14, 2017 at 03:40:28PM -0500, Tom Lendacky wrote:
> > > WARNING: Use of volatile is usually wrong: see 
> > > Documentation/process/volatile-considered-harmful.rst
> > > #134: FILE: drivers/iommu/amd_iommu.c:866:
> > > +static void build_completion_wait(struct iommu_cmd *cmd, volatile u64 
> > > *sem)
> > > 
> > 
> > The semaphore area is written to by the device so the use of volatile is
> > appropriate in this case.
> 
> Do you mean this is like the last exception case in that document above:
> 
> "
>   - Pointers to data structures in coherent memory which might be modified
> by I/O devices can, sometimes, legitimately be volatile.  A ring buffer
> used by a network adapter, where that adapter changes pointers to
> indicate which descriptors have been processed, is an example of this
> type of situation."
> 
> ?

So currently (without this patch) the build_completion_wait function
does not take a volatile parameter, only wait_on_sem() does.

Wait_on_sem() needs it because its purpose is to poll a memory location
which is changed by the iommu-hardware when its done with command
processing.

But the 'volatile' in build_completion_wait() looks unnecessary, because
the function does not poll the memory location. It only uses the
pointer, converts it to a physical address and writes it to the command
to be queued.


Regards,

Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 25/36] swiotlb: Add warnings for use of bounce buffers with SME

2017-06-21 Thread Tom Lendacky

On 6/21/2017 5:50 AM, Borislav Petkov wrote:

On Fri, Jun 16, 2017 at 01:54:36PM -0500, Tom Lendacky wrote:

Add warnings to let the user know when bounce buffers are being used for
DMA when SME is active.  Since the bounce buffers are not in encrypted
memory, these notifications are to allow the user to determine some
appropriate action - if necessary.  Actions can range from utilizing an
IOMMU, replacing the device with another device that can support 64-bit
DMA, ignoring the message if the device isn't used much, etc.

Signed-off-by: Tom Lendacky 
---
  include/linux/dma-mapping.h |   11 +++
  include/linux/mem_encrypt.h |8 
  lib/swiotlb.c   |3 +++
  3 files changed, 22 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 4f3eece..ee2307e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -10,6 +10,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /**

   * List of possible attributes associated with a DMA mapping. The semantics
@@ -577,6 +578,11 @@ static inline int dma_set_mask(struct device *dev, u64 
mask)
  
  	if (!dev->dma_mask || !dma_supported(dev, mask))

return -EIO;
+
+   /* Since mask is unsigned, this can only be true if SME is active */
+   if (mask < sme_dma_mask())
+   dev_warn(dev, "SME is active, device will require DMA bounce 
buffers\n");
+
*dev->dma_mask = mask;
return 0;
  }
@@ -596,6 +602,11 @@ static inline int dma_set_coherent_mask(struct device 
*dev, u64 mask)
  {
if (!dma_supported(dev, mask))
return -EIO;
+
+   /* Since mask is unsigned, this can only be true if SME is active */
+   if (mask < sme_dma_mask())
+   dev_warn(dev, "SME is active, device will require DMA bounce 
buffers\n");


Looks to me like those two checks above need to be a:

void sme_check_mask(struct device *dev, u64 mask)
{
 if (!sme_me_mask)
 return;

 /* Since mask is unsigned, this can only be true if SME is active */
 if (mask < (((u64)sme_me_mask << 1) - 1))
 dev_warn(dev, "SME is active, device will require DMA bounce 
buffers\n");
}

which gets called and sme_dma_mask() is not really needed.


Makes a lot of sense, I'll update the patch.

Thanks,
Tom




--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Tom Lendacky

On 6/21/2017 2:16 AM, Thomas Gleixner wrote:

On Fri, 16 Jun 2017, Tom Lendacky wrote:

diff --git a/arch/x86/include/asm/mem_encrypt.h 
b/arch/x86/include/asm/mem_encrypt.h
index a105796..988b336 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -15,16 +15,24 @@
  
  #ifndef __ASSEMBLY__
  
+#include 

+
  #ifdef CONFIG_AMD_MEM_ENCRYPT
  
  extern unsigned long sme_me_mask;
  
+void __init sme_enable(void);

+
  #else /* !CONFIG_AMD_MEM_ENCRYPT */
  
  #define sme_me_mask	0UL
  
+static inline void __init sme_enable(void) { }

+
  #endif/* CONFIG_AMD_MEM_ENCRYPT */
  
+unsigned long sme_get_me_mask(void);


Why is this an unconditional function? Isn't the mask simply 0 when the MEM
ENCRYPT support is disabled?


I made it unconditional because of the call from head_64.S. I can't make
use of the C level static inline function and since the mask is not a
variable if CONFIG_AMD_MEM_ENCRYPT is not configured (#defined to 0) I
can't reference the variable directly.

I could create a #define in head_64.S that changes this to load rax with
the variable if CONFIG_AMD_MEM_ENCRYPT is configured or a zero if it's
not or add a #ifdef at that point in the code directly. Thoughts on
that?




diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 6225550..ef12729 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -78,7 +78,29 @@ startup_64:
call__startup_64
popq%rsi
  
-	movq	$(early_top_pgt - __START_KERNEL_map), %rax

+   /*
+* Encrypt the kernel if SME is active.
+* The real_mode_data address is in %rsi and that register can be
+* clobbered by the called function so be sure to save it.
+*/
+   push%rsi
+   callsme_encrypt_kernel
+   pop %rsi


That does not make any sense. Neither the call to sme_encrypt_kernel() nor
the following call to sme_get_me_mask().

__startup_64() is already C code, so why can't you simply call that from
__startup_64() in C and return the mask from there?


I was trying to keep it explicit as to what was happening, but I can
move those calls into __startup_64(). I'll still need the call to
sme_get_me_mask() in the secondary_startup_64 path, though (depending on
your thoughts to the above response).




@@ -98,7 +120,20 @@ ENTRY(secondary_startup_64)
/* Sanitize CPU configuration */
call verify_cpu
  
-	movq	$(init_top_pgt - __START_KERNEL_map), %rax

+   /*
+* Get the SME encryption mask.
+*  The encryption mask will be returned in %rax so we do an ADD
+*  below to be sure that the encryption mask is part of the
+*  value that will stored in %cr3.
+*
+* The real_mode_data address is in %rsi and that register can be
+* clobbered by the called function so be sure to save it.
+*/
+   push%rsi
+   callsme_get_me_mask
+   pop %rsi


Do we really need a call here? The mask is established at this point, so
it's either 0 when the encryption stuff is not compiled in or it can be
retrieved from a variable which is accessible at this point.



Same as above, this can be updated based on the decided approach.

Thanks,
Tom


+
+   addq$(init_top_pgt - __START_KERNEL_map), %rax
  1:
  
  	/* Enable PAE mode, PGE and LA57 */


Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active

2017-06-21 Thread Tom Lendacky

On 6/21/2017 2:37 AM, Thomas Gleixner wrote:

On Fri, 16 Jun 2017, Tom Lendacky wrote:

Currently there is a check if the address being mapped is in the ISA
range (is_ISA_range()), and if it is then phys_to_virt() is used to
perform the mapping.  When SME is active, however, this will result
in the mapping having the encryption bit set when it is expected that
an ioremap() should not have the encryption bit set. So only use the
phys_to_virt() function if SME is not active

Reviewed-by: Borislav Petkov 
Signed-off-by: Tom Lendacky 
---
  arch/x86/mm/ioremap.c |7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 4c1b5fd..a382ba9 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -13,6 +13,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -106,9 +107,11 @@ static void __iomem *__ioremap_caller(resource_size_t 
phys_addr,
}
  
  	/*

-* Don't remap the low PCI/ISA area, it's always mapped..
+* Don't remap the low PCI/ISA area, it's always mapped.
+*   But if SME is active, skip this so that the encryption bit
+*   doesn't get set.
 */
-   if (is_ISA_range(phys_addr, last_addr))
+   if (is_ISA_range(phys_addr, last_addr) && !sme_active())
return (__force void __iomem *)phys_to_virt(phys_addr);


More thoughts about that.

Making this conditional on !sme_active() is not the best idea. I'd rather
remove that whole thing and make it unconditional so the code pathes get
always exercised and any subtle wreckage is detected on a broader base and
not only on that hard to access and debug SME capable machine owned by Joe
User.


Ok, that sounds good.  I'll remove the check and usage of phys_to_virt()
and update the changelog with additional detail about that.

Thanks,
Tom



Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active

2017-06-21 Thread Tom Lendacky

On 6/20/2017 3:55 PM, Thomas Gleixner wrote:

On Fri, 16 Jun 2017, Tom Lendacky wrote:


Currently there is a check if the address being mapped is in the ISA
range (is_ISA_range()), and if it is then phys_to_virt() is used to
perform the mapping.  When SME is active, however, this will result
in the mapping having the encryption bit set when it is expected that
an ioremap() should not have the encryption bit set. So only use the
phys_to_virt() function if SME is not active


This does not make sense to me. What the heck has phys_to_virt() to do with
the encryption bit. Especially why would the encryption bit be set on that
mapping in the first place?


The default is that all entries that get added to the pagetables have
the encryption bit set unless specifically overridden.  Any __va() or
phys_to_virt() calls will result in a pagetable mapping that has the
encryption bit set.  For ioremap, the PAGE_KERNEL_IO protection is used
which will not/does not have the encryption bit set.



I'm probably missing something, but this want's some coherent explanation
understandable by mere mortals both in the changelog and the code comment.


I'll add some additional info to the changelog and code.

Thanks,
Tom



Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: [PATCH v4 3/5] input: add a EV_SW event for ratchet switch

2017-06-21 Thread Mauro Carvalho Chehab
Hi Dmitry,

Ping.

How do you want to proceed with that?

Regards,
Mauro

Forwarded message:

Date: Sat, 15 Apr 2017 19:50:45 -0300
From: Mauro Carvalho Chehab 
To: Dmitry Torokhov 
Cc: linux-in...@vger.kernel.org, Benjamin Tissoires 
, Jiri Kosina , Jonathan Corbet 
, Roderick Colenbrander , 
Stuart Yoder , "David S. Miller" , 
Ingo Tuchscherer , Florian Fainelli 
, Ping Cheng , Hans Verkuil 
, Kamil Debski , Douglas Anderson 
, linux-doc@vger.kernel.org
Subject: Re: [PATCH v4 3/5] input: add a EV_SW event for ratchet switch


Em Sat, 15 Apr 2017 11:04:36 -0700
Dmitry Torokhov  escreveu:

> Hi Mauro,
> 
> On Tue, Apr 11, 2017 at 10:29:40AM -0300, Mauro Carvalho Chehab wrote:  
> > Some mice have a switch on their wheel, allowing to switch
> > between ratchet and free wheel mode. Add support for it.
> > 
> > Signed-off-by: Mauro Carvalho Chehab 
> > ---
> >  Documentation/input/event-codes.txt| 12 
> >  include/linux/mod_devicetable.h|  2 +-
> >  include/uapi/linux/input-event-codes.h |  4 +++-
> >  3 files changed, 16 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/input/event-codes.txt 
> > b/Documentation/input/event-codes.txt
> > index 50352ab5f6d4..5dbd45db9bf6 100644
> > --- a/Documentation/input/event-codes.txt
> > +++ b/Documentation/input/event-codes.txt
> > @@ -206,6 +206,18 @@ Upon resume, if the switch state is the same as before 
> > suspend, then the input
> >  subsystem will filter out the duplicate switch state reports. The driver 
> > does
> >  not need to keep the state of the switch at any time.
> >  
> > +A few EV_SW codes have special meanings:
> > +
> > +* SW_RATCHET:
> > +
> > +  - Some mice have a special switch for their wheel that allows to change
> > +between free wheel mode and ratchet mode. When the switch is ratchet
> > +mode (ON state), the wheel will offer some resistance for movements. It
> > +may also provide a tactile feedback when scrolled.
> > +
> > +Note that some mice have a ratchet switch that does not generate a
> > +software event.
> 
> So it is still not clear to me why we need the 2 discrete events. Either
> we key off the behavior off the new REL event, or from switch, but not
> both.  

The two events are independent. Clicking at the Wheel button just
sets it to free wheel or back to ratchet mode. It doesn't switch
the resolution.

The high resolution events are sent only when userspace sets
the mouse to high resolution mode.

I wrote patch series for Solaar with allows switching between
low resolution and high resolution modes and controls if the
wheel movement is normal or inverted:

https://github.com/pwr/Solaar/pull/351

It uses the hidraw interface to switch between the two modes.

> Also, it is unclear to me if allocating a new event for "hires" wheel is
> optimal. This still does not solve the question about resolution (how
> high is "hires" and what to do if Logitech will come out with
> ultra-high-resolution wheel next year, or if we need to express
> resolution for other relative events).  

How "high" is the resolution can be queried on those devices.
Not sure how to report it to userspace, though. Ok, one application
could query it via hidraw interface (my Solaar patches do that
when solaar is called with the "show" parameter).

Perhaps an ioctl? Or do you have a better idea?

> 
> Thanks.
>   
> > +
> >  EV_MSC:
> >  --
> >  EV_MSC events are used for input and output events that do not fall under 
> > other
> > diff --git a/include/linux/mod_devicetable.h 
> > b/include/linux/mod_devicetable.h
> > index a3e8c572a046..79dd7dbf5442 100644
> > --- a/include/linux/mod_devicetable.h
> > +++ b/include/linux/mod_devicetable.h
> > @@ -292,7 +292,7 @@ struct pcmcia_device_id {
> >  #define INPUT_DEVICE_ID_LED_MAX0x0f
> >  #define INPUT_DEVICE_ID_SND_MAX0x07
> >  #define INPUT_DEVICE_ID_FF_MAX 0x7f
> > -#define INPUT_DEVICE_ID_SW_MAX 0x0f
> > +#define INPUT_DEVICE_ID_SW_MAX 0x1f
> >  
> >  #define INPUT_DEVICE_ID_MATCH_BUS  1
> >  #define INPUT_DEVICE_ID_MATCH_VENDOR   2
> > diff --git a/include/uapi/linux/input-event-codes.h 
> > b/include/uapi/linux/input-event-codes.h
> > index da48d4079511..da83e231e93d 100644
> > --- a/include/uapi/linux/input-event-codes.h
> > +++ b/include/uapi/linux/input-event-codes.h
> > @@ -789,7 +789,9 @@
> >  #define SW_LINEIN_INSERT   0x0d  /* set = inserted */
> >  #define SW_MUTE_DEVICE 0x0e  /* set = device disabled */
> >  #define SW_PEN_INSERTED0x0f  /* set = pen inserted */
> > -#define SW_MAX 0x0f
> > +#define SW_RATCHET 0x10  /* set = ratchet mode,
> > +unset: free wheel */
> > +#define SW_MAX 0x1f
> >  #define SW_CNT (SW_MAX+1)
> >  
> >  /*
> > -- 
> > 2.9.3
> > 
>   



Thanks,
Mauro



Thanks,
Mauro
--
To unsubscribe from this list: send the 

Re: [PATCH v7 06/36] x86/mm: Add Secure Memory Encryption (SME) support

2017-06-21 Thread Tom Lendacky

On 6/20/2017 3:49 PM, Thomas Gleixner wrote:

On Fri, 16 Jun 2017, Tom Lendacky wrote:
  
+config ARCH_HAS_MEM_ENCRYPT

+   def_bool y
+   depends on X86


That one is silly. The config switch is in the x86 KConfig file, so X86 is
on. If you intended to move this to some generic place outside of
x86/Kconfig then this should be

config ARCH_HAS_MEM_ENCRYPT
bool

and x86/Kconfig should have

select ARCH_HAS_MEM_ENCRYPT

and that should be selected by AMD_MEM_ENCRYPT


This is used for deciding whether to include the asm/mem_encrypt.h file
so it needs to be on whether AMD_MEM_ENCRYPT is configured or not. I'll
leave it in the x86/Kconfig file and remove the depends on line.

Thanks,
Tom




+config AMD_MEM_ENCRYPT
+   bool "AMD Secure Memory Encryption (SME) support"
+   depends on X86_64 && CPU_SUP_AMD
+   ---help---
+ Say yes to enable support for the encryption of system memory.
+ This requires an AMD processor that supports Secure Memory
+ Encryption (SME).


Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 25/36] swiotlb: Add warnings for use of bounce buffers with SME

2017-06-21 Thread Borislav Petkov
On Fri, Jun 16, 2017 at 01:54:36PM -0500, Tom Lendacky wrote:
> Add warnings to let the user know when bounce buffers are being used for
> DMA when SME is active.  Since the bounce buffers are not in encrypted
> memory, these notifications are to allow the user to determine some
> appropriate action - if necessary.  Actions can range from utilizing an
> IOMMU, replacing the device with another device that can support 64-bit
> DMA, ignoring the message if the device isn't used much, etc.
> 
> Signed-off-by: Tom Lendacky 
> ---
>  include/linux/dma-mapping.h |   11 +++
>  include/linux/mem_encrypt.h |8 
>  lib/swiotlb.c   |3 +++
>  3 files changed, 22 insertions(+)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 4f3eece..ee2307e 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /**
>   * List of possible attributes associated with a DMA mapping. The semantics
> @@ -577,6 +578,11 @@ static inline int dma_set_mask(struct device *dev, u64 
> mask)
>  
>   if (!dev->dma_mask || !dma_supported(dev, mask))
>   return -EIO;
> +
> + /* Since mask is unsigned, this can only be true if SME is active */
> + if (mask < sme_dma_mask())
> + dev_warn(dev, "SME is active, device will require DMA bounce 
> buffers\n");
> +
>   *dev->dma_mask = mask;
>   return 0;
>  }
> @@ -596,6 +602,11 @@ static inline int dma_set_coherent_mask(struct device 
> *dev, u64 mask)
>  {
>   if (!dma_supported(dev, mask))
>   return -EIO;
> +
> + /* Since mask is unsigned, this can only be true if SME is active */
> + if (mask < sme_dma_mask())
> + dev_warn(dev, "SME is active, device will require DMA bounce 
> buffers\n");

Looks to me like those two checks above need to be a:

void sme_check_mask(struct device *dev, u64 mask)
{
if (!sme_me_mask)
return;

/* Since mask is unsigned, this can only be true if SME is active */
if (mask < (((u64)sme_me_mask << 1) - 1))
dev_warn(dev, "SME is active, device will require DMA bounce 
buffers\n");
}

which gets called and sme_dma_mask() is not really needed.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] irq: generic-chip: resource management improvements

2017-06-21 Thread Marc Zyngier
On 31/05/17 17:06, Bartosz Golaszewski wrote:
> This series is a follow-up to [1].
> 
> Some users of irq_alloc_generic_chip() are modules which can be
> removed (e.g. gpio-ml-ioh) but have no means of freeing the allocated
> generic chip.
> 
> Last time it was suggested to provide irq_destroy_generic_chip() which
> would undo both irq_remove_generic_chip() and irq_alloc_generic_chip().
> 
> This functionality is provided by patch 2/5 with 1/5 adding the option
> to only free the allocated memory.
> 
> Patch 3/5 exports a function that will be used in the devres variant
> of irq_alloc_generic_chip().
> 
> Patches 4/5 and 5/5 add resource managed versions of
> irq_alloc_generic_chip() & irq_setup_generic_chip(). They will be used
> in drivers where applicable. Device resources are released in reverse
> order so it's ok to call devm_irq_alloc_generic_chip() and then
> devm_irq_setup_generic_chip().
> 
> [1] https://lkml.org/lkml/2017/3/8/550
> 
> Bartosz Golaszewski (5):
>   irq: generic-chip: provide irq_free_generic_chip()
>   irq: generic-chip: provide irq_destroy_generic_chip()
>   irq: generic-chip: export irq_init_generic_chip() locally
>   irq: generic-chip: provide devm_irq_alloc_generic_chip()
>   irq: generic-chip: provide devm_irq_setup_generic_chip()
> 
>  Documentation/driver-model/devres.txt |  2 +
>  include/linux/irq.h   | 22 +
>  kernel/irq/devres.c   | 86 
> +++
>  kernel/irq/generic-chip.c |  7 ++-
>  kernel/irq/internals.h| 11 +
>  5 files changed, 124 insertions(+), 4 deletions(-)
> 

Looks OK to me. For the series:

Acked-by: Marc Zyngier 

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] irq: generic-chip: resource management improvements

2017-06-21 Thread Bartosz Golaszewski
2017-06-20 16:14 GMT+02:00 Thomas Gleixner :
> On Tue, 20 Jun 2017, Bartosz Golaszewski wrote:
>> 2017-06-20 12:41 GMT+02:00 Marc Zyngier :
>> > There was a kbuild report from June 1st with worrying warnings on x86_64
>> > (though I couldn't see how that was related to these patches). What's
>> > the status of that?
>> >
>> > Thanks,
>> >
>> > M.
>> > --
>> > Jazz is not dead. It just smells funny...
>>
>> Snap, I looked at it, determined that it was just a header included in
>> include/linux/irq.h (unrelated to the patch) and forgot to comment
>> about it.
>>
>> I've never seen this warning on my setup and don't see it now with rc6.
>
> Yep, that's a genuine x86 snafu. No idea how that got attributed to your
> patch.

So are the patches ok and can be merged for 4.13?

Thanks,
Bartosz
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 24/36] x86, swiotlb: Add memory encryption support

2017-06-21 Thread Borislav Petkov
On Fri, Jun 16, 2017 at 01:54:24PM -0500, Tom Lendacky wrote:
> Since DMA addresses will effectively look like 48-bit addresses when the
> memory encryption mask is set, SWIOTLB is needed if the DMA mask of the
> device performing the DMA does not support 48-bits. SWIOTLB will be
> initialized to create decrypted bounce buffers for use by these devices.
> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/include/asm/dma-mapping.h |5 ++-
>  arch/x86/include/asm/mem_encrypt.h |5 +++
>  arch/x86/kernel/pci-dma.c  |   11 +--
>  arch/x86/kernel/pci-nommu.c|2 +
>  arch/x86/kernel/pci-swiotlb.c  |   15 +-
>  arch/x86/mm/mem_encrypt.c  |   22 +++
>  include/linux/swiotlb.h|1 +
>  init/main.c|   10 +++
>  lib/swiotlb.c  |   54 
> +++-
>  9 files changed, 108 insertions(+), 17 deletions(-)

Reviewed-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 23/36] x86, realmode: Decrypt trampoline area if memory encryption is active

2017-06-21 Thread Borislav Petkov
On Fri, Jun 16, 2017 at 01:54:12PM -0500, Tom Lendacky wrote:
> When Secure Memory Encryption is enabled, the trampoline area must not
> be encrypted. A CPU running in real mode will not be able to decrypt
> memory that has been encrypted because it will not be able to use addresses
> with the memory encryption mask.
> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/realmode/init.c |8 
>  1 file changed, 8 insertions(+)

Subject: x86/realmode: ...

other than that:

Reviewed-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 20/36] x86, mpparse: Use memremap to map the mpf and mpc data

2017-06-21 Thread Borislav Petkov
On Fri, Jun 16, 2017 at 01:53:38PM -0500, Tom Lendacky wrote:
> The SMP MP-table is built by UEFI and placed in memory in a decrypted
> state. These tables are accessed using a mix of early_memremap(),
> early_memunmap(), phys_to_virt() and virt_to_phys(). Change all accesses
> to use early_memremap()/early_memunmap(). This allows for proper setting
> of the encryption mask so that the data can be successfully accessed when
> SME is active.
> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/kernel/mpparse.c |   98 
> -
>  1 file changed, 70 insertions(+), 28 deletions(-)

Reviewed-by: Borislav Petkov 

Please put the conversion to pr_fmt() on the TODO list for later.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 10/36] x86/mm: Provide general kernel support for memory encryption

2017-06-21 Thread Borislav Petkov
On Wed, Jun 21, 2017 at 09:18:59AM +0200, Thomas Gleixner wrote:
> That looks wrong. It's not decrypted it's rather unencrypted, right?

Yeah, it previous versions of the patchset, "decrypted" and
"unencrypted" were both present so we settled on "decrypted" for the
nomenclature.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active

2017-06-21 Thread Thomas Gleixner
On Fri, 16 Jun 2017, Tom Lendacky wrote:
> Currently there is a check if the address being mapped is in the ISA
> range (is_ISA_range()), and if it is then phys_to_virt() is used to
> perform the mapping.  When SME is active, however, this will result
> in the mapping having the encryption bit set when it is expected that
> an ioremap() should not have the encryption bit set. So only use the
> phys_to_virt() function if SME is not active
> 
> Reviewed-by: Borislav Petkov 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/mm/ioremap.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index 4c1b5fd..a382ba9 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -106,9 +107,11 @@ static void __iomem *__ioremap_caller(resource_size_t 
> phys_addr,
>   }
>  
>   /*
> -  * Don't remap the low PCI/ISA area, it's always mapped..
> +  * Don't remap the low PCI/ISA area, it's always mapped.
> +  *   But if SME is active, skip this so that the encryption bit
> +  *   doesn't get set.
>*/
> - if (is_ISA_range(phys_addr, last_addr))
> + if (is_ISA_range(phys_addr, last_addr) && !sme_active())
>   return (__force void __iomem *)phys_to_virt(phys_addr);

More thoughts about that.

Making this conditional on !sme_active() is not the best idea. I'd rather
remove that whole thing and make it unconditional so the code pathes get
always exercised and any subtle wreckage is detected on a broader base and
not only on that hard to access and debug SME capable machine owned by Joe
User.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 10/36] x86/mm: Provide general kernel support for memory encryption

2017-06-21 Thread Thomas Gleixner
On Fri, 16 Jun 2017, Tom Lendacky wrote:
>  
> +#ifndef pgprot_encrypted
> +#define pgprot_encrypted(prot)   (prot)
> +#endif
> +
> +#ifndef pgprot_decrypted

That looks wrong. It's not decrypted it's rather unencrypted, right?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing

2017-06-21 Thread Thomas Gleixner
On Fri, 16 Jun 2017, Tom Lendacky wrote:
> diff --git a/arch/x86/include/asm/mem_encrypt.h 
> b/arch/x86/include/asm/mem_encrypt.h
> index a105796..988b336 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -15,16 +15,24 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +#include 
> +
>  #ifdef CONFIG_AMD_MEM_ENCRYPT
>  
>  extern unsigned long sme_me_mask;
>  
> +void __init sme_enable(void);
> +
>  #else/* !CONFIG_AMD_MEM_ENCRYPT */
>  
>  #define sme_me_mask  0UL
>  
> +static inline void __init sme_enable(void) { }
> +
>  #endif   /* CONFIG_AMD_MEM_ENCRYPT */
>  
> +unsigned long sme_get_me_mask(void);

Why is this an unconditional function? Isn't the mask simply 0 when the MEM
ENCRYPT support is disabled?

> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 6225550..ef12729 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -78,7 +78,29 @@ startup_64:
>   call__startup_64
>   popq%rsi
>  
> - movq$(early_top_pgt - __START_KERNEL_map), %rax
> + /*
> +  * Encrypt the kernel if SME is active.
> +  * The real_mode_data address is in %rsi and that register can be
> +  * clobbered by the called function so be sure to save it.
> +  */
> + push%rsi
> + callsme_encrypt_kernel
> + pop %rsi

That does not make any sense. Neither the call to sme_encrypt_kernel() nor
the following call to sme_get_me_mask().

__startup_64() is already C code, so why can't you simply call that from
__startup_64() in C and return the mask from there?

> @@ -98,7 +120,20 @@ ENTRY(secondary_startup_64)
>   /* Sanitize CPU configuration */
>   call verify_cpu
>  
> - movq$(init_top_pgt - __START_KERNEL_map), %rax
> + /*
> +  * Get the SME encryption mask.
> +  *  The encryption mask will be returned in %rax so we do an ADD
> +  *  below to be sure that the encryption mask is part of the
> +  *  value that will stored in %cr3.
> +  *
> +  * The real_mode_data address is in %rsi and that register can be
> +  * clobbered by the called function so be sure to save it.
> +  */
> + push%rsi
> + callsme_get_me_mask
> + pop %rsi

Do we really need a call here? The mask is established at this point, so
it's either 0 when the encryption stuff is not compiled in or it can be
retrieved from a variable which is accessible at this point.

> +
> + addq$(init_top_pgt - __START_KERNEL_map), %rax
>  1:
>  
>   /* Enable PAE mode, PGE and LA57 */

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html