[RFC v3 06/23] powerpc: use helper functions in __hash_page_4K() for 64K PTE
replace redundant code in __hash_page_4K() with helper functions get_hidx_gslot() and set_hidx_slot() Signed-off-by: Ram Pai --- arch/powerpc/mm/hash64_64k.c | 24 ++-- 1 file changed, 6 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c index 5cbdaa9..cb48a60 100644 --- a/arch/powerpc/mm/hash64_64k.c +++ b/arch/powerpc/mm/hash64_64k.c @@ -103,18 +103,12 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid, if (__rpte_sub_valid(rpte, subpg_index)) { int ret; - hash = hpt_hash(vpn, shift, ssize); - hidx = __rpte_to_hidx(rpte, subpg_index); - if (hidx & _PTEIDX_SECONDARY) - hash = ~hash; - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP; - slot += hidx & _PTEIDX_GROUP_IX; - - ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, + gslot = get_hidx_gslot(vpn, shift, ssize, rpte, subpg_index); + ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags); /* -*if we failed because typically the HPTE wasn't really here +* if we failed because typically the HPTE wasn't really here * we try an insertion. */ if (ret == -1) @@ -214,15 +208,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid, * Since we have H_PAGE_BUSY set on ptep, we can be sure * nobody is undating hidx. */ - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE); - rpte.hidx &= ~(0xfUL << (subpg_index << 2)); - *hidxp = rpte.hidx | (slot << (subpg_index << 2)); - new_pte = mark_subptegroup_valid(new_pte, subpg_index); - new_pte |= H_PAGE_HASHPTE; - /* -* check __real_pte for details on matching smp_rmb() -*/ - smp_wmb(); + new_pte |= H_PAGE_HASHPTE; + new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot); + *ptep = __pte(new_pte & ~H_PAGE_BUSY); return 0; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 01/23] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6, in the 4K backed HPTE pages. These bits continue to be used for 64K backed HPTE pages in this patch, but will be freed up in the next patch. The bit numbers are big-endian as defined in the ISA3.0 The patch does the following change to the 64K PTE format H_PAGE_BUSY moves from bit 3 to bit 9 H_PAGE_F_SECOND which occupied bit 4 moves to the second part of the pte. H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the second part of the pte. the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot is initialized to 0xF indicating an invalid slot. If a HPTE gets cached in a 0xF slot(i.e 7th slot of secondary), it is released immediately. In other words, even though 0xF is a valid slot we discard and consider it as an invalid slot;i.e HPTE(). This gives us an opportunity to not depend on a bit in the primary PTE in order to determine the validity of a slot. When we release aHPTE in the 0xF slot we also release a legitimate primary slot andunmapthat entry. This is to ensure that we do get a legimate non-0xF slot the next time we retry for a slot. Though treating 0xF slot as invalid reduces the number of available slots and may have an effect on the performance, the probabilty of hitting a 0xF is extermely low. Compared to the current scheme, the above described scheme reduces the number of false hash table updates significantly and has the added advantage of releasing four valuable PTE bits for other purpose. This idea was jointly developed by Paul Mackerras, Aneesh, Michael Ellermen and myself. 4K PTE format remain unchanged currently. Signed-off-by: Ram Pai Conflicts: arch/powerpc/include/asm/book3s/64/hash.h --- arch/powerpc/include/asm/book3s/64/hash-4k.h | 7 +++ arch/powerpc/include/asm/book3s/64/hash-64k.h | 17 --- arch/powerpc/include/asm/book3s/64/hash.h | 12 +++-- arch/powerpc/mm/hash64_64k.c | 70 +++ arch/powerpc/mm/hash_utils_64.c | 4 +- 5 files changed, 66 insertions(+), 44 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h index b4b5e6b..9c2c8f1 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -16,6 +16,13 @@ #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE) #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE) +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */ +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) +#define H_PAGE_F_GIX_SHIFT 56 + +#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */ +#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */ + /* PTE flags to conserve for HPTE identification */ #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \ H_PAGE_F_SECOND | H_PAGE_F_GIX) diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h index 9732837..3f49941 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h @@ -10,20 +10,21 @@ * 64k aligned address free up few of the lower bits of RPN for us * We steal that here. For more deatils look at pte_pfn/pfn_pte() */ -#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */ -#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */ +#define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */ +#define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */ +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */ +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) +#define H_PAGE_F_GIX_SHIFT 56 + +#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */ +#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */ + /* * We need to differentiate between explicit huge page and THP huge * page, since THP huge page also need to track real subpage details */ #define H_PAGE_THP_HUGE H_PAGE_4K_PFN -/* - * Used to track subpage group valid if H_PAGE_COMBO is set - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND - */ -#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND) - /* PTE flags to conserve for HPTE identification */ #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \ H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO) diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h index 4e957b0..ac049de 100644 --- a/arch/powerpc/include/asm/book3s/64/hash.h +++ b/arch/powerpc/include/asm/book3s/64/hash.h @@ -8,11 +8,8 @@ * */ #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS -#define H_PAGE_F_GIX_SHIF
[RFC v3 07/23] powerpc: use helper functions in __hash_page_4K() for 4K PTE
replace redundant code with helper functions get_hidx_gslot() and set_hidx_slot() Signed-off-by: Ram Pai --- arch/powerpc/mm/hash64_4k.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c index 6fa450c..c673829 100644 --- a/arch/powerpc/mm/hash64_4k.c +++ b/arch/powerpc/mm/hash64_4k.c @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid, pte_t *ptep, unsigned long trap, unsigned long flags, int ssize, int subpg_prot) { + real_pte_t rpte; unsigned long hpte_group; unsigned long rflags, pa; unsigned long old_pte, new_pte; @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid, * need to add in 0x1 if it's a read-only user page */ rflags = htab_convert_pte_flags(new_pte); + rpte = __real_pte(__pte(old_pte), ptep); if (cpu_has_feature(CPU_FTR_NOEXECUTE) && !cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) @@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid, /* * There MIGHT be an HPTE for this pte */ - hash = hpt_hash(vpn, shift, ssize); - if (old_pte & H_PAGE_F_SECOND) - hash = ~hash; - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP; - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT; + unsigned long gslot = get_hidx_gslot(vpn, shift, + ssize, rpte, 0); - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K, + if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K, MMU_PAGE_4K, ssize, flags) == -1) old_pte &= ~_PAGE_HPTEFLAGS; } @@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid, return -1; } new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE; - new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & - (H_PAGE_F_SECOND | H_PAGE_F_GIX); + new_pte |= set_hidx_slot(ptep, rpte, 0, slot); } *ptep = __pte(new_pte & ~H_PAGE_BUSY); return 0; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 02/23] powerpc: introduce set_hidx_slot helper
Introduce set_hidx_slot() which sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX) bits at the appropriate location in the PTE of 4K PTE. In the case of 64K PTE, it sets the bits in the second part of the PTE. Though the implementation for the former just needs the slot parameter, it does take some additional parameters to keep the prototype consistent. This function will come in handy as we work towards re-arranging the bits in the later patches. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/book3s/64/hash-4k.h | 7 +++ arch/powerpc/include/asm/book3s/64/hash-64k.h | 16 2 files changed, 23 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h index 9c2c8f1..cef644c 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -55,6 +55,13 @@ static inline int hash__hugepd_ok(hugepd_t hpd) } #endif +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte, + unsigned int subpg_index, unsigned long slot) +{ + return (slot << H_PAGE_F_GIX_SHIFT) & + (H_PAGE_F_SECOND | H_PAGE_F_GIX); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline char *get_hpte_slot_array(pmd_t *pmdp) diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h index 3f49941..4bac70a 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h @@ -75,6 +75,22 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index) return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf; } +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte, + unsigned int subpg_index, unsigned long slot) +{ + unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE); + + rpte.hidx &= ~(0xfUL << (subpg_index << 2)); + *hidxp = rpte.hidx | (slot << (subpg_index << 2)); + /* +* Avoid race with __real_pte() +* hidx must be committed to memory before committing +* the pte. +*/ + smp_wmb(); + return 0x0UL; +} + #define __rpte_to_pte(r) ((r).pte) extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index); /* -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 03/23] powerpc: introduce get_hidx_gslot helper
Introduce get_hidx_gslot() which returns the slot number of the HPTE in the global hash table. This function will come in handy as we work towards re-arranging the PTE bits in the later patches. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/book3s/64/hash.h | 3 +++ arch/powerpc/mm/hash_utils_64.c | 14 ++ 2 files changed, 17 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h index ac049de..e7cf03a 100644 --- a/arch/powerpc/include/asm/book3s/64/hash.h +++ b/arch/powerpc/include/asm/book3s/64/hash.h @@ -162,6 +162,9 @@ static inline bool hpte_soft_invalid(unsigned long slot) return ((slot & 0xfUL) == 0xfUL); } +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift, + int ssize, real_pte_t rpte, unsigned int subpg_index); + /* This low level function performs the actual PTE insertion * Setting the PTE depends on the MMU type and other factors. It's * an horrible mess that I'm not going to try to clean up now but diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 1b494d0..99f97754c 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1591,6 +1591,20 @@ static inline void tm_flush_hash_page(int local) } #endif +unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift, + int ssize, real_pte_t rpte, unsigned int subpg_index) +{ + unsigned long hash, slot, hidx; + + hash = hpt_hash(vpn, shift, ssize); + hidx = __rpte_to_hidx(rpte, subpg_index); + if (hidx & _PTEIDX_SECONDARY) + hash = ~hash; + slot = (hash & htab_hash_mask) * HPTES_PER_GROUP; + slot += hidx & _PTEIDX_GROUP_IX; + return slot; +} + /* WARNING: This is called from hash_low_64.S, if you change this prototype, * do not forget to update the assembly call site ! */ -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 04/23] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 in the 64K backed HPTE pages. This along with the earlier patch will entirely free up the four bits from 64K PTE. The bit numbers are big-endian as defined in the ISA3.0 This patch does the following change to 64K PTE that is backed by 64K HPTE. H_PAGE_F_SECOND which occupied bit 4 moves to the second part of the pte. H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the second part of the pte. since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9 to bit 7. Trying to minimize gaps so that contiguous bits can be allocated if needed in the future. The second part of the PTE will hold (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. The above PTE changes is applicable to hugetlbpages aswell. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/book3s/64/hash-64k.h | 28 +-- arch/powerpc/mm/hash64_64k.c | 17 arch/powerpc/mm/hugetlbpage-hash64.c | 16 ++- 3 files changed, 23 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h index 4bac70a..7b5dbf3 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h @@ -12,11 +12,8 @@ */ #define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */ #define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */ -#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */ -#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) -#define H_PAGE_F_GIX_SHIFT 56 -#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */ +#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are busy */ #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */ /* @@ -26,8 +23,7 @@ #define H_PAGE_THP_HUGE H_PAGE_4K_PFN /* PTE flags to conserve for HPTE identification */ -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \ -H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO) +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO) /* * we support 16 fragments per PTE page of 64K size. */ @@ -55,24 +51,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep) unsigned long *hidxp; rpte.pte = pte; - rpte.hidx = 0; - if (pte_val(pte) & H_PAGE_COMBO) { - /* -* Make sure we order the hidx load against the H_PAGE_COMBO -* check. The store side ordering is done in __hash_page_4K -*/ - smp_rmb(); - hidxp = (unsigned long *)(ptep + PTRS_PER_PTE); - rpte.hidx = *hidxp; - } + /* +* The store side ordering is done in set_hidx_slot() +*/ + smp_rmb(); + hidxp = (unsigned long *)(ptep + PTRS_PER_PTE); + rpte.hidx = *hidxp; return rpte; } static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index) { - if ((pte_val(rpte.pte) & H_PAGE_COMBO)) - return (rpte.hidx >> (index<<2)) & 0xf; - return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf; + return ((rpte.hidx >> (index<<2)) & 0xfUL); } static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte, diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c index a16cd28..5cbdaa9 100644 --- a/arch/powerpc/mm/hash64_64k.c +++ b/arch/powerpc/mm/hash64_64k.c @@ -231,6 +231,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access, unsigned long vsid, pte_t *ptep, unsigned long trap, unsigned long flags, int ssize) { + real_pte_t rpte; unsigned long hpte_group; unsigned long rflags, pa; unsigned long old_pte, new_pte; @@ -267,6 +268,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access, } while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte))); rflags = htab_convert_pte_flags(new_pte); + rpte = __real_pte(__pte(old_pte), ptep); if (cpu_has_feature(CPU_FTR_NOEXECUTE) && !cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) @@ -274,16 +276,13 @@ int __hash_page_64K(unsigned long ea, unsigned long access, vpn = hpt_vpn(ea, vsid, ssize); if (unlikely(old_pte & H_PAGE_HASHPTE)) { + unsigned long gslot; + /* * There MIGHT be an HPTE for this pte */ - hash = hpt_hash(vpn, shift, ssize); - if (old_pte & H_PAGE_F_SECOND) - hash = ~hash; - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP; - slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT; - - if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K, + gslot = get_hidx_gslot(vpn, shift,
[RFC v3 05/23] powerpc: capture the PTE format changes in the dump pte report
The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE. capture these changes in the dump pte report. Signed-off-by: Ram Pai --- arch/powerpc/mm/dump_linuxpagetables.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c index 44fe483..5627edd 100644 --- a/arch/powerpc/mm/dump_linuxpagetables.c +++ b/arch/powerpc/mm/dump_linuxpagetables.c @@ -213,7 +213,7 @@ struct flag_info { .val= H_PAGE_4K_PFN, .set= "4K_pfn", }, { -#endif +#else /* CONFIG_PPC_64K_PAGES */ .mask = H_PAGE_F_GIX, .val= H_PAGE_F_GIX, .set= "f_gix", @@ -224,6 +224,7 @@ struct flag_info { .val= H_PAGE_F_SECOND, .set= "f_second", }, { +#endif /* CONFIG_PPC_64K_PAGES */ #endif .mask = _PAGE_SPECIAL, .val= _PAGE_SPECIAL, -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 10/23] mm: provide the ability to disable execute on a key at creation
Currently sys_pkey_create() provides the ability to disable read and write permission on the key, at creation. powerpc has the hardware support to disable execute on a pkey as well.This patch enhances the interface to let disable execute at key creation time. x86 does not allow this. Hence the next patch will add ability in x86 to return error is PKEY_DISABLE_EXECUTE is specified. Signed-off-by: Ram Pai --- include/uapi/asm-generic/mman-common.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 8c27db0..bf4fa07 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -74,7 +74,9 @@ #define PKEY_DISABLE_ACCESS0x1 #define PKEY_DISABLE_WRITE 0x2 +#define PKEY_DISABLE_EXECUTE 0x4 #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ -PKEY_DISABLE_WRITE) +PKEY_DISABLE_WRITE |\ +PKEY_DISABLE_EXECUTE) #endif /* __ASM_GENERIC_MMAN_COMMON_H */ -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 08/23] powerpc: use helper functions in flush_hash_page()
replace redundant code in flush_hash_page() with helper functions get_hidx_gslot() and set_hidx_slot() Signed-off-by: Ram Pai --- arch/powerpc/mm/hash_utils_64.c | 13 - 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 99f97754c..b3bc5d6 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1611,23 +1611,18 @@ unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift, void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize, unsigned long flags) { - unsigned long hash, index, shift, hidx, slot; + unsigned long index, shift, gslot; int local = flags & HPTE_LOCAL_UPDATE; DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn); pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) { - hash = hpt_hash(vpn, shift, ssize); - hidx = __rpte_to_hidx(pte, index); - if (hidx & _PTEIDX_SECONDARY) - hash = ~hash; - slot = (hash & htab_hash_mask) * HPTES_PER_GROUP; - slot += hidx & _PTEIDX_GROUP_IX; - DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx); + gslot = get_hidx_gslot(vpn, shift, ssize, pte, index); + DBG_LOW(" sub %ld: gslot=%lx\n", index, gslot); /* * We use same base page size and actual psize, because we don't * use these functions for hugepage */ - mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize, + mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize, ssize, local); } pte_iterate_hashed_end(); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 11/23] x86: key creation with PKEY_DISABLE_EXECUTE is disallowed
x86 does not support disabling execute permissions on a pkey. Signed-off-by: Ram Pai --- arch/x86/kernel/fpu/xstate.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index c24ac1e..d582631 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -900,6 +900,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, if (!boot_cpu_has(X86_FEATURE_OSPKE)) return -EINVAL; + if (init_val & PKEY_DISABLE_EXECUTE) + return -EINVAL; + /* Set the bits we need in PKRU: */ if (init_val & PKEY_DISABLE_ACCESS) new_pkru_bits |= PKRU_AD_BIT; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 12/23] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call
Sys_pkey_alloc() allocates and returns available pkey Sys_pkey_free() frees up the pkey. Total 32 keys are supported on powerpc. However pkey 0,1 and 31 are reserved. So effectively we have 29 pkeys. Each key can be initialized to disable read, write and execute permissions. On powerpc a key can be initialize to disable execute. Signed-off-by: Ram Pai --- arch/powerpc/Kconfig | 15 arch/powerpc/include/asm/book3s/64/mmu.h | 10 +++ arch/powerpc/include/asm/book3s/64/pgtable.h | 62 ++ arch/powerpc/include/asm/pkeys.h | 124 +++ arch/powerpc/include/asm/systbl.h| 2 + arch/powerpc/include/asm/unistd.h| 4 +- arch/powerpc/include/uapi/asm/unistd.h | 2 + arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/mmu_context_book3s64.c | 5 ++ arch/powerpc/mm/pkeys.c | 88 +++ 10 files changed, 310 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/include/asm/pkeys.h create mode 100644 arch/powerpc/mm/pkeys.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index f7c8f99..b6960617 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -871,6 +871,21 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config PPC64_MEMORY_PROTECTION_KEYS + prompt "PowerPC Memory Protection Keys" + def_bool y + # Note: only available in 64-bit mode + depends on PPC64 && PPC_64K_PAGES + select ARCH_USES_HIGH_VMA_FLAGS + select ARCH_HAS_PKEYS + ---help--- + Memory Protection Keys provides a mechanism for enforcing + page-based protections, but without requiring modification of the + page tables when an application changes protection domains. + + For details, see Documentation/powerpc/protection-keys.txt + + If unsure, say y. endmenu config ISA_DMA_API diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 77529a3..0c0a2a8 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -108,6 +108,16 @@ struct patb_entry { #ifdef CONFIG_SPAPR_TCE_IOMMU struct list_head iommu_group_mem_list; #endif + +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + /* +* Each bit represents one protection key. +* bit set -> key allocated +* bit unset -> key available for allocation +*/ + u32 pkey_allocation_map; + s16 execute_only_pkey; /* key holding execute-only protection */ +#endif } mm_context_t; /* diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 85bc987..87e9a89 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1); } + +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + +#include +static inline u64 read_amr(void) +{ + return mfspr(SPRN_AMR); +} +static inline void write_amr(u64 value) +{ + mtspr(SPRN_AMR, value); +} +static inline u64 read_iamr(void) +{ + return mfspr(SPRN_IAMR); +} +static inline void write_iamr(u64 value) +{ + mtspr(SPRN_IAMR, value); +} +static inline u64 read_uamor(void) +{ + return mfspr(SPRN_UAMOR); +} +static inline void write_uamor(u64 value) +{ + mtspr(SPRN_UAMOR, value); +} + +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + +static inline u64 read_amr(void) +{ + WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__); + return -1; +} +static inline void write_amr(u64 value) +{ + WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__); +} +static inline u64 read_uamor(void) +{ + WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__); + return -1; +} +static inline void write_uamor(u64 value) +{ + WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__); +} +static inline u64 read_iamr(void) +{ + WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__); + return -1; +} +static inline void write_iamr(u64 value) +{ + WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__); +} + +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + + #define __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h new file mode 100644 index 000..7bc8746 --- /dev/null +++ b/arch/powerpc/include/asm/pkeys.h @@ -0,0 +1,124 @@ +#ifndef _ASM_PPC64_PKEYS_H +#define _ASM_PPC64_PKEYS_H + + +#define arch_max_pkey() 32 + +#define AMR_AD_BIT
[RFC v3 09/23] mm: introduce an additional vma bit for powerpc pkey
Currently there are only 4bits in the vma flags to support 16 keys on x86. powerpc supports 32 keys, which needs 5bits. This patch introduces an addition bit in the vma flags. Signed-off-by: Ram Pai --- fs/proc/task_mmu.c | 6 +- include/linux/mm.h | 18 +- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f0c8b33..2ddc298 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MERGEABLE)] = "mg", [ilog2(VM_UFFD_MISSING)]= "um", [ilog2(VM_UFFD_WP)] = "uw", -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS +#ifdef CONFIG_ARCH_HAS_PKEYS /* These come out via ProtectionKey: */ [ilog2(VM_PKEY_BIT0)] = "", [ilog2(VM_PKEY_BIT1)] = "", [ilog2(VM_PKEY_BIT2)] = "", [ilog2(VM_PKEY_BIT3)] = "", +#endif /* CONFIG_ARCH_HAS_PKEYS */ +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + /* Additional bit in ProtectionKey: */ + [ilog2(VM_PKEY_BIT4)] = "", #endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index 7cb17c6..3d35bcc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -208,21 +208,29 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *, #define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit arch */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ -#if defined(CONFIG_X86) -# define VM_PATVM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ -#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) +#ifdef CONFIG_ARCH_HAS_PKEYS # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 -# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ +# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 # define VM_PKEY_BIT1 VM_HIGH_ARCH_1 # define VM_PKEY_BIT2 VM_HIGH_ARCH_2 # define VM_PKEY_BIT3 VM_HIGH_ARCH_3 -#endif +#endif /* CONFIG_ARCH_HAS_PKEYS */ + +#if defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS) +# define VM_PKEY_BIT4 VM_HIGH_ARCH_4 /* additional key bit used on ppc64 */ +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + + +#if defined(CONFIG_X86) +# define VM_PATVM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) # define VM_SAOVM_ARCH_1 /* Strong Access Ordering (powerpc) */ #elif defined(CONFIG_PARISC) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 13/23] powerpc: store and restore the pkey state across context switches
Store and restore the AMR, IAMR and UMOR register state of the task before scheduling out and after scheduling in, respectively. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/processor.h | 5 + arch/powerpc/kernel/process.c| 18 ++ 2 files changed, 23 insertions(+) diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index a2123f2..1f714df 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -310,6 +310,11 @@ struct thread_struct { struct thread_vr_state ckvr_state; /* Checkpointed VR state */ unsigned long ckvrsave; /* Checkpointed VRSAVE */ #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + unsigned long amr; + unsigned long iamr; + unsigned long uamor; +#endif #ifdef CONFIG_KVM_BOOK3S_32_HANDLER void* kvm_shadow_vcpu; /* KVM internal data */ #endif /* CONFIG_KVM_BOOK3S_32_HANDLER */ diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index baae104..37d001a 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t) t->tar = mfspr(SPRN_TAR); } #endif +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + t->amr = mfspr(SPRN_AMR); + t->iamr = mfspr(SPRN_IAMR); + t->uamor = mfspr(SPRN_UAMOR); +#endif } static inline void restore_sprs(struct thread_struct *old_thread, @@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct *old_thread, mtspr(SPRN_TAR, new_thread->tar); } #endif +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + if (old_thread->amr != new_thread->amr) + mtspr(SPRN_AMR, new_thread->amr); + if (old_thread->iamr != new_thread->iamr) + mtspr(SPRN_IAMR, new_thread->iamr); + if (old_thread->uamor != new_thread->uamor) + mtspr(SPRN_UAMOR, new_thread->uamor); +#endif } struct task_struct *__switch_to(struct task_struct *prev, @@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long start, unsigned long sp) current->thread.tm_texasr = 0; current->thread.tm_tfiar = 0; #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + current->thread.amr = 0x0ul; + current->thread.iamr = 0x0ul; + current->thread.uamor = 0x0ul; +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ } EXPORT_SYMBOL(start_thread); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 14/23] powerpc: Implementation for sys_mprotect_pkey() system call
This system call, associates the pkey with PTE of all pages covering the given address range. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/book3s/64/pgtable.h | 22 ++- arch/powerpc/include/asm/mman.h | 14 - arch/powerpc/include/asm/pkeys.h | 21 ++- arch/powerpc/include/asm/systbl.h| 1 + arch/powerpc/include/asm/unistd.h| 4 +- arch/powerpc/include/uapi/asm/unistd.h | 1 + arch/powerpc/mm/pkeys.c | 93 +++- 7 files changed, 148 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 87e9a89..bc845cd 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -37,6 +37,7 @@ #define _RPAGE_RSV20x0800UL #define _RPAGE_RSV30x0400UL #define _RPAGE_RSV40x0200UL +#define _RPAGE_RSV50x00040UL #define _PAGE_PTE 0x4000UL/* distinguishes PTEs from pointers */ #define _PAGE_PRESENT 0x8000UL/* pte contains a translation */ @@ -56,6 +57,20 @@ /* Max physical address bit as per radix table */ #define _RPAGE_PA_MAX 57 +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS +#define H_PAGE_PKEY_BIT0 _RPAGE_RSV1 +#define H_PAGE_PKEY_BIT1 _RPAGE_RSV2 +#define H_PAGE_PKEY_BIT2 _RPAGE_RSV3 +#define H_PAGE_PKEY_BIT3 _RPAGE_RSV4 +#define H_PAGE_PKEY_BIT4 _RPAGE_RSV5 +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ +#define H_PAGE_PKEY_BIT0 0 +#define H_PAGE_PKEY_BIT1 0 +#define H_PAGE_PKEY_BIT2 0 +#define H_PAGE_PKEY_BIT3 0 +#define H_PAGE_PKEY_BIT4 0 +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + /* * Max physical address bit we will use for now. * @@ -122,7 +137,12 @@ #define PAGE_PROT_BITS (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \ H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \ _PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY | _PAGE_EXEC | \ -_PAGE_SOFT_DIRTY) +_PAGE_SOFT_DIRTY | \ +H_PAGE_PKEY_BIT0 | \ +H_PAGE_PKEY_BIT1 | \ +H_PAGE_PKEY_BIT2 | \ +H_PAGE_PKEY_BIT3 | \ +H_PAGE_PKEY_BIT4) /* * We define 2 sets of base prot bits, one for basic pages (ie, * cacheable kernel and user pages) and one for non cacheable diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h index 30922f6..624f6a2 100644 --- a/arch/powerpc/include/asm/mman.h +++ b/arch/powerpc/include/asm/mman.h @@ -13,6 +13,7 @@ #include #include +#include #include /* @@ -22,13 +23,24 @@ static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot, unsigned long pkey) { - return (prot & PROT_SAO) ? VM_SAO : 0; +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + return (((prot & PROT_SAO) ? VM_SAO : 0) | + pkey_to_vmflag_bits(pkey)); +#else + return ((prot & PROT_SAO) ? VM_SAO : 0); +#endif } #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey) static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags) { +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + return (vm_flags & VM_SAO) ? + __pgprot(_PAGE_SAO | vmflag_to_page_pkey_bits(vm_flags)) : + __pgprot(0 | vmflag_to_page_pkey_bits(vm_flags)); +#else return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0); +#endif } #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags) diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h index 7bc8746..0f3dca8 100644 --- a/arch/powerpc/include/asm/pkeys.h +++ b/arch/powerpc/include/asm/pkeys.h @@ -14,6 +14,19 @@ VM_PKEY_BIT3 | \ VM_PKEY_BIT4) +#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \ + ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\ + ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\ + ((key & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) |\ + ((key & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL)) + +#define vmflag_to_page_pkey_bits(vm_flags) \ + (((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL)| \ + ((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) | \ + ((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) | \ + ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \ + ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL)) + /* * Bits are in BE format. * NOTE: key 31, 1, 0 are not used. @
[RFC v3 18/23] powerpc: Deliver SEGV signal on pkey violation
The value of the AMR register at the time of exception is made available in gp_regs[PT_AMR] of the siginfo. This field can be used to reprogram the permission bits of any valid pkey. Similarly the value of the pkey, whose protection got violated, is made available at si_pkey field of the siginfo structure. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/paca.h| 1 + arch/powerpc/include/uapi/asm/ptrace.h | 3 ++- arch/powerpc/kernel/asm-offsets.c | 5 arch/powerpc/kernel/exceptions-64s.S | 16 +-- arch/powerpc/kernel/signal_32.c| 14 ++ arch/powerpc/kernel/signal_64.c| 14 ++ arch/powerpc/kernel/traps.c| 49 ++ arch/powerpc/mm/fault.c| 2 ++ 8 files changed, 101 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 1c09f8f..a41afd3 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -92,6 +92,7 @@ struct paca_struct { struct dtl_entry *dispatch_log_end; #endif /* CONFIG_PPC_STD_MMU_64 */ u64 dscr_default; /* per-CPU default DSCR */ + u64 paca_amr; /* value of amr at exception */ #ifdef CONFIG_PPC_STD_MMU_64 /* diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h index 8036b38..7ec2428 100644 --- a/arch/powerpc/include/uapi/asm/ptrace.h +++ b/arch/powerpc/include/uapi/asm/ptrace.h @@ -108,8 +108,9 @@ struct pt_regs { #define PT_DAR 41 #define PT_DSISR 42 #define PT_RESULT 43 -#define PT_DSCR 44 #define PT_REGS_COUNT 44 +#define PT_DSCR 44 +#define PT_AMR 45 #define PT_FPR048 /* each FP reg occupies 2 slots in this space */ diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 709e234..17f5d8a 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -241,6 +241,11 @@ int main(void) OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id); OFFSET(PACAKEXECSTATE, paca_struct, kexec_state); OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default); + +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + OFFSET(PACA_AMR, paca_struct, paca_amr); +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime); OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user); OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime); diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 3fd0528..a4de1b4 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -493,9 +493,15 @@ EXC_COMMON_BEGIN(data_access_common) ld r12,_MSR(r1) ld r3,PACA_EXGEN+EX_DAR(r13) lwz r4,PACA_EXGEN+EX_DSISR(r13) - li r5,0x300 std r3,_DAR(r1) std r4,_DSISR(r1) +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */ + beq+1f + mfspr r5,SPRN_AMR + std r5,PACA_AMR(r13) +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ +1: li r5,0x300 BEGIN_MMU_FTR_SECTION b do_hash_page/* Try to handle as hpte fault */ MMU_FTR_SECTION_ELSE @@ -561,9 +567,15 @@ EXC_COMMON_BEGIN(instruction_access_common) ld r12,_MSR(r1) ld r3,_NIP(r1) andis. r4,r12,0x5820 - li r5,0x400 std r3,_DAR(r1) std r4,_DSISR(r1) +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + andis. r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */ + beq+1f + mfspr r5,SPRN_AMR + std r5,PACA_AMR(r13) +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ +1: li r5,0x400 BEGIN_MMU_FTR_SECTION b do_hash_page/* Try to handle as hpte fault */ MMU_FTR_SECTION_ELSE diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index 97bb138..059766a 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct mcontext __user *frame, (unsigned long) &frame->tramp[2]); } +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + if (__put_user(get_paca()->paca_amr, &frame->mc_gregs[PT_AMR])) + return 1; +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + return 0; } @@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs, long err; unsigned int save_r2 = 0; unsigned long msr; +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + unsigned long amr; +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ #ifdef CONFIG_VSX int i; #endif @@ -750,6 +758,12 @@ static long resto
[RFC v3 16/23] powerpc: Macro the mask used for checking DSI exception
Replace the magic number used to check for DSI exception with a meaningful value. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/reg.h | 7 ++- arch/powerpc/kernel/exceptions-64s.S | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 7e50e47..ba110dd 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -272,16 +272,21 @@ #define SPRN_DAR 0x013 /* Data Address Register */ #define SPRN_DBCR 0x136 /* e300 Data Breakpoint Control Reg */ #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */ +#define DSISR_BIT32 0x8000 /* not defined */ #define DSISR_NOHPTE 0x4000 /* no translation found */ +#define DSISR_PAGEATTR_CONFLT0x2000 /* page attribute conflict */ +#define DSISR_BIT35 0x1000 /* not defined */ #define DSISR_PROTFAULT 0x0800 /* protection fault */ #define DSISR_BADACCESS 0x0400 /* bad access to CI or G */ #define DSISR_ISSTORE0x0200 /* access was a store */ #define DSISR_DABRMATCH 0x0040 /* hit data breakpoint */ -#define DSISR_NOSEGMENT 0x0020 /* SLB miss */ #define DSISR_KEYFAULT 0x0020 /* Key fault */ +#define DSISR_BIT43 0x0010 /* not defined */ #define DSISR_UNSUPP_MMU 0x0008 /* Unsupported MMU config */ #define DSISR_SET_RC 0x0004 /* Failed setting of R/C bits */ #define DSISR_PGDIRFAULT 0x0002 /* Fault on page directory */ +#define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \ + DSISR_BADACCESS | DSISR_BIT43) #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */ #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */ #define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */ diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index ae418b8..3fd0528 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1411,7 +1411,7 @@ USE_TEXT_SECTION() .balign IFETCH_ALIGN_BYTES do_hash_page: #ifdef CONFIG_PPC_STD_MMU_64 - andis. r0,r4,0xa410/* weird error? */ + andis. r0,r4,DSISR_PAGE_FAULT_MASK@h bne-handle_page_fault /* if not, try to insert a HPTE */ andis. r0,r4,DSISR_DABRMATCH@h bne-handle_dabr_fault -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 15/23] powerpc: Program HPTE key protection bits
Map the PTE protection key bits to the HPTE key protection bits, while creating HPTE entries. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 + arch/powerpc/include/asm/pkeys.h | 7 +++ arch/powerpc/mm/hash_utils_64.c | 5 + 3 files changed, 17 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index 6981a52..f7a6ed3 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -90,6 +90,8 @@ #define HPTE_R_PP0 ASM_CONST(0x8000) #define HPTE_R_TS ASM_CONST(0x4000) #define HPTE_R_KEY_HI ASM_CONST(0x3000) +#define HPTE_R_KEY_BIT0ASM_CONST(0x2000) +#define HPTE_R_KEY_BIT1ASM_CONST(0x1000) #define HPTE_R_RPN_SHIFT 12 #define HPTE_R_RPN ASM_CONST(0x0000) #define HPTE_R_RPN_3_0 ASM_CONST(0x01fff000) @@ -104,6 +106,9 @@ #define HPTE_R_C ASM_CONST(0x0080) #define HPTE_R_R ASM_CONST(0x0100) #define HPTE_R_KEY_LO ASM_CONST(0x0e00) +#define HPTE_R_KEY_BIT2ASM_CONST(0x0800) +#define HPTE_R_KEY_BIT3ASM_CONST(0x0400) +#define HPTE_R_KEY_BIT4ASM_CONST(0x0200) #define HPTE_V_1TB_SEG ASM_CONST(0x4000) #define HPTE_V_VRMA_MASK ASM_CONST(0x4001ff00) diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h index 0f3dca8..af3882f 100644 --- a/arch/powerpc/include/asm/pkeys.h +++ b/arch/powerpc/include/asm/pkeys.h @@ -27,6 +27,13 @@ ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \ ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL)) +#define pte_to_hpte_pkey_bits(pteflags)\ + (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |\ + ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \ + ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \ + ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \ + ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL)) + /* * Bits are in BE format. * NOTE: key 31, 1, 0 are not used. diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index b3bc5d6..34bc94c 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -230,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags) */ rflags |= HPTE_R_M; +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + rflags |= pte_to_hpte_pkey_bits(pteflags); +#endif + return rflags; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 20/23] selftest: PowerPC specific test updates to memory protection keys
Abstracted out the arch specific code into the header file, and added powerpc specific changes. a) added 4k-backed hpte, memory allocator, powerpc specific. b) added three test case where the key is associated after the page is accessed/allocated/mapped. c) cleaned up the code to make checkpatch.pl happy Signed-off-by: Ram Pai --- tools/testing/selftests/vm/pkey-helpers.h| 230 +-- tools/testing/selftests/vm/protection_keys.c | 562 --- 2 files changed, 513 insertions(+), 279 deletions(-) diff --git a/tools/testing/selftests/vm/pkey-helpers.h b/tools/testing/selftests/vm/pkey-helpers.h index b202939..69bfa89 100644 --- a/tools/testing/selftests/vm/pkey-helpers.h +++ b/tools/testing/selftests/vm/pkey-helpers.h @@ -12,13 +12,72 @@ #include #include -#define NR_PKEYS 16 -#define PKRU_BITS_PER_PKEY 2 +/* Define some kernel-like types */ +#define u8 uint8_t +#define u16 uint16_t +#define u32 uint32_t +#define u64 uint64_t + +#ifdef __i386__ /* arch */ + +#define SYS_mprotect_key 380 +#define SYS_pkey_alloc 381 +#define SYS_pkey_free 382 +#define REG_IP_IDX REG_EIP +#define si_pkey_offset 0x14 + +#define NR_PKEYS 16 +#define NR_RESERVED_PKEYS 1 +#define PKRU_BITS_PER_PKEY 2 +#define PKEY_DISABLE_ACCESS0x1 +#define PKEY_DISABLE_WRITE 0x2 +#define HPAGE_SIZE (1UL<<21) + +#define INIT_PRKU 0x0UL + +#elif __powerpc64__ /* arch */ + +#define SYS_mprotect_key 386 +#define SYS_pkey_alloc 384 +#define SYS_pkey_free 385 +#define si_pkey_offset 0x20 +#define REG_IP_IDX PT_NIP +#define REG_TRAPNO PT_TRAP +#define REG_AMR45 +#define gregs gp_regs +#define fpregs fp_regs + +#define NR_PKEYS 32 +#define NR_RESERVED_PKEYS 3 +#define PKRU_BITS_PER_PKEY 2 +#define PKEY_DISABLE_ACCESS0x3 /* disable read and write */ +#define PKEY_DISABLE_WRITE 0x2 +#define HPAGE_SIZE (1UL<<24) + +#define INIT_PRKU 0x3UL +#else /* arch */ + + NOT SUPPORTED + +#endif /* arch */ + #ifndef DEBUG_LEVEL #define DEBUG_LEVEL 0 #endif #define DPRINT_IN_SIGNAL_BUF_SIZE 4096 + + +static inline u32 pkey_to_shift(int pkey) +{ +#ifdef __i386__ /* arch */ + return pkey * PKRU_BITS_PER_PKEY; +#elif __powerpc64__ /* arch */ + return (NR_PKEYS - pkey - 1) * PKRU_BITS_PER_PKEY; +#endif /* arch */ +} + + extern int dprint_in_signal; extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE]; static inline void sigsafe_printf(const char *format, ...) @@ -53,53 +112,76 @@ static inline void sigsafe_printf(const char *format, ...) #define dprintf3(args...) dprintf_level(3, args) #define dprintf4(args...) dprintf_level(4, args) -extern unsigned int shadow_pkru; -static inline unsigned int __rdpkru(void) +extern u64 shadow_pkey_reg; + +static inline u64 __rdpkey_reg(void) { +#ifdef __i386__ /* arch */ unsigned int eax, edx; unsigned int ecx = 0; - unsigned int pkru; + unsigned int pkey_reg; asm volatile(".byte 0x0f,0x01,0xee\n\t" : "=a" (eax), "=d" (edx) : "c" (ecx)); - pkru = eax; - return pkru; +#elif __powerpc64__ /* arch */ + u64 eax; + u64 pkey_reg; + + asm volatile("mfspr %0, 0xd" : "=r" ((u64)(eax))); +#endif /* arch */ + pkey_reg = (u64)eax; + return pkey_reg; } -static inline unsigned int _rdpkru(int line) +static inline u64 _rdpkey_reg(int line) { - unsigned int pkru = __rdpkru(); + u64 pkey_reg = __rdpkey_reg(); - dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n", - line, pkru, shadow_pkru); - assert(pkru == shadow_pkru); + dprintf4("rdpkey_reg(line=%d) pkey_reg: %lx shadow: %lx\n", + line, pkey_reg, shadow_pkey_reg); + assert(pkey_reg == shadow_pkey_reg); - return pkru; + return pkey_reg; } -#define rdpkru() _rdpkru(__LINE__) +#define rdpkey_reg() _rdpkey_reg(__LINE__) -static inline void __wrpkru(unsigned int pkru) +static inline void __wrpkey_reg(u64 pkey_reg) { - unsigned int eax = pkru; +#ifdef __i386__ /* arch */ + unsigned int eax = pkey_reg; unsigned int ecx = 0; unsigned int edx = 0; - dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru); + dprintf4("%s() changing %lx to %lx\n", +__func__, __rdpkey_reg(), pkey_reg); asm volatile(".byte 0x0f,0x01,0xef\n\t" : : "a" (eax), "c" (ecx), "d" (edx)); - assert(pkru == __rdpkru()); + dprintf4("%s() PKRUP after changing %lx to %lx\n", + __func__, __rdpkey_reg(), pkey_reg); +#else /* arch */ + u64 eax = pkey_reg; + + dprintf4("%s() changing %llx to %llx\n", +__func__, __rdpkey_reg(), pkey_reg); + asm volatile("mtspr 0xd, %0" : : "r" ((unsigned long)(eax)) : "memory"); + dprintf4("%s() PKRUP after changi
[RFC v3 19/23] selftest: Move protecton key selftest to arch neutral directory
Signed-off-by: Ram Pai --- tools/testing/selftests/vm/Makefile |1 + tools/testing/selftests/vm/pkey-helpers.h | 219 tools/testing/selftests/vm/protection_keys.c | 1395 + tools/testing/selftests/x86/Makefile |2 +- tools/testing/selftests/x86/pkey-helpers.h| 219 tools/testing/selftests/x86/protection_keys.c | 1395 - 6 files changed, 1616 insertions(+), 1615 deletions(-) create mode 100644 tools/testing/selftests/vm/pkey-helpers.h create mode 100644 tools/testing/selftests/vm/protection_keys.c delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h delete mode 100644 tools/testing/selftests/x86/protection_keys.c diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index cbb29e4..1d32f78 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -17,6 +17,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd TEST_GEN_FILES += mlock-random-test TEST_GEN_FILES += virtual_address_range +TEST_GEN_FILES += protection_keys TEST_PROGS := run_vmtests diff --git a/tools/testing/selftests/vm/pkey-helpers.h b/tools/testing/selftests/vm/pkey-helpers.h new file mode 100644 index 000..b202939 --- /dev/null +++ b/tools/testing/selftests/vm/pkey-helpers.h @@ -0,0 +1,219 @@ +#ifndef _PKEYS_HELPER_H +#define _PKEYS_HELPER_H +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define NR_PKEYS 16 +#define PKRU_BITS_PER_PKEY 2 + +#ifndef DEBUG_LEVEL +#define DEBUG_LEVEL 0 +#endif +#define DPRINT_IN_SIGNAL_BUF_SIZE 4096 +extern int dprint_in_signal; +extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE]; +static inline void sigsafe_printf(const char *format, ...) +{ + va_list ap; + + va_start(ap, format); + if (!dprint_in_signal) { + vprintf(format, ap); + } else { + int len = vsnprintf(dprint_in_signal_buffer, + DPRINT_IN_SIGNAL_BUF_SIZE, + format, ap); + /* +* len is amount that would have been printed, +* but actual write is truncated at BUF_SIZE. +*/ + if (len > DPRINT_IN_SIGNAL_BUF_SIZE) + len = DPRINT_IN_SIGNAL_BUF_SIZE; + write(1, dprint_in_signal_buffer, len); + } + va_end(ap); +} +#define dprintf_level(level, args...) do { \ + if (level <= DEBUG_LEVEL) \ + sigsafe_printf(args); \ + fflush(NULL); \ +} while (0) +#define dprintf0(args...) dprintf_level(0, args) +#define dprintf1(args...) dprintf_level(1, args) +#define dprintf2(args...) dprintf_level(2, args) +#define dprintf3(args...) dprintf_level(3, args) +#define dprintf4(args...) dprintf_level(4, args) + +extern unsigned int shadow_pkru; +static inline unsigned int __rdpkru(void) +{ + unsigned int eax, edx; + unsigned int ecx = 0; + unsigned int pkru; + + asm volatile(".byte 0x0f,0x01,0xee\n\t" +: "=a" (eax), "=d" (edx) +: "c" (ecx)); + pkru = eax; + return pkru; +} + +static inline unsigned int _rdpkru(int line) +{ + unsigned int pkru = __rdpkru(); + + dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n", + line, pkru, shadow_pkru); + assert(pkru == shadow_pkru); + + return pkru; +} + +#define rdpkru() _rdpkru(__LINE__) + +static inline void __wrpkru(unsigned int pkru) +{ + unsigned int eax = pkru; + unsigned int ecx = 0; + unsigned int edx = 0; + + dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru); + asm volatile(".byte 0x0f,0x01,0xef\n\t" +: : "a" (eax), "c" (ecx), "d" (edx)); + assert(pkru == __rdpkru()); +} + +static inline void wrpkru(unsigned int pkru) +{ + dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru); + /* will do the shadow check for us: */ + rdpkru(); + __wrpkru(pkru); + shadow_pkru = pkru; + dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru()); +} + +/* + * These are technically racy. since something could + * change PKRU between the read and the write. + */ +static inline void __pkey_access_allow(int pkey, int do_allow) +{ + unsigned int pkru = rdpkru(); + int bit = pkey * 2; + + if (do_allow) + pkru &= (1
[RFC v3 21/23] Documentation: Move protecton key documentation to arch neutral directory
Since PowerPC and Intel both support memory protection keys, moving the documenation to arch-neutral directory. Signed-off-by: Ram Pai --- Documentation/vm/protection-keys.txt | 85 +++ Documentation/x86/protection-keys.txt | 85 --- 2 files changed, 85 insertions(+), 85 deletions(-) create mode 100644 Documentation/vm/protection-keys.txt delete mode 100644 Documentation/x86/protection-keys.txt diff --git a/Documentation/vm/protection-keys.txt b/Documentation/vm/protection-keys.txt new file mode 100644 index 000..b643045 --- /dev/null +++ b/Documentation/vm/protection-keys.txt @@ -0,0 +1,85 @@ +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature +which will be found on future Intel CPUs. + +Memory Protection Keys provides a mechanism for enforcing page-based +protections, but without requiring modification of the page tables +when an application changes protection domains. It works by +dedicating 4 previously ignored bits in each page table entry to a +"protection key", giving 16 possible keys. + +There is also a new user-accessible register (PKRU) with two separate +bits (Access Disable and Write Disable) for each key. Being a CPU +register, PKRU is inherently thread-local, potentially giving each +thread a different set of protections from every other thread. + +There are two new instructions (RDPKRU/WRPKRU) for reading and writing +to the new register. The feature is only available in 64-bit mode, +even though there is theoretically space in the PAE PTEs. These +permissions are enforced on data access only and have no effect on +instruction fetches. + +=== Syscalls === + +There are 3 system calls which directly interact with pkeys: + + int pkey_alloc(unsigned long flags, unsigned long init_access_rights) + int pkey_free(int pkey); + int pkey_mprotect(unsigned long start, size_t len, + unsigned long prot, int pkey); + +Before a pkey can be used, it must first be allocated with +pkey_alloc(). An application calls the WRPKRU instruction +directly in order to change access permissions to memory covered +with a key. In this example WRPKRU is wrapped by a C function +called pkey_set(). + + int real_prot = PROT_READ|PROT_WRITE; + pkey = pkey_alloc(0, PKEY_DENY_WRITE); + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey); + ... application runs here + +Now, if the application needs to update the data at 'ptr', it can +gain access, do the update, then remove its write access: + + pkey_set(pkey, 0); // clear PKEY_DENY_WRITE + *ptr = foo; // assign something + pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again + +Now when it frees the memory, it will also free the pkey since it +is no longer in use: + + munmap(ptr, PAGE_SIZE); + pkey_free(pkey); + +(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. + An example implementation can be found in + tools/testing/selftests/x86/protection_keys.c) + +=== Behavior === + +The kernel attempts to make protection keys consistent with the +behavior of a plain mprotect(). For instance if you do this: + + mprotect(ptr, size, PROT_NONE); + something(ptr); + +you can expect the same effects with protection keys when doing this: + + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ); + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey); + something(ptr); + +That should be true whether something() is a direct access to 'ptr' +like: + + *ptr = foo; + +or when the kernel does the access on the application's behalf like +with a read(): + + read(fd, ptr, 1); + +The kernel will send a SIGSEGV in both cases, but si_code will be set +to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when +the plain mprotect() permissions are violated. diff --git a/Documentation/x86/protection-keys.txt b/Documentation/x86/protection-keys.txt deleted file mode 100644 index b643045..000 --- a/Documentation/x86/protection-keys.txt +++ /dev/null @@ -1,85 +0,0 @@ -Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature -which will be found on future Intel CPUs. - -Memory Protection Keys provides a mechanism for enforcing page-based -protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. - -There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each -thread a different set of protections from every oth
[RFC v3 22/23] Documentation: PowerPC specific updates to memory protection keys
Add documentation updates that capture PowerPC specific changes. Signed-off-by: Ram Pai --- Documentation/vm/protection-keys.txt | 65 +--- 1 file changed, 45 insertions(+), 20 deletions(-) diff --git a/Documentation/vm/protection-keys.txt b/Documentation/vm/protection-keys.txt index b643045..965ad75 100644 --- a/Documentation/vm/protection-keys.txt +++ b/Documentation/vm/protection-keys.txt @@ -1,21 +1,46 @@ -Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature -which will be found on future Intel CPUs. +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature found in +new generation of intel CPUs and on PowerPC 7 and higher CPUs. Memory Protection Keys provides a mechanism for enforcing page-based -protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. - -There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each -thread a different set of protections from every other thread. - -There are two new instructions (RDPKRU/WRPKRU) for reading and writing -to the new register. The feature is only available in 64-bit mode, -even though there is theoretically space in the PAE PTEs. These -permissions are enforced on data access only and have no effect on +protections, but without requiring modification of the page tables when an +application changes protection domains. + + +On Intel: + + It works by dedicating 4 previously ignored bits in each page table + entry to a "protection key", giving 16 possible keys. + + There is also a new user-accessible register (PKRU) with two separate + bits (Access Disable and Write Disable) for each key. Being a CPU + register, PKRU is inherently thread-local, potentially giving each + thread a different set of protections from every other thread. + + There are two new instructions (RDPKRU/WRPKRU) for reading and writing + to the new register. The feature is only available in 64-bit mode, + even though there is theoretically space in the PAE PTEs. These + permissions are enforced on data access only and have no effect on + instruction fetches. + + +On PowerPC: + + It works by dedicating 5 page table entry bits to a "protection key", + giving 32 possible keys. + + There is a user-accessible register (AMR) with two separate bits; + Access Disable and Write Disable, for each key. Being a CPU + register, AMR is inherently thread-local, potentially giving each + thread a different set of protections from every other thread. NOTE: + Disabling read permission does not disable write and vice-versa. + + The feature is available on 64-bit HPTE mode only. + 'mtspr 0xd, mem' reads the AMR register + 'mfspr mem, 0xd' writes into the AMR register. + + + +Permissions are enforced on data access only and have no effect on instruction fetches. === Syscalls === @@ -28,9 +53,9 @@ There are 3 system calls which directly interact with pkeys: unsigned long prot, int pkey); Before a pkey can be used, it must first be allocated with -pkey_alloc(). An application calls the WRPKRU instruction +pkey_alloc(). An application calls the WRPKRU/AMR instruction directly in order to change access permissions to memory covered -with a key. In this example WRPKRU is wrapped by a C function +with a key. In this example WRPKRU/AMR is wrapped by a C function called pkey_set(). int real_prot = PROT_READ|PROT_WRITE; @@ -52,11 +77,11 @@ is no longer in use: munmap(ptr, PAGE_SIZE); pkey_free(pkey); -(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. +(Note: pkey_set() is a wrapper for the RDPKRU,WRPKRU or AMR instructions. An example implementation can be found in tools/testing/selftests/x86/protection_keys.c) -=== Behavior === +=== Behavior = The kernel attempts to make protection keys consistent with the behavior of a plain mprotect(). For instance if you do this: -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 23/23] procfs: display the protection-key number associated with a vma
Display the pkey number associated with the vma in smaps of a task. The key will be seen as below: VmFlags: rd wr mr mw me dw ac key=0 Signed-off-by: Ram Pai --- Documentation/filesystems/proc.txt | 3 ++- fs/proc/task_mmu.c | 22 +++--- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 4cddbce..a8c74aa 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -423,7 +423,7 @@ SwapPss: 0 kB KernelPageSize:4 kB MMUPageSize: 4 kB Locked:0 kB -VmFlags: rd ex mr mw me dw +VmFlags: rd ex mr mw me dw key= the first of these lines shows the same information as is displayed for the mapping in /proc/PID/maps. The remaining lines show the size of the mapping @@ -491,6 +491,7 @@ manner. The codes are the following: hg - huge page advise flag nh - no-huge page advise flag mg - mergable advise flag +key= - the memory protection key number Note that there is no guarantee that every flag and associated mnemonic will be present in all further kernel releases. Things get changed, the flags may diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2ddc298..d2eb096 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1,4 +1,6 @@ #include +#include +#include #include #include #include @@ -666,22 +668,20 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MERGEABLE)] = "mg", [ilog2(VM_UFFD_MISSING)]= "um", [ilog2(VM_UFFD_WP)] = "uw", -#ifdef CONFIG_ARCH_HAS_PKEYS - /* These come out via ProtectionKey: */ - [ilog2(VM_PKEY_BIT0)] = "", - [ilog2(VM_PKEY_BIT1)] = "", - [ilog2(VM_PKEY_BIT2)] = "", - [ilog2(VM_PKEY_BIT3)] = "", -#endif /* CONFIG_ARCH_HAS_PKEYS */ -#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS - /* Additional bit in ProtectionKey: */ - [ilog2(VM_PKEY_BIT4)] = "", -#endif }; size_t i; seq_puts(m, "VmFlags: "); for (i = 0; i < BITS_PER_LONG; i++) { +#ifdef CONFIG_ARCH_HAS_PKEYS + if (i == ilog2(VM_PKEY_BIT0)) { + int keyvalue = vma_pkey(vma); + + i += ilog2(arch_max_pkey())-1; + seq_printf(m, "key=%d ", keyvalue); + continue; + } +#endif /* CONFIG_ARCH_HAS_PKEYS */ if (!mnemonics[i][0]) continue; if (vma->vm_flags & (1UL << i)) { -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v3 17/23] powerpc: Handle exceptions caused by violation of pkey protection
Handle Data and Instruction exceptions caused by memory protection-key. Signed-off-by: Ram Pai --- arch/powerpc/include/asm/mmu_context.h | 12 + arch/powerpc/include/asm/pkeys.h | 9 arch/powerpc/include/asm/reg.h | 2 +- arch/powerpc/mm/fault.c| 20 arch/powerpc/mm/pkeys.c| 90 ++ 5 files changed, 132 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index da7e943..71fffe0 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm, { } +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS +bool arch_pte_access_permitted(pte_t pte, bool write); +bool arch_vma_access_permitted(struct vm_area_struct *vma, + bool write, bool execute, bool foreign); +#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ +static inline bool arch_pte_access_permitted(pte_t pte, bool write) +{ + /* by default, allow everything */ + return true; +} static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write, bool execute, bool foreign) { /* by default, allow everything */ return true; } +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + #endif /* __KERNEL__ */ #endif /* __ASM_POWERPC_MMU_CONTEXT_H */ diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h index af3882f..a83722e 100644 --- a/arch/powerpc/include/asm/pkeys.h +++ b/arch/powerpc/include/asm/pkeys.h @@ -14,6 +14,15 @@ VM_PKEY_BIT3 | \ VM_PKEY_BIT4) +static inline u16 pte_flags_to_pkey(unsigned long pte_flags) +{ + return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) | + ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) | + ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) | + ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) | + ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0); +} + #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \ ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\ ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\ diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index ba110dd..6e2a860 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -286,7 +286,7 @@ #define DSISR_SET_RC 0x0004 /* Failed setting of R/C bits */ #define DSISR_PGDIRFAULT 0x0002 /* Fault on page directory */ #define DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \ - DSISR_BADACCESS | DSISR_BIT43) + DSISR_BADACCESS | DSISR_KEYFAULT | DSISR_BIT43) #define SPRN_TBRL 0x10C /* Time Base Read Lower Register (user, R/O) */ #define SPRN_TBRU 0x10D /* Time Base Read Upper Register (user, R/O) */ #define SPRN_CIR 0x11B /* Chip Information Register (hyper, R/0) */ diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 3a7d580..3d71984 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -261,6 +261,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, } #endif +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + if (error_code & DSISR_KEYFAULT) { + code = SEGV_PKUERR; + goto bad_area_nosemaphore; + } +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + /* We restore the interrupt state now */ if (!arch_irq_disabled_regs(regs)) local_irq_enable(); @@ -441,6 +448,19 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, WARN_ON_ONCE(error_code & DSISR_PROTFAULT); #endif /* CONFIG_PPC_STD_MMU */ +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, + is_exec, 0)) { + code = SEGV_PKUERR; + goto bad_area; + } +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */ + + /* handle_mm_fault() needs to know if its a instruction access +* fault. +*/ + if (is_exec) + flags |= FAULT_FLAG_INSTRUCTION; /* * If for any reason at all we couldn't handle the fault, * make sure we exit gracefully rather than endlessly redo diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c index 11a32b3..439241a 100644 --- a/arch/powerpc/mm/pkeys.c +++ b/arch/powerpc/mm/pkeys.c @@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey) return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift)); } +static inline bool pkey_allows_read(int pkey) +{ + int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY; +
[RFC v3 00/23] powerpc: Memory Protection Keys
Memory protection keys enable applications to protect its address space from inadvertent access or corruption from itself. The overall idea: A process allocates a key and associates it with a address range withinits address space. The process than can dynamically set read/write permissions on the key without involving the kernel. Any code that violates the permissions off the address space; as defined by its associated key, will receive a segmentation fault. This patch series enables the feature on PPC64. It is enabled on HPTE 64K-page platform. ISA3.0 section 5.7.13 describes the detailed specifications. Testing: This patch series has passed all the protection key tests available in the selftests directory. The tests are updated to work on both x86 and powerpc. version v3: (1) split the patches into smaller consumable patches. (2) added the ability to disable execute permission on a key at creation. (3) rename calc_pte_to_hpte_pkey_bits() to pte_to_hpte_pkey_bits() -- suggested by Anshuman (4) some code optimization and clarity in do_page_fault() (5) A bug fix while invalidating a hpte slot in __hash_page_4K() -- noticed by Aneesh version v2: (1) documentation and selftest added (2) fixed a bug in 4k hpte backed 64k pte where page invalidation was not done correctly, and initialization of second-part-of-the-pte was not done correctly if the pte was not yet Hashed with a hpte. Reported by Aneesh. (3) Fixed ABI breakage caused in siginfo structure. Reported by Anshuman. Outstanding known issue: Calls to sys_swapcontext with a made-up context will end up with a crap AMR if done by code who didn't know about that register. -- Reported by Ben. version v1: Initial version Thanks-to: Dave Hansen, Aneesh, Paul Mackerras, Michael Ellermen Ram Pai (23): powerpc: Free up four 64K PTE bits in 4K backed HPTE pages powerpc: introduce set_hidx_slot helper powerpc: introduce get_hidx_gslot helper powerpc: Free up four 64K PTE bits in 64K backed HPTE pages powerpc: capture the PTE format changes in the dump pte report powerpc: use helper functions in __hash_page_4K() for 64K PTE powerpc: use helper functions in __hash_page_4K() for 4K PTE powerpc: use helper functions in flush_hash_page() mm: introduce an additional vma bit for powerpc pkey mm: provide the ability to disable execute on a key at creation x86: key creation with PKEY_DISABLE_EXECUTE is disallowed powerpc: Implement sys_pkey_alloc and sys_pkey_free system call powerpc: store and restore the pkey state across context switches powerpc: Implementation for sys_mprotect_pkey() system call powerpc: Program HPTE key protection bits powerpc: Macro the mask used for checking DSI exception powerpc: Handle exceptions caused by violation of pkey protection powerpc: Deliver SEGV signal on pkey violation selftest: Move protecton key selftest to arch neutral directory selftest: PowerPC specific test updates to memory protection keys Documentation: Move protecton key documentation to arch neutral directory Documentation: PowerPC specific updates to memory protection keys procfs: display the protection-key number associated with a vma Documentation/filesystems/proc.txt|3 +- Documentation/vm/protection-keys.txt | 110 ++ Documentation/x86/protection-keys.txt | 85 -- arch/powerpc/Kconfig | 15 + arch/powerpc/include/asm/book3s/64/hash-4k.h | 14 + arch/powerpc/include/asm/book3s/64/hash-64k.h | 53 +- arch/powerpc/include/asm/book3s/64/hash.h | 15 +- arch/powerpc/include/asm/book3s/64/mmu-hash.h |5 + arch/powerpc/include/asm/book3s/64/mmu.h | 10 + arch/powerpc/include/asm/book3s/64/pgtable.h | 84 +- arch/powerpc/include/asm/mman.h | 14 +- arch/powerpc/include/asm/mmu_context.h| 12 + arch/powerpc/include/asm/paca.h |1 + arch/powerpc/include/asm/pkeys.h | 159 +++ arch/powerpc/include/asm/processor.h |5 + arch/powerpc/include/asm/reg.h|7 +- arch/powerpc/include/asm/systbl.h |3 + arch/powerpc/include/asm/unistd.h |6 +- arch/powerpc/include/uapi/asm/ptrace.h|3 +- arch/powerpc/include/uapi/asm/unistd.h|3 + arch/powerpc/kernel/asm-offsets.c |5 + arch/powerpc/kernel/exceptions-64s.S | 18 +- arch/powerpc/kernel/process.c | 18 + arch/powerpc/kernel/signal_32.c | 14 + arch/powerpc/kernel/signal_64.c | 14 + arch/powerpc/kernel/traps.c | 49 + arch/powerpc/mm/Makefile
Re: [PATCH] kbuild: replace genhdr-y with generated-y, deprecating genhdr-y
2017-06-09 17:29 GMT+09:00 Masahiro Yamada : > Prior to commit fcc8487d477a ("uapi: export all headers under uapi > directories"), genhdr-y was meant to specify generated UAPI headers. > > - generated-y: generated headers (other than asm-generic wrappers) > - header-y : headers to be exported > - genhdr-y : generated headers to be exported (generated-y + header-y) > > Now headers under UAPI directories are all exported. So, there is no > more difference between generated-y and genhdr-y. > > We see two users of genhdr-y, arch/{arm,x86}/include/uapi/asm/Kbuild. > They generate some headers in arch/{arm,x86}/include/uapi/generated/ > directories, which are obviously exported. > > Replace genhdr-y with generated-y, and deprecate genhdr-y. > > Signed-off-by: Masahiro Yamada Applied to linux-kbuild/kbuild. -- Best Regards Masahiro Yamada -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
We want to limit the number of tasks which are having an access to the memory reserves. To ensure the progress it's enough to have one such process at the time. If we need to kill the whole cgroup, let's give an access to the memory reserves only to the first process in the list, which is (usually) the biggest process. This will give us good chances that all other processes will be able to quit without an access to the memory reserves. Otherwise, to keep going forward, let's grant the access to the memory reserves for tasks, which can't be reaped by the oom_reaper. As it will be done from the oom reaper thread, which handles the oom reaper queue consequently, there is no high risk to have too many such processes at the same time. To implement this solution, we need to stop using TIF_MEMDIE flag as an universal marker for oom victims tasks. It's not a big issue, as we have oom_mm pointer/tsk_is_oom_victim(), which are just better. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tejun Heo Cc: Tetsuo Handa Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- kernel/exit.c | 2 +- mm/oom_kill.c | 31 ++- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index d211425..5b95d74 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -554,7 +554,7 @@ static void exit_mm(void) task_unlock(current); mm_update_next_owner(mm); mmput(mm); - if (test_thread_flag(TIF_MEMDIE)) + if (tsk_is_oom_victim(current)) exit_oom_victim(); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 489ab69..b55bd18 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES && + !__oom_reap_task_mm(tsk, mm)) { + + /* +* If the task has no access to the memory reserves, +* grant it to help the task to exit. +*/ + if (!test_tsk_thread_flag(tsk, TIF_MEMDIE)) + set_tsk_thread_flag(tsk, TIF_MEMDIE); + schedule_timeout_idle(HZ/10); + } if (attempts <= MAX_OOM_REAP_RETRIES) goto done; @@ -647,16 +657,13 @@ static inline void wake_oom_reaper(struct task_struct *tsk) */ static void mark_oom_victim(struct task_struct *tsk) { - struct mm_struct *mm = tsk->mm; - WARN_ON(oom_killer_disabled); - /* OOM killer might race with memcg OOM */ - if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) - return; /* oom_mm is bound to the signal struct life time. */ - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) - mmgrab(tsk->signal->oom_mm); + if (cmpxchg(&tsk->signal->oom_mm, NULL, tsk->mm) != NULL) + return; + + mmgrab(tsk->signal->oom_mm); /* * Make sure that the task is woken up from uninterruptible sleep @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk) * that TIF_MEMDIE tasks should be ignored. */ __thaw_task(tsk); - atomic_inc(&oom_victims); + + /* +* If there are no oom victims in flight, +* give the task an access to the memory reserves. +*/ + if (atomic_inc_return(&oom_victims) == 1) + set_tsk_thread_flag(tsk, TIF_MEMDIE); } /** -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v3 3/6] mm, oom: cgroup-aware OOM killer debug info
Dump the cgroup oom badness score, as well as the name of chosen victim cgroup. Here how it looks like in dmesg: [ 18.824495] Choosing a victim memcg because of the system-wide OOM [ 18.826911] Cgroup /A1: 200805 [ 18.827996] Cgroup /A2: 273072 [ 18.828937] Cgroup /A2/B3: 51 [ 18.829795] Cgroup /A2/B4: 272969 [ 18.830800] Cgroup /A2/B5: 52 [ 18.831890] Chosen cgroup /A2/B4: 272969 Signed-off-by: Roman Gushchin Cc: Tejun Heo Cc: Johannes Weiner Cc: Li Zefan Cc: Michal Hocko Cc: Vladimir Davydov Cc: Tetsuo Handa Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- mm/memcontrol.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bdb5103..4face20 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2669,7 +2669,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; + + pr_info("Choosing a victim memcg because of the %s", + oc->memcg ? + "memory limit reached of cgroup " : + "system-wide OOM\n"); if (oc->memcg) { + pr_cont_cgroup_path(oc->memcg->css.cgroup); + pr_cont("\n"); + chosen_memcg = oc->memcg; parent = oc->memcg; } @@ -2683,6 +2691,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) points = mem_cgroup_oom_badness(iter, oc->nodemask); + pr_info("Cgroup "); + pr_cont_cgroup_path(iter->css.cgroup); + pr_cont(": %ld\n", points); + if (points > chosen_memcg_points) { chosen_memcg = iter; chosen_memcg_points = points; @@ -2731,6 +2743,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) oc->chosen_memcg = chosen_memcg; } + pr_info("Chosen cgroup "); + pr_cont_cgroup_path(chosen_memcg->css.cgroup); + pr_cont(": %ld\n", oc->chosen_points); + /* * Even if we have to kill all tasks in the cgroup, * we need to select the biggest task to start with. @@ -2739,7 +2755,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) */ oc->chosen_points = 0; mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc); - } + } else if (oc->chosen) + pr_info("Chosen task %s (%d) in root cgroup: %ld\n", + oc->chosen->comm, oc->chosen->pid, oc->chosen_points); rcu_read_unlock(); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
Oom killer should avoid unnecessary kills. To prevent them, during the tasks list traverse we check for task which was previously selected as oom victims. If there is such a task, new victim is not selected. This approach is sub-optimal (we're doing costly iteration over the task list every time) and will not work for the cgroup-aware oom killer. We already have oom_victims counter, which can be effectively used for the task. If there are victims in flight, don't do anything; if the counter falls to 0, there are no more oom victims left. So, it's a good time to start looking for a new victim. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tejun Heo Cc: Tetsuo Handa Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- mm/oom_kill.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 0e2c925..e3aaf5c8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc) if (oom_killer_disabled) return false; + /* +* If there are oom victims in flight, we don't need to select +* a new victim. +*/ + if (atomic_read(&oom_victims) > 0) + return true; + if (!is_memcg_oom(oc)) { blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer
Update cgroups v2 docs. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- Documentation/cgroup-v2.txt | 44 1 file changed, 44 insertions(+) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index a86f3cb..7a1a1ac 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -44,6 +44,7 @@ CONTENTS 5-2-1. Memory Interface Files 5-2-2. Usage Guidelines 5-2-3. Memory Ownership +5-2-4. Cgroup-aware OOM Killer 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback @@ -799,6 +800,26 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.oom_kill_all_tasks + + A read-write single value file which exits on non-root + cgroups. The default is "0". + + Defines whether the OOM killer should treat the cgroup + as a single entity during the victim selection. + + If set, it will cause the OOM killer to kill all belonging + tasks, both in case of a system-wide or cgroup-wide OOM. + + memory.oom_score_adj + + A read-write single value file which exits on non-root + cgroups. The default is "0". + + OOM killer score adjustment, which has as similar meaning + to a per-process value, available via /proc//oom_score_adj. + Should be in a range [-1000, 1000]. + memory.events A read-only flat-keyed file which exists on non-root cgroups. @@ -1028,6 +1049,29 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership. +5-2-4. Cgroup-aware OOM Killer + +Cgroup v2 memory controller implements a cgroup-aware OOM killer. +It means that it treats memory cgroups as first class OOM entities. + +Under OOM conditions the memory controller tries to make the best +choise of a victim, hierarchically looking for the largest memory +consumer. By default, it will look for the biggest task in the +biggest leaf cgroup. + +But a user can change this behavior by enabling the per-cgroup +oom_kill_all_tasks option. If set, it causes the OOM killer treat +the whole cgroup as an indivisible memory consumer. In case if it's +selected as on OOM victim, all belonging tasks will be killed. + +Tasks in the root cgroup are treated as independent memory consumers, +and are compared with other memory consumers (e.g. leaf cgroups). +The root cgroup doesn't support the oom_kill_all_tasks feature. + +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM +the memory controller considers only cgroups belonging to the sub-tree +of the OOM'ing cgroup. + 5-3. IO The "io" controller regulates the distribution of IO resources. This -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups
Introduce a per-memory-cgroup oom_score_adj setting. A read-write single value file which exits on non-root cgroups. The default is "0". It will have a similar meaning to a per-process value, available via /proc//oom_score_adj. Should be in a range [-1000, 1000]. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tejun Heo Cc: Tetsuo Handa Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/memcontrol.h | 3 +++ mm/memcontrol.c| 36 2 files changed, 39 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c59926c..b84a050 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -203,6 +203,9 @@ struct mem_cgroup { /* kill all tasks in the subtree in case of OOM */ bool oom_kill_all_tasks; + /* OOM kill score adjustment */ + short oom_score_adj; + /* handle for "memory.events" */ struct cgroup_file events_file; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4face20..e474eba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5333,6 +5333,36 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of, return nbytes; } +static int memory_oom_score_adj_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + short oom_score_adj = memcg->oom_score_adj; + + seq_printf(m, "%d\n", oom_score_adj); + + return 0; +} + +static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int oom_score_adj; + int err; + + err = kstrtoint(strstrip(buf), 0, &oom_score_adj); + if (err) + return err; + + if (oom_score_adj < OOM_SCORE_ADJ_MIN || + oom_score_adj > OOM_SCORE_ADJ_MAX) + return -EINVAL; + + memcg->oom_score_adj = (short)oom_score_adj; + + return nbytes; +} + static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5459,6 +5489,12 @@ static struct cftype memory_files[] = { .write = memory_oom_kill_all_tasks_write, }, { + .name = "oom_score_adj", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_score_adj_show, + .write = memory_oom_score_adj_write, + }, + { .name = "events", .flags = CFTYPE_NOT_ON_ROOT, .file_offset = offsetof(struct mem_cgroup, events_file), -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v3 2/6] mm, oom: cgroup-aware OOM killer
Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers. There are two main issues: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much more safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. 3) Per-process oom_score_adj affects global OOM, so it's a breache in the isolation. To address these issues, cgroup-aware OOM killer is introduced. Under OOM conditions, it tries to find the biggest memory consumer, and free memory by killing corresponding task(s). The difference the "traditional" OOM killer is that it can treat memory cgroups as memory consumers as well as single processes. By default, it will look for the biggest leaf cgroup, and kill the largest task inside. But a user can change this behavior by enabling the per-cgroup oom_kill_all_tasks option. If set, it causes the OOM killer treat the whole cgroup as an indivisible memory consumer. In case if it's selected as on OOM victim, all belonging tasks will be killed. Tasks in the root cgroup are treated as independent memory consumers, and are compared with other memory consumers (e.g. leaf cgroups). The root cgroup doesn't support the oom_kill_all_tasks feature. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/memcontrol.h | 20 ++ include/linux/oom.h| 3 + mm/memcontrol.c| 155 ++ mm/oom_kill.c | 164 + 4 files changed, 285 insertions(+), 57 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3914e3d..c59926c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -35,6 +35,7 @@ struct mem_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct oom_control; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -199,6 +200,9 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; + /* kill all tasks in the subtree in case of OOM */ + bool oom_kill_all_tasks; + /* handle for "memory.events" */ struct cgroup_file events_file; @@ -342,6 +346,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ + css_put(&memcg->css); +} + #define mem_cgroup_from_counter(counter, member) \ container_of(counter, struct mem_cgroup, member) @@ -480,6 +489,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p) bool mem_cgroup_oom_synchronize(bool wait); +bool mem_cgroup_select_oom_victim(struct oom_control *oc); + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -739,6 +750,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task, return true; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -926,6 +941,11 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + return false; +} #endif /* CONFIG_MEMCG */ static inline void __inc_memcg_state(struct mem_cgroup *memcg, diff --git a/include/linux/oom.h b/include/linux/oom.h index 8a266e2..b7ec3bd 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -39,6 +39,7 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + struct mem_cgroup *chosen_memcg; }; extern struct mutex oom_lock; @@ -79,6 +80,8 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern int oom_evaluate_task(struct task_struct *task, void *arg); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 544d47e..bdb5103 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2625,6 +2625,128 @@ static inline bool memcg_has_children(struct mem_cgr
Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing
On Wed, 21 Jun 2017, Tom Lendacky wrote: > On 6/21/2017 10:38 AM, Thomas Gleixner wrote: > > /* > > * Sanitize CPU configuration and retrieve the modifier > > * for the initial pgdir entry which will be programmed > > * into CR3. Depends on enabled SME encryption, normally 0. > > */ > > call __startup_secondary_64 > > > > addq$(init_top_pgt - __START_KERNEL_map), %rax > > > > You can hide that stuff in C-code nicely without adding any cruft to the > > ASM code. > > > > Moving the call to verify_cpu into the C-code might be quite a bit of > change. Currently, the verify_cpu code is included code and not a > global function. Ah. Ok. I missed that. > I can still do the __startup_secondary_64() function and then look to > incorporate verify_cpu into both __startup_64() and > __startup_secondary_64() as a post-patch to this series. Yes, just having __startup_secondary_64() for now and there the extra bits for that encryption stuff is fine. > At least the secondary path will have a base C routine to which > modifications can be made in the future if needed. How does that sound? Sounds like a plan. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 26/34] iommu/amd: Allow the AMD IOMMU to work with memory encryption
On 6/21/2017 11:59 AM, Borislav Petkov wrote: On Wed, Jun 21, 2017 at 05:37:22PM +0200, Joerg Roedel wrote: Do you mean this is like the last exception case in that document above: " - Pointers to data structures in coherent memory which might be modified by I/O devices can, sometimes, legitimately be volatile. A ring buffer used by a network adapter, where that adapter changes pointers to indicate which descriptors have been processed, is an example of this type of situation." ? So currently (without this patch) the build_completion_wait function does not take a volatile parameter, only wait_on_sem() does. Wait_on_sem() needs it because its purpose is to poll a memory location which is changed by the iommu-hardware when its done with command processing. Right, the reason above - memory modifiable by an IO device. You could add a comment there explaining the need for the volatile. But the 'volatile' in build_completion_wait() looks unnecessary, because the function does not poll the memory location. It only uses the pointer, converts it to a physical address and writes it to the command to be queued. Ok. Ok, so the (now) current version of the patch that doesn't change the function signature is the right way to go. Thanks, Tom Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing
On 6/21/2017 10:38 AM, Thomas Gleixner wrote: On Wed, 21 Jun 2017, Tom Lendacky wrote: On 6/21/2017 2:16 AM, Thomas Gleixner wrote: Why is this an unconditional function? Isn't the mask simply 0 when the MEM ENCRYPT support is disabled? I made it unconditional because of the call from head_64.S. I can't make use of the C level static inline function and since the mask is not a variable if CONFIG_AMD_MEM_ENCRYPT is not configured (#defined to 0) I can't reference the variable directly. I could create a #define in head_64.S that changes this to load rax with the variable if CONFIG_AMD_MEM_ENCRYPT is configured or a zero if it's not or add a #ifdef at that point in the code directly. Thoughts on that? See below. That does not make any sense. Neither the call to sme_encrypt_kernel() nor the following call to sme_get_me_mask(). __startup_64() is already C code, so why can't you simply call that from __startup_64() in C and return the mask from there? I was trying to keep it explicit as to what was happening, but I can move those calls into __startup_64(). That's much preferred. And the return value wants to be documented in both C and ASM code. Will do. I'll still need the call to sme_get_me_mask() in the secondary_startup_64 path, though (depending on your thoughts to the above response). call verify_cpu movq$(init_top_pgt - __START_KERNEL_map), %rax So if you make that: /* * Sanitize CPU configuration and retrieve the modifier * for the initial pgdir entry which will be programmed * into CR3. Depends on enabled SME encryption, normally 0. */ call __startup_secondary_64 addq$(init_top_pgt - __START_KERNEL_map), %rax You can hide that stuff in C-code nicely without adding any cruft to the ASM code. Moving the call to verify_cpu into the C-code might be quite a bit of change. Currently, the verify_cpu code is included code and not a global function. I can still do the __startup_secondary_64() function and then look to incorporate verify_cpu into both __startup_64() and __startup_secondary_64() as a post-patch to this series. At least the secondary path will have a base C routine to which modifications can be made in the future if needed. How does that sound? Thanks, Tom Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 26/34] iommu/amd: Allow the AMD IOMMU to work with memory encryption
On Wed, Jun 21, 2017 at 05:37:22PM +0200, Joerg Roedel wrote: > > Do you mean this is like the last exception case in that document above: > > > > " > > - Pointers to data structures in coherent memory which might be modified > > by I/O devices can, sometimes, legitimately be volatile. A ring buffer > > used by a network adapter, where that adapter changes pointers to > > indicate which descriptors have been processed, is an example of this > > type of situation." > > > > ? > > So currently (without this patch) the build_completion_wait function > does not take a volatile parameter, only wait_on_sem() does. > > Wait_on_sem() needs it because its purpose is to poll a memory location > which is changed by the iommu-hardware when its done with command > processing. Right, the reason above - memory modifiable by an IO device. You could add a comment there explaining the need for the volatile. > But the 'volatile' in build_completion_wait() looks unnecessary, because > the function does not poll the memory location. It only uses the > pointer, converts it to a physical address and writes it to the command > to be queued. Ok. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing
On Wed, 21 Jun 2017, Tom Lendacky wrote: > On 6/21/2017 2:16 AM, Thomas Gleixner wrote: > > Why is this an unconditional function? Isn't the mask simply 0 when the MEM > > ENCRYPT support is disabled? > > I made it unconditional because of the call from head_64.S. I can't make > use of the C level static inline function and since the mask is not a > variable if CONFIG_AMD_MEM_ENCRYPT is not configured (#defined to 0) I > can't reference the variable directly. > > I could create a #define in head_64.S that changes this to load rax with > the variable if CONFIG_AMD_MEM_ENCRYPT is configured or a zero if it's > not or add a #ifdef at that point in the code directly. Thoughts on > that? See below. > > That does not make any sense. Neither the call to sme_encrypt_kernel() nor > > the following call to sme_get_me_mask(). > > > > __startup_64() is already C code, so why can't you simply call that from > > __startup_64() in C and return the mask from there? > > I was trying to keep it explicit as to what was happening, but I can > move those calls into __startup_64(). That's much preferred. And the return value wants to be documented in both C and ASM code. > I'll still need the call to sme_get_me_mask() in the secondary_startup_64 > path, though (depending on your thoughts to the above response). call verify_cpu movq$(init_top_pgt - __START_KERNEL_map), %rax So if you make that: /* * Sanitize CPU configuration and retrieve the modifier * for the initial pgdir entry which will be programmed * into CR3. Depends on enabled SME encryption, normally 0. */ call __startup_secondary_64 addq$(init_top_pgt - __START_KERNEL_map), %rax You can hide that stuff in C-code nicely without adding any cruft to the ASM code. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 26/34] iommu/amd: Allow the AMD IOMMU to work with memory encryption
On Thu, Jun 15, 2017 at 11:41:12AM +0200, Borislav Petkov wrote: > On Wed, Jun 14, 2017 at 03:40:28PM -0500, Tom Lendacky wrote: > > > WARNING: Use of volatile is usually wrong: see > > > Documentation/process/volatile-considered-harmful.rst > > > #134: FILE: drivers/iommu/amd_iommu.c:866: > > > +static void build_completion_wait(struct iommu_cmd *cmd, volatile u64 > > > *sem) > > > > > > > The semaphore area is written to by the device so the use of volatile is > > appropriate in this case. > > Do you mean this is like the last exception case in that document above: > > " > - Pointers to data structures in coherent memory which might be modified > by I/O devices can, sometimes, legitimately be volatile. A ring buffer > used by a network adapter, where that adapter changes pointers to > indicate which descriptors have been processed, is an example of this > type of situation." > > ? So currently (without this patch) the build_completion_wait function does not take a volatile parameter, only wait_on_sem() does. Wait_on_sem() needs it because its purpose is to poll a memory location which is changed by the iommu-hardware when its done with command processing. But the 'volatile' in build_completion_wait() looks unnecessary, because the function does not poll the memory location. It only uses the pointer, converts it to a physical address and writes it to the command to be queued. Regards, Joerg -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 25/36] swiotlb: Add warnings for use of bounce buffers with SME
On 6/21/2017 5:50 AM, Borislav Petkov wrote: On Fri, Jun 16, 2017 at 01:54:36PM -0500, Tom Lendacky wrote: Add warnings to let the user know when bounce buffers are being used for DMA when SME is active. Since the bounce buffers are not in encrypted memory, these notifications are to allow the user to determine some appropriate action - if necessary. Actions can range from utilizing an IOMMU, replacing the device with another device that can support 64-bit DMA, ignoring the message if the device isn't used much, etc. Signed-off-by: Tom Lendacky --- include/linux/dma-mapping.h | 11 +++ include/linux/mem_encrypt.h |8 lib/swiotlb.c |3 +++ 3 files changed, 22 insertions(+) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 4f3eece..ee2307e 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -10,6 +10,7 @@ #include #include #include +#include /** * List of possible attributes associated with a DMA mapping. The semantics @@ -577,6 +578,11 @@ static inline int dma_set_mask(struct device *dev, u64 mask) if (!dev->dma_mask || !dma_supported(dev, mask)) return -EIO; + + /* Since mask is unsigned, this can only be true if SME is active */ + if (mask < sme_dma_mask()) + dev_warn(dev, "SME is active, device will require DMA bounce buffers\n"); + *dev->dma_mask = mask; return 0; } @@ -596,6 +602,11 @@ static inline int dma_set_coherent_mask(struct device *dev, u64 mask) { if (!dma_supported(dev, mask)) return -EIO; + + /* Since mask is unsigned, this can only be true if SME is active */ + if (mask < sme_dma_mask()) + dev_warn(dev, "SME is active, device will require DMA bounce buffers\n"); Looks to me like those two checks above need to be a: void sme_check_mask(struct device *dev, u64 mask) { if (!sme_me_mask) return; /* Since mask is unsigned, this can only be true if SME is active */ if (mask < (((u64)sme_me_mask << 1) - 1)) dev_warn(dev, "SME is active, device will require DMA bounce buffers\n"); } which gets called and sme_dma_mask() is not really needed. Makes a lot of sense, I'll update the patch. Thanks, Tom -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing
On 6/21/2017 2:16 AM, Thomas Gleixner wrote: On Fri, 16 Jun 2017, Tom Lendacky wrote: diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h index a105796..988b336 100644 --- a/arch/x86/include/asm/mem_encrypt.h +++ b/arch/x86/include/asm/mem_encrypt.h @@ -15,16 +15,24 @@ #ifndef __ASSEMBLY__ +#include + #ifdef CONFIG_AMD_MEM_ENCRYPT extern unsigned long sme_me_mask; +void __init sme_enable(void); + #else /* !CONFIG_AMD_MEM_ENCRYPT */ #define sme_me_mask 0UL +static inline void __init sme_enable(void) { } + #endif/* CONFIG_AMD_MEM_ENCRYPT */ +unsigned long sme_get_me_mask(void); Why is this an unconditional function? Isn't the mask simply 0 when the MEM ENCRYPT support is disabled? I made it unconditional because of the call from head_64.S. I can't make use of the C level static inline function and since the mask is not a variable if CONFIG_AMD_MEM_ENCRYPT is not configured (#defined to 0) I can't reference the variable directly. I could create a #define in head_64.S that changes this to load rax with the variable if CONFIG_AMD_MEM_ENCRYPT is configured or a zero if it's not or add a #ifdef at that point in the code directly. Thoughts on that? diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 6225550..ef12729 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -78,7 +78,29 @@ startup_64: call__startup_64 popq%rsi - movq $(early_top_pgt - __START_KERNEL_map), %rax + /* +* Encrypt the kernel if SME is active. +* The real_mode_data address is in %rsi and that register can be +* clobbered by the called function so be sure to save it. +*/ + push%rsi + callsme_encrypt_kernel + pop %rsi That does not make any sense. Neither the call to sme_encrypt_kernel() nor the following call to sme_get_me_mask(). __startup_64() is already C code, so why can't you simply call that from __startup_64() in C and return the mask from there? I was trying to keep it explicit as to what was happening, but I can move those calls into __startup_64(). I'll still need the call to sme_get_me_mask() in the secondary_startup_64 path, though (depending on your thoughts to the above response). @@ -98,7 +120,20 @@ ENTRY(secondary_startup_64) /* Sanitize CPU configuration */ call verify_cpu - movq $(init_top_pgt - __START_KERNEL_map), %rax + /* +* Get the SME encryption mask. +* The encryption mask will be returned in %rax so we do an ADD +* below to be sure that the encryption mask is part of the +* value that will stored in %cr3. +* +* The real_mode_data address is in %rsi and that register can be +* clobbered by the called function so be sure to save it. +*/ + push%rsi + callsme_get_me_mask + pop %rsi Do we really need a call here? The mask is established at this point, so it's either 0 when the encryption stuff is not compiled in or it can be retrieved from a variable which is accessible at this point. Same as above, this can be updated based on the decided approach. Thanks, Tom + + addq$(init_top_pgt - __START_KERNEL_map), %rax 1: /* Enable PAE mode, PGE and LA57 */ Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active
On 6/21/2017 2:37 AM, Thomas Gleixner wrote: On Fri, 16 Jun 2017, Tom Lendacky wrote: Currently there is a check if the address being mapped is in the ISA range (is_ISA_range()), and if it is then phys_to_virt() is used to perform the mapping. When SME is active, however, this will result in the mapping having the encryption bit set when it is expected that an ioremap() should not have the encryption bit set. So only use the phys_to_virt() function if SME is not active Reviewed-by: Borislav Petkov Signed-off-by: Tom Lendacky --- arch/x86/mm/ioremap.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 4c1b5fd..a382ba9 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -106,9 +107,11 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr, } /* -* Don't remap the low PCI/ISA area, it's always mapped.. +* Don't remap the low PCI/ISA area, it's always mapped. +* But if SME is active, skip this so that the encryption bit +* doesn't get set. */ - if (is_ISA_range(phys_addr, last_addr)) + if (is_ISA_range(phys_addr, last_addr) && !sme_active()) return (__force void __iomem *)phys_to_virt(phys_addr); More thoughts about that. Making this conditional on !sme_active() is not the best idea. I'd rather remove that whole thing and make it unconditional so the code pathes get always exercised and any subtle wreckage is detected on a broader base and not only on that hard to access and debug SME capable machine owned by Joe User. Ok, that sounds good. I'll remove the check and usage of phys_to_virt() and update the changelog with additional detail about that. Thanks, Tom Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active
On 6/20/2017 3:55 PM, Thomas Gleixner wrote: On Fri, 16 Jun 2017, Tom Lendacky wrote: Currently there is a check if the address being mapped is in the ISA range (is_ISA_range()), and if it is then phys_to_virt() is used to perform the mapping. When SME is active, however, this will result in the mapping having the encryption bit set when it is expected that an ioremap() should not have the encryption bit set. So only use the phys_to_virt() function if SME is not active This does not make sense to me. What the heck has phys_to_virt() to do with the encryption bit. Especially why would the encryption bit be set on that mapping in the first place? The default is that all entries that get added to the pagetables have the encryption bit set unless specifically overridden. Any __va() or phys_to_virt() calls will result in a pagetable mapping that has the encryption bit set. For ioremap, the PAGE_KERNEL_IO protection is used which will not/does not have the encryption bit set. I'm probably missing something, but this want's some coherent explanation understandable by mere mortals both in the changelog and the code comment. I'll add some additional info to the changelog and code. Thanks, Tom Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [PATCH v4 3/5] input: add a EV_SW event for ratchet switch
Hi Dmitry, Ping. How do you want to proceed with that? Regards, Mauro Forwarded message: Date: Sat, 15 Apr 2017 19:50:45 -0300 From: Mauro Carvalho Chehab To: Dmitry Torokhov Cc: linux-in...@vger.kernel.org, Benjamin Tissoires , Jiri Kosina , Jonathan Corbet , Roderick Colenbrander , Stuart Yoder , "David S. Miller" , Ingo Tuchscherer , Florian Fainelli , Ping Cheng , Hans Verkuil , Kamil Debski , Douglas Anderson , linux-doc@vger.kernel.org Subject: Re: [PATCH v4 3/5] input: add a EV_SW event for ratchet switch Em Sat, 15 Apr 2017 11:04:36 -0700 Dmitry Torokhov escreveu: > Hi Mauro, > > On Tue, Apr 11, 2017 at 10:29:40AM -0300, Mauro Carvalho Chehab wrote: > > Some mice have a switch on their wheel, allowing to switch > > between ratchet and free wheel mode. Add support for it. > > > > Signed-off-by: Mauro Carvalho Chehab > > --- > > Documentation/input/event-codes.txt| 12 > > include/linux/mod_devicetable.h| 2 +- > > include/uapi/linux/input-event-codes.h | 4 +++- > > 3 files changed, 16 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/input/event-codes.txt > > b/Documentation/input/event-codes.txt > > index 50352ab5f6d4..5dbd45db9bf6 100644 > > --- a/Documentation/input/event-codes.txt > > +++ b/Documentation/input/event-codes.txt > > @@ -206,6 +206,18 @@ Upon resume, if the switch state is the same as before > > suspend, then the input > > subsystem will filter out the duplicate switch state reports. The driver > > does > > not need to keep the state of the switch at any time. > > > > +A few EV_SW codes have special meanings: > > + > > +* SW_RATCHET: > > + > > + - Some mice have a special switch for their wheel that allows to change > > +between free wheel mode and ratchet mode. When the switch is ratchet > > +mode (ON state), the wheel will offer some resistance for movements. It > > +may also provide a tactile feedback when scrolled. > > + > > +Note that some mice have a ratchet switch that does not generate a > > +software event. > > So it is still not clear to me why we need the 2 discrete events. Either > we key off the behavior off the new REL event, or from switch, but not > both. The two events are independent. Clicking at the Wheel button just sets it to free wheel or back to ratchet mode. It doesn't switch the resolution. The high resolution events are sent only when userspace sets the mouse to high resolution mode. I wrote patch series for Solaar with allows switching between low resolution and high resolution modes and controls if the wheel movement is normal or inverted: https://github.com/pwr/Solaar/pull/351 It uses the hidraw interface to switch between the two modes. > Also, it is unclear to me if allocating a new event for "hires" wheel is > optimal. This still does not solve the question about resolution (how > high is "hires" and what to do if Logitech will come out with > ultra-high-resolution wheel next year, or if we need to express > resolution for other relative events). How "high" is the resolution can be queried on those devices. Not sure how to report it to userspace, though. Ok, one application could query it via hidraw interface (my Solaar patches do that when solaar is called with the "show" parameter). Perhaps an ioctl? Or do you have a better idea? > > Thanks. > > > + > > EV_MSC: > > -- > > EV_MSC events are used for input and output events that do not fall under > > other > > diff --git a/include/linux/mod_devicetable.h > > b/include/linux/mod_devicetable.h > > index a3e8c572a046..79dd7dbf5442 100644 > > --- a/include/linux/mod_devicetable.h > > +++ b/include/linux/mod_devicetable.h > > @@ -292,7 +292,7 @@ struct pcmcia_device_id { > > #define INPUT_DEVICE_ID_LED_MAX0x0f > > #define INPUT_DEVICE_ID_SND_MAX0x07 > > #define INPUT_DEVICE_ID_FF_MAX 0x7f > > -#define INPUT_DEVICE_ID_SW_MAX 0x0f > > +#define INPUT_DEVICE_ID_SW_MAX 0x1f > > > > #define INPUT_DEVICE_ID_MATCH_BUS 1 > > #define INPUT_DEVICE_ID_MATCH_VENDOR 2 > > diff --git a/include/uapi/linux/input-event-codes.h > > b/include/uapi/linux/input-event-codes.h > > index da48d4079511..da83e231e93d 100644 > > --- a/include/uapi/linux/input-event-codes.h > > +++ b/include/uapi/linux/input-event-codes.h > > @@ -789,7 +789,9 @@ > > #define SW_LINEIN_INSERT 0x0d /* set = inserted */ > > #define SW_MUTE_DEVICE 0x0e /* set = device disabled */ > > #define SW_PEN_INSERTED0x0f /* set = pen inserted */ > > -#define SW_MAX 0x0f > > +#define SW_RATCHET 0x10 /* set = ratchet mode, > > +unset: free wheel */ > > +#define SW_MAX 0x1f > > #define SW_CNT (SW_MAX+1) > > > > /* > > -- > > 2.9.3 > > > Thanks, Mauro Thanks, Mauro -- To unsubscribe from this list: send the
Re: [PATCH v7 06/36] x86/mm: Add Secure Memory Encryption (SME) support
On 6/20/2017 3:49 PM, Thomas Gleixner wrote: On Fri, 16 Jun 2017, Tom Lendacky wrote: +config ARCH_HAS_MEM_ENCRYPT + def_bool y + depends on X86 That one is silly. The config switch is in the x86 KConfig file, so X86 is on. If you intended to move this to some generic place outside of x86/Kconfig then this should be config ARCH_HAS_MEM_ENCRYPT bool and x86/Kconfig should have select ARCH_HAS_MEM_ENCRYPT and that should be selected by AMD_MEM_ENCRYPT This is used for deciding whether to include the asm/mem_encrypt.h file so it needs to be on whether AMD_MEM_ENCRYPT is configured or not. I'll leave it in the x86/Kconfig file and remove the depends on line. Thanks, Tom +config AMD_MEM_ENCRYPT + bool "AMD Secure Memory Encryption (SME) support" + depends on X86_64 && CPU_SUP_AMD + ---help--- + Say yes to enable support for the encryption of system memory. + This requires an AMD processor that supports Secure Memory + Encryption (SME). Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 25/36] swiotlb: Add warnings for use of bounce buffers with SME
On Fri, Jun 16, 2017 at 01:54:36PM -0500, Tom Lendacky wrote: > Add warnings to let the user know when bounce buffers are being used for > DMA when SME is active. Since the bounce buffers are not in encrypted > memory, these notifications are to allow the user to determine some > appropriate action - if necessary. Actions can range from utilizing an > IOMMU, replacing the device with another device that can support 64-bit > DMA, ignoring the message if the device isn't used much, etc. > > Signed-off-by: Tom Lendacky > --- > include/linux/dma-mapping.h | 11 +++ > include/linux/mem_encrypt.h |8 > lib/swiotlb.c |3 +++ > 3 files changed, 22 insertions(+) > > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h > index 4f3eece..ee2307e 100644 > --- a/include/linux/dma-mapping.h > +++ b/include/linux/dma-mapping.h > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > > /** > * List of possible attributes associated with a DMA mapping. The semantics > @@ -577,6 +578,11 @@ static inline int dma_set_mask(struct device *dev, u64 > mask) > > if (!dev->dma_mask || !dma_supported(dev, mask)) > return -EIO; > + > + /* Since mask is unsigned, this can only be true if SME is active */ > + if (mask < sme_dma_mask()) > + dev_warn(dev, "SME is active, device will require DMA bounce > buffers\n"); > + > *dev->dma_mask = mask; > return 0; > } > @@ -596,6 +602,11 @@ static inline int dma_set_coherent_mask(struct device > *dev, u64 mask) > { > if (!dma_supported(dev, mask)) > return -EIO; > + > + /* Since mask is unsigned, this can only be true if SME is active */ > + if (mask < sme_dma_mask()) > + dev_warn(dev, "SME is active, device will require DMA bounce > buffers\n"); Looks to me like those two checks above need to be a: void sme_check_mask(struct device *dev, u64 mask) { if (!sme_me_mask) return; /* Since mask is unsigned, this can only be true if SME is active */ if (mask < (((u64)sme_me_mask << 1) - 1)) dev_warn(dev, "SME is active, device will require DMA bounce buffers\n"); } which gets called and sme_dma_mask() is not really needed. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] irq: generic-chip: resource management improvements
On 31/05/17 17:06, Bartosz Golaszewski wrote: > This series is a follow-up to [1]. > > Some users of irq_alloc_generic_chip() are modules which can be > removed (e.g. gpio-ml-ioh) but have no means of freeing the allocated > generic chip. > > Last time it was suggested to provide irq_destroy_generic_chip() which > would undo both irq_remove_generic_chip() and irq_alloc_generic_chip(). > > This functionality is provided by patch 2/5 with 1/5 adding the option > to only free the allocated memory. > > Patch 3/5 exports a function that will be used in the devres variant > of irq_alloc_generic_chip(). > > Patches 4/5 and 5/5 add resource managed versions of > irq_alloc_generic_chip() & irq_setup_generic_chip(). They will be used > in drivers where applicable. Device resources are released in reverse > order so it's ok to call devm_irq_alloc_generic_chip() and then > devm_irq_setup_generic_chip(). > > [1] https://lkml.org/lkml/2017/3/8/550 > > Bartosz Golaszewski (5): > irq: generic-chip: provide irq_free_generic_chip() > irq: generic-chip: provide irq_destroy_generic_chip() > irq: generic-chip: export irq_init_generic_chip() locally > irq: generic-chip: provide devm_irq_alloc_generic_chip() > irq: generic-chip: provide devm_irq_setup_generic_chip() > > Documentation/driver-model/devres.txt | 2 + > include/linux/irq.h | 22 + > kernel/irq/devres.c | 86 > +++ > kernel/irq/generic-chip.c | 7 ++- > kernel/irq/internals.h| 11 + > 5 files changed, 124 insertions(+), 4 deletions(-) > Looks OK to me. For the series: Acked-by: Marc Zyngier M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] irq: generic-chip: resource management improvements
2017-06-20 16:14 GMT+02:00 Thomas Gleixner : > On Tue, 20 Jun 2017, Bartosz Golaszewski wrote: >> 2017-06-20 12:41 GMT+02:00 Marc Zyngier : >> > There was a kbuild report from June 1st with worrying warnings on x86_64 >> > (though I couldn't see how that was related to these patches). What's >> > the status of that? >> > >> > Thanks, >> > >> > M. >> > -- >> > Jazz is not dead. It just smells funny... >> >> Snap, I looked at it, determined that it was just a header included in >> include/linux/irq.h (unrelated to the patch) and forgot to comment >> about it. >> >> I've never seen this warning on my setup and don't see it now with rc6. > > Yep, that's a genuine x86 snafu. No idea how that got attributed to your > patch. So are the patches ok and can be merged for 4.13? Thanks, Bartosz -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 24/36] x86, swiotlb: Add memory encryption support
On Fri, Jun 16, 2017 at 01:54:24PM -0500, Tom Lendacky wrote: > Since DMA addresses will effectively look like 48-bit addresses when the > memory encryption mask is set, SWIOTLB is needed if the DMA mask of the > device performing the DMA does not support 48-bits. SWIOTLB will be > initialized to create decrypted bounce buffers for use by these devices. > > Signed-off-by: Tom Lendacky > --- > arch/x86/include/asm/dma-mapping.h |5 ++- > arch/x86/include/asm/mem_encrypt.h |5 +++ > arch/x86/kernel/pci-dma.c | 11 +-- > arch/x86/kernel/pci-nommu.c|2 + > arch/x86/kernel/pci-swiotlb.c | 15 +- > arch/x86/mm/mem_encrypt.c | 22 +++ > include/linux/swiotlb.h|1 + > init/main.c| 10 +++ > lib/swiotlb.c | 54 > +++- > 9 files changed, 108 insertions(+), 17 deletions(-) Reviewed-by: Borislav Petkov -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 23/36] x86, realmode: Decrypt trampoline area if memory encryption is active
On Fri, Jun 16, 2017 at 01:54:12PM -0500, Tom Lendacky wrote: > When Secure Memory Encryption is enabled, the trampoline area must not > be encrypted. A CPU running in real mode will not be able to decrypt > memory that has been encrypted because it will not be able to use addresses > with the memory encryption mask. > > Signed-off-by: Tom Lendacky > --- > arch/x86/realmode/init.c |8 > 1 file changed, 8 insertions(+) Subject: x86/realmode: ... other than that: Reviewed-by: Borislav Petkov -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 20/36] x86, mpparse: Use memremap to map the mpf and mpc data
On Fri, Jun 16, 2017 at 01:53:38PM -0500, Tom Lendacky wrote: > The SMP MP-table is built by UEFI and placed in memory in a decrypted > state. These tables are accessed using a mix of early_memremap(), > early_memunmap(), phys_to_virt() and virt_to_phys(). Change all accesses > to use early_memremap()/early_memunmap(). This allows for proper setting > of the encryption mask so that the data can be successfully accessed when > SME is active. > > Signed-off-by: Tom Lendacky > --- > arch/x86/kernel/mpparse.c | 98 > - > 1 file changed, 70 insertions(+), 28 deletions(-) Reviewed-by: Borislav Petkov Please put the conversion to pr_fmt() on the TODO list for later. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 10/36] x86/mm: Provide general kernel support for memory encryption
On Wed, Jun 21, 2017 at 09:18:59AM +0200, Thomas Gleixner wrote: > That looks wrong. It's not decrypted it's rather unencrypted, right? Yeah, it previous versions of the patchset, "decrypted" and "unencrypted" were both present so we settled on "decrypted" for the nomenclature. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 07/36] x86/mm: Don't use phys_to_virt in ioremap() if SME is active
On Fri, 16 Jun 2017, Tom Lendacky wrote: > Currently there is a check if the address being mapped is in the ISA > range (is_ISA_range()), and if it is then phys_to_virt() is used to > perform the mapping. When SME is active, however, this will result > in the mapping having the encryption bit set when it is expected that > an ioremap() should not have the encryption bit set. So only use the > phys_to_virt() function if SME is not active > > Reviewed-by: Borislav Petkov > Signed-off-by: Tom Lendacky > --- > arch/x86/mm/ioremap.c |7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c > index 4c1b5fd..a382ba9 100644 > --- a/arch/x86/mm/ioremap.c > +++ b/arch/x86/mm/ioremap.c > @@ -13,6 +13,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -106,9 +107,11 @@ static void __iomem *__ioremap_caller(resource_size_t > phys_addr, > } > > /* > - * Don't remap the low PCI/ISA area, it's always mapped.. > + * Don't remap the low PCI/ISA area, it's always mapped. > + * But if SME is active, skip this so that the encryption bit > + * doesn't get set. >*/ > - if (is_ISA_range(phys_addr, last_addr)) > + if (is_ISA_range(phys_addr, last_addr) && !sme_active()) > return (__force void __iomem *)phys_to_virt(phys_addr); More thoughts about that. Making this conditional on !sme_active() is not the best idea. I'd rather remove that whole thing and make it unconditional so the code pathes get always exercised and any subtle wreckage is detected on a broader base and not only on that hard to access and debug SME capable machine owned by Joe User. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 10/36] x86/mm: Provide general kernel support for memory encryption
On Fri, 16 Jun 2017, Tom Lendacky wrote: > > +#ifndef pgprot_encrypted > +#define pgprot_encrypted(prot) (prot) > +#endif > + > +#ifndef pgprot_decrypted That looks wrong. It's not decrypted it's rather unencrypted, right? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 08/36] x86/mm: Add support to enable SME in early boot processing
On Fri, 16 Jun 2017, Tom Lendacky wrote: > diff --git a/arch/x86/include/asm/mem_encrypt.h > b/arch/x86/include/asm/mem_encrypt.h > index a105796..988b336 100644 > --- a/arch/x86/include/asm/mem_encrypt.h > +++ b/arch/x86/include/asm/mem_encrypt.h > @@ -15,16 +15,24 @@ > > #ifndef __ASSEMBLY__ > > +#include > + > #ifdef CONFIG_AMD_MEM_ENCRYPT > > extern unsigned long sme_me_mask; > > +void __init sme_enable(void); > + > #else/* !CONFIG_AMD_MEM_ENCRYPT */ > > #define sme_me_mask 0UL > > +static inline void __init sme_enable(void) { } > + > #endif /* CONFIG_AMD_MEM_ENCRYPT */ > > +unsigned long sme_get_me_mask(void); Why is this an unconditional function? Isn't the mask simply 0 when the MEM ENCRYPT support is disabled? > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S > index 6225550..ef12729 100644 > --- a/arch/x86/kernel/head_64.S > +++ b/arch/x86/kernel/head_64.S > @@ -78,7 +78,29 @@ startup_64: > call__startup_64 > popq%rsi > > - movq$(early_top_pgt - __START_KERNEL_map), %rax > + /* > + * Encrypt the kernel if SME is active. > + * The real_mode_data address is in %rsi and that register can be > + * clobbered by the called function so be sure to save it. > + */ > + push%rsi > + callsme_encrypt_kernel > + pop %rsi That does not make any sense. Neither the call to sme_encrypt_kernel() nor the following call to sme_get_me_mask(). __startup_64() is already C code, so why can't you simply call that from __startup_64() in C and return the mask from there? > @@ -98,7 +120,20 @@ ENTRY(secondary_startup_64) > /* Sanitize CPU configuration */ > call verify_cpu > > - movq$(init_top_pgt - __START_KERNEL_map), %rax > + /* > + * Get the SME encryption mask. > + * The encryption mask will be returned in %rax so we do an ADD > + * below to be sure that the encryption mask is part of the > + * value that will stored in %cr3. > + * > + * The real_mode_data address is in %rsi and that register can be > + * clobbered by the called function so be sure to save it. > + */ > + push%rsi > + callsme_get_me_mask > + pop %rsi Do we really need a call here? The mask is established at this point, so it's either 0 when the encryption stuff is not compiled in or it can be retrieved from a variable which is accessible at this point. > + > + addq$(init_top_pgt - __START_KERNEL_map), %rax > 1: > > /* Enable PAE mode, PGE and LA57 */ Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html