Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On Wed, 20 Dec 2017, Juergen Gross wrote: > On 20/12/17 01:22, Thomas Gleixner wrote: > > On Tue, 19 Dec 2017, Thomas Gleixner wrote: > >> On Tue, 19 Dec 2017, Ingo Molnar wrote: > >> We don't run out of space, but the 0-day robot triggered a nasty issue. > >> > >> The fixmap bottom address, which contains the early_ioremap fixmap area, > >> is: > >> > >> vaddr_bt = FIXADDR_TOP - FIX_BTMAP_BEGIN * PAGE_SIZE > >> > >> If that address is lower than: > >> > >> vaddr_end = __START_KERNEL_map + KERNEL_IMAGE_SIZE; > >> > >> then cleanup_highmap() will happily 0 out the PMD entry for the PTE page of > >> FIX_BTMAP. That entry was set up earlier in early_ioremap_init(). > >> > >> As a consequence the first call to __early_set_fixmap() which tries to > >> install a PTE for early_ioremap() will crash and burn. > >> > >> Below is a nasty hack which fixes the problem. Ideally we get all of this > >> cpu_entry_stuff out of the fixmap. I'll look into that later, but for now > >> the patch 'fixes' the issue. > > > > I had a stab on moving the cpu_entry_area to some other place. > > > > The patch below works, but: > > > > - it breaks i386 build because I have not yet found a way to place the > >CPU_ENTRY_AREA_BASE without creating include recursion hell > > > > - it probably does not work on XEN_PV, but I'm too tired now to figure > >that out. > > The attached patch lets the system come up as XEN_PV. I folded it back to the proper place. Thanks for looking! tglx
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On 20/12/17 01:22, Thomas Gleixner wrote: > On Tue, 19 Dec 2017, Thomas Gleixner wrote: >> On Tue, 19 Dec 2017, Ingo Molnar wrote: >> We don't run out of space, but the 0-day robot triggered a nasty issue. >> >> The fixmap bottom address, which contains the early_ioremap fixmap area, is: >> >> vaddr_bt = FIXADDR_TOP - FIX_BTMAP_BEGIN * PAGE_SIZE >> >> If that address is lower than: >> >> vaddr_end = __START_KERNEL_map + KERNEL_IMAGE_SIZE; >> >> then cleanup_highmap() will happily 0 out the PMD entry for the PTE page of >> FIX_BTMAP. That entry was set up earlier in early_ioremap_init(). >> >> As a consequence the first call to __early_set_fixmap() which tries to >> install a PTE for early_ioremap() will crash and burn. >> >> Below is a nasty hack which fixes the problem. Ideally we get all of this >> cpu_entry_stuff out of the fixmap. I'll look into that later, but for now >> the patch 'fixes' the issue. > > I had a stab on moving the cpu_entry_area to some other place. > > The patch below works, but: > > - it breaks i386 build because I have not yet found a way to place the >CPU_ENTRY_AREA_BASE without creating include recursion hell > > - it probably does not work on XEN_PV, but I'm too tired now to figure >that out. The attached patch lets the system come up as XEN_PV. Juergen diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c index e4a6fe8354f0..577fa8adb785 100644 --- a/arch/x86/entry/vsyscall/vsyscall_64.c +++ b/arch/x86/entry/vsyscall/vsyscall_64.c @@ -37,6 +37,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include "vsyscall_trace.h" @@ -351,15 +352,15 @@ void __init set_vsyscall_pgtable_user_bits(pgd_t *root) pmd_t *pmd; pgd = pgd_offset_pgd(root, VSYSCALL_ADDR); - pgd->pgd |= _PAGE_USER; + set_pgd(pgd, __pgd(pgd_val(*pgd) | _PAGE_USER)); p4d = p4d_offset(pgd, VSYSCALL_ADDR); #if CONFIG_PGTABLE_LEVELS >= 5 p4d->p4d |= _PAGE_USER; #endif pud = pud_offset(p4d, VSYSCALL_ADDR); - pud->pud |= _PAGE_USER; + set_pud(pud, __pud(pud_val(*pud) | _PAGE_USER)); pmd = pmd_offset(pud, VSYSCALL_ADDR); - pmd->pmd |= _PAGE_USER; + set_pmd(pmd, __pmd(pmd_val(*pmd) | _PAGE_USER)); } void __init map_vsyscall(void)
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On Tue, 19 Dec 2017, Thomas Gleixner wrote: > On Tue, 19 Dec 2017, Ingo Molnar wrote: > We don't run out of space, but the 0-day robot triggered a nasty issue. > > The fixmap bottom address, which contains the early_ioremap fixmap area, is: > > vaddr_bt = FIXADDR_TOP - FIX_BTMAP_BEGIN * PAGE_SIZE > > If that address is lower than: > > vaddr_end = __START_KERNEL_map + KERNEL_IMAGE_SIZE; > > then cleanup_highmap() will happily 0 out the PMD entry for the PTE page of > FIX_BTMAP. That entry was set up earlier in early_ioremap_init(). > > As a consequence the first call to __early_set_fixmap() which tries to > install a PTE for early_ioremap() will crash and burn. > > Below is a nasty hack which fixes the problem. Ideally we get all of this > cpu_entry_stuff out of the fixmap. I'll look into that later, but for now > the patch 'fixes' the issue. I had a stab on moving the cpu_entry_area to some other place. The patch below works, but: - it breaks i386 build because I have not yet found a way to place the CPU_ENTRY_AREA_BASE without creating include recursion hell - it probably does not work on XEN_PV, but I'm too tired now to figure that out. Thanks, tglx 8<--- Documentation/x86/x86_64/mm.txt |4 arch/x86/events/intel/ds.c | 53 ++-- arch/x86/include/asm/desc.h |1 arch/x86/include/asm/fixmap.h | 89 arch/x86/include/asm/pgtable_32_types.h |6 - arch/x86/include/asm/pgtable_64_types.h | 49 ++- arch/x86/kernel/cpu/common.c| 125 arch/x86/kernel/dumpstack.c |1 arch/x86/kernel/traps.c |5 - arch/x86/mm/Makefile|2 arch/x86/mm/dump_pagetables.c |2 arch/x86/mm/kasan_init_64.c |6 - arch/x86/mm/pti.c | 39 +++-- arch/x86/xen/mmu_pv.c |2 b/arch/x86/include/asm/cpu_entry_area.h | 79 ++ b/arch/x86/mm/cpu_entry_area.c | 138 16 files changed, 309 insertions(+), 292 deletions(-) --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -12,7 +12,8 @@ ea00 - eaff (=40 ... unused hole ... ec00 - fbff (=44 bits) kasan shadow memory (16TB) ... unused hole ... -fe80 - feff (=39 bits) LDT remap for PTI +fe00 - fe7f (=39 bits) LDT remap for PTI +fe80 - feff (=39 bits) cpu_entry_area mapping ff00 - ff7f (=39 bits) %esp fixup stacks ... unused hole ... ffef - fffe (=64 GB) EFI region mapping space @@ -36,6 +37,7 @@ ffd4 - ffd5 (=49 ... unused hole ... ffdf - fc00 (=53 bits) kasan shadow memory (8PB) ... unused hole ... +fe80 - feff (=39 bits) cpu_entry_area mapping ff00 - ff7f (=39 bits) %esp fixup stacks ... unused hole ... ffef - fffe (=64 GB) EFI region mapping space --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -76,36 +76,39 @@ typedef struct { pteval_t pte; } pte_t; #define PGDIR_MASK (~(PGDIR_SIZE - 1)) /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */ -#define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL) +#define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL) #ifdef CONFIG_X86_5LEVEL -#define VMALLOC_SIZE_TB _AC(12800, UL) -#define __VMALLOC_BASE _AC(0xffa0, UL) -#define __VMEMMAP_BASE _AC(0xffd4, UL) -#define LDT_PGD_ENTRY _AC(-112, UL) -#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) +#define VMALLOC_SIZE_TB_AC(12800, UL) +#define __VMALLOC_BASE _AC(0xffa0, UL) +#define __VMEMMAP_BASE _AC(0xffd4, UL) +#define LDT_PGD_ENTRY _AC(-112, UL) +#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) #else -#define VMALLOC_SIZE_TB_AC(32, UL) -#define __VMALLOC_BASE _AC(0xc900, UL) -#define __VMEMMAP_BASE _AC(0xea00, UL) -#define LDT_PGD_ENTRY _AC(-3, UL) -#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) +#define VMALLOC_SIZE_TB_AC(32, UL) +#define __VMALLOC_BASE _AC(0xc900, UL) +#define __VMEMMAP_BASE _AC(0xea00, UL) +#define LDT_PGD_ENTRY _AC(-4, UL) +#define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) #endif #ifdef CONFIG_RANDOMIZE_MEMORY -#define VMALLOC_START vmalloc_base -#define VMEMMAP_START vmemmap_base +#define VMALLOC_START vmalloc_base +#define VMEMMAP_START vmemmap_base #else -#define VMALLOC_START __VMALLOC_BASE -#define VMEMMAP_START __VMEMMAP_BASE +#def
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On Tue, 19 Dec 2017, Ingo Molnar wrote: > * Peter Zijlstra wrote: > > > On Mon, Dec 18, 2017 at 12:45:13PM -0800, Dave Hansen wrote: > > > On 12/18/2017 12:41 PM, Peter Zijlstra wrote: > > > >> I also don't think the user_shared area of the fixmap can get *that* > > > >> big. Does anybody know offhand what the theoretical limits are there? > > > > Problem there is the nr_cpus term I think, we currently have up to 8k > > > > CPUs, but I can see that getting bigger in the future. > > > > > > It only matters if we go over 512GB, though. Is the per-cpu part of the > > > fixmap ever more than 512GB/8k=64MB? > > > > Unlikely, I think the LDT (@ 32 pages / 128K) and the DS (@ 2*4 pages / > > 32K) are the largest entries in there. > > Note that with the latest state of things the LDT is not in the fixmap > anymore, > it's mapped separately, via Andy's following patch: > > e86aaee3f2d9: ("x86/pti: Put the LDT in its own PGD if PTI is on") > > We have the IDT, the per-CPU entry area and the Debug Store (on Intel CPUs) > mapped > in the fixmap area, in addition to the usual fixmap entries that are a > handful of > pages. (That's on 64-bit - on 32-bit we have a pretty large kmap area.) > > The biggest contribution to the size of the fixmap area is struct > cpu_entry_area > (FIX_CPU_ENTRY_AREA_BOTTOM..FIX_CPU_ENTRY_AREA_TOP), which is ~180k, i.e. 44 > pages. > > Our current NR_CPUS limit is 8,192 CPUs, but even with 65,536 CPUs the fixmap > area > would still only be ~12 GB total - so we are far from running out of space. We don't run out of space, but the 0-day robot triggered a nasty issue. The fixmap bottom address, which contains the early_ioremap fixmap area, is: vaddr_bt = FIXADDR_TOP - FIX_BTMAP_BEGIN * PAGE_SIZE If that address is lower than: vaddr_end = __START_KERNEL_map + KERNEL_IMAGE_SIZE; then cleanup_highmap() will happily 0 out the PMD entry for the PTE page of FIX_BTMAP. That entry was set up earlier in early_ioremap_init(). As a consequence the first call to __early_set_fixmap() which tries to install a PTE for early_ioremap() will crash and burn. Below is a nasty hack which fixes the problem. Ideally we get all of this cpu_entry_stuff out of the fixmap. I'll look into that later, but for now the patch 'fixes' the issue. Thanks, tglx 8<- --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -209,6 +209,8 @@ extern pte_t *kmap_pte; #define kmap_prot PAGE_KERNEL extern pte_t *pkmap_page_table; +extern pmd_t *early_ioremap_page_table; + void __native_set_fixmap(enum fixed_addresses idx, pte_t pte); void native_set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t flags); --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -393,6 +393,15 @@ void __init cleanup_highmap(void) for (; vaddr + PMD_SIZE - 1 < vaddr_end; pmd++, vaddr += PMD_SIZE) { if (pmd_none(*pmd)) continue; + /* +* Careful here. vaddr_end might be past the pmd which is +* used by the early ioremap stuff. Don't clean that out as +* it's already set up. +*/ + if (__phys_addr_nodebug((unsigned long) pmd) == + __phys_addr_nodebug((unsigned long) early_ioremap_page_table)) + continue; + if (vaddr < (unsigned long) _text || vaddr > end) set_pmd(pmd, __pmd(0)); } --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -27,6 +27,8 @@ #include "physaddr.h" +pmd_t __initdata *early_ioremap_page_table; + /* * Fix up the linear direct mapping of the kernel to avoid cache attribute * conflicts. @@ -709,7 +711,7 @@ void __init early_ioremap_init(void) pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); memset(bm_pte, 0, sizeof(bm_pte)); pmd_populate_kernel(&init_mm, pmd, bm_pte); - + early_ioremap_page_table = pmd; /* * The boot-ioremap range spans multiple pmds, for which * we are not prepared:
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
* Peter Zijlstra wrote: > On Mon, Dec 18, 2017 at 12:45:13PM -0800, Dave Hansen wrote: > > On 12/18/2017 12:41 PM, Peter Zijlstra wrote: > > >> I also don't think the user_shared area of the fixmap can get *that* > > >> big. Does anybody know offhand what the theoretical limits are there? > > > Problem there is the nr_cpus term I think, we currently have up to 8k > > > CPUs, but I can see that getting bigger in the future. > > > > It only matters if we go over 512GB, though. Is the per-cpu part of the > > fixmap ever more than 512GB/8k=64MB? > > Unlikely, I think the LDT (@ 32 pages / 128K) and the DS (@ 2*4 pages / > 32K) are the largest entries in there. Note that with the latest state of things the LDT is not in the fixmap anymore, it's mapped separately, via Andy's following patch: e86aaee3f2d9: ("x86/pti: Put the LDT in its own PGD if PTI is on") We have the IDT, the per-CPU entry area and the Debug Store (on Intel CPUs) mapped in the fixmap area, in addition to the usual fixmap entries that are a handful of pages. (That's on 64-bit - on 32-bit we have a pretty large kmap area.) The biggest contribution to the size of the fixmap area is struct cpu_entry_area (FIX_CPU_ENTRY_AREA_BOTTOM..FIX_CPU_ENTRY_AREA_TOP), which is ~180k, i.e. 44 pages. Our current NR_CPUS limit is 8,192 CPUs, but even with 65,536 CPUs the fixmap area would still only be ~12 GB total - so we are far from running out of space. Btw., the DS portion of the fixmap is not 64K but 128K. Here's the current size distribution of struct cpu_entry_area: ::gdt -4K ::tss - 12K ::entry_trampoline -4K #if 64-bit ::exception_stacks - 20K #endif #if Intel ::cpu_debug_store -4K ::cpu_debug_buffers - 128K #endif So ::cpu_debug_buffers (struct debug_store_buffers) is the biggest fixmap chunk, by far, distributed the following way: ::bts_buffer[] - 64K ::pebs_buffer[] - 64K Thanks, Ingo
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On Mon, Dec 18, 2017 at 12:34 PM, Dave Hansen wrote: > On 12/18/2017 03:42 AM, Thomas Gleixner wrote: >> --- a/arch/x86/include/asm/pgtable.h >> +++ b/arch/x86/include/asm/pgtable.h >> @@ -1120,6 +1120,11 @@ static inline void pmdp_set_wrprotect(st >> static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) >> { >> memcpy(dst, src, count * sizeof(pgd_t)); >> +#ifdef CONFIG_PAGE_TABLE_ISOLATION >> + /* Clone the user space pgd as well */ >> + memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src), >> +count * sizeof(pgd_t)); >> +#endif >> } > > I was just thinking about this as I re-write the documentation about > where the overhead of pti comes from. > > This obviously *works* for now. But, we certainly have the pti-mapped > stuff spread much less through the address space than when this was > thrown in here. It *seems* like we could probably do this with just 4 PGDs: > >> pti_clone_user_shared(); >> pti_clone_entry_text(); >> pti_setup_espfix64(); >> pti_setup_vsyscall(); > > The vsyscall is just one page and the espfix is *sized* to be one PGD, > so we know each of those only takes one entry. > > We surely don't have 512GB of entry_text, and I don't think KASLR can > ever cause it to span two PGD entries. This would definitely work and, long-term, I think we should get rid of the entry text mapping entirely. The tricky bit is that we need to rearrange the whole memory map fairly radically for this. We could make it more compact, too: the vsyscall page and the cpu_entry_area stuff can share a PGD. The LDT could go in there, too. The only requirement the LDT PGD has is that all of the next-level entries that will ever be allocated are allocated at boot time.
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On Mon, Dec 18, 2017 at 12:45:13PM -0800, Dave Hansen wrote: > On 12/18/2017 12:41 PM, Peter Zijlstra wrote: > >> I also don't think the user_shared area of the fixmap can get *that* > >> big. Does anybody know offhand what the theoretical limits are there? > > Problem there is the nr_cpus term I think, we currently have up to 8k > > CPUs, but I can see that getting bigger in the future. > > It only matters if we go over 512GB, though. Is the per-cpu part of the > fixmap ever more than 512GB/8k=64MB? Unlikely, I think the LDT (@ 32 pages / 128K) and the DS (@ 2*4 pages / 32K) are the largest entries in there.
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On 12/18/2017 12:41 PM, Peter Zijlstra wrote: >> I also don't think the user_shared area of the fixmap can get *that* >> big. Does anybody know offhand what the theoretical limits are there? > Problem there is the nr_cpus term I think, we currently have up to 8k > CPUs, but I can see that getting bigger in the future. It only matters if we go over 512GB, though. Is the per-cpu part of the fixmap ever more than 512GB/8k=64MB? In any case, it's not hard to make a little loop. I think the common case, at least is that we only do 4 PGDs.
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On Mon, Dec 18, 2017 at 12:34:22PM -0800, Dave Hansen wrote: > On 12/18/2017 03:42 AM, Thomas Gleixner wrote: > > --- a/arch/x86/include/asm/pgtable.h > > +++ b/arch/x86/include/asm/pgtable.h > > @@ -1120,6 +1120,11 @@ static inline void pmdp_set_wrprotect(st > > static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) > > { > > memcpy(dst, src, count * sizeof(pgd_t)); > > +#ifdef CONFIG_PAGE_TABLE_ISOLATION > > + /* Clone the user space pgd as well */ > > + memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src), > > + count * sizeof(pgd_t)); > > +#endif > > } > > I was just thinking about this as I re-write the documentation about > where the overhead of pti comes from. The think I thought of when I saw this earlier today was that this could trivially be wrapped in a static_cpu_has(X86_FEATURE_PTI). > This obviously *works* for now. But, we certainly have the pti-mapped > stuff spread much less through the address space than when this was > thrown in here. It *seems* like we could probably do this with just 4 PGDs: > > > pti_clone_user_shared(); > > pti_clone_entry_text(); > > pti_setup_espfix64(); > > pti_setup_vsyscall(); > > The vsyscall is just one page and the espfix is *sized* to be one PGD, > so we know each of those only takes one entry. > > We surely don't have 512GB of entry_text, and I don't think KASLR can > ever cause it to span two PGD entries. > > I also don't think the user_shared area of the fixmap can get *that* > big. Does anybody know offhand what the theoretical limits are there? Problem there is the nr_cpus term I think, we currently have up to 8k CPUs, but I can see that getting bigger in the future.
Re: [patch V163 27/51] x86/mm/pti: Populate user PGD
On 12/18/2017 03:42 AM, Thomas Gleixner wrote: > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -1120,6 +1120,11 @@ static inline void pmdp_set_wrprotect(st > static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) > { > memcpy(dst, src, count * sizeof(pgd_t)); > +#ifdef CONFIG_PAGE_TABLE_ISOLATION > + /* Clone the user space pgd as well */ > + memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src), > +count * sizeof(pgd_t)); > +#endif > } I was just thinking about this as I re-write the documentation about where the overhead of pti comes from. This obviously *works* for now. But, we certainly have the pti-mapped stuff spread much less through the address space than when this was thrown in here. It *seems* like we could probably do this with just 4 PGDs: > pti_clone_user_shared(); > pti_clone_entry_text(); > pti_setup_espfix64(); > pti_setup_vsyscall(); The vsyscall is just one page and the espfix is *sized* to be one PGD, so we know each of those only takes one entry. We surely don't have 512GB of entry_text, and I don't think KASLR can ever cause it to span two PGD entries. I also don't think the user_shared area of the fixmap can get *that* big. Does anybody know offhand what the theoretical limits are there?
[patch V163 27/51] x86/mm/pti: Populate user PGD
From: Dave Hansen In clone_pgd_range() copy the init user PGDs which cover the kernel half of the address space, so a process has all the required kernel mappings visible. [ tglx: Split out from the big kaiser dump and folded Andys simplification ] Signed-off-by: Dave Hansen Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Reviewed-by: Borislav Petkov Cc: Andy Lutomirski Cc: Boris Ostrovsky Cc: Brian Gerst Cc: David Laight Cc: Denys Vlasenko Cc: Eduardo Valentin Cc: Greg KH Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Juergen Gross Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Will Deacon Cc: aligu...@amazon.com Cc: daniel.gr...@iaik.tugraz.at Cc: hu...@google.com Cc: keesc...@google.com --- arch/x86/include/asm/pgtable.h |5 + 1 file changed, 5 insertions(+) --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1120,6 +1120,11 @@ static inline void pmdp_set_wrprotect(st static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) { memcpy(dst, src, count * sizeof(pgd_t)); +#ifdef CONFIG_PAGE_TABLE_ISOLATION + /* Clone the user space pgd as well */ + memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src), + count * sizeof(pgd_t)); +#endif } #define PTE_SHIFT ilog2(PTRS_PER_PTE)