Re: [PATCH UPDATE] x86: ignore spurious faults
Ingo Molnar wrote: spurious faults happen all the time on SMP, in the native kernel. And what i mean is that Linux mprotect currently does not take advantage of x86's ability to just change the ptes, because there's no structured way to tell mm/mprotect.c that "it's safe to skip the TLB flush here". The flush happens in mm/mprotect.c's change_protection() function: flush_tlb_range(vma, start, end); and that is unnecessary when we increase the protection rights, such as in a RO->RW change. (all that is needed is an smp_wmb() instead, to make sure all the pte modifications are visible when the syscall returns.) and it's a really rare case these days that you can find an area where Linux does not make use of a hardware MMU feature - so we should fix this ;-) Well, I guess this isn't really specific to x86; we could always legitimately not do a tlb flush after increasing permissions and leave the fault handler to clean up the mess where needed. But I don't think that's necessarily much of a win; it's cheaper to just do the tlb flush rather than take a spurious fault, unless the faults are very rare. If someone is doing an mprotect on a piece of memory (esp to make it writable), my guess is that they're going to touch that memory in the very near future. The big win for this patch is avoiding cross-cpu tlb invalidation when changing kernel mappings. mprotect doesn't attempt to do that anyway, and so can incur spurious faults on other CPUs. J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > Ingo Molnar wrote: >> thanks, applied. >> >> it would be nice to expose this ability of the architecture to the core >> Linux kernel mprotect code as well, and let it skip on a TLB flush when >> doing a RO->RW transition. > > The usermode fault handler already effectively does this; this patch > just does it for kernel mode as well. I don't know if mprotect takes > advantage of this. spurious faults happen all the time on SMP, in the native kernel. And what i mean is that Linux mprotect currently does not take advantage of x86's ability to just change the ptes, because there's no structured way to tell mm/mprotect.c that "it's safe to skip the TLB flush here". The flush happens in mm/mprotect.c's change_protection() function: flush_tlb_range(vma, start, end); and that is unnecessary when we increase the protection rights, such as in a RO->RW change. (all that is needed is an smp_wmb() instead, to make sure all the pte modifications are visible when the syscall returns.) and it's a really rare case these days that you can find an area where Linux does not make use of a hardware MMU feature - so we should fix this ;-) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
Ingo Molnar wrote: thanks, applied. it would be nice to expose this ability of the architecture to the core Linux kernel mprotect code as well, and let it skip on a TLB flush when doing a RO->RW transition. The usermode fault handler already effectively does this; this patch just does it for kernel mode as well. I don't know if mprotect takes advantage of this. It could speed up valgrind and the other mprotect() users i guess? [and UML too perhaps] Not valgrind (it doesn't rely on mmap protections), but electric fence perhaps. J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > When changing a kernel page from RO->RW, it's OK to leave stale TLB > entries around, since doing a global flush is expensive and they pose > no security problem. They can, however, generate a spurious fault, > which we should catch and simply return from (which will have the > side-effect of reloading the TLB to the current PTE). > > This can occur when running under Xen, because it frequently changes > kernel pages from RW->RO->RW to implement Xen's pagetable semantics. > It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids > doing a global TLB flush after changing page permissions. thanks, applied. it would be nice to expose this ability of the architecture to the core Linux kernel mprotect code as well, and let it skip on a TLB flush when doing a RO->RW transition. It could speed up valgrind and the other mprotect() users i guess? [and UML too perhaps] Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
On 25/1/08 10:19, "Andi Kleen" <[EMAIL PROTECTED]> wrote: >> Whether this a problem in light of Xen spurious faults depends on whether >> NMI handlers touch dynamically-allocated data. > > How do you define dynamically-allocated data? Anything that could have been a read-only pte or ldt page in a previous life with no intervening TLB flush. So get_free_page(), kmalloc(), vmalloc(), ... Actually I think we are fine, now I think about it some more, because we only clear the software NMI-in-flight flag if the guest executes IRET via the hypervisor. Most Xen Linux guests only do IRET via the hypervisor when the current context is an NMI handler (additionally x86_64 also does so when returning to ring 3). Most importantly for this case, we will *not* IRET via the hypervisor when returning from a #PF context nested in an NMI context. Hence the NMI-in-flight flag will not be cleared, and guest virtual NMIs will not nest. So that's a relief! -- Keir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
> Whether this a problem in light of Xen spurious faults depends on whether > NMI handlers touch dynamically-allocated data. How do you define dynamically-allocated data? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
> Whether this a problem in light of Xen spurious faults depends on whether > NMI handlers touch dynamically-allocated data. And if they do, it still > depends on the exact scenario. If it is likely to be a problem, a Xen pv_op > can flush the TLB on NMI entry, or we could have Xen do that implicitly > before invoking the guest NMI handler. For PV kernels it would probably be better to just implement a truly nesting NMI trigger method instead of the one bit state of the real hardware. Actually I'm not sure NMIs are really that useful for guests. Most things done traditionally by NMIs are probably better done outside the guest in a virtualized environment. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
>>> Nick Piggin <[EMAIL PROTECTED]> 25.01.08 09:38 >>> >On Friday 25 January 2008 19:15, Jan Beulich wrote: >> Actually, another thought: permitting (and handling) spurious faults for >> kernel mappings conflicts with NMI handling, i.e. great care would be >> needed to ensure the NMI path cannot touch any such mapping. So >> even the present Xen/Linux Dom0 implementation may have some >> (perhaps unlikely) problems here, and it would get worse if we added >> e.g. a virtual watchdog NMI (something I am considering, which would >> then extend the problem to DomU-s). > >Can you explain how they conflict? In the same way as vmalloc faults do (which is why vmalloc_sync_all() got introduced): a page fault nested inside an NMI will, by virtue of executing IRET, prematurely tell the processor that NMI handling is done (and specifically unmask further NMIs). Jan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
On 25/1/08 09:11, "Andi Kleen" <[EMAIL PROTECTED]> wrote: >>> Actually, another thought: permitting (and handling) spurious faults for >>> kernel mappings conflicts with NMI handling, i.e. great care would be >>> needed to ensure the NMI path cannot touch any such mapping. So >>> even the present Xen/Linux Dom0 implementation may have some >>> (perhaps unlikely) problems here, and it would get worse if we added >>> e.g. a virtual watchdog NMI (something I am considering, which would >>> then extend the problem to DomU-s). >> >> Can you explain how they conflict? > > NMI is blocked by the hardware until IRET and when a page fault happens inside > the NMI handler the early IRET unblocks it and then NMIs can nest, which > will lead to stack corruption. Whether this a problem in light of Xen spurious faults depends on whether NMI handlers touch dynamically-allocated data. And if they do, it still depends on the exact scenario. If it is likely to be a problem, a Xen pv_op can flush the TLB on NMI entry, or we could have Xen do that implicitly before invoking the guest NMI handler. Currently Xen guests do not use NMI for watchdog or oprofile, so that rather limits the scope for problems. -- Keir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
On Friday 25 January 2008 09:38:38 Nick Piggin wrote: > On Friday 25 January 2008 19:15, Jan Beulich wrote: > > Actually, another thought: permitting (and handling) spurious faults for > > kernel mappings conflicts with NMI handling, i.e. great care would be > > needed to ensure the NMI path cannot touch any such mapping. So > > even the present Xen/Linux Dom0 implementation may have some > > (perhaps unlikely) problems here, and it would get worse if we added > > e.g. a virtual watchdog NMI (something I am considering, which would > > then extend the problem to DomU-s). > > Can you explain how they conflict? NMI is blocked by the hardware until IRET and when a page fault happens inside the NMI handler the early IRET unblocks it and then NMIs can nest, which will lead to stack corruption. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
On 25/1/08 00:26, "Jeremy Fitzhardinge" <[EMAIL PROTECTED]> wrote: >> I (obviously) don't know exactly how the TLB works in x86, but I >> thought that on a miss, the CPU walks the pagetables first before >> faulting? Maybe that's not the case if there is an RO entry >> actually in the TLB? >> > > My understanding is that it will fault immediately if there's a TLB > entry, and rewalk the tables on return from the fault before restarting > the instruction, so there's no need for an explicit TLB flush. The TLB > doesn't have a notion of negative cache entries, so any entry represents > a present page of some variety. Yes, write access with a r/o TLB entry causes the TLB entry to be flushed and an immediate #PF with no page walk. This is a hardware optimisation for copy-on-write demand faults. Both Intel and AMD implement it. -- Keir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
On Friday 25 January 2008 19:15, Jan Beulich wrote: > Actually, another thought: permitting (and handling) spurious faults for > kernel mappings conflicts with NMI handling, i.e. great care would be > needed to ensure the NMI path cannot touch any such mapping. So > even the present Xen/Linux Dom0 implementation may have some > (perhaps unlikely) problems here, and it would get worse if we added > e.g. a virtual watchdog NMI (something I am considering, which would > then extend the problem to DomU-s). Can you explain how they conflict? Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
Actually, another thought: permitting (and handling) spurious faults for kernel mappings conflicts with NMI handling, i.e. great care would be needed to ensure the NMI path cannot touch any such mapping. So even the present Xen/Linux Dom0 implementation may have some (perhaps unlikely) problems here, and it would get worse if we added e.g. a virtual watchdog NMI (something I am considering, which would then extend the problem to DomU-s). Jan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
Nick Piggin wrote: On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote: Matt Mackall wrote: There's perhaps an opportunity to do this lazy TLB trick in the mmap path as well, where RW mappings are initially mapped as RO so we can catch processes dirtying them and then switched to RW. If the mapping is shared across threads on multiple cores, we can defer synchronizing the TLBs on the others. I think spurious usermode faults are already dealt with. handle_pte_fault() does essentially the same thing as this patch: if (ptep_set_access_flags(vma, address, pte, entry, write_access)) { update_mmu_cache(vma, address, entry); } else { /* * This is needed only for protection faults but the arch code * is not yet telling us if this is a protection fault or not. * This still avoids useless tlb flushes for .text page faults * with threads. */ if (write_access) flush_tlb_page(vma, address); } I (obviously) don't know exactly how the TLB works in x86, but I thought that on a miss, the CPU walks the pagetables first before faulting? Maybe that's not the case if there is an RO entry actually in the TLB? My understanding is that it will fault immediately if there's a TLB entry, and rewalk the tables on return from the fault before restarting the instruction, so there's no need for an explicit TLB flush. The TLB doesn't have a notion of negative cache entries, so any entry represents a present page of some variety. J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
On Friday 25 January 2008 06:21, Jeremy Fitzhardinge wrote: > Matt Mackall wrote: > > There's perhaps an opportunity to do this lazy TLB trick in the mmap > > path as well, where RW mappings are initially mapped as RO so we can > > catch processes dirtying them and then switched to RW. If the mapping is > > shared across threads on multiple cores, we can defer synchronizing the > > TLBs on the others. > > I think spurious usermode faults are already dealt with. > handle_pte_fault() does essentially the same thing as this patch: > > if (ptep_set_access_flags(vma, address, pte, entry, write_access)) { > update_mmu_cache(vma, address, entry); > } else { > /* >* This is needed only for protection faults but the arch code >* is not yet telling us if this is a protection fault or not. >* This still avoids useless tlb flushes for .text page faults >* with threads. >*/ > if (write_access) > flush_tlb_page(vma, address); > } I (obviously) don't know exactly how the TLB works in x86, but I thought that on a miss, the CPU walks the pagetables first before faulting? Maybe that's not the case if there is an RO entry actually in the TLB? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
Matt Mackall wrote: There's perhaps an opportunity to do this lazy TLB trick in the mmap path as well, where RW mappings are initially mapped as RO so we can catch processes dirtying them and then switched to RW. If the mapping is shared across threads on multiple cores, we can defer synchronizing the TLBs on the others. I think spurious usermode faults are already dealt with. handle_pte_fault() does essentially the same thing as this patch: if (ptep_set_access_flags(vma, address, pte, entry, write_access)) { update_mmu_cache(vma, address, entry); } else { /* * This is needed only for protection faults but the arch code * is not yet telling us if this is a protection fault or not. * This still avoids useless tlb flushes for .text page faults * with threads. */ if (write_access) flush_tlb_page(vma, address); } J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] x86: ignore spurious faults
On Wed, 2008-01-23 at 16:28 -0800, Jeremy Fitzhardinge wrote: > When changing a kernel page from RO->RW, it's OK to leave stale TLB > entries around, since doing a global flush is expensive and they pose > no security problem. They can, however, generate a spurious fault, > which we should catch and simply return from (which will have the > side-effect of reloading the TLB to the current PTE). > > This can occur when running under Xen, because it frequently changes > kernel pages from RW->RO->RW to implement Xen's pagetable semantics. > It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids > doing a global TLB flush after changing page permissions. There's perhaps an opportunity to do this lazy TLB trick in the mmap path as well, where RW mappings are initially mapped as RO so we can catch processes dirtying them and then switched to RW. If the mapping is shared across threads on multiple cores, we can defer synchronizing the TLBs on the others. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH UPDATE] x86: ignore spurious faults
When changing a kernel page from RO->RW, it's OK to leave stale TLB entries around, since doing a global flush is expensive and they pose no security problem. They can, however, generate a spurious fault, which we should catch and simply return from (which will have the side-effect of reloading the TLB to the current PTE). This can occur when running under Xen, because it frequently changes kernel pages from RW->RO->RW to implement Xen's pagetable semantics. It could also occur when using CONFIG_DEBUG_PAGEALLOC, since it avoids doing a global TLB flush after changing page permissions. Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]> Cc: Harvey Harrison <[EMAIL PROTECTED]> --- arch/x86/mm/fault_32.c | 50 arch/x86/mm/fault_64.c | 50 2 files changed, 100 insertions(+) === --- a/arch/x86/mm/fault_32.c +++ b/arch/x86/mm/fault_32.c @@ -324,6 +324,51 @@ static int is_f00f_bug(struct pt_regs *r } /* + * Handle a spurious fault caused by a stale TLB entry. This allows + * us to lazily refresh the TLB when increasing the permissions of a + * kernel page (RO -> RW or NX -> X). Doing it eagerly is very + * expensive since that implies doing a full cross-processor TLB + * flush, even if no stale TLB entries exist on other processors. + * There are no security implications to leaving a stale TLB when + * increasing the permissions on a page. + */ +static int spurious_fault(unsigned long address, + unsigned long error_code) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + /* Reserved-bit violation or user access to kernel space? */ + if (error_code & (PF_USER | PF_RSVD)) + return 0; + + pgd = init_mm.pgd + pgd_index(address); + if (!pgd_present(*pgd)) + return 0; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + return 0; + + pmd = pmd_offset(pud, address); + if (!pmd_present(*pmd)) + return 0; + + pte = pte_offset_kernel(pmd, address); + if (!pte_present(*pte)) + return 0; + + if ((error_code & PF_WRITE) && !pte_write(*pte)) + return 0; + if ((error_code & PF_INSTR) && !pte_exec(*pte)) + return 0; + + return 1; +} + +/* * Handle a fault on the vmalloc or module mapping area * * This assumes no large pages in there. @@ -446,6 +491,11 @@ void __kprobes do_page_fault(struct pt_r if (!(error_code & (PF_RSVD|PF_USER|PF_PROT)) && vmalloc_fault(address) >= 0) return; + + /* Can handle a stale RO->RW TLB */ + if (spurious_fault(address, error_code)) + return; + /* * Don't take the mm semaphore here. If we fixup a prefetch * fault we could otherwise deadlock. === --- a/arch/x86/mm/fault_64.c +++ b/arch/x86/mm/fault_64.c @@ -312,6 +312,51 @@ static noinline void pgtable_bad(unsigne } /* + * Handle a spurious fault caused by a stale TLB entry. This allows + * us to lazily refresh the TLB when increasing the permissions of a + * kernel page (RO -> RW or NX -> X). Doing it eagerly is very + * expensive since that implies doing a full cross-processor TLB + * flush, even if no stale TLB entries exist on other processors. + * There are no security implications to leaving a stale TLB when + * increasing the permissions on a page. + */ +static int spurious_fault(unsigned long address, + unsigned long error_code) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + /* Reserved-bit violation or user access to kernel space? */ + if (error_code & (PF_USER | PF_RSVD)) + return 0; + + pgd = init_mm.pgd + pgd_index(address); + if (!pgd_present(*pgd)) + return 0; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + return 0; + + pmd = pmd_offset(pud, address); + if (!pmd_present(*pmd)) + return 0; + + pte = pte_offset_kernel(pmd, address); + if (!pte_present(*pte)) + return 0; + + if ((error_code & PF_WRITE) && !pte_write(*pte)) + return 0; + if ((error_code & PF_INSTR) && !pte_exec(*pte)) + return 0; + + return 1; +} + +/* * Handle a fault on the vmalloc area * * This assumes no large pages in there. @@ -443,6 +488,11 @@ asmlinkage void __kprobes do_page_fault( if (vmalloc_fault(address) >= 0) return; } + + /* Can handle a stale RO->RW TLB */ + if (spuri